[futurebasic] text parsing -- a clarification

Message: < previous - next > : Reply : Subscribe : Cleanse
Home   : February 2002 : Group Archive : Group : All Groups

From: Bowerbird@...
Date: Mon, 25 Feb 2002 12:59:45 EST
um, sorry, folks, my fault, i wasn't clear 
in my request for fb2 challenge functions...

i should have refreshed your memory
on the exact nature of the challenge.

if you'll remember, it was for routines
that would examine an arbitrary text-file
(which could be as big as 5 megabytes in size),
and create a list of all the words used in it,
with a count of how many times each was used.

but never mind that, if you don't want,
let me tell you what _i_ actually need, for _my_ purpose,
which is (of course, as you know) within an e-book program.

i need a list of all unique strings that are contained in the text,
the delimiter being a space and/or a return (or multiples of 'em).

the list _is_ sensitive to case, and leading/trailing punctuation.

additionally, this list should be saved in a sorted order,
with the sort also being case- and punctuation-sensitive.

next step, the list should be split and 
saved across up to 245 str# resources,
each holding no more than 245 items.
(this gives a maximum list of 
unique strings right over 60,000.)

also, of course, the original list should be stored in 
a str# resource as well (or in 2 if it's over _maxint items.).
likewise with its sorted counterpart.

each delimiter-separated string in the original text must
also be replaced with a two-character pointer-token, where 
the first character points to one of the 245 str# resources,
and the second points to a specific item within that str#.
this tokenized text should then be saved in one or more str#s.

so also humbly requested is a routine to do this tokenization.

the routine could work _after_ the string-list generation process,
or as code contained _within_ that function; either would be fine.

(if it's done afterwards, we would want to put the _sorted_ 
version of the string-list into the str# resources, and use it.
but if tokenization is done in the _process_ of string-collection,
you'd only be able to tokenize based on the _unsorted_ version.)

i should also mention that the tokenized file
must maintain the returns of the original file,
so the tokenized file will contain as many lines as the original.
(i'm working with project gutenberg text-files,
and the return at the end of each line is significant.)

but spaces are _not_ maintained in the tokenized file.
every string is expected to be followed by a single space,
so that space is assumed and thus automatically generated.

i've written routines that do all this stuff, but slowly,
if you thought looking at it would help you out, but
somehow i doubt a line-input type of approach would...           :+)

-bowerbird