SITE UNDER CONSTRUCTION

(*) feature not implemented

eggBuilderTM: Hyperstrings Programmer's Manual

Print
   


Hyperstrings

Programmer’s manual.

By Alexey Pismenny



  1. Overview

Hyperstrings, or in abbreviated form, Hystrings are strings best suited for syntax analysis. Unlike the container string classes where the instance of the class allocates storage for the string, a Hystring is a collection of pointer-length pairs that together form several segments of the string. The storage, however, is not the responsibility of the Hystring class. Thus, the string storage is separated from the string analysis and manipulation.

Imagine, for example, a large text: an article or a book. If our task as a programmer is to somehow analyze the book’s structure it would be to our advantage to allocate the storage for the book once and then make references to the structural elements of the book without copying any of its substrings into memory allocated in addition to the main book storage. Our task might be to pick out the chapter titles in order to make a table of content. The most direct approach is to form a vector of std::wstring objects and populate it from the substrings of the main string, which is the entire book, that are the titles. Now, in addition to the storage allocated for the book we allocated storage for its chapter titles. In order to create a table of content we will have to combine the titles from our vector with page numbers and formatting and allocate yet another string for the table of content. Now each chapter title is stored in three places: in the book itself; in the vector of the titles; and in the table of content string. The processing done in that style consists of little more than allocating memory and copying characters from one place to another.

Perhaps, the next task is to break our book into pages. Again, the naïve approach would be to allocate another vector of strings, which are the substrings corresponding to each page. More copying, and the required storage had to double: each character in the book is now stored twice: in the original book and in the page where it belongs.

If the book has been analyzed in this fashion and now needs to be edited, each edited character has to be updated in two places. If the chapter title is edited, -- in at least three places!

But at the same time the naïve approach has the benefit of simplicity. The program that forms the table of content and the pages practically writes itself in a few self-evident lines of code. So-called container semantics are great for casual string processing. They become an exercise in wasting time and space for large text processing.

Using Hystrings, the same program can be written with comparable clarity. However, the only large space allocation will be for the book itself. The chapter titles, the table of content and the individual pages will be pointers to the same book storage coupled with the segment length.

More generally, a Hystring is a vector of segments, and a segment is a pointer-length pair. The pointers must be pointing to a collection of regular C strings called Space. Space could be a single C-string (for example, the entire book) or it could be a container of strings. The only requirement is that the Hystring segment pointers should point to allocated memory. The DEBUG version of the library even validates each pointer against the registered space container and nulls the invalid pointers.

To search for a substring of a Hystring means to walk its segments and examine the characters pointed to. To concatenate two Hystrings is to push the segments of the second Hystring to the vetctor which is the first Hystring. To delete a substring of a Hystring is to find the segment corresponding to the first deleted character, and to the last deleted character. If they are the same segment, shorten it and insert its tail as a new segment. If the last character deleted is in another segment, shorten the first segment, remove the intervening segments, and insert a new segment made of the tail of the last segment. As you can see, the internal algorithms of Hystring are sometimes non-trivial compared to the scanning and array insertions and contractions done with conventional C strings.

A Hystring can be converted to a conventional string by walking the segments and copying into the new contiguous storage. A conventional string can be registered with the Hystring Space and then a Hystring can be made out of it.

Indeed some Hystring operations do not conserve space at all. For example, to revert a string of 10 characters backward ten pointers and ten length values (of 1) must be allocated in addition to the original string. However, for typical text analysis operations: defining sections of text and making combinations of them, Hystrings offer a significant advantage over container-semantics strings.

{pg}





  1. Introduction to Hystring class and Space class



Let us discuss the Hystring header ({hystring.h}). It first defines the pointer-length pair, called Segment. It is of these segments that the Hystring is made up; among the protected data members we see

vector < Segment > segments;

Hystring has rather self-evident default and copy constructors.

Hystring has two constructors from conventional strings:

Hystring( const char_t* text, unsigned int size );

Hystring( const char_t* text );



The difference between the two is that the second constructor takes a null-terminated string. Caution should be exercised when constructing a Hystring from conventional strings of limited scope. This, for example, would be a programming error:

Hystring getProjectName()

{

char_t name[] = {T("Eggbuilder")};

return Hystring(name);

}


Unlike strings with container semantics, Hystring is a collection of pointers. Since name goes out of scope when the function returns, the Hystring returned is unusable. Of course, a similar mistake can be made if simply a pointer to the string declared in the function’s scope were returned, -- that would be a typical beginner’s mistake in C programming. Hystring, however, has a mechanism to reduce such errors. Separate from Hystring, there is a singleton class Space. Space is a single container for all text material out of which we shall be constructing Hystrings. It can be described as a string heap. The elements of Space are SpaceUnit instances (see {space.h}).

The programmer is supposed to first allocate his C strings in the Space. Then he can use them, or their substrings as a building material for Hystrings. The Hystring constructors expect that the string text is allocated in the Space. The constructor can check if the constructor parameter is in the Space by finding a SpaceUnit whose address range includes the entire parameter string.

It would be cumbersome to always enforce that the text parameter is allocated in the Space right in the constructor; on the other hand, not enforcing the rule at all would decouple Hystring from Space altogether. We don’t want to do that because we don’t want the programmer to have the burden of managing his strings underlying Hystring.

The solution is to enforce the rule in debug mode, but not in release. That is what

void Segment::validate()

does. In debug mode, it asks the Space singleton to validate the new segment. Space then walks its SpaceUnit elements looking for one that encompasses the pointer-length pair. Invalid segments are nulled (set to zero pointer of zero length). The programmer might prefer to throw an exception, but I found this technique most helpful exactly as implemented. If I know that I was building my Hystring from a real string, and yet the debugger shows that it has no length and points to nothing, I immediately realize that I have a problem with the string’s scope.

Let us look at Space header more closely ({space.h} ). Its building block is SpaceUnit. That is also a pointer-length pair (similar to the Hystring’s segments). However, Space Unit also keeps track of the C string allocation. The Boolean data member own indicates whether the SpaceUnit instance also should deallocate the pointer in the destructor. While SpaceUnit is similar to Segment, its purpose is different. The SpaceUnit is supposed to be “large”, whereas the Segment inside a Hystring is typically some “small” substring of the SpaceUnit instance. For example, in our introductory example the programmer would allocate one SpaceUnit for the entire text of the book and that SpaceUnit would be the only unit in his Space. If the programmer needs to work with two files, it would make sense to have a Space comprising two SpaceUnits, one for each file.

Space is a map of SpaceUnit instances, identified by a short integer handle. Space itself is a singleton, whose instance methods can only be accessed through

static Space* instance()

Space has some sophisticated methods and elements that are utilized by classes that derive from Hystring. We shall postpone studying those till we need them in the Hystring derivations. The simple way to use Space is simply to pick a handle for each SpaceUnit that you need, and call Space::reg to create the SpaceUnit. For example, Eggbuilder reads two files for input, the “source” and the “equates”. The files are read into C strings pSource and pContent. In order to process the two strings as Hystrings, two SpaceUnits are registered (see {template.cpp}):

hsp_content = Space::instance()->reg( HSP_CONTENT, pSource );

hsp_equates = Space::instance()->reg( HSP_EQUATES, pEquates );



The all-capital handles are suggested values; the reg method returns the same handle if it is available, otherwise zero. Another approach is to use 0xffff for suggested handle and Space will then return a handle value used internally.

At this point Hystrings can be built from the entire registered strings or from some parts of it. Note that a Hystring can be built over segments from different SpaceUnits. This, in fact, happens a lot in Eggbuilder, when the variables occurring in pSource are replaced by values given to them in pEquates. Here is a simple example:

char_t pSource[] = T(“The dinner today is {dinner}.”);

char_t pEquate[] = T(“dinner=fish”);

unsigned short h1 = Space::instance()->reg( 0xffff, pSource );

unsigned short h2 = Space::instance()->reg( 0xffff, pEquate );

Hystring src( pSource ); // entire source

Hystring eqt( pEquate ); // entire equate


// build the equate

Hystring name, value;

eqt.split( T(’=’), name, value );


// evaluate the source

// split is explained in the next chapter; it will produce

// outs[…] = “The dinner today is ”, “.”

// ins[0] = “dinner”

vector < Hystring > outs, ins;

unsigned int count = src.split( T(’{’), T(’}’), outs, ins );

if( count > 0 && count != npos )

{

// Variables (count) found in the source

// Let us find their values

Hystring result = outs[0]; // “The dinner today is “

for( nVar = 0; nVar < count; ++nVar )

{

if( ins[nVar] == name )

{

// this variable has a value

result.append( value ); // “The dinner today is fish”

}

else

{

result.append( T(’?’) );

}

}

}

// here’s the evaluation result

result.append( outs[count] ); // “The dinner today is fish.”

// free the Space

Space::instance()->unreg( h1 );

Space::instance()->unreg( h2 );


Note that result has three segments, two from the source SpaceUnit and one from the equates SpaceUnit.

{pg}





  1. Hystring functionality

Let us read together the main Hystring header, {hystring.h}.

The public sub-struct CharLocator is a convenient way to identify a character of the Hystring, similar to how a character index identifies a character of a conventional string. CharLocator, however, consists of two numbers, indexing the segment and the character inside the segment.

Hystring also has assignment and comparison operators. Consistently with the design objective, assignment does not copy or move the string content pointed to by the segments. The segment structure on the right hand side of the assignment is, of course, copied to the assignee. Comparisons are done with respect to the text represented by the Hystring; so two Hystrings are equal whenever they yield the same sequence of characters, regardless of the segment structure of the two. The core method for all comparisons is

int Hystring::compare( const Hystring& other, unsigned int& len, unsigned int start ) const

If the programmer wishes to understand the internals of the Hystring implementation, reading through the Hystring::compare method would be a good place to start. Note that there are actually several Hystring::compare methods allowing comparisons with std::string and C strings.

Reading down the header we find length method and several classification methods.

Next we see a group called “locator arithmetic”. They produce CharLocator values for the given Hystring. Most Hystring internals use them as convenient abstractions of the actual segments.

Following that, we find many methods that one would expect in any string class. Usually, multiple versions exist for each family of algorithms. When that is the case, there are typically one or two universal methods that take most parameters, and many simpler methods that are implemented in terms of the complex ones.

Several method families deserve special mention. To convert a Hystring to a conventional string, use any of the print methods.

The find_select is indispensable in any syntax analysis; it allows scanning the string looking for either of several target strings and passing back the position of the target as well as the index of the target.

There is a rich with features split family of methods. In its simplest form, it breaks a Hystring around a given token. For example, to split a Hystring into comma-separated fragments, use:

vector < Hystring > fragments;

unsigned int count = myHystring.split( T(’,’), fragments );


If only two fragments are expected, there is a simpler form

Hystring left, right;

unsigned int count = myHystring.split( T(’,’), left, right );


Returned is the number of split fragments, so 1 indicates that the token was not found; there is always at least one element in the fragment vector.

Often, a programmer needs to break his string into fragments inside the brackets and outside the brackets; that is, instead of a single token there are two tokens and two fragment vectors. For example:

vector < Hystring > outs, ins;

unsigned int count = myHystring.split( T(’{’), T(’}’), outs, ins );



This method returns the count of inside fragments, that is the count of ins[]. The count of outs[] is always the count of ins[] plus 1.

At times it is convenient to look for several types of brackets at the same time. For example, consider this marked-up text:

Dinner today is {dinner}. <B>Don’t forget</B> to bring wine!

Here two types of brackets are used, braces (for variable substitution as you may guess) and the usual HTML bold tags. Now instead of a single open bracket and a single close bracket we need a vector of open brackets, the same-size vector of close brackets, and an output array of indices into them, that will tell us for each of the ins[] element, which type of bracket it was split from.

vector < Hystring > outs, ins;

vector < Hystring > open, close;

vector < Hystring > idx;

open.push_back( HystringAuto( T(’{’) ) ); // see next chapter

open.push_back( HystringAuto( T(“<b>”) ) );

close.push_back( HystringAuto( T(’}’) ) );

close.push_back( HystringAuto( T(“</b>”) ) );


unsigned int count = myHystring.split( open, close, idx, outs, ins );


This will return 2 and the vectors will be:

idx[0] == 0; // in[0] is inside brackets open[0], close[0]

idx[1] == 1; // in[1] is inside brackets open[1], close[1]

outs[] == “Dinner today is ”, “. ”, “ to bring wine!”

ins[] == “dinner”, “Don’t forget”



Even more complex, but also powerful, is find_in_split. This solves the following problem. Suppose you have a text with nested brackets (this time, all of the same kind). Suppose that you want to find a target string, but ignore the target strings that are inside brackets. Or more generally, you want to find the target string on nesting level aLevel. For example, we might want to find “dinner” when it is inside the braces, but not inside two layers of braces, and not outside of all braces. We will then specify aLevel = 1. This scenario must account for the possibility that the target string itself contains brackets, which are not necessarily balanced inside the target. To account for that, the method passes back the bracket level just after the target. If the target contained, for example an unbalanced opening bracket, the aLevel on the output will be the aLevel specified on the input plus 1.

{pg}





  1. Classes that derive from Hystring

HystringAuto

The necessity to allocate the space for the Hystrings is a natural requirement in file processing, when the entire file content is allocated right after it is read. But sometime we need to make a Hystring out of a string of limited scope, in order to use the powerful analytical methods that come with Hystring. Hystring Auto is a kind of Hystring that is well-suited for underlying C strings of limited scope. We already used a HystringAuto in the preceding example of split:

open.push_back( HystringAuto( T(’{’) ) );

open.push_back( HystringAuto( T(“<b>”) ) );



In that example, open was a vector of Hystrings declared in some limited scope that was not shown in the example. The vector elements were initially empty Hystrings, requiring no corresponding SpaceUnit. Now we want to create a Hystring out of a single character or out of a short C string. That would cause the Hystring constructor to build a segment out of the temporary strings “{“ and “<b>”, and in Debug the Segment constructor will fail. We don’t want Hystrings to be built from temporary C strings, because such Hystring would have hanging pointers if it were to leave the scope of the underlying C strings.

Certainly we don’t want to register all the sundry strings that we might use inside various methods ahead of time in some initialization section. Instead, we want a derived class that would register the string with which it is constructed just for the scope of its existence, and then the destructor would unregister it.

And that is what HystringAuto is. While its instance exists, the string is registered; when the instance goes out of scope, it is unregistered. Because the class is anticipated to be frequently used with single characters, it makes a single character into a string internally. When built over a string, that string is not copied and is registered as passed. The programmer must ensure that the scope of the underlying C string is not shorter than the scope of HystringAuto. It is not recommended to concatenate HystringAuto instances. If the programmer feels the need to do string arithmetic on HystringAuto, he should write the necessary Append method that ensures that the appended segments remain registered.

This class has been supplanted by HystringContainer in the Eggbuilder project.



HystringContainer

A desire to build a Hystring that allocates itself but also can be appended, inserted into, its substrings replaced, -- all that without worry that the underlying C string might go away, -- has led to the development of this class. Since HystringContainer’s anticipated use is casual, and often over short limited scope strings, we do not allocate and register a string each time a HystringContainer instance is needed. Instead we maintain a pool of registered strings whose size is a multiple of fixed “chunk size”. As another HystringContainer is created, the init method tries to push the underlying C string on the same handle that was used previously, appending the new string to the pooled string. If the new string cannot be appended within the size already allocated, a similar attempt is done with older pooled strings. Finally, if no suitable pooled string exists, a new pooled string is allocated and accommodates the new string.

This works well for relatively short strings, which are concatenated together into a few in number pooled strings. Nothing happens when a HystringContainer is destructed: its underlying string will be released when the entire Space is destructed at the end of the program. For that reason, it is not desirable to use HystringContainer for temporary variables that need to allocate changing string content. Eggbuilder uses HystringContainer mostly for its fixed syntactic elements such as keywords and delimiters.

In order to understand how HystringContainer manages its pool of registered strings, read HystringContainer::init. HystringContainer maintains a static pool of handles

static vector < ContainerHandle > container_handles;

HystringContainer::init method cooperates with Space::push method. The latter attempts to append a copy of the string onto the given handle while maintaining the given string size. The granularity with which the pooled string size is calculates is also determined by Space::chunk_size(). It is currently set to 0xfff = 4095 bytes.

Important feature of HystringContainer is that its pointer can be taken directly as

const char_t* ptr()

For regular Hystrings that would not be possible, because a Hystring may have multiple segments, and even one segment is meaningless without knowing its length. However, HystringContainer is always allocated in a single contiguous storage in the Space and it is null-terminated. So a direct pointer to its allocation can be produced.



HystringContainerNumber

Very often, string content is various representations of numbers. Pages, and sometimes paragraphs need to be numbered; arrays and tables have rows and column numbers. In text processing, we still do number arithmetic; however, we also distinguish between numbers that are the same numerically but are differently formatted. We need to process numbers in the same way we process other Hystrings; at the same time it makes little sense to store several copies of the same string representing the same format of the same numerical value. Yet that is what would happen if numbers were stored as HystringContainers and converted to numerical value on the fly.

HystringContainerNumber derives from HystringContainer and resolves these issues, at least for whole numbers. Let us read {Hystring_container_number.h} together.

First, we define flags and format specifiers that apply to whole numbers. A FormattedNumber is defined as a struct containing the numerical long value, its format, formatting flags and the width. We define equality and comparison operator that take into account the formatting attributes as well as the numerical value.

We decide to create a special storage for numbers, separate from the pooled strings used by HystringContainer parent. That is because we do not want to store the same FormattedNumber several times. Our number storage, HCN_map, is a map keyed by FormattedNumber, pointing to the familiar to us handle-string pair, which this class calls HCN_storage.

Instead of a public constructor, -- that would always produce a new class instance when called, -- we create HystringContainerNumber instances from a factory. We define several commands that the Factory takes: HNF_Init and HNF_Exit must be called when the HyperString library is initialized in HS_Init() and HS_Exit(); HNF_Make produces a HystringContainerNumber, and HNF_Delete deletes it from the storage.

Next, let us examine the Factory.

The execution of HNF_Init and HNF_Exit is straightforward: the _numbers storage is respectively allocated or freed.

For HNF_Make the Factory makes a local instance of FormattedNumber and looks for it in the _numbers storage. If that FormattedNumber is already allocated, the Factory returns a copy of the allocated instance, and passes back the Space handle in case the caller needs it. If the FromattedNumber is not in the storage yet, the Factory uses the protected constructor to allocate the string representation of the number as a HystringContainer, and inserts its Space handle and the pointer in the _numbers storage.

HNF_Delete removes and erases the given FormattedNumber if it had been stored.

HystringContainerNumber can be incremented. The method works much like the factory’s HNF_Make.

{pg}



Hystring Library

Hystring is a complex class, not even counting the derived classes. A functional wrapper around Hystring is Hystring Library. The library is designed to resolve several parsing problems using Hystring class as its internals.

The library is not large. It is implemented in two files, hslib.h and hslib.cpp.

The main purpose is to support scenarios when multiple types of bracket pairs exist and need to be parsed. The “base” bracket pair is something like “{“ and “}”. Those need to be define once at the library initiation. Additional “custom” bracket pairs can be constructed using the base. For example {begin} and {end} are intuitively recognizable bracket pairs that can be created dynamically and parsed against. Further, at times we want different opening custom bracket to be closed with the same closing custom bracket: for example, it might be a reasonable syntax to have opening brackets “{begin}”, “{block}”, “{if}”, “{for}” all close on the same closing bracket “{end}”. Because different opening brackets may associate with their own sets of closing brackets, they are called “associative brackets”.

Another desired feature is opening bracket embedding other code or variables. For example, it would be nice to be able to define only “{if}” and “{end}” as a custom bracket pair, but actually parse expressions like “{if: x > 0} … {end}”. This task is facilitated by Hystring Library as well. The idea is to use unbalanced opening brackets, but parse to the rest of the text till the base bracket is balanced. The remainder between the unbalanced custom bracket and the point when it becomes balanced in terms of the base is called the complement. For example, if we define “{if: ” as an opening bracket, the library will parse “{if: x > 0} … {end}” and define “x > 0” as the opening bracket's complement.

The plane text between two brackets is called inside fragment. So, given proper bracket definitions,

“{if: x > 0} it's positive! {end}”

will parse as follows:

“{if: “ is the associative opening bracket

“x > 0” is the complement

“it's positive” is the inside fragment

{end} is the associated closing bracket.


Let us examine the header file.

void HS_Init(const char_t* baseOpen, const char_t* baseClose)

This needs to be called once. Currently, there is no corresponding HS_Exit. The parameters define the base brackets.

typedef pair < Hystring, vector < const Hystring > > HystringAssociation;

This is the fundamental type for the library, in this context called shortly association. It associates a given string with an array of its closing strings. For example, in the Egg language "{set" can be closed by "{endset}" or "{endif}" or "{endfor}".

struct HS_Fragment

{

unsigned short idx; // index into opens

unsigned short minor; // index into opens[idx].second

Hystring in; // inside fragment

Hystring complement; // complement

};


Another fundamental Hystring Library type. This is what the parser returns for each parsed fragment. The index 'idx' is into a given array of associations. The other index, 'minor', is into the array of possible closes for the opening bracket that defined the fragment. String 'in' is the inside fragment, i.e. the text between entire (complemented) opening bracket and the closing bracket. String 'complement' is what complemented the opening bracket to balance.



string_t HS_Print(const vector < Hystring >& strings);

string_t HS_Print(const HystringAssociation& assoc);

string_t HS_Print(const vector < HystringAssociation >& opens);

string_t HS_Print(const HS_Fragment& fragment);

string_t HS_Print(const vector < HS_Fragment >& fragments);

These print the corresponding object for debugging and logging purposes.



unsigned int HS_Split(const Hystring& src, const vector < Hystring >& tokens, vector < HS_Fragment >& fragments, unsigned int maxCount = 0);

This function splits the given string between tokens. No associations are involved. If maxCount is specified, the function quits after finding maxCount fragments. By convention, fragments[n].idx is index into the array of tokens for the token to the left of the fragment (or 0xffff for the leftmost fragment); fragments[n].minor is index into the array of tokens for the token to the right of the fragment (or 0xffff for the rightmost fragment).


unsigned int HS_AssociativeSplit(

const Hystring& src,

const vector < HystringAssociation >& opens,

vector < Hystring >& outs,

vector < HS_Fragment >& fragments,

unsigned int maxCount = 0);

This is the most powerful parsing method in this library. Fpr a given string and a given array od associations, pass back the array of strings falling outside of matching brackets, and an array of fragments inside associated brackets. Optionally, limit the parsing to a given number of fragments.



Other methods:

unsigned int HS_FindMatch(const Hystring& src, const vector < HystringAssociation >& opens, unsigned int nodeOpen, const CharLocator* locNode, CharLocator* locNext, CharLocator* locMatch = 0);

This is a helper method. Knowing that nodeOpen is found and locOpen points just after it, find a matching close, recursing on nested opens. Return index into the associated closes vector; pass back the locator of the matching close and the locator right after the matching close.



int HS_BracketLevel(const Hystring& src, const char_t* baseOpen = 0, const char_t* baseClose = 0);

This is a helper method. It computes the balance of given opening and closing brackets in a string. By default, the base brackets are used.



unsigned int HS_BracketLevelSubstr(

const Hystring& src, // source string

int lvlTarget, // level to look for

Hystring& aSub, // resulting substring

const CharLocator* pLocBegin = 0, const CharLocator* pLocEnd = 0, // brackets of extraction in src

const char_t* baseOpen = 0, const char_t* baseClose = 0 // base brackets

);


This is another helper. From a given string, a given base bracket level is extracted. For example, using the string "A{B}C}E" and looking for level -1 we find, starting from pos 0, "A{B}C".









Website material Copyright © Alexey Pismenny, 2014.
eggBuilder software Copyright © Alexey Pismenny, 2002-2014.
All rights reserved.