Skip to content

Commit

Permalink
README: Clean up.
Browse files Browse the repository at this point in the history
* Consistently use two spaces after a period.  It makes no difference
  to the rendered result but allows text-processing tools to distinguish
  between the period at the end of a sentence and the period at the end
  of an abbreviation within a sentence.
* Re-wrap all paragraphs.
* Use indentation instead of backticks for code blocks.  This makes them
  easier to read in the source.
* Use `inline code markup` where appropriate.
* Add missing punctuation here and there.
* Use proper quotes.
* Fix a typo.
* Remove a sentence that was meant to link to the POSIX spec but didn't
  actually contain a link.  It's a moving target, and I'm sure readers
  know how to find it if they don't already have a copy.
  • Loading branch information
dag-erling committed Sep 5, 2024
1 parent 354c916 commit 193ca19
Showing 1 changed file with 107 additions and 111 deletions.
218 changes: 107 additions & 111 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,12 @@ matching library with some exciting features such as approximate

The matching algorithm used in TRE uses linear worst-case time in
the length of the text being searched, and quadratic worst-case
time in the length of the used regular expression.
time in the length of the used regular expression.

In other words, the time complexity of the algorithm is O(M^2N), where
M is the length of the regular expression and N is the length of the
text. The used space is also quadratic on the length of the regex, but
does not depend on the searched string. This quadratic behaviour
text. The used space is also quadratic on the length of the regex,
but does not depend on the searched string. This quadratic behaviour
occurs only on pathological cases which are probably very rare in
practice.

Expand All @@ -44,35 +44,33 @@ You will need the following tools installed on your system:
Building
--------

First, prepare the tre. Change to the root of the source directory
First, prepare the tree. Change to the root of the source directory
and run
```
./utils/autogen.sh
```

./utils/autogen.sh

This will regenerate various things using the prerequisite tools so
that you end up with a buildable tree.

After this, you can run the configure script and build TRE as usual:
```
./configure
make
make check
make install
```

./configure
make
make check
make install


Building a source code package
------------------------------

In a prepared tree, this command creates a source code tarball:
```
./configure && make dist
```

./configure && make dist

Alternatively, you can run
```
./utils/build-sources.sh
```

./utils/build-sources.sh

which builds the source code packages and puts them in the `dist`
subdirectory. This script needs a working `zip` command.

Expand All @@ -89,16 +87,16 @@ Approximate matching
--------------------

Approximate pattern matching allows matches to be approximate, that
is, allows the matches to be close to the searched pattern under
some measure of closeness. TRE uses the edit-distance measure (also
known as the Levenshtein distance) where characters can be
inserted, deleted, or substituted in the searched text in order to
get an exact match.
is, allows the matches to be close to the searched pattern under some
measure of closeness. TRE uses the edit-distance measure (also known
as the Levenshtein distance) where characters can be inserted,
deleted, or substituted in the searched text in order to get an exact
match.

Each insertion, deletion, or substitution adds the distance, or cost,
of the match. TRE can report the matches which have a cost lower than
some given threshold value. TRE can also be used to search for matches
with the lowest cost.
of the match. TRE can report the matches which have a cost lower than
some given threshold value. TRE can also be used to search for
matches with the lowest cost.

TRE includes a version of the agrep (approximate grep) command line
tool for approximate regexp matching in the style of grep. Unlike
Expand All @@ -110,19 +108,19 @@ deletion and substitution.
Strict standard conformance
---------------------------

POSIX defines the behaviour of regexp functions precisely. TRE
POSIX defines the behaviour of regexp functions precisely. TRE
attempts to conform to these specifications as strictly as possible.
TRE always returns the correct matches for subpatterns, for example.
Very few other implementations do this correctly. In fact, the only
other implementations besides TRE that I am aware of (free or not)
that get it right are Rx by Tom Lord, Regex++ by John Maddock, and the
AT&T ast regex by Glenn Fowler and Doug McIlroy.

The standard TRE tries to conform to is the IEEE Std 1003.1-2001,
or Open Group Base Specifications Issue 6, commonly referred to as
"POSIX". It can be found online here. The relevant parts are the
base specifications on regular expressions (and the rationale) and
the description of the regcomp() API.
The standard TRE tries to conform to is the IEEE Std 1003.1-2001, or
Open Group Base Specifications Issue 6, commonly referred to as
POSIX. The relevant parts are the base specifications on regular
expressions (and the rationale) and the description of the `regcomp()`
API.

For an excellent survey on POSIX regexp matchers, see the testregex
pages by Glenn Fowler of AT&T Labs Research.
Expand All @@ -131,58 +129,57 @@ Predictable matching speed
--------------------------

Because of the matching algorithm used in TRE, the maximum time
consumed by any regexec() call is always directly proportional to
consumed by any `regexec()` call is always directly proportional to
the length of the searched string. There is one exception: if back
references are used, the matching may take time that grows
exponentially with the length of the string. This is because
matching back references is an NP complete problem, and almost
certainly requires exponential time to match in the worst case.
exponentially with the length of the string. This is because matching
back references is an NP complete problem, and almost certainly
requires exponential time to match in the worst case.

Predictable and modest memory consumption
-----------------------------------------

A regexec() call never allocates memory from the heap. TRE
allocates all the memory it needs during a regcomp() call, and some
temporary working space from the stack frame for the duration of
the regexec() call. The amount of temporary space needed is
constant during matching and does not depend on the searched
string. For regexps of reasonable size TRE needs less than 50K of
dynamically allocated memory during the regcomp() call, less than
20K for the compiled pattern buffer, and less than two kilobytes of
temporary working space from the stack frame during a regexec()
call. There is no time/memory tradeoff. TRE is also small in code
size; statically linking with TRE increases the executable size
less than 30K (gcc-3.2, x86, GNU/Linux).
A `regexec()` call never allocates memory from the heap. TRE allocates
all the memory it needs during a `regcomp()` call, and some temporary
working space from the stack frame for the duration of the `regexec()`
call. The amount of temporary space needed is constant during
matching and does not depend on the searched string. For regexps of
reasonable size TRE needs less than 50K of dynamically allocated
memory during the `regcomp()` call, less than 20K for the compiled
pattern buffer, and less than two kilobytes of temporary working space
from the stack frame during a `regexec()` call. There is no time /
memory tradeoff. TRE is also small in code size; statically linking
with TRE increases the executable size less than 30K (gcc-3.2, x86,
GNU/Linux).

Wide character and multibyte character set support
--------------------------------------------------

TRE supports multibyte character sets. This makes it possible to
use regexps seamlessly with, for example, Japanese locales. TRE
also provides a wide character API.
TRE supports multibyte character sets. This makes it possible to use
regexps seamlessly with, for example, Japanese locales. TRE also
provides a wide character API.

Binary pattern and data support
-------------------------------

TRE provides APIs which allow binary zero characters both in
regexps and searched strings. The standard API cannot be easily
used to, for example, search for printable words from binary data
(although it is possible with some hacking). Searching for patterns
which contain binary zeroes embedded is not possible at all with
the standard API.
TRE provides APIs which allow binary zero characters both in regexps
and searched strings. The standard API cannot be easily used to, for
example, search for printable words from binary data (although it is
possible with some hacking). Searching for patterns which contain
binary zeroes embedded is not possible at all with the standard API.

Completely thread safe
----------------------

TRE is completely thread safe. All the exported functions are
TRE is completely thread safe. All the exported functions are
re-entrant, and a single compiled regexp object can be used
simultaneously in multiple contexts; e.g. in main() and a signal
simultaneously in multiple contexts; e.g. in `main()` and a signal
handler, or in many threads of a multithreaded application.

Portable
--------

TRE is portable across multiple platforms. Here's a table of
TRE is portable across multiple platforms. Here's a table of
platforms and compilers that have been successfully used to compile
and run TRE:

Expand All @@ -206,7 +203,7 @@ and run TRE:

TRE 0.7.5 should compile without changes on all of the above
platforms. Tell me if you are using TRE on a platform that is not
listed above, and I'll add it to the list. Also let me know if TRE
listed above, and I'll add it to the list. Also let me know if TRE
does not work on a listed platform.

Depending on the platform, you may need to install libutf8 to get
Expand All @@ -215,85 +212,84 @@ wide character and multibyte character set support.
Free
----

TRE is released under a license which is essentially the same as
the "2 clause" BSD-style license used in NetBSD. See the file
LICENSE for details.
TRE is released under a license which is essentially the same as the
2 clause BSD-style license used in NetBSD. See the file LICENSE for
details.

Roadmap
-------

There are currently two features, both related to collating
elements, missing from 100% POSIX compliance. These are:
There are currently two features, both related to collating elements,
missing from 100% POSIX compliance. These are:

* Support for collating elements (e.g. [[.\<X>.]], where \<X> is a
collating element). It is not possible to support
multi-character collating elements portably, since POSIX does
not define a way to determine whether a character sequence is a
multi-character collating element or not.
* Support for collating elements (e.g. `[[.\<X>.]]`, where `\<X>` is a
collating element). It is not possible to support multi-character
collating elements portably, since POSIX does not define a way to
determine whether a character sequence is a multi-character
collating element or not.

* Support for equivalence classes, for example [[=\<X>=]], where
\<X> is a collating element. An equivalence class matches any
character which has the same primary collation weight as
\<X>. Again, POSIX provides no portable mechanism for
determining the primary collation weight of a collating
element.
* Support for equivalence classes, for example `[[=\<X>=]]`, where
`\<X>` is a collating element. An equivalence class matches any
character which has the same primary collation weight as `\<X>`.
Again, POSIX provides no portable mechanism for determining the
primary collation weight of a collating element.

Note that other portable regexp implementations don't support
collating elements either. The single exception is Regex++, which
collating elements either. The single exception is Regex++, which
comes with its own database for collating elements for different
locales. Support for collating elements and equivalence classes has
not been widely requested and is not very high on the TODO list at
the moment.
locales. Support for collating elements and equivalence classes has
not been widely requested and is not very high on the TODO list at the
moment.

These are other features I'm planning to implement real soon now:

* All the missing GNU extensions enabled in GNU regex, such as
[[:<:]] and [[:>:]]
`[[:<:]]` and `[[:>:]]`.

* A REG_SHORTEST regexec() flag for returning the shortest match
* A `REG_SHORTEST` `regexec()` flag for returning the shortest match
instead of the longest match.

* Perl-compatible syntax
* `[:^class:]`
Matches anything but the characters in class. Note that
`[^[:class:]]` works already, this would be just a
convenience shorthand.
* Perl-compatible syntax:
* `[:^class:]`
Matches anything but the characters in class. Note that
`[^[:class:]]` works already, this would be just a convenience
shorthand.

* `\A`
Match only at beginning of string
* `\A`
Match only at beginning of string.

* `\Z`
Match only at end of string, or before newline at the end
* `\Z`
Match only at end of string, or before newline at the end.

* `\z`
Match only at end of string
* `\z`
Match only at end of string.

* `\l`
Lowercase next char (think vi)
* `\l`
Lowercase next char (think vi).

* `\u`
Uppercase next char (think vi)
* `\u`
Uppercase next char (think vi).

* `\L`
Lowercase till \E (think vi)
* `\L`
Lowercase till `\E` (think vi).

* `\U`
Uppercase till \E (think vi)
* `\U`
Uppercase till `\E` (think vi).

* `(?=pattern)`
* `(?=pattern)`
Zero-width positive look-ahead assertions.

* `(?!pattern)`
* `(?!pattern)`
Zero-width negative look-ahead assertions.

* `(?<=pattern)`
* `(?<=pattern)`
Zero-width positive look-behind assertions.

* `(?<!pattern)`
Zero-width negative look-behind assertions.

Documentation especially for the nonstandard features of TRE, such
as approximate matching, is a work in progress (with "progress"
loosely defined...) If you want to find an extension to use, reading
the `include/tre/tre.h` header might provide some additional hints if
you are comfortable with C source code.
Documentation especially for the nonstandard features of TRE, such as
approximate matching, is a work in progress (with progress” loosely
defined...) If you want to find an extension to use, reading the
`include/tre/tre.h` header might provide some additional hints if you
are comfortable with C source code.

0 comments on commit 193ca19

Please sign in to comment.