15 Sep 19:12

bmschmidt

7f3a1a9

3-Alpha Pre-release

Pre-release

A major set of changes that will either revive the project or kill it dead!

The changes that make the biggest difference for users are:

Tables no longer need to reside in memory, and bookworm will still run after a restart.
Apache and MySQL configuration steps are simplified. Rather than having code go through
gymnastics to create MySQL passwords, you're now just responsible yourself for getting a valid admin passwords into ~/.my.cnf.
The API now runs as a separate process by default on port 10012. You can expose this directly, which
should be fine for most use but may be vulnerable to DDOS; or you can mask it behind an Apache/Nginx location.
The whole stack now runs on Python 3 instead of Python 2.

Changes to API and serving exposure

The old linechart GUI is no longer supported.

We fixed the API in about 2012, but the linechart still uses some 2011 terms to access the API. Back compatibility makes this project extremely hard to support. It would definitely be worth re-wrapping the UI elements around the API or updating the project there; but the database end is no longer going to carry all of that old code inside of it.

CGI scripts now run as wsgi.

Scripts are now wsgi instead of cgi, which significantly reduces latency; things like pandas imports now only run once, rather than once for each query. This also means that the old 'bookworm query' method which starts a new service will be much slower than simply hitting the web server.

Userland serving of the API through gunicorn

You can still run the webserver through Apache if you want; but now the default assumption is that you'll serve the API separately on its own port. This is configurable, but is -- by default -- 10012. A production sensible approach would be to force that port to listen only to local requests, and to force Apache (or Nginx or whatever) to pass API queries through to 10012. The following snippet passes all cgi-bin requests through to localhost:10012 in Apache.

    	<Proxy *>
          Order deny,allow
      	  Allow from all
        </Proxy>
	ProxyPreserveHost On
        <Location "/cgi-bin">
          ProxyPass "http://127.0.0.1:10012/"
          ProxyPassReverse "http://127.0.0.1:10012/"
        </Location>

The problem with the old setup was that Apache (i.e., the httpd user) needed to be living in the same world
of python dependencies. This requires all sorts of mucking about with root privileges. By running everything as user, the process is greatly simplified.

Export as csv or feather.

For programming, you could always export data in "tsv" or "json". Now possible, as well are "csv"---self-explanatory--and feather. For reading into R and python, feather provides a binary serialization
that should be smaller and faster to read than CSV in most cases, while preserving the type attributes
of data from the original Bookworm. I recommend that for anything data-science-y on wordcounts.

Fast memory tables are optional

Bookworm likes to use fast hash-based memory tables to speed up lookup operations. But these also require sitting open in memory, and scripting regular operations to ensure they fill up on restart.

This version adds an on-disk version of the lookup tables as the default use case; these are still indexed and reasonably fast, but take no space in memory (other than that defined by the cache).

Better Unicode support.

The default tokenization now uses "\w" instead of "\p{L}" for word characters, which appears to work better in Arabic and Sanskrit (among other languages?)
Test suite includes Cherokee and Arabic.
Databases now explicitly force utf8 collation.

Under-the-hood improvements primarily to modernizing the codebase.

Python3

The codebase is now in python3 to support further development. I do not plan to back-support python2.

This was done with the help of an autotranslate, so it's possible there will be some unexpected consequences that pop up.

Cleaner under-the-hood operations

The previous python module used to have an underlying Makefile
and one or two remaining calls to perl, as well as GNU parallel (which had
an obnoxious citation message). Those are gone. Parallelism in the build
is all handled in python, which appears to speed things up a bit because there's
less passing things in and out of pipes.

`bounter` for word counts.

One slightly more controversial choice is that the initial wordcounting is no longer
exact, but instead uses the 'bounter' package from gensim. This uses probabilistic counts
to decides where the cutoff for most frequent words will be; it will give different results in some cases.

No longer requires root

This makes installation much easier. Everything that used to require a root password has been ripped out.

Easier MySQL password handling

The previous MySQL configurations was preachy about security. Now, it does the following.

The user must be able to set up their MySQL to allow root access through a file at ~/.my.cnf or ~/my.cnf. If there is no file there, it tries to log into mysql as 'root' with no password; if that fails, you can't build.
By default, bookworm handles all connections as a read-only user named 'bookworm' with no password. This user cannot modify any databases, and has SELECT access only to Bookworm databases, so there are no longer any attempts made to force it to have a secure password.

Assets 2

23 Aug 16:29

bmschmidt

v0.5

f574d62

Parkman Version 1 Latest

Latest

A fairly routine update, but easier to install.

Unit Testing

Significantly better unit tests and incorporation into a Travis framework.

pip install . is now the recommended way to install the platform.

For new contributions, we'll try to add more unit tests to allow easier merging of suggested changes.

Configuration handling

MySQL configurations are now reliably handled through a central Configfile instance. This reduces some of the headaches where mysql passwords are read from multiple places.

bookworm config mysql now works better as a result, reducing the need for users to muck around in configuration files and type SQL queries themselves.

Other changes

API now includes CORS headers for cross-domain requests: 35cf1ef
Some hardcoded host files have been removed
Missing feature count ingest code has been restored
Code is more style-compliant
Some outdated code blocks are removed.

Assets 2

03 Dec 21:07

bmschmidt

v0.4.1

043f729

Whitman revision 1.

A variety of incremental improvements and bugfixes.

Performance improvements

add_metadata now automatically reboots the tables it created, and only those tables.
The underlying script no longer constructs a useless dictionary when the identifier is bookid; supplementary data keyed to bookid can now be loaded in extremely quickly. This and the preceding are good for the HTRC workset use case.

Code niceness

The test suite is more fully empowered, and the tests are now independent of each other. (Although there are still a bunch of assertions in the primary test, since it's sequential on itself.)
Adding a single class in configuration.py to handle searching for mysql configuration files; this should start to pay dividends as we get these files in various different places.
configuration.py now folds in the files from fix_config.py; there are still more different modules
A few vestigial code blocks have been deleted.

Bug fixes

Supplementing metadata from a tsv with no field descriptions file works again.
Several necessary files for the python setuptools script are now present.
Using console_scripts in setuptools to handle provisioning of binaries, since that seems to work better.

Assets 2

18 Sep 16:08

bmschmidt

v0.4.0

bdd5a83

Whitman

Overview

This is a major set of ease-of-use improvements defining a new server side API. The most important elements are:

The whole release is organized as a python module; rather than clone the repo for each bookworm, you now install it system-wide.
All build and update tools are bundled inside a command line executable installed with the package and accessible as bookworm. After installing, type bookworm --help for an outline of the functions.
The API code, previously a separate repo, has been folded into this framework, accessible through the command line as bookworm query.
An easy-to-set-up webserver is available through bookworm serve for testing and local data exploration.
Setup help is included under bookworm config mysql.
A new command bookworm add_metadata makes it easier to extend the metadata around a library.
The beginnings of test suite are included, including most importantly a standardization of the Federalist papers test set to make sure that Bookworm is configured properly.

It also incorporates various bugfixes. A slightly more extended discussion is available here

I'm naming it "Whitman" because, much like Leaves of Grass, this version is folding a lot of previous stuff from various sources into one updated omnibus volume.

I'm going to let this sit for a little while for comment, but hopefully we can move this straight into a 1.0 alpha release. I would consider doing this now, except that at 1.0, the vocabulary needs to be set in stone to abide by semantic versioning.

Usability improvements

A more consistent API with command line help. Type bookworm --help after installing for some examples.
All submodules now use the logging module to write errors. This means that problems can be more easily debugged by running with bookworm --log-level=DEBUG ... without extraneously dumping all that stuff out.

Assets 2

28 Jul 18:03

bmschmidt

v0.3.1

34ccae5

Sterne-stable

Stable release of the old creation API.

This fixes a number of bugs from 0.3-beta over the last several months, and adds some features specifically to work with the Hathi files.

The changes are particularly important for the incorporation of outside metadata into pre-existing bookworms.

From this point on, new development will take a different folder structure, so I'm tagging this release as a stable point for the old server-side API.

Assets 2

26 Jul 21:51

bmschmidt

v0.3-beta

0eb2a40

Sterne beta

This includes some minor performance fixes principally focused on Bookworms of over 1 million documents, and some bugfixes. All other branches have been deleted, so we can now focus solely on developing master for the time being.

It's only been tested on smaller productions (indeed, a full test suite is a real need); I would love to head any errors.

New features

Some additional date rounding features; rounding is still not fully documented, but the general terms are available from the code in parseMetadata.py

Performance Improvements

Booklists are now saved using the anydb python module on disk, which saves a lot of memory in the encoding step for large bookworms.
The settings to parallel have been tweaked to run slightly faster.
Catalog parsing now takes place in parallel as well; with sets over 1 million volumes, this can be a significant improvement.

Bugfixes

Variables with the same name as a reserved MySQL word will now be dropped.

Usability improvements

Fewer branches.
Less extraneous output written to stdout during a build.
Warning messages more consistently sent to stderr rather than stdout.

Assets 2

10 Jun 18:29

bmschmidt

v0.3-alpha

0d7ea86

Sterne

Rather than do a beta on v0.2, we'll roll straight ahead to 0.3. Existing bookworms that want to take full example should be rebuilt from their old folders.

v0.2 was centered on the ingest format for tokenization; v0.3 focuses on updating the modules for creation the database, although there are a number of other improvements as well. I'm going to name this one after Laurence Sterne for having better documentation and being better able to incorporate references from a variety of sources.

New Features

Easily supplementing metadata.

The most important improvement is the addition of a new scheme that allows metadata to be added to an existing bookworm in json or TSV format. Documentation is available here. This makes it easy to extend a bookworm with any sort of metadata about any existing metadata field: adding genders to authors, decades to years, genres to books, and so forth.

Extension system

Non-base features can now be added through an extension system. Further information is here

Usability improvements

New Documentation

Documentation for the creation of Bookworms has been added to the new gh-pages branch of the repository and is available in gitbook form at http://bmschmidt.github.io/Presidio/. This is a work-in-progress, but should greatly improve existing documentation. We should start retiring/folding in older documentation to avoid confusion.

Changing Github branches.

We've been suffering a proliferation of branches before this release. Rather than keep a master branch around for old versions, that will now be handled by release tags: all work will take place on dev, and dev--stable will be retired while master takes its place.

Config file to handle commands.

The first step in bookworm creation, handled automatically by the makefile, is the creation of a configuration file at bookworm.cnf that includes local password, database name, and username settings. This greatly simplifies some of the problems users have had with mysql configurations, and provides a location for further customizations to be easily held.

Better syntax for OneClick.py

Thanks to the config file, the awkward old syntax requiring you to name your bookworm and some passwords from the command line can be retired. OneClick.py now provides a consistent command line interface to the bookworm that allows input in the format python OneClick.py command [arguments]. So the new metadata-adding script can be invoked by running the appropriate command. Whereas previous version only assumed you'd run OneClick once (hence the name), it's now fully agnostic; it handles, for example, the new table recreation scripts.
To match this, OneClick.py will be renamed "bookworm.py" in some future version. It's also now possible that we could simply use one system-wide installation of bookworm.

More convenient handling of memory tables

Bookworm holds tables in memory. These must be recreated on startup. Previously the code for this was written in SQL and had to be handled by a cron job. Now the bindings are written directly in python, and there is a new method, python OneClick.py reloadAllMemory, that will immediately update every v 0.3 or later Bookworm on the server if and only if the memory tables are empty. This script will need an assigned cron job to run automatically.

Performance changes

I'd usually say 'improvements,' but this is actually a mixed-bag. The tokenization scripts have been rewritten to take much less disk-space and to place much smaller demands on active memory while running. Some users found the working-disk-space requirements excessive, and I tend to agree.
The binary files introduced in Munro were slightly more efficient than previous versions, but still hugely bloated. Bookworm no longer stores intermediate phases in binary; instead, Bookworm

assembles a word list by tokenizing the entire corpus, and assigns codes to the million most common words,
directly encodes the files in large batches, retokenizing in the process.

This isn't a free lunch. While reducing the load on disk space and I/O operations, this may be more time-consuming. (Essentially, we now run a tokenizing script instead of parsing massive cPickle objects). With small texts, it may be faster: with book-length texts, it's probably a bit slower than the previous version. (Though still, I'd wager, faster than v0.1). But it also introduces the possibility of fast operations where a predefined word list from another bookworm can simply be cloned at the start, which means some processes can be made dramatically faster if they don't require their own word list.

Given the simplicity of the new regular expression format, I think that eventually this code should probably all be rewritten in a language other than python: probably either C++ or, if someone's feeling interested, Go.

Assets 2

14 Mar 22:37

bmschmidt

v0.2-alpha

e46e7c2

Munro alpha Pre-release

Pre-release

This is the first version of a release that should be much faster in a number of ways. It's been tested on a couple sets, but is not yet thoroughly debugged. I'm tagging it as a pre-release for that version, but would love for anyone to try it.

I'm calling it Munro because we should name these releases after authors, obviously, and the distinguishing feature of this release is that it better handles large numbers of short texts much better than the old ones.

Speed Improvements

The big change is that where previous versions required individual files for every text, the new version handles texts in chunks which greatly reduces the number of disk reads required. I haven't fully benchmarked it, but for the History dissertation set (30,000 extremely small texts) it reduced the overall build time (including database formatting, which hasn't changed) from about 30 seconds to 8 seconds.

Relatedly, it enables a new input format; instead of a directory of files, you can now upload a single text file with your entire archive. The first word of each line is the filename; it should be followed by a tab and the full text of the text identified by the filename. One outstanding question is whether lines past a certain size might somehow cause python to break. It's certainly not possible to use a file larger than the computer's available RAM; but it's hard to imagine that happening nowadays.

The intermediary stages which stored tables of wordcounts for every text now use the cPickle module to store batches (specifically, a new python object of class tokenBatches of the counts for about 10MB worth of texts at a time. That 10MB number is arbitrary, but seems to work: we might want to dynamically increase how it's chosen. For one Bookworm with about 1.6m documents of about a paragraph each, that reduces the number of tokenization files down to 88. The wordcounting and tokenization scripts use native methods of the tokenBatches object to do their counting and encoding; avoid the need to parse to and from CSV's seems to help speed things up a bit more.

Code Improvements

Tokenization is now handled by a new python submodule, bookworm.tokenizer. The old version was a tangle of perl code with all sorts of Internet Archive specific gobbledegook based around substitutions that tried to surround words with spaces and capture English sentence-ending code; now, it takes a MALLET-inspired approach and simply tries to capture all the elements of a word with one big regex. (In fact, it exports an object called bigregex that constitutes the Bookworm definition of a token: that has previously been a really opaque definition even for those who know the code, so this is a big step towards transparency.)

The old python code seemed to have been dropping unicode characters at a certain point: they are hopefully once again supported.

A number of extraneous old modules and scripts, including tokenizeAndEncodeFiles and the various perl scripts in /scripts, have been cleared out.

Architecture changes

One last change, which is more in the neutral category.

This version continues a transition most of my edits over the last year or so, farther away from old model where the python script OneClick.py spawns everything, and towards a layout where a Makefile defines all the targets. This is primarily due to the issues we've had with getting sensible parallelization to work in python, but has some nice sideffects as well; Make handles the progressive stages of builds quite nicely, and GNU parallel distributes jobs across processors extremely simply.

For the end user, the difference is simply that instead of calling python OneClick.py bookworm username password, you call make all bookwormName=bookworm. Some form of automatic web site creation should be included shortly.

New Dependencies

This requires a couple new objects on any system.

Python packages:

regex (To handle unicode regular expressions)
cPickle

System utilities

GNU parallel (Oversees job creation very easily)

Assets 2

20 Jan 22:32

bmschmidt

v0.1

060b627

Alpha Pre-release

Pre-release

First master version--may be necessary for a few back-compatible installations at Rice.

Assets 2

Releases: Bookworm-project/BookwormDB

3-Alpha

Changes to API and serving exposure

The old linechart GUI is no longer supported.

CGI scripts now run as wsgi.

Userland serving of the API through gunicorn

Export as csv or feather.

Fast memory tables are optional

Better Unicode support.

Under-the-hood improvements primarily to modernizing the codebase.

Python3

Cleaner under-the-hood operations

bounter for word counts.

No longer requires root

Easier MySQL password handling

Parkman Version 1

Unit Testing

Configuration handling

Other changes

Whitman revision 1.

Performance improvements

Code niceness

Bug fixes

Whitman

Overview

Usability improvements

Sterne-stable

Sterne beta

New features

Performance Improvements

Bugfixes

Usability improvements

Sterne

New Features

Easily supplementing metadata.

Extension system

Usability improvements

New Documentation

Changing Github branches.

Config file to handle commands.

Better syntax for OneClick.py

More convenient handling of memory tables

Performance changes

Munro alpha

Speed Improvements

Code Improvements

Architecture changes

New Dependencies

Alpha

`bounter` for word counts.