Skip to content

Sterne

Compare
Choose a tag to compare
@bmschmidt bmschmidt released this 10 Jun 18:29

Rather than do a beta on v0.2, we'll roll straight ahead to 0.3. Existing bookworms that want to take full example should be rebuilt from their old folders.

v0.2 was centered on the ingest format for tokenization; v0.3 focuses on updating the modules for creation the database, although there are a number of other improvements as well. I'm going to name this one after Laurence Sterne for having better documentation and being better able to incorporate references from a variety of sources.

New Features

Easily supplementing metadata.

The most important improvement is the addition of a new scheme that allows metadata to be added to an existing bookworm in json or TSV format. Documentation is available here. This makes it easy to extend a bookworm with any sort of metadata about any existing metadata field: adding genders to authors, decades to years, genres to books, and so forth.

Extension system

Non-base features can now be added through an extension system. Further information is here

Usability improvements

New Documentation

Documentation for the creation of Bookworms has been added to the new gh-pages branch of the repository and is available in gitbook form at http://bmschmidt.github.io/Presidio/. This is a work-in-progress, but should greatly improve existing documentation. We should start retiring/folding in older documentation to avoid confusion.

Changing Github branches.

We've been suffering a proliferation of branches before this release. Rather than keep a master branch around for old versions, that will now be handled by release tags: all work will take place on dev, and dev--stable will be retired while master takes its place.

Config file to handle commands.

The first step in bookworm creation, handled automatically by the makefile, is the creation of a configuration file at bookworm.cnf that includes local password, database name, and username settings. This greatly simplifies some of the problems users have had with mysql configurations, and provides a location for further customizations to be easily held.

Better syntax for OneClick.py

Thanks to the config file, the awkward old syntax requiring you to name your bookworm and some passwords from the command line can be retired. OneClick.py now provides a consistent command line interface to the bookworm that allows input in the format python OneClick.py command [arguments]. So the new metadata-adding script can be invoked by running the appropriate command. Whereas previous version only assumed you'd run OneClick once (hence the name), it's now fully agnostic; it handles, for example, the new table recreation scripts.
To match this, OneClick.py will be renamed "bookworm.py" in some future version. It's also now possible that we could simply use one system-wide installation of bookworm.

More convenient handling of memory tables

Bookworm holds tables in memory. These must be recreated on startup. Previously the code for this was written in SQL and had to be handled by a cron job. Now the bindings are written directly in python, and there is a new method, python OneClick.py reloadAllMemory, that will immediately update every v 0.3 or later Bookworm on the server if and only if the memory tables are empty. This script will need an assigned cron job to run automatically.

Performance changes

I'd usually say 'improvements,' but this is actually a mixed-bag. The tokenization scripts have been rewritten to take much less disk-space and to place much smaller demands on active memory while running. Some users found the working-disk-space requirements excessive, and I tend to agree.
The binary files introduced in Munro were slightly more efficient than previous versions, but still hugely bloated. Bookworm no longer stores intermediate phases in binary; instead, Bookworm

  1. assembles a word list by tokenizing the entire corpus, and assigns codes to the million most common words,
  2. directly encodes the files in large batches, retokenizing in the process.

This isn't a free lunch. While reducing the load on disk space and I/O operations, this may be more time-consuming. (Essentially, we now run a tokenizing script instead of parsing massive cPickle objects). With small texts, it may be faster: with book-length texts, it's probably a bit slower than the previous version. (Though still, I'd wager, faster than v0.1). But it also introduces the possibility of fast operations where a predefined word list from another bookworm can simply be cloned at the start, which means some processes can be made dramatically faster if they don't require their own word list.

Given the simplicity of the new regular expression format, I think that eventually this code should probably all be rewritten in a language other than python: probably either C++ or, if someone's feeling interested, Go.