Releases: Bookworm-project/BookwormDB
3-Alpha
A major set of changes that will either revive the project or kill it dead!
The changes that make the biggest difference for users are:
- Tables no longer need to reside in memory, and bookworm will still run after a restart.
- Apache and MySQL configuration steps are simplified. Rather than having code go through
gymnastics to create MySQL passwords, you're now just responsible yourself for getting a valid admin passwords into ~/.my.cnf. - The API now runs as a separate process by default on port 10012. You can expose this directly, which
should be fine for most use but may be vulnerable to DDOS; or you can mask it behind an Apache/Nginx location. - The whole stack now runs on Python 3 instead of Python 2.
Changes to API and serving exposure
The old linechart GUI is no longer supported.
We fixed the API in about 2012, but the linechart still uses some 2011 terms to access the API. Back compatibility makes this project extremely hard to support. It would definitely be worth re-wrapping the UI elements around the API or updating the project there; but the database end is no longer going to carry all of that old code inside of it.
CGI scripts now run as wsgi.
Scripts are now wsgi instead of cgi, which significantly reduces latency; things like pandas imports now only run once, rather than once for each query. This also means that the old 'bookworm query' method which starts a new service will be much slower than simply hitting the web server.
Userland serving of the API through gunicorn
You can still run the webserver through Apache if you want; but now the default assumption is that you'll serve the API separately on its own port. This is configurable, but is -- by default -- 10012. A production sensible approach would be to force that port to listen only to local requests, and to force Apache (or Nginx or whatever) to pass API queries through to 10012. The following snippet passes all cgi-bin requests through to localhost:10012 in Apache.
<Proxy *>
Order deny,allow
Allow from all
</Proxy>
ProxyPreserveHost On
<Location "/cgi-bin">
ProxyPass "http://127.0.0.1:10012/"
ProxyPassReverse "http://127.0.0.1:10012/"
</Location>
The problem with the old setup was that Apache (i.e., the httpd
user) needed to be living in the same world
of python dependencies. This requires all sorts of mucking about with root privileges. By running everything as user, the process is greatly simplified.
Export as csv or feather.
For programming, you could always export data in "tsv" or "json". Now possible, as well are "csv"---self-explanatory--and feather
. For reading into R and python, feather provides a binary serialization
that should be smaller and faster to read than CSV in most cases, while preserving the type attributes
of data from the original Bookworm. I recommend that for anything data-science-y on wordcounts.
Fast memory tables are optional
Bookworm likes to use fast hash-based memory tables to speed up lookup operations. But these also require sitting open in memory, and scripting regular operations to ensure they fill up on restart.
This version adds an on-disk version of the lookup tables as the default use case; these are still indexed and reasonably fast, but take no space in memory (other than that defined by the cache).
Better Unicode support.
- The default tokenization now uses "\w" instead of "\p{L}" for word characters, which appears to work better in Arabic and Sanskrit (among other languages?)
- Test suite includes Cherokee and Arabic.
- Databases now explicitly force utf8 collation.
Under-the-hood improvements primarily to modernizing the codebase.
Python3
The codebase is now in python3 to support further development. I do not plan to back-support python2.
This was done with the help of an autotranslate, so it's possible there will be some unexpected consequences that pop up.
Cleaner under-the-hood operations
The previous python module used to have an underlying Makefile
and one or two remaining calls to perl, as well as GNU parallel (which had
an obnoxious citation message). Those are gone. Parallelism in the build
is all handled in python, which appears to speed things up a bit because there's
less passing things in and out of pipes.
bounter
for word counts.
One slightly more controversial choice is that the initial wordcounting is no longer
exact, but instead uses the 'bounter' package from gensim. This uses probabilistic counts
to decides where the cutoff for most frequent words will be; it will give different results in some cases.
No longer requires root
This makes installation much easier. Everything that used to require a root password has been ripped out.
Easier MySQL password handling
The previous MySQL configurations was preachy about security. Now, it does the following.
-
The user must be able to set up their MySQL to allow root access through a file at ~/.my.cnf or ~/my.cnf. If there is no file there, it tries to log into mysql as 'root' with no password; if that fails, you can't build.
-
By default, bookworm handles all connections as a read-only user named 'bookworm' with no password. This user cannot modify any databases, and has
SELECT
access only to Bookworm databases, so there are no longer any attempts made to force it to have a secure password.
Parkman Version 1
A fairly routine update, but easier to install.
Unit Testing
Significantly better unit tests and incorporation into a Travis framework.
pip install .
is now the recommended way to install the platform.
For new contributions, we'll try to add more unit tests to allow easier merging of suggested changes.
Configuration handling
MySQL configurations are now reliably handled through a central Configfile
instance. This reduces some of the headaches where mysql passwords are read from multiple places.
bookworm config mysql
now works better as a result, reducing the need for users to muck around in configuration files and type SQL queries themselves.
Other changes
- API now includes CORS headers for cross-domain requests: 35cf1ef
- Some hardcoded host files have been removed
- Missing feature count ingest code has been restored
- Code is more style-compliant
- Some outdated code blocks are removed.
Whitman revision 1.
A variety of incremental improvements and bugfixes.
Performance improvements
add_metadata
now automatically reboots the tables it created, and only those tables.- The underlying script no longer constructs a useless dictionary when the identifier is
bookid
; supplementary data keyed tobookid
can now be loaded in extremely quickly. This and the preceding are good for the HTRC workset use case.
Code niceness
- The test suite is more fully empowered, and the tests are now independent of each other. (Although there are still a bunch of assertions in the primary test, since it's sequential on itself.)
- Adding a single class in configuration.py to handle searching for mysql configuration files; this should start to pay dividends as we get these files in various different places.
- configuration.py now folds in the files from fix_config.py; there are still more different modules
- A few vestigial code blocks have been deleted.
Bug fixes
- Supplementing metadata from a tsv with no field descriptions file works again.
- Several necessary files for the python setuptools script are now present.
- Using console_scripts in setuptools to handle provisioning of binaries, since that seems to work better.
Whitman
Overview
This is a major set of ease-of-use improvements defining a new server side API. The most important elements are:
- The whole release is organized as a python module; rather than clone the repo for each bookworm, you now install it system-wide.
- All build and update tools are bundled inside a command line executable installed with the package and accessible as
bookworm
. After installing, typebookworm --help
for an outline of the functions. - The API code, previously a separate repo, has been folded into this framework, accessible through the command line as
bookworm query
. - An easy-to-set-up webserver is available through
bookworm serve
for testing and local data exploration. - Setup help is included under
bookworm config mysql
. - A new command
bookworm add_metadata
makes it easier to extend the metadata around a library. - The beginnings of test suite are included, including most importantly a standardization of the Federalist papers test set to make sure that Bookworm is configured properly.
It also incorporates various bugfixes. A slightly more extended discussion is available here
I'm naming it "Whitman" because, much like Leaves of Grass, this version is folding a lot of previous stuff from various sources into one updated omnibus volume.
I'm going to let this sit for a little while for comment, but hopefully we can move this straight into a 1.0 alpha release. I would consider doing this now, except that at 1.0, the vocabulary needs to be set in stone to abide by semantic versioning.
Usability improvements
- A more consistent API with command line help. Type
bookworm --help
after installing for some examples. - All submodules now use the
logging
module to write errors. This means that problems can be more easily debugged by running withbookworm --log-level=DEBUG ...
without extraneously dumping all that stuff out.
Sterne-stable
Stable release of the old creation API.
This fixes a number of bugs from 0.3-beta over the last several months, and adds some features specifically to work with the Hathi files.
The changes are particularly important for the incorporation of outside metadata into pre-existing bookworms.
From this point on, new development will take a different folder structure, so I'm tagging this release as a stable point for the old server-side API.
Sterne beta
This includes some minor performance fixes principally focused on Bookworms of over 1 million documents, and some bugfixes. All other branches have been deleted, so we can now focus solely on developing master
for the time being.
It's only been tested on smaller productions (indeed, a full test suite is a real need); I would love to head any errors.
New features
- Some additional date rounding features; rounding is still not fully documented, but the general terms are available from the code in parseMetadata.py
Performance Improvements
- Booklists are now saved using the
anydb
python module on disk, which saves a lot of memory in the encoding step for large bookworms. - The settings to parallel have been tweaked to run slightly faster.
- Catalog parsing now takes place in parallel as well; with sets over 1 million volumes, this can be a significant improvement.
Bugfixes
- Variables with the same name as a reserved MySQL word will now be dropped.
Usability improvements
- Fewer branches.
- Less extraneous output written to stdout during a build.
- Warning messages more consistently sent to stderr rather than stdout.
Sterne
Rather than do a beta on v0.2, we'll roll straight ahead to 0.3. Existing bookworms that want to take full example should be rebuilt from their old folders.
v0.2 was centered on the ingest format for tokenization; v0.3 focuses on updating the modules for creation the database, although there are a number of other improvements as well. I'm going to name this one after Laurence Sterne for having better documentation and being better able to incorporate references from a variety of sources.
New Features
Easily supplementing metadata.
The most important improvement is the addition of a new scheme that allows metadata to be added to an existing bookworm in json or TSV format. Documentation is available here. This makes it easy to extend a bookworm with any sort of metadata about any existing metadata field: adding genders to authors, decades to years, genres to books, and so forth.
Extension system
Non-base features can now be added through an extension system. Further information is here
Usability improvements
New Documentation
Documentation for the creation of Bookworms has been added to the new gh-pages
branch of the repository and is available in gitbook form at http://bmschmidt.github.io/Presidio/. This is a work-in-progress, but should greatly improve existing documentation. We should start retiring/folding in older documentation to avoid confusion.
Changing Github branches.
We've been suffering a proliferation of branches before this release. Rather than keep a master branch around for old versions, that will now be handled by release tags: all work will take place on dev
, and dev--stable
will be retired while master
takes its place.
Config file to handle commands.
The first step in bookworm creation, handled automatically by the makefile, is the creation of a configuration file at bookworm.cnf
that includes local password, database name, and username settings. This greatly simplifies some of the problems users have had with mysql configurations, and provides a location for further customizations to be easily held.
Better syntax for OneClick.py
Thanks to the config file, the awkward old syntax requiring you to name your bookworm and some passwords from the command line can be retired. OneClick.py now provides a consistent command line interface to the bookworm that allows input in the format python OneClick.py command [arguments]
. So the new metadata-adding script can be invoked by running the appropriate command. Whereas previous version only assumed you'd run OneClick once (hence the name), it's now fully agnostic; it handles, for example, the new table recreation scripts.
To match this, OneClick.py will be renamed "bookworm.py" in some future version. It's also now possible that we could simply use one system-wide installation of bookworm.
More convenient handling of memory tables
Bookworm holds tables in memory. These must be recreated on startup. Previously the code for this was written in SQL and had to be handled by a cron job. Now the bindings are written directly in python, and there is a new method, python OneClick.py reloadAllMemory
, that will immediately update every v 0.3 or later Bookworm on the server if and only if the memory tables are empty. This script will need an assigned cron job to run automatically.
Performance changes
I'd usually say 'improvements,' but this is actually a mixed-bag. The tokenization scripts have been rewritten to take much less disk-space and to place much smaller demands on active memory while running. Some users found the working-disk-space requirements excessive, and I tend to agree.
The binary files introduced in Munro were slightly more efficient than previous versions, but still hugely bloated. Bookworm no longer stores intermediate phases in binary; instead, Bookworm
- assembles a word list by tokenizing the entire corpus, and assigns codes to the million most common words,
- directly encodes the files in large batches, retokenizing in the process.
This isn't a free lunch. While reducing the load on disk space and I/O operations, this may be more time-consuming. (Essentially, we now run a tokenizing script instead of parsing massive cPickle objects). With small texts, it may be faster: with book-length texts, it's probably a bit slower than the previous version. (Though still, I'd wager, faster than v0.1). But it also introduces the possibility of fast operations where a predefined word list from another bookworm can simply be cloned at the start, which means some processes can be made dramatically faster if they don't require their own word list.
Given the simplicity of the new regular expression format, I think that eventually this code should probably all be rewritten in a language other than python: probably either C++ or, if someone's feeling interested, Go.
Munro alpha
This is the first version of a release that should be much faster in a number of ways. It's been tested on a couple sets, but is not yet thoroughly debugged. I'm tagging it as a pre-release for that version, but would love for anyone to try it.
I'm calling it Munro
because we should name these releases after authors, obviously, and the distinguishing feature of this release is that it better handles large numbers of short texts much better than the old ones.
Speed Improvements
The big change is that where previous versions required individual files for every text, the new version handles texts in chunks which greatly reduces the number of disk reads required. I haven't fully benchmarked it, but for the History dissertation set (30,000 extremely small texts) it reduced the overall build time (including database formatting, which hasn't changed) from about 30 seconds to 8 seconds.
Relatedly, it enables a new input format; instead of a directory of files, you can now upload a single text file with your entire archive. The first word of each line is the filename; it should be followed by a tab and the full text of the text identified by the filename. One outstanding question is whether lines past a certain size might somehow cause python to break. It's certainly not possible to use a file larger than the computer's available RAM; but it's hard to imagine that happening nowadays.
The intermediary stages which stored tables of wordcounts for every text now use the cPickle module to store batches (specifically, a new python object of class tokenBatches
of the counts for about 10MB worth of texts at a time. That 10MB number is arbitrary, but seems to work: we might want to dynamically increase how it's chosen. For one Bookworm with about 1.6m documents of about a paragraph each, that reduces the number of tokenization files down to 88. The wordcounting and tokenization scripts use native methods of the tokenBatches
object to do their counting and encoding; avoid the need to parse to and from CSV's seems to help speed things up a bit more.
Code Improvements
Tokenization is now handled by a new python submodule, bookworm.tokenizer
. The old version was a tangle of perl code with all sorts of Internet Archive specific gobbledegook based around substitutions that tried to surround words with spaces and capture English sentence-ending code; now, it takes a MALLET-inspired approach and simply tries to capture all the elements of a word with one big regex. (In fact, it exports an object called bigregex
that constitutes the Bookworm definition of a token: that has previously been a really opaque definition even for those who know the code, so this is a big step towards transparency.)
The old python code seemed to have been dropping unicode characters at a certain point: they are hopefully once again supported.
A number of extraneous old modules and scripts, including tokenizeAndEncodeFiles
and the various perl scripts in /scripts, have been cleared out.
Architecture changes
One last change, which is more in the neutral category.
This version continues a transition most of my edits over the last year or so, farther away from old model where the python script OneClick.py
spawns everything, and towards a layout where a Makefile defines all the targets. This is primarily due to the issues we've had with getting sensible parallelization to work in python, but has some nice sideffects as well; Make handles the progressive stages of builds quite nicely, and GNU parallel distributes jobs across processors extremely simply.
For the end user, the difference is simply that instead of calling python OneClick.py bookworm username password
, you call make all bookwormName=bookworm
. Some form of automatic web site creation should be included shortly.
New Dependencies
This requires a couple new objects on any system.
Python packages:
regex
(To handle unicode regular expressions)cPickle
System utilities
- GNU parallel (Oversees job creation very easily)