Skip to content

3-Alpha

Pre-release
Pre-release
Compare
Choose a tag to compare
@bmschmidt bmschmidt released this 15 Sep 19:12
· 3 commits to py3 since this release

A major set of changes that will either revive the project or kill it dead!

The changes that make the biggest difference for users are:

  1. Tables no longer need to reside in memory, and bookworm will still run after a restart.
  2. Apache and MySQL configuration steps are simplified. Rather than having code go through
    gymnastics to create MySQL passwords, you're now just responsible yourself for getting a valid admin passwords into ~/.my.cnf.
  3. The API now runs as a separate process by default on port 10012. You can expose this directly, which
    should be fine for most use but may be vulnerable to DDOS; or you can mask it behind an Apache/Nginx location.
  4. The whole stack now runs on Python 3 instead of Python 2.

Changes to API and serving exposure

The old linechart GUI is no longer supported.

We fixed the API in about 2012, but the linechart still uses some 2011 terms to access the API. Back compatibility makes this project extremely hard to support. It would definitely be worth re-wrapping the UI elements around the API or updating the project there; but the database end is no longer going to carry all of that old code inside of it.

CGI scripts now run as wsgi.

Scripts are now wsgi instead of cgi, which significantly reduces latency; things like pandas imports now only run once, rather than once for each query. This also means that the old 'bookworm query' method which starts a new service will be much slower than simply hitting the web server.

Userland serving of the API through gunicorn

You can still run the webserver through Apache if you want; but now the default assumption is that you'll serve the API separately on its own port. This is configurable, but is -- by default -- 10012. A production sensible approach would be to force that port to listen only to local requests, and to force Apache (or Nginx or whatever) to pass API queries through to 10012. The following snippet passes all cgi-bin requests through to localhost:10012 in Apache.

    	<Proxy *>
          Order deny,allow
      	  Allow from all
        </Proxy>
	ProxyPreserveHost On
        <Location "/cgi-bin">
          ProxyPass "http://127.0.0.1:10012/"
          ProxyPassReverse "http://127.0.0.1:10012/"
        </Location>

The problem with the old setup was that Apache (i.e., the httpd user) needed to be living in the same world
of python dependencies. This requires all sorts of mucking about with root privileges. By running everything as user, the process is greatly simplified.

Export as csv or feather.

For programming, you could always export data in "tsv" or "json". Now possible, as well are "csv"---self-explanatory--and feather. For reading into R and python, feather provides a binary serialization
that should be smaller and faster to read than CSV in most cases, while preserving the type attributes
of data from the original Bookworm. I recommend that for anything data-science-y on wordcounts.

Fast memory tables are optional

Bookworm likes to use fast hash-based memory tables to speed up lookup operations. But these also require sitting open in memory, and scripting regular operations to ensure they fill up on restart.

This version adds an on-disk version of the lookup tables as the default use case; these are still indexed and reasonably fast, but take no space in memory (other than that defined by the cache).

Better Unicode support.

  • The default tokenization now uses "\w" instead of "\p{L}" for word characters, which appears to work better in Arabic and Sanskrit (among other languages?)
  • Test suite includes Cherokee and Arabic.
  • Databases now explicitly force utf8 collation.

Under-the-hood improvements primarily to modernizing the codebase.

Python3

The codebase is now in python3 to support further development. I do not plan to back-support python2.

This was done with the help of an autotranslate, so it's possible there will be some unexpected consequences that pop up.

Cleaner under-the-hood operations

The previous python module used to have an underlying Makefile
and one or two remaining calls to perl, as well as GNU parallel (which had
an obnoxious citation message). Those are gone. Parallelism in the build
is all handled in python, which appears to speed things up a bit because there's
less passing things in and out of pipes.

bounter for word counts.

One slightly more controversial choice is that the initial wordcounting is no longer
exact, but instead uses the 'bounter' package from gensim. This uses probabilistic counts
to decides where the cutoff for most frequent words will be; it will give different results in some cases.

No longer requires root

This makes installation much easier. Everything that used to require a root password has been ripped out.

Easier MySQL password handling

The previous MySQL configurations was preachy about security. Now, it does the following.

  1. The user must be able to set up their MySQL to allow root access through a file at ~/.my.cnf or ~/my.cnf. If there is no file there, it tries to log into mysql as 'root' with no password; if that fails, you can't build.

  2. By default, bookworm handles all connections as a read-only user named 'bookworm' with no password. This user cannot modify any databases, and has SELECT access only to Bookworm databases, so there are no longer any attempts made to force it to have a secure password.