Calibre version 2.85.1 - Downloadable from here - needed to convert the ebooks to text file.
Python 2.7 - I have Anaconda Python 2.7 version installed.
Spark 2.20 - Can be downloaded from Apache.
Code that helped build the literary clock, further explanation can be found at www.literaryclock.com.
amazon_affiliate_api.py
A code snippet of how I used the amazon api to get the unique asin identifier for the book from which a quote came from. Needs python-amazon-product-api (version 0.2.8) to work and a Amazon Associates Web Service account as described in the api documentation.
convert_books_wrapper.py
This is a wrapper around the Calibre command line tools to convert ebooks to txt files. Detailed documentation for the calibre command line tools can be found here.
get_times.py
find_times.py
calls the get_times
function, which converts digital times of the form 11:29 to many different ways that it could be transcribed in a book:
In this file is also the digit2word
function which turns numbers from 0 to 59 into words, which is needed in the get_times
function.
Possible improvement for this function is also to have capitalised times.
gutenberg_metadata.py
Having downloaded ~ 50,000 ebooks from Project Gutenberg as discussed in this post, this is some nifty code so we can link the filenames downloaded with the author and title of the books they contain. BeautifulSoup (version 4.5.3) does the heavy lifting, with some hacks either side of it to get the author and title as plain strings. The results are returned as dictionary, with the filename as key and author and title in a list. I used these results to rename and move the 50,000 files into many folders to set find_times.py
on.
find_times.py
This needs to run using the PySpark API. I must confess I could not get this to work as a stand alone file. Instead I removed the indentation of the main function code and ran it in the $SPARK/bin/./pyspark
command line. If the folder containing the books returns times greater than the memory available this will crash. Hence I try divide the books between many folders and run the code iteratively for each folder. The code expects the file name to be of the form author - book title.txt
. If it part of series, I have also added code to deal with author - book title (nth in series title)
- basically all I want from the filename is the author and book title (and it expects them in that order separated by -
). The output in the tab separated file (time_results.tsv) will not be in any necessary order and may contain many false positives.
twitter_bot.py
The code I use to send a tweet and set up the cron scheduler to send the next tweet at the appropriate time. To do so I made use of the Tweepy library (version 3.5.0). The code should work recursively through time_results.tsv
sending a tweet of the next quote in the file until it runs out. Care must be taken that all the quotes (plus book title and author) are less than 280 characters.
In the books folder I have two books; a Sherlock Holmes Collection and Around the World in Eighty Days. These were downloaded from Project Gutenberg in mobi form and converted to text files using convert_books_wrapper.py
.
Here are two examples of the rdf files downloaded from Project Gutenberg, covering the metadata of the two books described above.
time_results.tsv
contains the results of find_times.py
run on the books above for the times 04:25 and 11:29. It can be opened as a spreadsheet for easier viewing.