The scripts here are used to automate the ingestion of plain-text
UTF-8 files into the first stage of the language learning pipeline.
These can be applied to any flat text files from any origin of your
choice. Some tools for downloading Wikipedia and Project Gutenberg
texts can be found in the ../../download
directory.
Processing can be done in a "fully automated" way or in a semi-automated
way. For full automation, just run run-all.sh
and move to the next
stage (stage 3-mst-parsing
). However...
Text processing a large corpus can take a long time to run (hours or
days) and if issues arise, it may be easier to run in a semi-automated
mode. This is done by starting a cogserver in one terminal with
run-cogserver.sh
, and sending text to it in another, with
pair-submit.sh
. If counting is interrupted, just re-run
pair-submit.sh
again, and it will pick up where it left off.
After counting has completed, marginal statistics must be computed.
This can be done by hand, by running compute-marginals.sh
.
The easiest way to run in semi-automated mode is to run the
run-shells.sh
script. It creates multiple terminal sessions in
tmux/byobu. One terminal will have the cogserver running in it, and
another will have the text-feeder script. By default, the text-feeder
script is not started automatically; toggle through the terminals
with F3 and F4 for hints, or just look at run-shells.sh
directly.
The scripts here assume that run-config/0-pipeline.sh
and that
run-config/2-pair-conf.sh
have been configured.
Detailed instructions on what to do can be found in
../../README-Natural.md
After this stage is completed, move on to ../3-mst-parsing
.
A quick overview:
-
run-cogserver.sh
: Starts a guile shell, runs the atomspace, starts the cogserver in the background, and opens the storage database. -
pair-submit.sh
: feeds text to the cogserver for word-pair counting. This pull text files, one by one, from the data directory, and submits them for word-pair counting. Start it manually in thesubmit
byobu window. The cogserver needs to be running first, and a database needs to be open (it should be in another byobu terminal). -
compute-marginals.sh
: A bash script that computes marginal statistics after pair counting has concluded. This needs to be run by hand. -
run-shells.sh
: multi-tasking terminal server. Opens multiple terminal sessions with tmux/byobu, and starts the cogserver in one of them. Use F3 and F4 to switch to different terminals. Switch to the "submit" terminal, and startpair-submit.sh
by hand. -
run-all.sh
: Combine all of the above into one step. Opens the terminal sessions in the same way, but also launches the cogserver, the telnet session, and the marginals recomputation. At this time, this won't auto-exit when it finishes. That's in order to allow review and confirmation. If satsified, kill the entire session with
tmux kill-session -t auto-pair-count