Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
compute-marginals.sh		compute-marginals.sh
halt.sh		halt.sh
pair-submit.sh		pair-submit.sh
run-all.sh		run-all.sh
run-cogserver.sh		run-cogserver.sh
run-shells.sh		run-shells.sh
trim-dataset.sh		trim-dataset.sh

README.md

Word-pair counting management scripts

The scripts here are used to automate the ingestion of plain-text UTF-8 files into the first stage of the language learning pipeline. These can be applied to any flat text files from any origin of your choice. Some tools for downloading Wikipedia and Project Gutenberg texts can be found in the ../../download directory.

Processing can be done in a "fully automated" way or in a semi-automated way. For full automation, just run run-all.sh and move to the next stage (stage 3-mst-parsing). However...

Text processing a large corpus can take a long time to run (hours or days) and if issues arise, it may be easier to run in a semi-automated mode. This is done by starting a cogserver in one terminal with run-cogserver.sh, and sending text to it in another, with pair-submit.sh. If counting is interrupted, just re-run pair-submit.sh again, and it will pick up where it left off.

After counting has completed, marginal statistics must be computed. This can be done by hand, by running compute-marginals.sh.

The easiest way to run in semi-automated mode is to run the run-shells.sh script. It creates multiple terminal sessions in tmux/byobu. One terminal will have the cogserver running in it, and another will have the text-feeder script. By default, the text-feeder script is not started automatically; toggle through the terminals with F3 and F4 for hints, or just look at run-shells.sh directly.

The scripts here assume that run-config/0-pipeline.sh and that run-config/2-pair-conf.sh have been configured.

Detailed instructions on what to do can be found in ../../README-Natural.md

After this stage is completed, move on to ../3-mst-parsing.

File overview

A quick overview:

run-cogserver.sh: Starts a guile shell, runs the atomspace, starts the cogserver in the background, and opens the storage database.
pair-submit.sh: feeds text to the cogserver for word-pair counting. This pull text files, one by one, from the data directory, and submits them for word-pair counting. Start it manually in the submit byobu window. The cogserver needs to be running first, and a database needs to be open (it should be in another byobu terminal).
compute-marginals.sh: A bash script that computes marginal statistics after pair counting has concluded. This needs to be run by hand.
run-shells.sh: multi-tasking terminal server. Opens multiple terminal sessions with tmux/byobu, and starts the cogserver in one of them. Use F3 and F4 to switch to different terminals. Switch to the "submit" terminal, and start pair-submit.sh by hand.
run-all.sh: Combine all of the above into one step. Opens the terminal sessions in the same way, but also launches the cogserver, the telnet session, and the marginals recomputation. At this time, this won't auto-exit when it finishes. That's in order to allow review and confirmation. If satsified, kill the entire session with

  tmux kill-session -t auto-pair-count

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2-word-pairs

2-word-pairs

README.md

Word-pair counting management scripts

File overview

Files

2-word-pairs

Directory actions

More options

Directory actions

More options

Latest commit

History

2-word-pairs

Folders and files

parent directory

README.md

Word-pair counting management scripts

File overview