Ngram
package provides basic n-gram functionality for Pharo. This includes Ngram
class as well as String
and SequenceableCollection
extension that allow you to split text into unigrams, bigrams, trigrams, etc. Basically, this is just a simple utility for splitting texts into sequences of words.
This project also provides
To install the packages of NgramModel, go to the Playground (Ctrl+OW) in your Pharo image and execute the following Metacello script (select it and press Do-it button or Ctrl+D):
Metacello new
baseline: 'AINgramModel';
repository: 'github://pharo-ai/NgramModel/src';
load
If you want to add a dependency to this project to your own project, include the following lines into your baseline method:
spec
baseline: 'NgramModel'
with: [ spec repository: 'github://pharo-ai/NgramModel/src' ].
If you are new to baselines and Metacello, check out the Baselines tutorial on Pharo Wiki.
N-gram is a sequence of n elements, usually words. Number n is called the order of n-gram The concept of n-grams is widely used in natural language processing (NLP). A text can be split into n-grams - sequences of n words. Consider the following text:
I do not like green eggs and ham
We can split it into unigrams (n-grams with n=1):
(I), (do), (not), (like), (green), (eggs), (and), (ham)
Or bigrams (n-grams with n=2):
(I do), (do not), (not like), (like green), (green eggs), (eggs and), (and ham)
Or trigrams (n-grams with n=3):
(I do not), (do not like), (not like green), (like green eggs), (green eggs and), (eggs and ham)
And so on (tetragrams, pentagrams, etc.).
N-grams are widely applied in language modeling. For example, take a look at the implementation of n-gram language model in Pharo.
Each n-gram can be separated into:
- last word - the last element in a sequence;
- history (context) - n-gram of order n-1 with all words except the last one.
Such separation is useful in probabilistic modeling when we want to estimate the probability of word given n-1 previous words (see n-gram language model).
This package provides only one class - Ngram
. It models the n-gram.
You can create n-gram from any SequenceableCollection
:
trigram := AINgram withElements: #(do not like).
tetragram := #(green eggs and ham) asNgram.
Or by explicitly providing the history (n-gram of lower order) and last element:
hist := #(green eggs and) asNgram.
w := 'ham'.
ngram := AINgram withHistory: hist last: w.
You can also create a zerogram - n-gram of order 0. It is an empty sequence with no history and no last word:
AINgram zerogram.
You can access the order of n-gram, its history and last element:
tetragram. "n-gram(green eggs and ham)"
tetragram order. "4"
tetragram history. "n-gram(green eggs and)"
tetragram last. "ham"
TODO
file := 'pharo-local/iceberg/pharo-ai/NgramModel/Corpora/brown.txt' asFileReference.
brown := file contents.
model := AINgramModel order: 2.
model trainOn: brown.
At each step the model selects top 5 words that are most likely to follow the previous words and returns the random word from those five (this randomnes ensures that the generator does not get stuck in a cycle).
generator := AINgramTextGenerator new model: model.
generator generateTextOfSize: 100.
educator cannot describe and edited a highway at private time ``
Fallen Figure Technique tells him life pattern more flesh tremble
with neither my God `` Hit ) landowners began this narrative and
planted , post-war years Josephus Daniels was Virginia years
Congress with confluent , jurisdiction involved some used which
he''s something the Lyle Elliott Carter officiated and edited and
portents like Paradise Road in boatloads . Shipments of Student
Movement itself officially shifted religions of fluttering soutane .
Coolest shade which reasonably . Coolest shade less shaky . Doubts
thus preventing them proper bevels easily take comfort was
The Fulton County purchasing departments do to escape Nicolas Manas .
But plain old bean soup , broth , hash , and cultivated in himself ,
back straight , black sheepskin hat from Texas A & I College and
operates the institution , the antipathy to outward ceremonies hailed
by modern plastic materials -- a judgment based on displacement of his
arrival spread through several stitches along edge to her paper for
further meditation . `` Hit the bum '' ! ! Fort up ! ! Fort up ! !
Kizzie turned to similar approaches . When Mrs. Coolidge for
This model was trained on the corpus composed from the source code of 85,000 Pharo methods tokenized at the subtoken level (composite names like OrderedCollection
were split into subtokens: ordered
, collection
)
super initialize value holders . ( aggregated series := ( margins if nil
if false ) text styler blue style table detect : [ uniform drop list input .
export csv label : suggested file name < a parametric function . | phase
<num> := bit thing basic size >= desired length ) ascii . space width +
bounds top - an event character : d bytes : stream if absent put : answers )
| width of text . status value := dual value at last : category string :=
value cos ) abs raised to n number of
Training the model on the entire Pharo corpus and generating 100 words can take over 10 minutes. So start with a smaller exercise: train a 2-gram model on a Brown corpus (it is the smallest one) and generate 10 words.