Skip to content

pharo-ai/NgramModel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

93 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ngram Language Model

Build status Coverage Status License

Ngram package provides basic n-gram functionality for Pharo. This includes Ngram class as well as String and SequenceableCollection extension that allow you to split text into unigrams, bigrams, trigrams, etc. Basically, this is just a simple utility for splitting texts into sequences of words. This project also provides

Installation

To install the packages of NgramModel, go to the Playground (Ctrl+OW) in your Pharo image and execute the following Metacello script (select it and press Do-it button or Ctrl+D):

Metacello new
  baseline: 'AINgramModel';
  repository: 'github://pharo-ai/NgramModel/src';
  load

How to depend on it?

If you want to add a dependency to this project to your own project, include the following lines into your baseline method:

spec
  baseline: 'NgramModel'
  with: [ spec repository: 'github://pharo-ai/NgramModel/src' ].

If you are new to baselines and Metacello, check out the Baselines tutorial on Pharo Wiki.

What are n-grams?

N-gram is a sequence of n elements, usually words. Number n is called the order of n-gram The concept of n-grams is widely used in natural language processing (NLP). A text can be split into n-grams - sequences of n words. Consider the following text:

I do not like green eggs and ham

We can split it into unigrams (n-grams with n=1):

(I), (do), (not), (like), (green), (eggs), (and), (ham)

Or bigrams (n-grams with n=2):

(I do), (do not), (not like), (like green), (green eggs), (eggs and), (and ham)

Or trigrams (n-grams with n=3):

(I do not), (do not like), (not like green), (like green eggs), (green eggs and), (eggs and ham)

And so on (tetragrams, pentagrams, etc.).

Applications

N-grams are widely applied in language modeling. For example, take a look at the implementation of n-gram language model in Pharo.

Structure of n-gram

Each n-gram can be separated into:

  • last word - the last element in a sequence;
  • history (context) - n-gram of order n-1 with all words except the last one.

Such separation is useful in probabilistic modeling when we want to estimate the probability of word given n-1 previous words (see n-gram language model).

Ngram class

This package provides only one class - Ngram. It models the n-gram.

Instance creation

You can create n-gram from any SequenceableCollection:

trigram := AINgram withElements: #(do not like).
tetragram := #(green eggs and ham) asNgram.

Or by explicitly providing the history (n-gram of lower order) and last element:

hist := #(green eggs and) asNgram.
w := 'ham'.

ngram := AINgram withHistory: hist last: w.

You can also create a zerogram - n-gram of order 0. It is an empty sequence with no history and no last word:

AINgram zerogram.

Accessing

You can access the order of n-gram, its history and last element:

tetragram. "n-gram(green eggs and ham)"
tetragram order. "4"
tetragram history. "n-gram(green eggs and)"
tetragram last. "ham"

String extensions

TODO

Example of text generation

1. Loading Brown corpus

file := 'pharo-local/iceberg/pharo-ai/NgramModel/Corpora/brown.txt' asFileReference.
brown := file contents.

2. Training a 2-gram language model on the corpus

model := AINgramModel order: 2.
model trainOn: brown.

3. Generating text of 100 words

At each step the model selects top 5 words that are most likely to follow the previous words and returns the random word from those five (this randomnes ensures that the generator does not get stuck in a cycle).

generator := AINgramTextGenerator new model: model.
generator generateTextOfSize: 100.

Result:

100 words generated by a 2-gram model trained on Brown corpus

 educator cannot describe and edited a highway at private time ``
 Fallen Figure Technique tells him life pattern more flesh tremble 
 with neither my God `` Hit ) landowners began this narrative and 
 planted , post-war years Josephus Daniels was Virginia years 
 Congress with confluent , jurisdiction involved some used which 
 he''s something the Lyle Elliott Carter officiated and edited and
 portents like Paradise Road in boatloads . Shipments of Student 
 Movement itself officially shifted religions of fluttering soutane .
 Coolest shade which reasonably . Coolest shade less shaky . Doubts 
 thus preventing them proper bevels easily take comfort was

100 words generated by a 3-gram model trained on Brown corpus

 The Fulton County purchasing departments do to escape Nicolas Manas .
 But plain old bean soup , broth , hash , and cultivated in himself , 
 back straight , black sheepskin hat from Texas A & I College and 
 operates the institution , the antipathy to outward ceremonies hailed 
 by modern plastic materials -- a judgment based on displacement of his 
 arrival spread through several stitches along edge to her paper for 
 further meditation . `` Hit the bum '' ! ! Fort up ! ! Fort up ! ! 
 Kizzie turned to similar approaches . When Mrs. Coolidge for

100 words generated by a 3-gram model trained on Pharo source code corpus

This model was trained on the corpus composed from the source code of 85,000 Pharo methods tokenized at the subtoken level (composite names like OrderedCollection were split into subtokens: ordered, collection)

 super initialize value holders . ( aggregated series := ( margins if nil
 if false ) text styler blue style table detect : [ uniform drop list input . 
 export csv label : suggested file name < a parametric function . | phase 
 <num> := bit thing basic size >= desired length ) ascii . space width + 
 bounds top - an event character : d bytes : stream if absent put : answers )
 | width of text . status value := dual value at last : category string := 
 value cos ) abs raised to n number of

Warning

Training the model on the entire Pharo corpus and generating 100 words can take over 10 minutes. So start with a smaller exercise: train a 2-gram model on a Brown corpus (it is the smallest one) and generate 10 words.