Skip to content

smhavens/NLPCategoryGame

Repository files navigation

title emoji colorFrom colorTo sdk sdk_version app_file pinned
AnalogyArcade
🏆
blue
yellow
gradio
4.8.0
app.py
false

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Introduction

Welcome to Analogy Arcade! Here the game is simple: given the prompt by the system, try to fill in the blank as best you can! If you guess correctly, a new prompt will be generated, but if the guess is wrong in 'previous guesses' the system will output what the prompt would have been if your answer was correct! This system uses LMs using word embeddings, so the prompts and answers can get quite strange.

Usage

Running the program is simple: go over to Analogy Arcade at huggingface.co and the app will already be running. At the top will be a prompt in the format of "A is B as C is to ___" and you need to input a guess into the first textbox. The 2nd textbox will return "Try Again" or "Correct" depending on if your guess was right. Finally, at the bottom is a dropdown that will display all previous guesses and what the third word in the prompt would have been if the user's guess was correct.

Because this is using a fine-tuned sentence embedding model, some of the analogies are not very intuitive to humans but are still very interesting to think how the model got its answer! Look at this prompt, "humans is to fired as exam is to ___". At first it seems very strange, but the answer is actually "recyling" because the model saw "humans and fired" as a way to remove a bad human (fired from the job). The system then compared this to how to remove a bad exam? Well, it thinks you would recyle it.

Documentation

The model I settled on was a sentence-transformer model fine-tuned on the 2k dataset relbert/analogy_questions and finally run through the huggingface Fill-Mask pipeline to fill in missing words from the prompts. Sentence-transformer models (like Bert) focus on embeddings, which seems most applicable to my game's objective while the dataset offered specific analogies to train on in an applicable format where I processed the data into the model by pairing the 'stem' (first analogy set) with the correct answer from the dataset.

For user input, not much is done too it when checking it against the stored answer: both are made lowercase and the programs checks if they're equal. Outside of that, the user input is also stored as a guess and run through its own embedding to create a new prompt of the type "A is to B as [MASK] is to Guess". This allows the user to see how their own answer to the prompt would have been interpreted by the system so they can see behind the scenes and how the model thinks.

Prompts for this system (and all other tested systems) are generated by first selecting a random word from the model's vocabulary as word 'A' and then having the system determine what a similar word to 'A' is and set that to word 'B'. For the fine-tuned model, this means it will use a randomly selected base prompt (such as "Sun is to Moon as ") and then add the randomly slected word 'A' so the embedding function will recieve the sentence "Sun is to Moon as WordA is to [MASK]."

Experiments

Model Types

Baseline

For my dataset, I made use of relbert/analogy_questions on huggingface, which has all data in the format of:

"stem": ["raphael", "painter"],
"answer": 2,
"choice": [["andersen", "plato"],
          ["reading", "berkshire"],
          ["marx", "philosopher"],
          ["tolstoi", "edison"]]

For a baseline, if I were to do a random selection for answer to train the system on (so the stem analogy is compared to a random choice among the answers), then there would only be a 25% baseline for correct categorization and comparison.

Bag-of-Words Model

For comparison, I made use of my previously trained bag-of-words model from our previous project. I changed this model to focus entierely on the most_similar function of word to vec.

Fine-Tuning

Dataset

analogy questions dataset

This database uses a text with label format, with each label being an integer between 0 and 3, relating to the 4 main categories of the news: World (0), Sports (1), Business (2), Sci/Tech (3).

I chose this one because of the larger variety of categories compared to sentiment databases, with the themes/categories theoretically being more closely related to analogies. I also chose ag_news because, as a news source, it should avoid slang and other potential hiccups that databases using tweets or general reviews will have.

Pre-trained model

sentence-transformers/all-MiniLM-L6-v2

Because my focus is on using embeddings to evaluate analogies for the AnalogyArcade, I focused my model search for those in the sentence-transformers category, as they are readily made for embedding usage. I chose all-MiniLM-L6-v2 because of its high usage and good reviews: it is a well trained model but smaller and more efficient than its previous version.

In-Context

My in-context model used Google's flan-t5-base along with the database focused on pair analogies.

Human Analogy Judgement

Because analogies are hard to properly test by normal metrics, I focused on how humans would rank the three different models' outputs. I had 5 participants and each were given the same 10 prompts and their answer generated by the 3 models and had to rank over all which model had the best prompts. The average results were:

  1. Sentence-Transformer Model
  2. Word2Vec Model
  3. In-Context Model using Flan

As while none were perfect, the first usually had understandable word choices and some logic to them, the 2nd would choose vaguely related words for AB and CD, but the third struggled by having exceptionally weird word choice and often returning subsections of words or the words themselves back from the prompt.

Sentence-Transformer Model

1. humans is to fired as exam is to recycling
2. accessible is to trigger as bicycle is to federation
3. sipped is to recover as able is to alternate
4. renewal is to evidenced as travel is to curve
5. cot is to center as dissent is to fan
6. endure is to mine as accessible is to within
7. pierced is to royal as tissue is to insisted
8. amor is to period as maneuvers is to chorus
9. city is to elimination as tod is to offset
10. arabian is to poem as pluto is to embassy

In-Context Model using Flan

1. ciones is to Mexican as Assisted is to Assistance
2. Stu is to Stubble as île is to Laos
3. wusste is to unknown as nature is to nature
4. AIDS is to Mauritania as essentiellement is to Mauritania
5. something is to Something as OC is to Oman
6. abend is to abend as Lie is to lie
7. Muzeul is to Romania as hora is to Mauritania
8. BB is to Belorussian as werk is to German
9. éti is to étizstan as ţele is to celswana
10. î is to îzbekistan as derma is to Dermatan

Word2Vec Model

1. headless is to hobo as thylacines is to arawaks
2. 42 km is to dalston as sisler is to paperboy
3. tolna is to fejér as dermott is to ꯄꯟꯊꯢꯕ.
4. recursion is to postscript as ornithischian is to ceratopsian.
5. 19812007 is to appl as khimichev is to pashkevich.
6. trier is to nürnberg as hathaways is to tate.
7. yon is to arnoldo as neocon is to pešice.
8. washingtonpost is to secretaria as laugh is to shout.
9. waking is to wakes as prelude is to fugue.
      10. 2car is to shunters as mariah is to demi.

Automated Tests

I also employed 11 automated tests, with a set prompt and answer that the model attempts to fill in correctly. Most failed, though some gave relatively close answers. It seem while some had good word choice they struggled to grasp the relation to words and while Word2Vec could get some fundamental relationship it struggled to generate its own viable prompts.

Sentence-Transformer Model

1. PROMPT: Sun is to Moon as Black is to [MASK]. ANSWER: exhausted IS TRUE: False
Real Answer: White
2. PROMPT: Black is to White as Sun is to [MASK]. ANSWER: bulls IS TRUE: False
Real Answer: Moon
3. PROMPT: Atom is to Element as Molecule is to [MASK]. ANSWER: gig IS TRUE: False
Real Answer: compound
4. PROMPT: Raphael is to painter as Marx is to [MASK]. ANSWER: decade IS TRUE: False
Real Answer: philosopher
5. PROMPT: huge is to hugely as subsequent is to [MASK]. ANSWER: organised IS TRUE: False
Real Answer: subsequently
6. PROMPT: simple is to difficult as fat is to [MASK]. ANSWER: dick IS TRUE: False
Real Answer: thin
7. PROMPT: poem is to stanza as staircase is to [MASK]. ANSWER: cooking IS TRUE: False
Real Answer: step
8. PROMPT: academia is to college as typewriter is to [MASK]. ANSWER: folder IS TRUE: False
Real Answer: keyboard
9. PROMPT: acquire is to reacquire as examine is to [MASK]. ANSWER: futures IS TRUE: False
Real Answer: reexamine
10. PROMPT: pastry is to food as blender is to [MASK]. ANSWER: casting IS TRUE: False
Real Answer: appliance
11. PROMPT: Athens is to Greece as Tokyo is to [MASK]. ANSWER: homeless IS TRUE: False
Real Answer: Japan

0/11, failed its own dataset prompts

In-Context Model using Flan

1. PROMPT: Sun is to Moon as Black is to ___. ANSWER: Blacks IS TRUE: False
Real Answer: White
2. PROMPT: Black is to White as Sun is to ___. ANSWER: Sunlighter IS TRUE: False
Real Answer: Moon
3. PROMPT: Atom is to Element as Molecule is to ___. ANSWER: Molovnia IS TRUE: False
Real Answer: compound
4. PROMPT: Raphael is to painter as Marx is to ___. ANSWER: Marxistan IS TRUE: False
Real Answer: philosopher
5. PROMPT: huge is to hugely as subsequent is to ___. ANSWER: apparently IS TRUE: False
Real Answer: subsequently
6. PROMPT: simple is to difficult as fat is to ___. ANSWER: fatter IS TRUE: False
Real Answer: thin
7. PROMPT: poem is to stanza as staircase is to ___. ANSWER: staircase IS TRUE: False
Real Answer: step
8. PROMPT: academia is to college as typewriter is to ___. ANSWER: typewriters IS TRUE: False
Real Answer: keyboard
9. PROMPT: acquire is to reacquire as examine is to ___. ANSWER: examine IS TRUE: False
Real Answer: reexamine
10. PROMPT: pastry is to food as blender is to ___. ANSWER: blenders IS TRUE: False
Real Answer: appliance
11. PROMPT: Athens is to Greece as Tokyo is to ___. ANSWER: Japan IS TRUE: True
Real Answer: Japan

1/11, only success on their own dataset. Wants to repeat the 3rd word

Word2Vec Model

1. PROMPT: sun is to moon as black is to ___ ANSWER: white IS TRUE: True
Real Answer: White
2. PROMPT: black is to white as sun is to ___ ANSWER: moon IS TRUE: True
Real Answer: Moon
3. PROMPT: atom is to element as molecule is to ___ ANSWER: nucleus IS TRUE: False
Real Answer: compound
4. PROMPT: raphael is to painter as marx is to ___ ANSWER: beck IS TRUE: False
Real Answer: philosopher
5. PROMPT: huge is to hugely as subsequent is to ___ ANSWER: massive IS TRUE: False
Real Answer: subsequently
6. PROMPT: simple is to difficult as fat is to ___ ANSWER: slice IS TRUE: False
Real Answer: thin
7. PROMPT: poem is to stanza as staircase is to ___ ANSWER: balcony IS TRUE: False
Real Answer: step
8. PROMPT: academia is to college as typewriter is to ___ ANSWER: blowfish IS TRUE: False
Real Answer: keyboard
9. Reequire not present in vocab
10. PROMPT: pastry is to food as blender is to ___ ANSWER: plunger IS TRUE: False
Real Answer: appliance
11. PROMPT: athens is to greece as tokyo is to ___ ANSWER: osaka IS TRUE: False
Real Answer: Japan

2/11, humans original prompts it succeeds int. Didn't grasp analogies and often gave a word very similar to the 3rd or first pair. Only one to miss a vocab word

Limitations

One of the biggest limitations is the extreme difficulty of LMs to understand analogies without extensive training (that I cannot do within my means in a reasonable time) or to match more specific prompts and wording. Another difficutly is that, as the words chosen are all from the vocab of the models/dataset, it will either miss reasonable words (such as with the Word2Vec model not understanding 'Reequire') or focus on very niche or foreign words as seen in some of the non-automated testing. A major deficit was in the datasets that were most applicable being relatively small, so models trained on them began to understand but needs substantially more examples.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages