nlp-nebula

A repository containing code for DS207: Introduction to NLP (IISc, Bangalore) course project.

Repository name has nothing to do with the project topic.

Original Paper = "Lost in the Middle: How Language Models Use Long Contexts". Read paper here.

Objective -

To find out whether LLMs can really utilize their full context length.

Experiment/Task Descriptions -

Key-Value Retrieval : Given $k$ (unique) key-value pairs and a (one) key, find its corresponding value, where $k \in {75, 140, 300}$. The position of the query key is varied and its effect on the accuracy is noted.
Multi-Document Question-Answer : Given $k$ documents (text excerpts) and a question, give the answer, where $k \in {10, 20, 30}$. Moreover, exactly one of the k documents is relevant for answering the question and remaining are distractors. The position of the relevant document is varied and its effect on the accuracy is noted.

These are just high level overview of the experiments. A detailed description will be added latter.

Experimental Results of Gemini -

We have used Gemini pro 1.0. For more details, here is the Gemini technical report

Note : You may see "Gemini pro" in the code and test results, which is at the time of testing is an alias for Gemini pro 1.0

Key-Value Retrieval Task for 75 Keys:

Correct Key at location ->	0	24	49	74
W/O QAC	100%	99.8%	100%	100%
W/ QAC	100%	100%	100%	100%

Key-Value Retrieval Task for 140 Keys:

Correct Key at location ->	0	34	69	104	139
W/O QAC	100%	99.4%	100%	100%	100%
W/ QAC	99.8%	100%	100%	100%	100%

Key-Value Retrieval Task for 300 Keys:

Correct Key at location ->	0	49	99	149	199	249	299
W/O QAC	95.4%	90%	95.6%	87%	98.2%	98.6%	100%
W/ QAC	97.2%	100%	100%	100%	100%	99.8%	100%

Question-Answer Task:

Relevant Document Location ->	0	4	9	14	19	24	29
QA Task Closedbook	43.61%	-	-	-	-	-	-
QA Task Oracle w/o QAC	71.56%	-	-	-	-	-	-
QA Task Oracle w/ QAC	78.30%	-	-	-	-	-	-
QA Task on 10 Doc w/o QAC	62.63%	57.21%	65.98%	-	-	-	-
QA Task on 10 Doc w/ QAC	67.38%	64.97%	69.22%	-	-	-	-
QA Task on 20 Doc w/o QAC	58.56%	53.18%	54.80%	55.55%	64.44%	-	-
QA Task on 20 Doc w/ QAC	63.27%	59.96%	62.33%	62.22%	67.34%	-	-
QA Task on 30 Doc w/o QAC	57.92%	44.63%	45.64%	48.73%	51.33%	50.99%	63.76%
QA Task on 30 Doc w/ QAC	63.69%	54.08%	54.38%	56.98%	59.47%	60.15%	66.96%

Experimental Results of RWKV v4 -

We have used RWKV v4 instruction tuned model with 14B parameters for all the testing, for details refer Hugging Face Model Card

Note : RWKV v5 and v6 is available but with less than equal to 7B parameters only.

Key-Value Retrieval Task: RWKV Models cann't answer this task.
Question-Answer Task:

Relevant Document Location ->	0	4	9	14	19	24	29
QA Task Closedbook	30.77%	-	-	-	-	-	-
QA Task Oracle w/o QAC	80.56%	-	-	-	-	-	-
QA Task Oracle w/ QAC	76.19%	-	-	-	-	-	-
QA Task on 10 Doc w/o QAC	38.45%	34.95%	43.84%	-	-	-	-
QA Task on 10 Doc w/ QAC	38.34%	33.59%	40.94%	-	-	-	-
QA Task on 20 Doc w/o QAC	32.35%	31.03%	30.09%	31.52%	43.35%	-	-
QA Task on 20 Doc w/ QAC	32.91%	30.88%	30.13%	30.80%	39.77%	-	-
QA Task on 30 Doc w/o QAC	30.24%	29.45%	29.03%	29.03%	29.67%	30.92%	46.21%
QA Task on 30 Doc w/ QAC	29.64%	28.88%	28.17%	27.68%	28.02%	29.26%	43.50%

Effect of Model Size on Accuracy -

Note : We have used RWKV v4, raven implies the model is instruct fine tuned and pile implies the model is base model.

Question-Answer Task with 20 Documents:

Relevant Document Location ->	0	4	9	14	19
Raven 3B	19.17%	18.56%	19.28%	22.25%	41.77%
Raven 7B	23.69%	24.14%	24.44%	27.19%	39.51%
Raven 14B	32.35%	31.03%	30.09%	31.52%	43.35%
Pile 3B	10.96%	11.03%	11.56%	13.25%	32.73%
Pile 7B	17.70%	18.00%	18.79%	21.39%	41.31%
Pile 14B	20.97%	21.65%	21.61%	23.61%	37.32%

Effect of Instruction Fine Tuning on Accuracy -

Note : We have used RWKV v4, raven implies the model is instruct fine tuned and pile implies the model is base model.

Question-Answer Task with 20 Documents:

Relevant Document Location ->	0	4	9	14	19
Raven 14B	32.35%	31.03%	30.09%	31.52%	43.35%
Pile 14B	20.97%	21.65%	21.61%	23.61%	37.32%
Mistral 7B - Instruct	64.78%	65.00%	65.23%	65.68%	66.13%
Mistral 7B - Base	62.93%	35.59%	33.63%	33.97%	43.95%

Effect of Attention Mechanism Used -

Note : We have used Mistral-7B Instruct v0.2, model card. Here, sdpa = scaled dot product attention and flash = flash_attention_2, paper and github repo.

Question-Answer Task with 20 Documents:

Relevant Document Location ->	0	4	9	14	19
sdpa	64.78%	65.00%	65.23%	65.68%	66.13%
flash	64.63%	64.85%	65.12%	65.64%	66.17%

Conclusion - Flash Attention marginally (negligibly) decreases the accuracy. Remark - Both experiments described in this section has been performed on systems with different configurations, and thus average time required to execute a prompt cannot be compared.

Experimental Results of LLaMA2-70B

We have used pre-trained and fine-tuned LLaMA2-4k contex length model with 70B parameters (developed by Meta) for all the testing, for details refer [Hugging Face Model Card].(https://huggingface.co/meta-llama/Llama-2-70b)

Key-Value Retrieval Task for 75 Keys:

Correct Key at location ->	0	24	49	74
W/O QAC	60%	0.68%	0%	0%

As we can see from above LLaMA is not suitable for KV-retrieval tasks

Question-Answer Task:

Relevant Document Location ->	0	4	9	14	19	24	29
QA Task Closedbook	51.33%	-	-	-	-	-	-
QA Task Oracle w/o QAC	90.24%	-	-	-	-	-	-
QA Task Oracle w/ QAC	89.52%	-	-	-	-	-	-
QA Task on 10 Doc w/o QAC	73.67%	67.86%	69.86%	-	-	-	-
QA Task on 10 Doc w/ QAC	77.47%	70.50%	68.51%	-	-	-	-
QA Task on 20 Doc w/o QAC	60.30%	44.40%	45.87%	48.62%	27.98%	-	-
QA Task on 20 Doc w/ QAC	70.32%	56.66%	58.71%	54.16%	35.90%	-	-
QA Task on 30 Doc w/o QAC	61.95%	51.56%	50.02%	50.96%	25.35%	24.33%	27.76%
QA Task on 30 Doc w/ QAC	70.79%	59.17%	53.29%	56.72%	34.08%	34.68%	35.59%

Experimental Results of Mistral-7B-Instruct

Key-Value Retrieval Task for 75 Keys:

Correct Key at location ->	0	24	49	74
W/O QAC	99.6%	98%	98.2%	96%
W/ QAC	99.2%	100%	100%	99.8%

Key-Value Retrieval Task for 140 Keys:

Correct Key at location ->	0	34	69	104	139
W/O QAC	99.2%	94.8%	94.6%	96.2%	88.6%
W/ QAC	98.8%	100%	100%	100%	92.6%

Key-Value Retrieval Task for 300 Keys:

Correct Key at location ->	0	49	99	149	199	249	299
W/O QAC	99.6%	85.4%	84.8%	83.2%	76.8%	90.4%	73.4%
W/ QAC	97.6%	99.8%	99.8%	99.8%	99.4%	99.8%	26.2%

Question-Answer Task:

Relevant Document Location ->	0	4	9	14	19	24	29
QA Task Closedbook	%	-	-	-	-	-	-
QA Task Oracle w/o QAC	%	-	-	-	-	-	-
QA Task Oracle w/ QAC	%	-	-	-	-	-	-
QA Task on 10 Doc w/o QAC	71.60%	68.85%	68.24%	-	-	-	-
QA Task on 10 Doc w/ QAC	77.55%	67.64%	64.21%	-	-	-	-
QA Task on 20 Doc w/o QAC	64.78%	65.00%	65.23%	65.68%	66.13%	-	-
QA Task on 20 Doc w/ QAC	74.04%	63.16%	61.80%	62.33%	61.35%	-	-
QA Task on 30 Doc w/o QAC	62.18%	60.15%	62.22%	64.18%	63.08%	64.25%	66.29%
QA Task on 30 Doc w/ QAC	70.50%	59.92%	59.92%	61.69%	60.64%	61.80%	62.71%

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.ipynb_checkpoints		.ipynb_checkpoints
kv_retrieval_data		kv_retrieval_data
prompting		prompting
qa_data		qa_data
responses		responses
Evaluate_KV.ipynb		Evaluate_KV.ipynb
Evaluate_QA.ipynb		Evaluate_QA.ipynb
Gemini_KV.ipynb		Gemini_KV.ipynb
Gemini_QA_closedbook.ipynb		Gemini_QA_closedbook.ipynb
Gemini_QA_multi_doc.ipynb		Gemini_QA_multi_doc.ipynb
Gemini_QA_oracle.ipynb		Gemini_QA_oracle.ipynb
README.md		README.md
llama2_QA_closedbook.ipynb		llama2_QA_closedbook.ipynb
llama2_QA_multi_doc.ipynb		llama2_QA_multi_doc.ipynb
llama2_QA_oracle.ipynb		llama2_QA_oracle.ipynb
mist_kv.py		mist_kv.py
mist_qa		mist_qa
rwkv_qa.py		rwkv_qa.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nlp-nebula

A repository containing code for DS207: Introduction to NLP (IISc, Bangalore) course project.

Repository name has nothing to do with the project topic.

Objective -

Experiment/Task Descriptions -

These are just high level overview of the experiments. A detailed description will be added latter.

Experimental Results of Gemini -

Note : You may see "Gemini pro" in the code and test results, which is at the time of testing is an alias for Gemini pro 1.0

Experimental Results of RWKV v4 -

Note : RWKV v5 and v6 is available but with less than equal to 7B parameters only.

Effect of Model Size on Accuracy -

Note : We have used RWKV v4, raven implies the model is instruct fine tuned and pile implies the model is base model.

Effect of Instruction Fine Tuning on Accuracy -

Note : We have used RWKV v4, raven implies the model is instruct fine tuned and pile implies the model is base model.

Effect of Attention Mechanism Used -

Experimental Results of LLaMA2-70B

Experimental Results of Mistral-7B-Instruct

Contributors -

Check out commits to see exactly who contributed to what!

About

Releases

Packages

Contributors 2

Languages

yoR-rihsihS/nlp-nebula

Folders and files

Latest commit

History

Repository files navigation

nlp-nebula

A repository containing code for DS207: Introduction to NLP (IISc, Bangalore) course project.

Repository name has nothing to do with the project topic.

Objective -

Experiment/Task Descriptions -

These are just high level overview of the experiments. A detailed description will be added latter.

Experimental Results of Gemini -

Note : You may see "Gemini pro" in the code and test results, which is at the time of testing is an alias for Gemini pro 1.0

Experimental Results of RWKV v4 -

Note : RWKV v5 and v6 is available but with less than equal to 7B parameters only.

Effect of Model Size on Accuracy -

Note : We have used RWKV v4, raven implies the model is instruct fine tuned and pile implies the model is base model.

Effect of Instruction Fine Tuning on Accuracy -

Note : We have used RWKV v4, raven implies the model is instruct fine tuned and pile implies the model is base model.

Effect of Attention Mechanism Used -

Experimental Results of LLaMA2-70B

Experimental Results of Mistral-7B-Instruct

Contributors -

Check out commits to see exactly who contributed to what!

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages