Original Paper = "Lost in the Middle: How Language Models Use Long Contexts". Read paper here.
To find out whether LLMs can really utilize their full context length.
-
Key-Value Retrieval :
Given
$k$ (unique) key-value pairs and a (one) key, find its corresponding value, where$k \in {75, 140, 300}$ . The position of the query key is varied and its effect on the accuracy is noted. -
Multi-Document Question-Answer :
Given
$k$ documents (text excerpts) and a question, give the answer, where$k \in {10, 20, 30}$ . Moreover, exactly one of the k documents is relevant for answering the question and remaining are distractors. The position of the relevant document is varied and its effect on the accuracy is noted.
We have used Gemini pro 1.0. For more details, here is the Gemini technical report
Note : You may see "Gemini pro" in the code and test results, which is at the time of testing is an alias for Gemini pro 1.0
- Key-Value Retrieval Task for 75 Keys:
Correct Key at location -> | 0 | 24 | 49 | 74 |
---|---|---|---|---|
W/O QAC | 100% | 99.8% | 100% | 100% |
W/ QAC | 100% | 100% | 100% | 100% |
- Key-Value Retrieval Task for 140 Keys:
Correct Key at location -> | 0 | 34 | 69 | 104 | 139 |
---|---|---|---|---|---|
W/O QAC | 100% | 99.4% | 100% | 100% | 100% |
W/ QAC | 99.8% | 100% | 100% | 100% | 100% |
- Key-Value Retrieval Task for 300 Keys:
Correct Key at location -> | 0 | 49 | 99 | 149 | 199 | 249 | 299 |
---|---|---|---|---|---|---|---|
W/O QAC | 95.4% | 90% | 95.6% | 87% | 98.2% | 98.6% | 100% |
W/ QAC | 97.2% | 100% | 100% | 100% | 100% | 99.8% | 100% |
- Question-Answer Task:
Relevant Document Location -> | 0 | 4 | 9 | 14 | 19 | 24 | 29 |
---|---|---|---|---|---|---|---|
QA Task Closedbook | 43.61% | - | - | - | - | - | - |
QA Task Oracle w/o QAC | 71.56% | - | - | - | - | - | - |
QA Task Oracle w/ QAC | 78.30% | - | - | - | - | - | - |
QA Task on 10 Doc w/o QAC | 62.63% | 57.21% | 65.98% | - | - | - | - |
QA Task on 10 Doc w/ QAC | 67.38% | 64.97% | 69.22% | - | - | - | - |
QA Task on 20 Doc w/o QAC | 58.56% | 53.18% | 54.80% | 55.55% | 64.44% | - | - |
QA Task on 20 Doc w/ QAC | 63.27% | 59.96% | 62.33% | 62.22% | 67.34% | - | - |
QA Task on 30 Doc w/o QAC | 57.92% | 44.63% | 45.64% | 48.73% | 51.33% | 50.99% | 63.76% |
QA Task on 30 Doc w/ QAC | 63.69% | 54.08% | 54.38% | 56.98% | 59.47% | 60.15% | 66.96% |
We have used RWKV v4 instruction tuned model with 14B parameters for all the testing, for details refer Hugging Face Model Card
- Key-Value Retrieval Task: RWKV Models cann't answer this task.
- Question-Answer Task:
Relevant Document Location -> | 0 | 4 | 9 | 14 | 19 | 24 | 29 |
---|---|---|---|---|---|---|---|
QA Task Closedbook | 30.77% | - | - | - | - | - | - |
QA Task Oracle w/o QAC | 80.56% | - | - | - | - | - | - |
QA Task Oracle w/ QAC | 76.19% | - | - | - | - | - | - |
QA Task on 10 Doc w/o QAC | 38.45% | 34.95% | 43.84% | - | - | - | - |
QA Task on 10 Doc w/ QAC | 38.34% | 33.59% | 40.94% | - | - | - | - |
QA Task on 20 Doc w/o QAC | 32.35% | 31.03% | 30.09% | 31.52% | 43.35% | - | - |
QA Task on 20 Doc w/ QAC | 32.91% | 30.88% | 30.13% | 30.80% | 39.77% | - | - |
QA Task on 30 Doc w/o QAC | 30.24% | 29.45% | 29.03% | 29.03% | 29.67% | 30.92% | 46.21% |
QA Task on 30 Doc w/ QAC | 29.64% | 28.88% | 28.17% | 27.68% | 28.02% | 29.26% | 43.50% |
Note : We have used RWKV v4, raven implies the model is instruct fine tuned and pile implies the model is base model.
- Question-Answer Task with 20 Documents:
Relevant Document Location -> | 0 | 4 | 9 | 14 | 19 |
---|---|---|---|---|---|
Raven 3B | 19.17% | 18.56% | 19.28% | 22.25% | 41.77% |
Raven 7B | 23.69% | 24.14% | 24.44% | 27.19% | 39.51% |
Raven 14B | 32.35% | 31.03% | 30.09% | 31.52% | 43.35% |
Pile 3B | 10.96% | 11.03% | 11.56% | 13.25% | 32.73% |
Pile 7B | 17.70% | 18.00% | 18.79% | 21.39% | 41.31% |
Pile 14B | 20.97% | 21.65% | 21.61% | 23.61% | 37.32% |
Note : We have used RWKV v4, raven implies the model is instruct fine tuned and pile implies the model is base model.
- Question-Answer Task with 20 Documents:
Relevant Document Location -> | 0 | 4 | 9 | 14 | 19 |
---|---|---|---|---|---|
Raven 14B | 32.35% | 31.03% | 30.09% | 31.52% | 43.35% |
Pile 14B | 20.97% | 21.65% | 21.61% | 23.61% | 37.32% |
Mistral 7B - Instruct | 64.78% | 65.00% | 65.23% | 65.68% | 66.13% |
Mistral 7B - Base | 62.93% | 35.59% | 33.63% | 33.97% | 43.95% |
Note : We have used Mistral-7B Instruct v0.2, model card. Here, sdpa = scaled dot product attention and flash = flash_attention_2, paper and github repo.
- Question-Answer Task with 20 Documents:
Relevant Document Location -> | 0 | 4 | 9 | 14 | 19 |
---|---|---|---|---|---|
sdpa | 64.78% | 65.00% | 65.23% | 65.68% | 66.13% |
flash | 64.63% | 64.85% | 65.12% | 65.64% | 66.17% |
Conclusion - Flash Attention marginally (negligibly) decreases the accuracy. Remark - Both experiments described in this section has been performed on systems with different configurations, and thus average time required to execute a prompt cannot be compared.
We have used pre-trained and fine-tuned LLaMA2-4k contex length model with 70B parameters (developed by Meta) for all the testing, for details refer [Hugging Face Model Card].(https://huggingface.co/meta-llama/Llama-2-70b)
- Key-Value Retrieval Task for 75 Keys:
Correct Key at location -> | 0 | 24 | 49 | 74 |
---|---|---|---|---|
W/O QAC | 60% | 0.68% | 0% | 0% |
As we can see from above LLaMA is not suitable for KV-retrieval tasks
- Question-Answer Task:
Relevant Document Location -> | 0 | 4 | 9 | 14 | 19 | 24 | 29 |
---|---|---|---|---|---|---|---|
QA Task Closedbook | 51.33% | - | - | - | - | - | - |
QA Task Oracle w/o QAC | 90.24% | - | - | - | - | - | - |
QA Task Oracle w/ QAC | 89.52% | - | - | - | - | - | - |
QA Task on 10 Doc w/o QAC | 73.67% | 67.86% | 69.86% | - | - | - | - |
QA Task on 10 Doc w/ QAC | 77.47% | 70.50% | 68.51% | - | - | - | - |
QA Task on 20 Doc w/o QAC | 60.30% | 44.40% | 45.87% | 48.62% | 27.98% | - | - |
QA Task on 20 Doc w/ QAC | 70.32% | 56.66% | 58.71% | 54.16% | 35.90% | - | - |
QA Task on 30 Doc w/o QAC | 61.95% | 51.56% | 50.02% | 50.96% | 25.35% | 24.33% | 27.76% |
QA Task on 30 Doc w/ QAC | 70.79% | 59.17% | 53.29% | 56.72% | 34.08% | 34.68% | 35.59% |
- Key-Value Retrieval Task for 75 Keys:
Correct Key at location -> | 0 | 24 | 49 | 74 |
---|---|---|---|---|
W/O QAC | 99.6% | 98% | 98.2% | 96% |
W/ QAC | 99.2% | 100% | 100% | 99.8% |
- Key-Value Retrieval Task for 140 Keys:
Correct Key at location -> | 0 | 34 | 69 | 104 | 139 |
---|---|---|---|---|---|
W/O QAC | 99.2% | 94.8% | 94.6% | 96.2% | 88.6% |
W/ QAC | 98.8% | 100% | 100% | 100% | 92.6% |
- Key-Value Retrieval Task for 300 Keys:
Correct Key at location -> | 0 | 49 | 99 | 149 | 199 | 249 | 299 |
---|---|---|---|---|---|---|---|
W/O QAC | 99.6% | 85.4% | 84.8% | 83.2% | 76.8% | 90.4% | 73.4% |
W/ QAC | 97.6% | 99.8% | 99.8% | 99.8% | 99.4% | 99.8% | 26.2% |
- Question-Answer Task:
Relevant Document Location -> | 0 | 4 | 9 | 14 | 19 | 24 | 29 |
---|---|---|---|---|---|---|---|
QA Task Closedbook | % | - | - | - | - | - | - |
QA Task Oracle w/o QAC | % | - | - | - | - | - | - |
QA Task Oracle w/ QAC | % | - | - | - | - | - | - |
QA Task on 10 Doc w/o QAC | 71.60% | 68.85% | 68.24% | - | - | - | - |
QA Task on 10 Doc w/ QAC | 77.55% | 67.64% | 64.21% | - | - | - | - |
QA Task on 20 Doc w/o QAC | 64.78% | 65.00% | 65.23% | 65.68% | 66.13% | - | - |
QA Task on 20 Doc w/ QAC | 74.04% | 63.16% | 61.80% | 62.33% | 61.35% | - | - |
QA Task on 30 Doc w/o QAC | 62.18% | 60.15% | 62.22% | 64.18% | 63.08% | 64.25% | 66.29% |
QA Task on 30 Doc w/ QAC | 70.50% | 59.92% | 59.92% | 61.69% | 60.64% | 61.80% | 62.71% |