Skip to content

Commit

Permalink
[GH Actions] Automatically add papers from authors (#235)
Browse files Browse the repository at this point in the history
Co-authored-by: xhluca <[email protected]>
  • Loading branch information
github-actions[bot] and xhluca authored Aug 15, 2023
1 parent 4e9d8f3 commit d68f11c
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 0 deletions.
22 changes: 22 additions & 0 deletions _posts/papers/2023-07-31-2307.16877.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
title: Evaluating Correctness and Faithfulness of Instruction-Following Models for
Question Answering
venue: arXiv.org
names: Vaibhav Adlakha, Parishad BehnamGhader, Xing Han Lu, Nicholas Meade, Siva Reddy
tags:
- arXiv.org
link: https://arxiv.org/abs/2307.16877
author: Vaibhav Adlakha
categories: Publications

---

*{{ page.names }}*

**{{ page.venue }}**

{% include display-publication-links.html pub=page %}

## Abstract

Retriever-augmented instruction-following models are attractive alternatives to fine-tuned approaches for information-seeking tasks such as question answering (QA). By simply prepending retrieved documents in its input along with an instruction, these models can be adapted to various information domains and tasks without additional fine-tuning. While the model responses tend to be natural and fluent, the additional verbosity makes traditional QA evaluation metrics such as exact match (EM) and F1 unreliable for accurately quantifying model performance. In this work, we investigate the performance of instruction-following models across three information-seeking QA tasks. We use both automatic and human evaluation to evaluate these models along two dimensions: 1) how well they satisfy the user's information need (correctness), and 2) whether they produce a response based on the provided knowledge (faithfulness). Guided by human evaluation and analysis, we highlight the shortcomings of traditional metrics for both correctness and faithfulness. We then propose simple token-overlap based and model-based metrics that reflect the true performance of these models. Our analysis reveals that instruction-following models are competitive, and sometimes even outperform fine-tuned models for correctness. However, these models struggle to stick to the provided knowledge and often hallucinate in their responses. We hope our work encourages a more holistic evaluation of instruction-following models for QA. Our code and data is available at https://github.com/McGill-NLP/instruct-qa
1 change: 1 addition & 0 deletions records/semantic_paper_ids_ignored.json
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@
"6f6e2e0311589a9af045f6acd00b7dee6d19fce4",
"72d862256f707613a3c16cc79e490a69151d73bf",
"732020be199519fb197a1f2839c6f91ef0583ca7",
"76513f54fcecf7a380f77ad785f05c3bc869db4a",
"7676c5e0cbd1366d23549c4a773fcfc4d21bdb0e",
"7717ef7ae58f1969c3758b5ff4dc2ffce17088d1",
"780356ea2a4b7758f0e53173fa44357ab2ccb592",
Expand Down

0 comments on commit d68f11c

Please sign in to comment.