-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TREC iKAT 2023/2024 #260
Comments
Awesome! What's the corpus download process like? We can handle the case like we do for other licensed datasets: provide instructions in the software, and ask them to link the downloaded file somewhere that ir-datasets can pick it up. |
As far as the structure goes -- can you clarify if the dataset is a typical clueweb22 split, or a special subset for ikat? If the former, we have a PR already set for CW22, and it should go under there. Something like If the latter, then it's probably a different top-level dataset? And it'd probably be structured like: |
I realized that we already have an agreement for cw22, so I can request a copy and check :) |
yes, the raw dataset is included in clueweb22 (clueweb22-iKAT), but it needs a lot of processing to create the passages splits. So instead, we had a processed version, hosted on your server https://ikattrecweb.grill.science/, that could be accessed by contacting Andrew Ramsay to get the credential. |
as for the hierarchy, if we want to add the queries and qrels I don't think we can do it under clueweb22/trec-ikat-2023, so it might be better to have a dedicated one? |
I have integrated the code, @seanmacavaney can you have a check? I am not sure about the next steps for the PR to be accepted. Hierarchy is a following: As for the doc collection, we kindly ask people to link the 16 downloaded chunks of the collection in the folder
|
Dataset Information:
The purpose is to add the processed TREC iKAT collection (same collection for years 2023 and 2024 [subset of ClueWeb22-B]).
The Shared Task of iKAT can be defined as personalized retrieval-based "candidate response retrieval" in context of the conversation.
Collection with around 116,838,987 passages (with id in the form: clueweb22-en0004-50-00170:0).
Links to Resources:
Guidelines from year 2023: https://www.trecikat.com/guidelines/
Overview of year 2023: https://arxiv.org/abs/2401.01330
Github of year 2023: https://github.com/irlabamsterdam/iKAT
Test topics and qrels 2023: https://trec.nist.gov/data/ikat2023.html
Dataset ID(s) & supported entities:
We can provide with the documents, and flatten version of the conversation:
trec_ikat23/doc
: collection of passages, 116M passagestrec_ikat23/queries
: the flatten conversations (156 entrees from the 24 topics)trec_ikat23/qrels
: qrels from the flatten conversationsChecklist
Mark each task once completed. All should be checked prior to merging a new dataset.
ir_datasets/datasets/[topid].py
)tests/integration/[topid].py
)ir_datasets generate_metadata
command, should appear inir_datasets/etc/metadata.json
)ir_datasets/etc/[topid].yaml
)ir_datasets/etc/downloads.json
).github/workflows/verify_downloads.yml
). Only one needed pertopid
.downloads.json
.Additional comments/concerns/ideas/etc.
The collection requires a licence approved by CMU, is it possible to restrict the access of the collection? (more details below)
The text was updated successfully, but these errors were encountered: