Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TREC iKAT 2023/2024 #260

Open
5 of 8 tasks
SimonLupart opened this issue Apr 12, 2024 · 6 comments
Open
5 of 8 tasks

TREC iKAT 2023/2024 #260

SimonLupart opened this issue Apr 12, 2024 · 6 comments

Comments

@SimonLupart
Copy link

SimonLupart commented Apr 12, 2024

Dataset Information:

The purpose is to add the processed TREC iKAT collection (same collection for years 2023 and 2024 [subset of ClueWeb22-B]).
The Shared Task of iKAT can be defined as personalized retrieval-based "candidate response retrieval" in context of the conversation.
Collection with around 116,838,987 passages (with id in the form: clueweb22-en0004-50-00170:0).

Links to Resources:

Guidelines from year 2023: https://www.trecikat.com/guidelines/
Overview of year 2023: https://arxiv.org/abs/2401.01330
Github of year 2023: https://github.com/irlabamsterdam/iKAT
Test topics and qrels 2023: https://trec.nist.gov/data/ikat2023.html

Dataset ID(s) & supported entities:

We can provide with the documents, and flatten version of the conversation:
trec_ikat23/doc : collection of passages, 116M passages
trec_ikat23/queries : the flatten conversations (156 entrees from the 24 topics)
trec_ikat23/qrels : qrels from the flatten conversations

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

  • Dataset definition (in ir_datasets/datasets/[topid].py)
  • Tests (in tests/integration/[topid].py)
  • Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
  • Documentation (in ir_datasets/etc/[topid].yaml)
  • Downloadable content (in ir_datasets/etc/downloads.json)
    • Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
    • Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

Additional comments/concerns/ideas/etc.

The collection requires a licence approved by CMU, is it possible to restrict the access of the collection? (more details below)

💥 Document Collection: TREC iKAT 2023 ClueWeb22-B

The collection distribution is being handled directly by CMU and not the iKAT organizers. Please follow these steps to get your data license ASAP:

Sign the license form available on the ClueWeb22 project web page.
Send the form to CMU for approval ([email protected])

Please give enough time to the CMU licensing office to accept your request. A download link will be sent to you by the ClueWeb22 team at CMU.

Note:

CMU requires a signature from the organization (i.e., the university or company), not an individual who wants to use the data. This can slow down the process at your end too. So, it’s useful to start the process ASAP.
If you already have an accepted license for ClueWeb22, you don’t need a new form. Please let us know if that’s the case.

@seanmacavaney
Copy link
Collaborator

Awesome!

What's the corpus download process like? We can handle the case like we do for other licensed datasets: provide instructions in the software, and ask them to link the downloaded file somewhere that ir-datasets can pick it up.

@seanmacavaney
Copy link
Collaborator

As far as the structure goes -- can you clarify if the dataset is a typical clueweb22 split, or a special subset for ikat?

If the former, we have a PR already set for CW22, and it should go under there. Something like clueweb22/trec-ikat-2023 and clueweb22/trec-ikat-2024

If the latter, then it's probably a different top-level dataset? And it'd probably be structured like: ikat/trec-2023 and ikat/trec-2024 (or similar)

@seanmacavaney
Copy link
Collaborator

I realized that we already have an agreement for cw22, so I can request a copy and check :)

@SimonLupart
Copy link
Author

yes, the raw dataset is included in clueweb22 (clueweb22-iKAT), but it needs a lot of processing to create the passages splits. So instead, we had a processed version, hosted on your server https://ikattrecweb.grill.science/, that could be accessed by contacting Andrew Ramsay to get the credential.

@SimonLupart
Copy link
Author

as for the hierarchy, if we want to add the queries and qrels I don't think we can do it under clueweb22/trec-ikat-2023, so it might be better to have a dedicated one?

@SimonLupart
Copy link
Author

I have integrated the code, @seanmacavaney can you have a check? I am not sure about the next steps for the PR to be accepted.

Hierarchy is a following:
trec-ikat/2023 -> doc collection - qrels - both train and test queries
trec-ikat/2023/judged -> subset of queries with relevance judgement in the qrels from NIST assessors.
trec-ikat/2023/judged/ptkb -> qrels of the ptkb (see ikat description)

As for the doc collection, we kindly ask people to link the 16 downloaded chunks of the collection in the folder .ir_datasets/trec-ikat/TREC-Ikat-CW22-passage/ (.jsonl.bz2)

Getting the license to use the collection can be time-consuming and would be handled by CMU, not the iKAT organizers. Please follow these steps to get your data license ASAP:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants