TREC iKAT 2023/2024 #260

SimonLupart · 2024-04-12T12:53:54Z

Dataset Information:

The purpose is to add the processed TREC iKAT collection (same collection for years 2023 and 2024 [subset of ClueWeb22-B]).
The Shared Task of iKAT can be defined as personalized retrieval-based "candidate response retrieval" in context of the conversation.
Collection with around 116,838,987 passages (with id in the form: clueweb22-en0004-50-00170:0).

Links to Resources:

Guidelines from year 2023: https://www.trecikat.com/guidelines/
Overview of year 2023: https://arxiv.org/abs/2401.01330
Github of year 2023: https://github.com/irlabamsterdam/iKAT
Test topics and qrels 2023: https://trec.nist.gov/data/ikat2023.html

Dataset ID(s) & supported entities:

We can provide with the documents, and flatten version of the conversation:
trec_ikat23/doc : collection of passages, 116M passages
trec_ikat23/queries : the flatten conversations (156 entrees from the 24 topics)
trec_ikat23/qrels : qrels from the flatten conversations

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

Dataset definition (in ir_datasets/datasets/[topid].py)
Tests (in tests/integration/[topid].py)
Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
Documentation (in ir_datasets/etc/[topid].yaml)
- Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
Downloadable content (in ir_datasets/etc/downloads.json)
- Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
- Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

Additional comments/concerns/ideas/etc.

The collection requires a licence approved by CMU, is it possible to restrict the access of the collection? (more details below)

💥 Document Collection: TREC iKAT 2023 ClueWeb22-B

The collection distribution is being handled directly by CMU and not the iKAT organizers. Please follow these steps to get your data license ASAP:

Sign the license form available on the ClueWeb22 project web page.
Send the form to CMU for approval ([email protected])

Please give enough time to the CMU licensing office to accept your request. A download link will be sent to you by the ClueWeb22 team at CMU.

Note:

CMU requires a signature from the organization (i.e., the university or company), not an individual who wants to use the data. This can slow down the process at your end too. So, it’s useful to start the process ASAP.
If you already have an accepted license for ClueWeb22, you don’t need a new form. Please let us know if that’s the case.

The text was updated successfully, but these errors were encountered:

seanmacavaney · 2024-04-12T14:11:10Z

Awesome!

What's the corpus download process like? We can handle the case like we do for other licensed datasets: provide instructions in the software, and ask them to link the downloaded file somewhere that ir-datasets can pick it up.

seanmacavaney · 2024-04-12T14:14:15Z

As far as the structure goes -- can you clarify if the dataset is a typical clueweb22 split, or a special subset for ikat?

If the former, we have a PR already set for CW22, and it should go under there. Something like clueweb22/trec-ikat-2023 and clueweb22/trec-ikat-2024

If the latter, then it's probably a different top-level dataset? And it'd probably be structured like: ikat/trec-2023 and ikat/trec-2024 (or similar)

seanmacavaney · 2024-04-15T09:23:38Z

I realized that we already have an agreement for cw22, so I can request a copy and check :)

SimonLupart · 2024-04-15T14:16:21Z

yes, the raw dataset is included in clueweb22 (clueweb22-iKAT), but it needs a lot of processing to create the passages splits. So instead, we had a processed version, hosted on your server https://ikattrecweb.grill.science/, that could be accessed by contacting Andrew Ramsay to get the credential.

SimonLupart · 2024-04-15T14:26:42Z

as for the hierarchy, if we want to add the queries and qrels I don't think we can do it under clueweb22/trec-ikat-2023, so it might be better to have a dedicated one?

SimonLupart · 2024-05-02T09:16:36Z

I have integrated the code, @seanmacavaney can you have a check? I am not sure about the next steps for the PR to be accepted.

Hierarchy is a following:
trec-ikat/2023 -> doc collection - qrels - both train and test queries
trec-ikat/2023/judged -> subset of queries with relevance judgement in the qrels from NIST assessors.
trec-ikat/2023/judged/ptkb -> qrels of the ptkb (see ikat description)

As for the doc collection, we kindly ask people to link the 16 downloaded chunks of the collection in the folder .ir_datasets/trec-ikat/TREC-Ikat-CW22-passage/ (.jsonl.bz2)

Getting the license to use the collection can be time-consuming and would be handled by CMU, not the iKAT organizers. Please follow these steps to get your data license ASAP:

Sign the license form available on the ClueWeb22 project web page: https://lemurproject.org/clueweb22/obtain.php and send the form to CMU for approval ([email protected]).

Once you have the license, send a mail to Andrew Ramsay <[email protected]> to have access to a download link with the preprocessed iKAT passage collection (here are the 16 chunks)

SimonLupart added the add-dataset label Apr 12, 2024

SimonLupart mentioned this issue May 1, 2024

add trec ikat 2023 #264

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TREC iKAT 2023/2024 #260

TREC iKAT 2023/2024 #260

SimonLupart commented Apr 12, 2024 •

edited

Loading

seanmacavaney commented Apr 12, 2024

seanmacavaney commented Apr 12, 2024

seanmacavaney commented Apr 15, 2024

SimonLupart commented Apr 15, 2024

SimonLupart commented Apr 15, 2024

SimonLupart commented May 2, 2024

TREC iKAT 2023/2024 #260

TREC iKAT 2023/2024 #260

Comments

SimonLupart commented Apr 12, 2024 • edited Loading

seanmacavaney commented Apr 12, 2024

seanmacavaney commented Apr 12, 2024

seanmacavaney commented Apr 15, 2024

SimonLupart commented Apr 15, 2024

SimonLupart commented Apr 15, 2024

SimonLupart commented May 2, 2024

SimonLupart commented Apr 12, 2024 •

edited

Loading