Information Retrieval Datasets

This section lists native IR dataset definitions.

MS MARCO Passage

The MS MARCO (Microsoft Machine Reading Comprehension) Passage Ranking dataset. One of the most widely used benchmarks for neural IR research.

Contains ~8.8M passages and ~500K training queries with sparse relevance judgments.

MS MARCO Passage Ranking collection.

A large-scale dataset focused on machine reading comprehension, question answering, and passage ranking. The passage reranking task provides a query and the top-1000 BM25 passages; a system is expected to rerank the most relevant passage as high as possible. Not all 1000 passages are judged; evaluation uses MRR.

Publication: Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. In CoCo@NIPS.

See MSMARCO-Passage-Ranking for more details.

Dataset com.microsoft.msmarco.passage.documents

datamaestro_ir.data.stores.MsMarcoPassagesStore

MS-Marco passage collection and small query/qrel files.

Tags: passage, collection

Downloads collectionandqueries.tar.gz once, builds the document store, and extracts query/qrel files for dev-small and eval-small splits.

Format is TSV (pid t content)

Dataset com.microsoft.msmarco.passage.train.run

datamaestro_ir.data.csv.AdhocRunWithText

Tags: run

TSV format: qid, pid, query, passage

Dataset com.microsoft.msmarco.passage.train.queries

datamaestro_ir.data.csv.Topics

Tags: topics

Dataset com.microsoft.msmarco.passage.train.qrels

datamaestro_ir.data.trec.TrecAdhocAssessments

Tags: qrels

Dataset com.microsoft.msmarco.passage.train

datamaestro_ir.data.Adhoc

MS-Marco train dataset

Tasks: adhoc retrieval

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.train.withrun

datamaestro_ir.data.RerankAdhoc

MSMarco train dataset, including the top-1000 to documents to re-rank

Tasks: adhoc retrieval, learning to rank

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.train.triples.id

datamaestro_ir.data.TrainingTripletsLines

Full training triples (query, positive passage, negative passage) with IDs

Tags: triples

Tasks: learning to rank

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.train.triples.small.text

datamaestro_ir.data.TrainingTripletsLines

Small training triples (query, positive passage, negative passage) with text

Tags: triples

Tasks: learning to rank

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.dev.queries

datamaestro_ir.data.csv.Topics

Tags: topics

Dataset com.microsoft.msmarco.passage.dev.run

datamaestro_ir.data.csv.AdhocRunWithText

Tags: run

Dataset com.microsoft.msmarco.passage.dev.qrels

datamaestro_ir.data.trec.TrecAdhocAssessments

Tags: qrels

Dataset com.microsoft.msmarco.passage.dev

datamaestro_ir.data.Adhoc

MS-Marco dev dataset

Tasks: adhoc retrieval

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.dev.withrun

datamaestro_ir.data.RerankAdhoc

MSMarco dev dataset, including the top-1000 to documents to re-rank

Tasks: adhoc retrieval, learning to rank

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.dev.judged

datamaestro_ir.data.Adhoc

MS-Marco dev dataset, restricted to judged queries

Tasks: adhoc retrieval

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.eval.withrun

datamaestro_ir.data.csv.AdhocRunWithText

Tags: run

Dataset com.microsoft.msmarco.passage.dev.small

datamaestro_ir.data.Adhoc

MS-Marco dev small dataset

Tasks: adhoc retrieval

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.eval.queries.small

datamaestro_ir.data.csv.Topics

MS-Marco eval small queries

Tags: topics

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.trec2019.queries

datamaestro_ir.data.csv.Topics

Tags: topics

Dataset com.microsoft.msmarco.passage.trec2019.run

datamaestro_ir.data.csv.AdhocRunWithText

Tags: run

Dataset com.microsoft.msmarco.passage.trec2019.qrels

datamaestro_ir.data.trec.TrecAdhocAssessments

Tags: qrels

Dataset com.microsoft.msmarco.passage.trec2019

datamaestro_ir.data.Adhoc

TREC Deep Learning (2019)

Tasks: adhoc retrieval

External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019.html

Dataset com.microsoft.msmarco.passage.trec2019.withrun

datamaestro_ir.data.RerankAdhoc

TREC Deep Learning (2019), including the top-1000 to documents to re-rank

Tasks: adhoc retrieval, learning to rank

External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019.html

Dataset com.microsoft.msmarco.passage.trec2019.judged

datamaestro_ir.data.Adhoc

TREC Deep Learning (2019), restricted to judged queries

Tasks: adhoc retrieval

External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019.html

Dataset com.microsoft.msmarco.passage.trec2020.queries

datamaestro_ir.data.csv.Topics

TREC Deep Learning 2019 (topics)

Tags: topics

Topics of the TREC 2019 MS-Marco Deep Learning track

Dataset com.microsoft.msmarco.passage.trec2020.run

datamaestro_ir.data.csv.AdhocRunWithText

TREC Deep Learning (2020)

Tags: run

External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020.html

Set of query/passages for the passage re-ranking task re-rank (TREC 2020)

Dataset com.microsoft.msmarco.passage.trec2020.qrels

datamaestro_ir.data.trec.TrecAdhocAssessments

Tags: qrels

Dataset com.microsoft.msmarco.passage.trec2020

datamaestro_ir.data.Adhoc

TREC Deep Learning (2020)

Tasks: adhoc retrieval

External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020.html

Dataset com.microsoft.msmarco.passage.trec2020.withrun

datamaestro_ir.data.RerankAdhoc

TREC Deep Learning (2020), including the top-1000 to documents to re-rank

Tasks: adhoc retrieval, learning to rank

External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020.html

Dataset com.microsoft.msmarco.passage.trec2020.judged

datamaestro_ir.data.Adhoc

TREC Deep Learning (2020), restricted to judged queries

Tasks: adhoc retrieval

External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020.html

Dataset com.microsoft.msmarco.passage.trec.dl.hard.qrels

datamaestro_ir.data.trec.TrecAdhocAssessments

TREC DL-Hard qrels (passage)

Tags: qrels

External link: https://github.com/grill-lab/DL-Hard

Dataset com.microsoft.msmarco.passage.trec.dl.hard

datamaestro_ir.data.Adhoc

A more challenging subset of TREC DL 2019 and 2020 passage queries

Tasks: adhoc retrieval

External link: https://github.com/grill-lab/DL-Hard

See: Mackie et al., “How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset”, SIGIR 2021.

Example usage:

from datamaestro import prepare_dataset
from datamaestro.record import IDItem, TextItem

# Load the full adhoc dataset
adhoc = prepare_dataset("com.microsoft.msmarco.passage")

# Iterate over documents
for doc in adhoc.documents.iter_documents():
    doc_id = doc[IDItem].id
    text = doc[TextItem].text

# Load training triplets
triplets = prepare_dataset("com.microsoft.msmarco.passage.train.idstriples.small")
for triplet in triplets.iter():
    query = triplet.query
    pos_doc = triplet.positive
    neg_doc = triplet.negative

BEIR Benchmark

The BEIR (Benchmarking IR) benchmark is a heterogeneous collection of diverse IR tasks for evaluating zero-shot retrieval models. It includes datasets from question answering, fact verification, citation prediction, and more.

BEIR (Benchmarking IR) benchmark datasets.

Provides native datamaestro dataset definitions for ~15 BEIR datasets plus 12 CQADupStack sub-datasets, using CompressedDocumentStore for efficient document storage.

Each dataset downloads its ZIP once (transient). The docstore is built from the corpus, and queries/qrels are copied out before cleanup.

See: https://github.com/beir-cellar/beir

Dataset org.beir.trec.covid

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.nq

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.arguana

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.webis.touche2020

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.webis.touche2020.v2

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.climate.fever

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.scidocs

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.nfcorpus.collection

datamaestro_ir.data.beir.BeirDocumentStore

Tags: document, collection

Dataset org.beir.nfcorpus.train

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.nfcorpus.dev

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.nfcorpus.test

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.hotpotqa.collection

datamaestro_ir.data.beir.BeirDocumentStore

Tags: document, collection

Dataset org.beir.hotpotqa.train

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.hotpotqa.dev

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.hotpotqa.test

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.fiqa.collection

datamaestro_ir.data.beir.BeirDocumentStore

Tags: document, collection

Dataset org.beir.fiqa.train

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.fiqa.dev

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.fiqa.test

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.quora.collection

datamaestro_ir.data.beir.BeirDocumentStore

Tags: document, collection

Dataset org.beir.quora.dev

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.quora.test

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.dbpedia.entity.collection

datamaestro_ir.data.beir.BeirDocumentStore

Tags: document, collection

Dataset org.beir.dbpedia.entity.dev

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.dbpedia.entity.test

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.fever.collection

datamaestro_ir.data.beir.BeirDocumentStore

Tags: document, collection

Dataset org.beir.fever.train

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.fever.dev

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.fever.test

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.scifact.collection

datamaestro_ir.data.beir.BeirDocumentStore

Tags: document, collection

Dataset org.beir.scifact.train

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.scifact.test

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.cqadupstack.android

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.cqadupstack.english

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.cqadupstack.gaming

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.cqadupstack.gis

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.cqadupstack.mathematica

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.cqadupstack.physics

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.cqadupstack.programmers

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.cqadupstack.stats

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.cqadupstack.tex

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.cqadupstack.unix

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.cqadupstack.webmasters

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Dataset org.beir.cqadupstack.wordpress

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://github.com/beir-cellar/beir

Example usage:

from datamaestro import prepare_dataset

# Load a single-split dataset
adhoc = prepare_dataset("org.beir.scidocs")

# Load a multi-split dataset
adhoc = prepare_dataset("org.beir.nfcorpus_test")

# Access components
for doc in adhoc.documents.iter():
    print(doc["id"], doc["text_item"].text)

LoTTE Benchmark

The LoTTE (Long-Tail Topic-stratified Evaluation) benchmark from ColBERTv2. Contains 6 domains (lifestyle, recreation, science, technology, writing, pooled) with dev/test splits and two query types (search, forum) per split.

LoTTE (Long-Tail Topic-stratified Evaluation) benchmark datasets.

Provides native datamaestro dataset definitions for 6 domains x 2 splits x 2 query types = 24 IR tasks. A single 3.6GB tar.gz is downloaded once; per-domain docstores are built, and query/qrel files are copied out.

See: https://github.com/stanford-futuredata/ColBERT

Dataset edu.stanford.lotte.lotte.data

datamaestro_ir.data.lotte.LotteDocumentStore

Tags: passage, collection

Dataset edu.stanford.lotte.lifestyle.dev.search

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.lifestyle.dev.forum

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.lifestyle.test.search

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.lifestyle.test.forum

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.recreation.dev.search

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.recreation.dev.forum

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.recreation.test.search

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.recreation.test.forum

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.science.dev.search

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.science.dev.forum

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.science.test.search

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.science.test.forum

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.technology.dev.search

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.technology.dev.forum

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.technology.test.search

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.technology.test.forum

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.writing.dev.search

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.writing.dev.forum

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.writing.test.search

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.writing.test.forum

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.pooled.dev.search

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.pooled.dev.forum

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.pooled.test.search

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Dataset edu.stanford.lotte.pooled.test.forum

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz

Example usage:

from datamaestro import prepare_dataset

# Load a specific task
adhoc = prepare_dataset("edu.stanford.lotte.science_test_search")

# Access components
for doc in adhoc.documents.iter():
    print(doc["id"], doc["text_item"].text)

TIPSTER Collections

The TIPSTER document collections used in TREC evaluations, organized by source.

TIPSTER document collections.

Also known as the Text Research Collection Volume or TREC. Sponsored by ARPA/SISTO to advance the state of the art in document detection and data extraction from large, real-world collections. The detection data is a test collection built at NIST for the TIPSTER and related TREC projects: three CD-ROMs of SGML-encoded documents from LDC plus queries and relevance judgments from NIST.

See also https://trec.nist.gov/data/docs_eng.html and https://trec.nist.gov/data/intro_eng.html.

Dataset gov.nist.trec.tipster.ap88

datamaestro_ir.data.trec.TipsterCollection

Associated Press document collection (1988)

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.ap89

datamaestro_ir.data.trec.TipsterCollection

Associated Press document collection (1989)

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.ap90

datamaestro_ir.data.trec.TipsterCollection

Associated Press document collection (1990)

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.doe1

datamaestro_ir.data.trec.TipsterCollection

Department of Energy documents

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.wsj87

datamaestro_ir.data.trec.TipsterCollection

Wall Street Journal (1987)

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.wsj88

datamaestro_ir.data.trec.TipsterCollection

Wall Street Journal (1988)

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.wsj89

datamaestro_ir.data.trec.TipsterCollection

Wall Street Journal (1989)

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.wsj90

datamaestro_ir.data.trec.TipsterCollection

Wall Street Journal (1990)

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.wsj91

datamaestro_ir.data.trec.TipsterCollection

Wall Street Journal (1991)

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.wsj92

datamaestro_ir.data.trec.TipsterCollection

Wall Street Journal (1992)

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.fr88

datamaestro_ir.data.trec.TipsterCollection

Federal Register (1988)

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.fr89

datamaestro_ir.data.trec.TipsterCollection

Federal Register (1989)

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.fr94

datamaestro_ir.data.trec.TipsterCollection

Federal Register (1994)

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.ziff1

datamaestro_ir.data.trec.TipsterCollection

Information from the Computer Select disks (1989-90)

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.ziff2

datamaestro_ir.data.trec.TipsterCollection

Information from the Computer Select disks (1989-90)

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.ziff3

datamaestro_ir.data.trec.TipsterCollection

Information from the Computer Select disks (1990-91)

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.sjm1

datamaestro_ir.data.trec.TipsterCollection

San Jose Mercury News (1991)

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.cr1

datamaestro_ir.data.trec.TipsterCollection

TODO

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.ft1

datamaestro_ir.data.trec.TipsterCollection

Financial Times

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.fbis1

datamaestro_ir.data.trec.TipsterCollection

Foreign Broadcast Information Service (1996)

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.la8990

datamaestro_ir.data.trec.TipsterCollection

Los Angeles Times (1989-90)

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC93T3A

AQUAINT

The AQUAINT Corpus consists of newswire text data in English from three sources: Xinhua News Service, New York Times, and Associated Press.

AQUAINT newswire corpus.

LDC catalog number LDC2002T31 (ISBN 1-58563-240-6). Newswire text in English drawn from three sources: the Xinhua News Service (PRC), the New York Times News Service, and the Associated Press Worldstream News Service. Prepared by the LDC for the AQUAINT Project and used in official NIST benchmark evaluations.

Dataset edu.upenn.ldc.aquaint.apw

datamaestro_ir.data.trec.TipsterCollection

Associated Press (1998-2000)

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC2002T31

Dataset edu.upenn.ldc.aquaint.nyt

datamaestro_ir.data.trec.TipsterCollection

New York Times (1998-2000)

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC2002T31

Dataset edu.upenn.ldc.aquaint.xie

datamaestro_ir.data.trec.TipsterCollection

Xinhua News Agency newswires (1996-2000)

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC2002T31

Dataset edu.upenn.ldc.aquaint

datamaestro_ir.data.trec.TipsterCollection

Aquaint documents

Tags: document, collection

External link: https://catalog.ldc.upenn.edu/LDC2002T31

TREC Ad Hoc

Classic TREC Ad Hoc test collections from NIST. These collections have been fundamental benchmarks in IR research since the 1990s.

TREC Adhoc datasets and tasks.

See https://trec.nist.gov/data/test_coll.html.

Dataset gov.nist.trec.adhoc.1.documents

datamaestro_ir.data.trec.TipsterCollection

TREC-1 to TREC-3 documents (TIPSTER volumes 1 and 2)

Tags: document, collection

Dataset gov.nist.trec.adhoc.1.topics

datamaestro_ir.data.trec.TrecTopics

Tags: topics

Dataset gov.nist.trec.adhoc.1.assessments

datamaestro_ir.data.trec.TrecAdhocAssessments

Tags: qrels

Dataset gov.nist.trec.adhoc.1

datamaestro_ir.data.Adhoc

Ad-hoc task of TREC 1 (1992)

Tasks: adhoc retrieval

Dataset gov.nist.trec.adhoc.2.topics

datamaestro_ir.data.trec.TrecTopics

Tags: topics

Dataset gov.nist.trec.adhoc.2.assessments

datamaestro_ir.data.trec.TrecAdhocAssessments

Tags: qrels

Dataset gov.nist.trec.adhoc.2

datamaestro_ir.data.Adhoc

Ad-hoc task of TREC 2 (1993)

Tasks: adhoc retrieval

Dataset gov.nist.trec.adhoc.3.topics

datamaestro_ir.data.trec.TrecTopics

Tags: topics

Dataset gov.nist.trec.adhoc.3.assessments

datamaestro_ir.data.trec.TrecAdhocAssessments

Tags: qrels

Dataset gov.nist.trec.adhoc.3

datamaestro_ir.data.Adhoc

Ad-hoc task of TREC 3 (1994)

Tasks: adhoc retrieval

Dataset gov.nist.trec.adhoc.4.documents

datamaestro_ir.data.trec.TipsterCollection

TREC-4 documents

Tags: document, collection

Dataset gov.nist.trec.adhoc.4.topics

datamaestro_ir.data.trec.TrecTopics

Tags: topics

Dataset gov.nist.trec.adhoc.4.assessments

datamaestro_ir.data.trec.TrecAdhocAssessments

Tags: qrels

Dataset gov.nist.trec.adhoc.4

datamaestro_ir.data.Adhoc

Ad-hoc task of TREC 4 (1995)

Tasks: adhoc retrieval

Dataset gov.nist.trec.adhoc.5.documents

datamaestro_ir.data.trec.TipsterCollection

TREC-5 documents

Tags: document, collection

Dataset gov.nist.trec.adhoc.5.topics

datamaestro_ir.data.trec.TrecTopics

Tags: topics

Dataset gov.nist.trec.adhoc.5.qrels

datamaestro_ir.data.trec.TrecAdhocAssessments

Tags: qrels

Dataset gov.nist.trec.adhoc.5

datamaestro_ir.data.Adhoc

Ad-hoc task of TREC 5 (1996)

Tasks: adhoc retrieval

Dataset gov.nist.trec.adhoc.6.documents

datamaestro_ir.data.trec.TipsterCollection

TREC-5 documents

Tags: document, collection

Dataset gov.nist.trec.adhoc.6.topics

datamaestro_ir.data.trec.TrecTopics

Tags: topics

Dataset gov.nist.trec.adhoc.6.qrels

datamaestro_ir.data.trec.TrecAdhocAssessments

Tags: qrels

Dataset gov.nist.trec.adhoc.6

datamaestro_ir.data.Adhoc

Ad-hoc task of TREC 6 (1997)

Tasks: adhoc retrieval

Dataset gov.nist.trec.adhoc.7.documents

datamaestro_ir.data.trec.TipsterCollection

TREC-7 documents

Tags: document, collection

Dataset gov.nist.trec.adhoc.7.topics

datamaestro_ir.data.trec.TrecTopics

Tags: topics

Dataset gov.nist.trec.adhoc.7.qrels

datamaestro_ir.data.trec.TrecAdhocAssessments

Tags: qrels

Dataset gov.nist.trec.adhoc.7

datamaestro_ir.data.Adhoc

Ad-hoc task of TREC 3 (1994)

Tasks: adhoc retrieval

Dataset gov.nist.trec.adhoc.8.topics

datamaestro_ir.data.trec.TrecTopics

Tags: topics

Dataset gov.nist.trec.adhoc.8.qrels

datamaestro_ir.data.trec.TrecAdhocAssessments

Tags: qrels

Dataset gov.nist.trec.adhoc.8

datamaestro_ir.data.Adhoc

Ad-hoc task of TREC 8 (1999)

Tasks: adhoc retrieval

Dataset gov.nist.trec.adhoc.robust.2004.topics

datamaestro_ir.data.trec.TrecTopics

Tags: topics

Dataset gov.nist.trec.adhoc.robust.2004.qrels

datamaestro_ir.data.trec.TrecAdhocAssessments

Tags: qrels

Dataset gov.nist.trec.adhoc.robust.2004

datamaestro_ir.data.Adhoc

Ad-hoc task of TREC Robust (2004)

Tasks: adhoc retrieval

Dataset gov.nist.trec.adhoc.robust.2005.topics

datamaestro_ir.data.trec.TrecTopics

Tags: topics

Dataset gov.nist.trec.adhoc.robust.2005.qrels

datamaestro_ir.data.trec.TrecAdhocAssessments

Tags: qrels

Dataset gov.nist.trec.adhoc.robust.2005

datamaestro_ir.data.Adhoc

Ad-hoc task of TREC Robust (2005)

Tasks: adhoc retrieval

Dataset gov.nist.trec.adhoc.1.documents.store

datamaestro_ir.data.stores.TipsterDocumentStore

Tags: document, collection

Dataset gov.nist.trec.adhoc.4.documents.store

datamaestro_ir.data.stores.TipsterDocumentStore

Tags: document, collection

Dataset gov.nist.trec.adhoc.5.documents.store

datamaestro_ir.data.stores.TipsterDocumentStore

Tags: document, collection

Dataset gov.nist.trec.adhoc.6.documents.store

datamaestro_ir.data.stores.TipsterDocumentStore

Tags: document, collection

Dataset gov.nist.trec.adhoc.7.documents.store

datamaestro_ir.data.stores.TipsterDocumentStore

Tags: document, collection

Dataset edu.upenn.ldc.aquaint.store

datamaestro_ir.data.stores.TipsterDocumentStore

Tags: document, collection

Dataset gov.nist.trec.adhoc.1.withstore

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

Dataset gov.nist.trec.adhoc.2.withstore

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

Dataset gov.nist.trec.adhoc.3.withstore

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

Dataset gov.nist.trec.adhoc.4.withstore

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

Dataset gov.nist.trec.adhoc.5.withstore

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

Dataset gov.nist.trec.adhoc.6.withstore

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

Dataset gov.nist.trec.adhoc.7.withstore

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

Dataset gov.nist.trec.adhoc.8.withstore

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

Dataset gov.nist.trec.adhoc.robust.2004.withstore

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

Dataset gov.nist.trec.adhoc.robust.2005.withstore

datamaestro_ir.data.Adhoc

Tasks: adhoc retrieval

Example usage:

from datamaestro import prepare_dataset

# Load TREC Adhoc dataset (e.g., TREC-8)
adhoc = prepare_dataset("gov.nist.trec.adhoc.8")

# Access components
documents = adhoc.documents
topics = adhoc.topics
assessments = adhoc.assessments

TREC CAR

The TREC Complex Answer Retrieval paragraph corpus — ~29.8M paragraphs extracted from Wikipedia, used as a document collection in several TREC tracks including CaST.

TREC Complex Answer Retrieval (CAR) v2.0 paragraph corpus.

The CAR paragraph corpus contains ~29.8M paragraphs extracted from Wikipedia, used as a document collection in several TREC tracks including CaST.

See http://trec-car.cs.unh.edu/datareleases/ for more details.

Dataset gov.nist.trec.car.documents

datamaestro_ir.data.stores.CarParagraphStore

TREC CAR v2.0 paragraph corpus.

Tags: passage, collection

External link: http://trec-car.cs.unh.edu/datareleases/v2.0/paragraphCorpus.v2.0.tar.xz

Contains ~29.8M paragraphs from Wikipedia. Each paragraph is a simple text document identified by a unique paragraph ID.

Requires the trec-car-tools library (pip install trec-car-tools).

Washington Post

Washington Post document collections used in several TREC tracks. These collections require a data-use agreement with NIST and must be provided locally via DatafolderPath.

Washington Post (WAPO) document collections.

The Washington Post provides document collections used in several TREC tracks. These collections require a data use agreement with NIST and must be provided locally via DatafolderPath.

See https://trec.nist.gov/data/wapost/ for more details.

Dataset gov.nist.trec.wapo.wapo.v2.documents

datamaestro_ir.data.stores.WapoDocumentStore

Washington Post v2 document collection.

Tags: document, collection

External link: https://trec.nist.gov/data/wapost/

Contains ~608K news articles from the Washington Post. Requires a NIST data use agreement. Point DatafolderPath to the directory containing the WAPO v2 JSON lines file.

Dataset gov.nist.trec.wapo.wapo.v2.passages

datamaestro_ir.data.stores.WapoPassageStore

Washington Post v2 paragraph-level passages for CaST v0.

Tags: passage, collection

External link: https://trec.nist.gov/data/wapost/

Each WAPO document is split into paragraphs following the official CaST tools script. Paragraph IDs follow the format {doc_id}-{index} (1-indexed). Empty paragraphs are skipped.

Dataset gov.nist.trec.wapo.wapo.v4.documents

datamaestro_ir.data.stores.KiltDocumentStore

Washington Post v4 document collection.

Tags: document, collection

External link: https://trec.nist.gov/data/wapost/

Contains news articles from the Washington Post (v4 release). Used in CaST 2021 and 2022. Requires a NIST data use agreement.