Information Retrieval Datasets
This section lists native IR dataset definitions.
MS MARCO Passage
The MS MARCO (Microsoft Machine Reading Comprehension) Passage Ranking dataset. One of the most widely used benchmarks for neural IR research.
Contains ~8.8M passages and ~500K training queries with sparse relevance judgments.
MS MARCO Passage Ranking collection.
A large-scale dataset focused on machine reading comprehension, question answering, and passage ranking. The passage reranking task provides a query and the top-1000 BM25 passages; a system is expected to rerank the most relevant passage as high as possible. Not all 1000 passages are judged; evaluation uses MRR.
Publication: Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. In CoCo@NIPS.
See MSMARCO-Passage-Ranking for more details.
-
Dataset com.microsoft.msmarco.passage.documents
datamaestro_ir.data.stores.MsMarcoPassagesStore
MS-Marco passage collection and small query/qrel files.
Tags: passage, collection
Downloads collectionandqueries.tar.gz once, builds the document store, and extracts query/qrel files for dev-small and eval-small splits.
Format is TSV (pid t content)
-
Dataset com.microsoft.msmarco.passage.train.run
datamaestro_ir.data.csv.AdhocRunWithText
Tags: run
TSV format: qid, pid, query, passage
-
Dataset com.microsoft.msmarco.passage.train.queries
datamaestro_ir.data.csv.Topics
Tags: topics
-
Dataset com.microsoft.msmarco.passage.train.qrels
datamaestro_ir.data.trec.TrecAdhocAssessments
Tags: qrels
-
Dataset com.microsoft.msmarco.passage.train
-
MS-Marco train dataset
Tasks: adhoc retrieval
External link: https://github.com/microsoft/MSMARCO-Passage-Ranking
-
Dataset com.microsoft.msmarco.passage.train.withrun
datamaestro_ir.data.RerankAdhoc
MSMarco train dataset, including the top-1000 to documents to re-rank
Tasks: adhoc retrieval, learning to rank
External link: https://github.com/microsoft/MSMARCO-Passage-Ranking
-
Dataset com.microsoft.msmarco.passage.train.triples.id
datamaestro_ir.data.TrainingTripletsLines
Full training triples (query, positive passage, negative passage) with IDs
Tags: triples
Tasks: learning to rank
External link: https://github.com/microsoft/MSMARCO-Passage-Ranking
-
Dataset com.microsoft.msmarco.passage.train.triples.small.text
datamaestro_ir.data.TrainingTripletsLines
Small training triples (query, positive passage, negative passage) with text
Tags: triples
Tasks: learning to rank
External link: https://github.com/microsoft/MSMARCO-Passage-Ranking
-
Dataset com.microsoft.msmarco.passage.dev.queries
datamaestro_ir.data.csv.Topics
Tags: topics
-
Dataset com.microsoft.msmarco.passage.dev.run
datamaestro_ir.data.csv.AdhocRunWithText
Tags: run
-
Dataset com.microsoft.msmarco.passage.dev.qrels
datamaestro_ir.data.trec.TrecAdhocAssessments
Tags: qrels
-
Dataset com.microsoft.msmarco.passage.dev
-
MS-Marco dev dataset
Tasks: adhoc retrieval
External link: https://github.com/microsoft/MSMARCO-Passage-Ranking
-
Dataset com.microsoft.msmarco.passage.dev.withrun
datamaestro_ir.data.RerankAdhoc
MSMarco dev dataset, including the top-1000 to documents to re-rank
Tasks: adhoc retrieval, learning to rank
External link: https://github.com/microsoft/MSMARCO-Passage-Ranking
-
Dataset com.microsoft.msmarco.passage.dev.judged
-
MS-Marco dev dataset, restricted to judged queries
Tasks: adhoc retrieval
External link: https://github.com/microsoft/MSMARCO-Passage-Ranking
-
Dataset com.microsoft.msmarco.passage.eval.withrun
datamaestro_ir.data.csv.AdhocRunWithText
Tags: run
-
Dataset com.microsoft.msmarco.passage.dev.small
-
MS-Marco dev small dataset
Tasks: adhoc retrieval
External link: https://github.com/microsoft/MSMARCO-Passage-Ranking
-
Dataset com.microsoft.msmarco.passage.eval.queries.small
datamaestro_ir.data.csv.Topics
MS-Marco eval small queries
Tags: topics
External link: https://github.com/microsoft/MSMARCO-Passage-Ranking
-
Dataset com.microsoft.msmarco.passage.trec2019.queries
datamaestro_ir.data.csv.Topics
Tags: topics
-
Dataset com.microsoft.msmarco.passage.trec2019.run
datamaestro_ir.data.csv.AdhocRunWithText
Tags: run
-
Dataset com.microsoft.msmarco.passage.trec2019.qrels
datamaestro_ir.data.trec.TrecAdhocAssessments
Tags: qrels
-
Dataset com.microsoft.msmarco.passage.trec2019
-
TREC Deep Learning (2019)
Tasks: adhoc retrieval
External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019.html
-
Dataset com.microsoft.msmarco.passage.trec2019.withrun
datamaestro_ir.data.RerankAdhoc
TREC Deep Learning (2019), including the top-1000 to documents to re-rank
Tasks: adhoc retrieval, learning to rank
External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019.html
-
Dataset com.microsoft.msmarco.passage.trec2019.judged
-
TREC Deep Learning (2019), restricted to judged queries
Tasks: adhoc retrieval
External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019.html
-
Dataset com.microsoft.msmarco.passage.trec2020.queries
datamaestro_ir.data.csv.Topics
TREC Deep Learning 2019 (topics)
Tags: topics
Topics of the TREC 2019 MS-Marco Deep Learning track
-
Dataset com.microsoft.msmarco.passage.trec2020.run
datamaestro_ir.data.csv.AdhocRunWithText
TREC Deep Learning (2020)
Tags: run
External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020.html
Set of query/passages for the passage re-ranking task re-rank (TREC 2020)
-
Dataset com.microsoft.msmarco.passage.trec2020.qrels
datamaestro_ir.data.trec.TrecAdhocAssessments
Tags: qrels
-
Dataset com.microsoft.msmarco.passage.trec2020
-
TREC Deep Learning (2020)
Tasks: adhoc retrieval
External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020.html
-
Dataset com.microsoft.msmarco.passage.trec2020.withrun
datamaestro_ir.data.RerankAdhoc
TREC Deep Learning (2020), including the top-1000 to documents to re-rank
Tasks: adhoc retrieval, learning to rank
External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020.html
-
Dataset com.microsoft.msmarco.passage.trec2020.judged
-
TREC Deep Learning (2020), restricted to judged queries
Tasks: adhoc retrieval
External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020.html
-
Dataset com.microsoft.msmarco.passage.trec.dl.hard.qrels
datamaestro_ir.data.trec.TrecAdhocAssessments
TREC DL-Hard qrels (passage)
Tags: qrels
External link: https://github.com/grill-lab/DL-Hard
-
Dataset com.microsoft.msmarco.passage.trec.dl.hard
-
A more challenging subset of TREC DL 2019 and 2020 passage queries
Tasks: adhoc retrieval
External link: https://github.com/grill-lab/DL-Hard
See: Mackie et al., “How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset”, SIGIR 2021.
Example usage:
from datamaestro import prepare_dataset
from datamaestro.record import IDItem, TextItem
# Load the full adhoc dataset
adhoc = prepare_dataset("com.microsoft.msmarco.passage")
# Iterate over documents
for doc in adhoc.documents.iter_documents():
doc_id = doc[IDItem].id
text = doc[TextItem].text
# Load training triplets
triplets = prepare_dataset("com.microsoft.msmarco.passage.train.idstriples.small")
for triplet in triplets.iter():
query = triplet.query
pos_doc = triplet.positive
neg_doc = triplet.negative
BEIR Benchmark
The BEIR (Benchmarking IR) benchmark is a heterogeneous collection of diverse IR tasks for evaluating zero-shot retrieval models. It includes datasets from question answering, fact verification, citation prediction, and more.
BEIR (Benchmarking IR) benchmark datasets.
Provides native datamaestro dataset definitions for ~15 BEIR datasets plus 12 CQADupStack sub-datasets, using CompressedDocumentStore for efficient document storage.
Each dataset downloads its ZIP once (transient). The docstore is built from the corpus, and queries/qrels are copied out before cleanup.
See: https://github.com/beir-cellar/beir
-
Dataset org.beir.trec.covid
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.nq
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.arguana
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.webis.touche2020
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.webis.touche2020.v2
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.climate.fever
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.scidocs
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.nfcorpus.collection
datamaestro_ir.data.beir.BeirDocumentStore
Tags: document, collection
-
Dataset org.beir.nfcorpus.train
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.nfcorpus.dev
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.nfcorpus.test
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.hotpotqa.collection
datamaestro_ir.data.beir.BeirDocumentStore
Tags: document, collection
-
Dataset org.beir.hotpotqa.train
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.hotpotqa.dev
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.hotpotqa.test
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.fiqa.collection
datamaestro_ir.data.beir.BeirDocumentStore
Tags: document, collection
-
Dataset org.beir.fiqa.train
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.fiqa.dev
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.fiqa.test
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.quora.collection
datamaestro_ir.data.beir.BeirDocumentStore
Tags: document, collection
-
Dataset org.beir.quora.dev
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.quora.test
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.dbpedia.entity.collection
datamaestro_ir.data.beir.BeirDocumentStore
Tags: document, collection
-
Dataset org.beir.dbpedia.entity.dev
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.dbpedia.entity.test
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.fever.collection
datamaestro_ir.data.beir.BeirDocumentStore
Tags: document, collection
-
Dataset org.beir.fever.train
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.fever.dev
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.fever.test
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.scifact.collection
datamaestro_ir.data.beir.BeirDocumentStore
Tags: document, collection
-
Dataset org.beir.scifact.train
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.scifact.test
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.cqadupstack.android
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.cqadupstack.english
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.cqadupstack.gaming
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.cqadupstack.gis
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.cqadupstack.mathematica
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.cqadupstack.physics
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.cqadupstack.programmers
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.cqadupstack.stats
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.cqadupstack.tex
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.cqadupstack.unix
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.cqadupstack.webmasters
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
-
Dataset org.beir.cqadupstack.wordpress
-
Tasks: adhoc retrieval
External link: https://github.com/beir-cellar/beir
Example usage:
from datamaestro import prepare_dataset
# Load a single-split dataset
adhoc = prepare_dataset("org.beir.scidocs")
# Load a multi-split dataset
adhoc = prepare_dataset("org.beir.nfcorpus_test")
# Access components
for doc in adhoc.documents.iter():
print(doc["id"], doc["text_item"].text)
LoTTE Benchmark
The LoTTE (Long-Tail Topic-stratified Evaluation) benchmark from ColBERTv2. Contains 6 domains (lifestyle, recreation, science, technology, writing, pooled) with dev/test splits and two query types (search, forum) per split.
LoTTE (Long-Tail Topic-stratified Evaluation) benchmark datasets.
Provides native datamaestro dataset definitions for 6 domains x 2 splits x 2 query types = 24 IR tasks. A single 3.6GB tar.gz is downloaded once; per-domain docstores are built, and query/qrel files are copied out.
See: https://github.com/stanford-futuredata/ColBERT
-
Dataset edu.stanford.lotte.lotte.data
datamaestro_ir.data.lotte.LotteDocumentStore
Tags: passage, collection
-
Dataset edu.stanford.lotte.lifestyle.dev.search
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.lifestyle.dev.forum
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.lifestyle.test.search
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.lifestyle.test.forum
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.recreation.dev.search
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.recreation.dev.forum
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.recreation.test.search
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.recreation.test.forum
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.science.dev.search
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.science.dev.forum
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.science.test.search
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.science.test.forum
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.technology.dev.search
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.technology.dev.forum
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.technology.test.search
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.technology.test.forum
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.writing.dev.search
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.writing.dev.forum
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.writing.test.search
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.writing.test.forum
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.pooled.dev.search
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.pooled.dev.forum
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.pooled.test.search
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
-
Dataset edu.stanford.lotte.pooled.test.forum
-
Tasks: adhoc retrieval
External link: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
Example usage:
from datamaestro import prepare_dataset
# Load a specific task
adhoc = prepare_dataset("edu.stanford.lotte.science_test_search")
# Access components
for doc in adhoc.documents.iter():
print(doc["id"], doc["text_item"].text)
TIPSTER Collections
The TIPSTER document collections used in TREC evaluations, organized by source.
TIPSTER document collections.
Also known as the Text Research Collection Volume or TREC. Sponsored by ARPA/SISTO to advance the state of the art in document detection and data extraction from large, real-world collections. The detection data is a test collection built at NIST for the TIPSTER and related TREC projects: three CD-ROMs of SGML-encoded documents from LDC plus queries and relevance judgments from NIST.
See also https://trec.nist.gov/data/docs_eng.html and https://trec.nist.gov/data/intro_eng.html.
-
Dataset gov.nist.trec.tipster.ap88
datamaestro_ir.data.trec.TipsterCollection
Associated Press document collection (1988)
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.ap89
datamaestro_ir.data.trec.TipsterCollection
Associated Press document collection (1989)
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.ap90
datamaestro_ir.data.trec.TipsterCollection
Associated Press document collection (1990)
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.doe1
datamaestro_ir.data.trec.TipsterCollection
Department of Energy documents
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.wsj87
datamaestro_ir.data.trec.TipsterCollection
Wall Street Journal (1987)
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.wsj88
datamaestro_ir.data.trec.TipsterCollection
Wall Street Journal (1988)
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.wsj89
datamaestro_ir.data.trec.TipsterCollection
Wall Street Journal (1989)
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.wsj90
datamaestro_ir.data.trec.TipsterCollection
Wall Street Journal (1990)
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.wsj91
datamaestro_ir.data.trec.TipsterCollection
Wall Street Journal (1991)
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.wsj92
datamaestro_ir.data.trec.TipsterCollection
Wall Street Journal (1992)
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.fr88
datamaestro_ir.data.trec.TipsterCollection
Federal Register (1988)
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.fr89
datamaestro_ir.data.trec.TipsterCollection
Federal Register (1989)
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.fr94
datamaestro_ir.data.trec.TipsterCollection
Federal Register (1994)
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.ziff1
datamaestro_ir.data.trec.TipsterCollection
Information from the Computer Select disks (1989-90)
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.ziff2
datamaestro_ir.data.trec.TipsterCollection
Information from the Computer Select disks (1989-90)
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.ziff3
datamaestro_ir.data.trec.TipsterCollection
Information from the Computer Select disks (1990-91)
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.sjm1
datamaestro_ir.data.trec.TipsterCollection
San Jose Mercury News (1991)
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.cr1
datamaestro_ir.data.trec.TipsterCollection
TODO
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.ft1
datamaestro_ir.data.trec.TipsterCollection
Financial Times
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.fbis1
datamaestro_ir.data.trec.TipsterCollection
Foreign Broadcast Information Service (1996)
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.la8990
datamaestro_ir.data.trec.TipsterCollection
Los Angeles Times (1989-90)
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC93T3A
AQUAINT
The AQUAINT Corpus consists of newswire text data in English from three sources: Xinhua News Service, New York Times, and Associated Press.
AQUAINT newswire corpus.
LDC catalog number LDC2002T31 (ISBN 1-58563-240-6). Newswire text in English drawn from three sources: the Xinhua News Service (PRC), the New York Times News Service, and the Associated Press Worldstream News Service. Prepared by the LDC for the AQUAINT Project and used in official NIST benchmark evaluations.
-
Dataset edu.upenn.ldc.aquaint.apw
datamaestro_ir.data.trec.TipsterCollection
Associated Press (1998-2000)
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC2002T31
-
Dataset edu.upenn.ldc.aquaint.nyt
datamaestro_ir.data.trec.TipsterCollection
New York Times (1998-2000)
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC2002T31
-
Dataset edu.upenn.ldc.aquaint.xie
datamaestro_ir.data.trec.TipsterCollection
Xinhua News Agency newswires (1996-2000)
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC2002T31
-
Dataset edu.upenn.ldc.aquaint
datamaestro_ir.data.trec.TipsterCollection
Aquaint documents
Tags: document, collection
External link: https://catalog.ldc.upenn.edu/LDC2002T31
TREC Ad Hoc
Classic TREC Ad Hoc test collections from NIST. These collections have been fundamental benchmarks in IR research since the 1990s.
TREC Adhoc datasets and tasks.
See https://trec.nist.gov/data/test_coll.html.
-
Dataset gov.nist.trec.adhoc.1.documents
datamaestro_ir.data.trec.TipsterCollection
TREC-1 to TREC-3 documents (TIPSTER volumes 1 and 2)
Tags: document, collection
-
Dataset gov.nist.trec.adhoc.1.topics
datamaestro_ir.data.trec.TrecTopics
Tags: topics
-
Dataset gov.nist.trec.adhoc.1.assessments
datamaestro_ir.data.trec.TrecAdhocAssessments
Tags: qrels
-
Dataset gov.nist.trec.adhoc.1
-
Ad-hoc task of TREC 1 (1992)
Tasks: adhoc retrieval
-
Dataset gov.nist.trec.adhoc.2.topics
datamaestro_ir.data.trec.TrecTopics
Tags: topics
-
Dataset gov.nist.trec.adhoc.2.assessments
datamaestro_ir.data.trec.TrecAdhocAssessments
Tags: qrels
-
Dataset gov.nist.trec.adhoc.2
-
Ad-hoc task of TREC 2 (1993)
Tasks: adhoc retrieval
-
Dataset gov.nist.trec.adhoc.3.topics
datamaestro_ir.data.trec.TrecTopics
Tags: topics
-
Dataset gov.nist.trec.adhoc.3.assessments
datamaestro_ir.data.trec.TrecAdhocAssessments
Tags: qrels
-
Dataset gov.nist.trec.adhoc.3
-
Ad-hoc task of TREC 3 (1994)
Tasks: adhoc retrieval
-
Dataset gov.nist.trec.adhoc.4.documents
datamaestro_ir.data.trec.TipsterCollection
TREC-4 documents
Tags: document, collection
-
Dataset gov.nist.trec.adhoc.4.topics
datamaestro_ir.data.trec.TrecTopics
Tags: topics
-
Dataset gov.nist.trec.adhoc.4.assessments
datamaestro_ir.data.trec.TrecAdhocAssessments
Tags: qrels
-
Dataset gov.nist.trec.adhoc.4
-
Ad-hoc task of TREC 4 (1995)
Tasks: adhoc retrieval
-
Dataset gov.nist.trec.adhoc.5.documents
datamaestro_ir.data.trec.TipsterCollection
TREC-5 documents
Tags: document, collection
-
Dataset gov.nist.trec.adhoc.5.topics
datamaestro_ir.data.trec.TrecTopics
Tags: topics
-
Dataset gov.nist.trec.adhoc.5.qrels
datamaestro_ir.data.trec.TrecAdhocAssessments
Tags: qrels
-
Dataset gov.nist.trec.adhoc.5
-
Ad-hoc task of TREC 5 (1996)
Tasks: adhoc retrieval
-
Dataset gov.nist.trec.adhoc.6.documents
datamaestro_ir.data.trec.TipsterCollection
TREC-5 documents
Tags: document, collection
-
Dataset gov.nist.trec.adhoc.6.topics
datamaestro_ir.data.trec.TrecTopics
Tags: topics
-
Dataset gov.nist.trec.adhoc.6.qrels
datamaestro_ir.data.trec.TrecAdhocAssessments
Tags: qrels
-
Dataset gov.nist.trec.adhoc.6
-
Ad-hoc task of TREC 6 (1997)
Tasks: adhoc retrieval
-
Dataset gov.nist.trec.adhoc.7.documents
datamaestro_ir.data.trec.TipsterCollection
TREC-7 documents
Tags: document, collection
-
Dataset gov.nist.trec.adhoc.7.topics
datamaestro_ir.data.trec.TrecTopics
Tags: topics
-
Dataset gov.nist.trec.adhoc.7.qrels
datamaestro_ir.data.trec.TrecAdhocAssessments
Tags: qrels
-
Dataset gov.nist.trec.adhoc.7
-
Ad-hoc task of TREC 3 (1994)
Tasks: adhoc retrieval
-
Dataset gov.nist.trec.adhoc.8.topics
datamaestro_ir.data.trec.TrecTopics
Tags: topics
-
Dataset gov.nist.trec.adhoc.8.qrels
datamaestro_ir.data.trec.TrecAdhocAssessments
Tags: qrels
-
Dataset gov.nist.trec.adhoc.8
-
Ad-hoc task of TREC 8 (1999)
Tasks: adhoc retrieval
-
Dataset gov.nist.trec.adhoc.robust.2004.topics
datamaestro_ir.data.trec.TrecTopics
Tags: topics
-
Dataset gov.nist.trec.adhoc.robust.2004.qrels
datamaestro_ir.data.trec.TrecAdhocAssessments
Tags: qrels
-
Dataset gov.nist.trec.adhoc.robust.2004
-
Ad-hoc task of TREC Robust (2004)
Tasks: adhoc retrieval
-
Dataset gov.nist.trec.adhoc.robust.2005.topics
datamaestro_ir.data.trec.TrecTopics
Tags: topics
-
Dataset gov.nist.trec.adhoc.robust.2005.qrels
datamaestro_ir.data.trec.TrecAdhocAssessments
Tags: qrels
-
Dataset gov.nist.trec.adhoc.robust.2005
-
Ad-hoc task of TREC Robust (2005)
Tasks: adhoc retrieval
-
Dataset gov.nist.trec.adhoc.1.documents.store
datamaestro_ir.data.stores.TipsterDocumentStore
Tags: document, collection
-
Dataset gov.nist.trec.adhoc.4.documents.store
datamaestro_ir.data.stores.TipsterDocumentStore
Tags: document, collection
-
Dataset gov.nist.trec.adhoc.5.documents.store
datamaestro_ir.data.stores.TipsterDocumentStore
Tags: document, collection
-
Dataset gov.nist.trec.adhoc.6.documents.store
datamaestro_ir.data.stores.TipsterDocumentStore
Tags: document, collection
-
Dataset gov.nist.trec.adhoc.7.documents.store
datamaestro_ir.data.stores.TipsterDocumentStore
Tags: document, collection
-
Dataset edu.upenn.ldc.aquaint.store
datamaestro_ir.data.stores.TipsterDocumentStore
Tags: document, collection
-
Dataset gov.nist.trec.adhoc.1.withstore
-
Tasks: adhoc retrieval
-
Dataset gov.nist.trec.adhoc.2.withstore
-
Tasks: adhoc retrieval
-
Dataset gov.nist.trec.adhoc.3.withstore
-
Tasks: adhoc retrieval
-
Dataset gov.nist.trec.adhoc.4.withstore
-
Tasks: adhoc retrieval
-
Dataset gov.nist.trec.adhoc.5.withstore
-
Tasks: adhoc retrieval
-
Dataset gov.nist.trec.adhoc.6.withstore
-
Tasks: adhoc retrieval
-
Dataset gov.nist.trec.adhoc.7.withstore
-
Tasks: adhoc retrieval
-
Dataset gov.nist.trec.adhoc.8.withstore
-
Tasks: adhoc retrieval
-
Dataset gov.nist.trec.adhoc.robust.2004.withstore
-
Tasks: adhoc retrieval
-
Dataset gov.nist.trec.adhoc.robust.2005.withstore
-
Tasks: adhoc retrieval
Example usage:
from datamaestro import prepare_dataset
# Load TREC Adhoc dataset (e.g., TREC-8)
adhoc = prepare_dataset("gov.nist.trec.adhoc.8")
# Access components
documents = adhoc.documents
topics = adhoc.topics
assessments = adhoc.assessments
TREC CAR
The TREC Complex Answer Retrieval paragraph corpus — ~29.8M paragraphs extracted from Wikipedia, used as a document collection in several TREC tracks including CaST.
TREC Complex Answer Retrieval (CAR) v2.0 paragraph corpus.
The CAR paragraph corpus contains ~29.8M paragraphs extracted from Wikipedia, used as a document collection in several TREC tracks including CaST.
See http://trec-car.cs.unh.edu/datareleases/ for more details.
-
Dataset gov.nist.trec.car.documents
datamaestro_ir.data.stores.CarParagraphStore
TREC CAR v2.0 paragraph corpus.
Tags: passage, collection
External link: http://trec-car.cs.unh.edu/datareleases/v2.0/paragraphCorpus.v2.0.tar.xz
Contains ~29.8M paragraphs from Wikipedia. Each paragraph is a simple text document identified by a unique paragraph ID.
Requires the
trec-car-toolslibrary (pip install trec-car-tools).
Washington Post
Washington Post document collections
used in several TREC tracks. These collections require a data-use agreement
with NIST and must be provided locally via DatafolderPath.
Washington Post (WAPO) document collections.
The Washington Post provides document collections used in several TREC tracks.
These collections require a data use agreement with NIST and must be provided
locally via DatafolderPath.
See https://trec.nist.gov/data/wapost/ for more details.
-
Dataset gov.nist.trec.wapo.wapo.v2.documents
datamaestro_ir.data.stores.WapoDocumentStore
Washington Post v2 document collection.
Tags: document, collection
External link: https://trec.nist.gov/data/wapost/
Contains ~608K news articles from the Washington Post. Requires a NIST data use agreement. Point
DatafolderPathto the directory containing the WAPO v2 JSON lines file.
-
Dataset gov.nist.trec.wapo.wapo.v2.passages
datamaestro_ir.data.stores.WapoPassageStore
Washington Post v2 paragraph-level passages for CaST v0.
Tags: passage, collection
External link: https://trec.nist.gov/data/wapost/
Each WAPO document is split into paragraphs following the official CaST tools script. Paragraph IDs follow the format
{doc_id}-{index}(1-indexed). Empty paragraphs are skipped.
-
Dataset gov.nist.trec.wapo.wapo.v4.documents
datamaestro_ir.data.stores.KiltDocumentStore
Washington Post v4 document collection.
Tags: document, collection
External link: https://trec.nist.gov/data/wapost/
Contains news articles from the Washington Post (v4 release). Used in CaST 2021 and 2022. Requires a NIST data use agreement.