Information Retrieval Datasets
==============================
This section lists native IR dataset definitions.
MS MARCO Passage
----------------
The `MS MARCO `_ (Microsoft Machine Reading
Comprehension) Passage Ranking dataset. One of the most widely used benchmarks
for neural IR research.
Contains ~8.8M passages and ~500K training queries with sparse relevance judgments.
.. dm:datasets:: com.microsoft.msmarco.passage ir
Example usage:
.. code-block:: python
from datamaestro import prepare_dataset
from datamaestro.record import IDItem, TextItem
# Load the full adhoc dataset
adhoc = prepare_dataset("com.microsoft.msmarco.passage")
# Iterate over documents
for doc in adhoc.documents.iter_documents():
doc_id = doc[IDItem].id
text = doc[TextItem].text
# Load training triplets
triplets = prepare_dataset("com.microsoft.msmarco.passage.train.idstriples.small")
for triplet in triplets.iter():
query = triplet.query
pos_doc = triplet.positive
neg_doc = triplet.negative
BEIR Benchmark
--------------
The `BEIR `_ (Benchmarking IR) benchmark
is a heterogeneous collection of diverse IR tasks for evaluating zero-shot
retrieval models. It includes datasets from question answering, fact
verification, citation prediction, and more.
.. dm:datasets:: org.beir ir
Example usage:
.. code-block:: python
from datamaestro import prepare_dataset
# Load a single-split dataset
adhoc = prepare_dataset("org.beir.scidocs")
# Load a multi-split dataset
adhoc = prepare_dataset("org.beir.nfcorpus_test")
# Access components
for doc in adhoc.documents.iter():
print(doc["id"], doc["text_item"].text)
LoTTE Benchmark
---------------
The `LoTTE `_ (Long-Tail
Topic-stratified Evaluation) benchmark from ColBERTv2. Contains 6 domains
(lifestyle, recreation, science, technology, writing, pooled) with dev/test
splits and two query types (search, forum) per split.
.. dm:datasets:: edu.stanford.lotte ir
Example usage:
.. code-block:: python
from datamaestro import prepare_dataset
# Load a specific task
adhoc = prepare_dataset("edu.stanford.lotte.science_test_search")
# Access components
for doc in adhoc.documents.iter():
print(doc["id"], doc["text_item"].text)
TIPSTER Collections
-------------------
The TIPSTER document collections used in TREC evaluations, organized by source.
.. dm:datasets:: gov.nist.trec.tipster ir
AQUAINT
-------
The AQUAINT Corpus consists of newswire text data in English from three sources:
Xinhua News Service, New York Times, and Associated Press.
.. dm:datasets:: edu.upenn.ldc.aquaint ir
TREC Ad Hoc
-----------
Classic TREC Ad Hoc test collections from NIST. These collections have been
fundamental benchmarks in IR research since the 1990s.
.. dm:datasets:: gov.nist.trec.adhoc ir
Example usage:
.. code-block:: python
from datamaestro import prepare_dataset
# Load TREC Adhoc dataset (e.g., TREC-8)
adhoc = prepare_dataset("gov.nist.trec.adhoc.8")
# Access components
documents = adhoc.documents
topics = adhoc.topics
assessments = adhoc.assessments
TREC CAR
--------
The `TREC Complex Answer Retrieval `_ paragraph
corpus — ~29.8M paragraphs extracted from Wikipedia, used as a document
collection in several TREC tracks including CaST.
.. dm:datasets:: gov.nist.trec.car ir
Washington Post
---------------
`Washington Post `_ document collections
used in several TREC tracks. These collections require a data-use agreement
with NIST and must be provided locally via ``DatafolderPath``.
.. dm:datasets:: gov.nist.trec.wapo ir