Datamaestro IR Datasets
This section lists the datasets available through the datamaestro-ir plugin.
Datasets are organized by domain:
Information Retrieval Datasets - Information retrieval benchmark collections (MS MARCO, TREC, etc.)
Distillation Datasets - Knowledge distillation datasets for training neural rankers
Conversational IR Datasets - Conversational search and query reformulation
To load a dataset:
from datamaestro import prepare_dataset
# Load by dataset ID
dataset = prepare_dataset("com.microsoft.msmarco.passage")
To discover available datasets:
# List all datasets
datamaestro search ir
# Search by keyword
datamaestro search "trec"