Datamaestro IR
datamaestro-ir is a datamaestro plugin that provides access to Information Retrieval datasets for research in:
Information Retrieval (IR) - Document collections, topics, relevance judgments, training triplets
Conversational IR - Query rewriting, conversational search datasets
Installation
Install from PyPI:
pip install datamaestro-ir
For development:
git clone https://github.com/experimaestro/datamaestro_ir.git
cd datamaestro_ir
pip install -e ".[dev]"
Quick Start
List available datasets:
# List all datasets in the IR repository
datamaestro search ir
# Search for specific datasets
datamaestro search "msmarco"
Load a dataset in Python:
from datamaestro import prepare_dataset
# Load MS MARCO passage dataset
dataset = prepare_dataset("ir.com.microsoft.msmarco.passage")
Key Concepts
- Data Types
Schema classes that define the structure of datasets (e.g.,
Documents,Topics,Adhoc). See the Datamaestro IR API for the complete API reference.- Dataset Configurations
Specific dataset definitions that implement data types with download URLs and processing logic. See Datamaestro IR Datasets for available datasets.