Information Retrieval Datasets ============================== This section lists native IR dataset definitions. MS MARCO Passage ---------------- The `MS MARCO `_ (Microsoft Machine Reading Comprehension) Passage Ranking dataset. One of the most widely used benchmarks for neural IR research. Contains ~8.8M passages and ~500K training queries with sparse relevance judgments. .. dm:datasets:: com.microsoft.msmarco.passage ir Example usage: .. code-block:: python from datamaestro import prepare_dataset from datamaestro.record import IDItem, TextItem # Load the full adhoc dataset adhoc = prepare_dataset("com.microsoft.msmarco.passage") # Iterate over documents for doc in adhoc.documents.iter_documents(): doc_id = doc[IDItem].id text = doc[TextItem].text # Load training triplets triplets = prepare_dataset("com.microsoft.msmarco.passage.train.idstriples.small") for triplet in triplets.iter(): query = triplet.query pos_doc = triplet.positive neg_doc = triplet.negative BEIR Benchmark -------------- The `BEIR `_ (Benchmarking IR) benchmark is a heterogeneous collection of diverse IR tasks for evaluating zero-shot retrieval models. It includes datasets from question answering, fact verification, citation prediction, and more. .. dm:datasets:: org.beir ir Example usage: .. code-block:: python from datamaestro import prepare_dataset # Load a single-split dataset adhoc = prepare_dataset("org.beir.scidocs") # Load a multi-split dataset adhoc = prepare_dataset("org.beir.nfcorpus_test") # Access components for doc in adhoc.documents.iter(): print(doc["id"], doc["text_item"].text) LoTTE Benchmark --------------- The `LoTTE `_ (Long-Tail Topic-stratified Evaluation) benchmark from ColBERTv2. Contains 6 domains (lifestyle, recreation, science, technology, writing, pooled) with dev/test splits and two query types (search, forum) per split. .. dm:datasets:: edu.stanford.lotte ir Example usage: .. code-block:: python from datamaestro import prepare_dataset # Load a specific task adhoc = prepare_dataset("edu.stanford.lotte.science_test_search") # Access components for doc in adhoc.documents.iter(): print(doc["id"], doc["text_item"].text) TIPSTER Collections ------------------- The TIPSTER document collections used in TREC evaluations, organized by source. .. dm:datasets:: gov.nist.trec.tipster ir AQUAINT ------- The AQUAINT Corpus consists of newswire text data in English from three sources: Xinhua News Service, New York Times, and Associated Press. .. dm:datasets:: edu.upenn.ldc.aquaint ir TREC Ad Hoc ----------- Classic TREC Ad Hoc test collections from NIST. These collections have been fundamental benchmarks in IR research since the 1990s. .. dm:datasets:: gov.nist.trec.adhoc ir Example usage: .. code-block:: python from datamaestro import prepare_dataset # Load TREC Adhoc dataset (e.g., TREC-8) adhoc = prepare_dataset("gov.nist.trec.adhoc.8") # Access components documents = adhoc.documents topics = adhoc.topics assessments = adhoc.assessments TREC CAR -------- The `TREC Complex Answer Retrieval `_ paragraph corpus — ~29.8M paragraphs extracted from Wikipedia, used as a document collection in several TREC tracks including CaST. .. dm:datasets:: gov.nist.trec.car ir Washington Post --------------- `Washington Post `_ document collections used in several TREC tracks. These collections require a data-use agreement with NIST and must be provided locally via ``DatafolderPath``. .. dm:datasets:: gov.nist.trec.wapo ir