Information Retrieval API

This module provides data types for Information Retrieval datasets and experiments.

The core abstractions are:

Documents - Collections of documents to be searched
Topics - Queries or information needs
Assessments - Relevance judgments (qrels) linking topics to relevant documents
Adhoc - A complete IR test collection combining documents, topics, and assessments

For training neural rankers:

TrainingTriplets - Training data as (query, positive_doc, negative_doc) triplets
PairwiseSampleDataset - General pairwise training data

Data objects

class datamaestro_ir.data.base.IDRecord

Bases: TypedDict

A record with just an ID

class datamaestro_ir.data.base.IDTextRecord

Bases: dict

A record with an ID and a text item

class datamaestro_ir.data.base.ScoredDocument(document: dict, score: float)

Bases: object

A data structure that associates a score with a document, allowing to sort documents by score (e.g., for nDCG)

document: dict: The document (IDRecord, TextRecord, or IDTextRecord)

score: float: The associated score

class datamaestro_ir.data.base.SimpleTextItem(text: str)

Bases: TextItem

A topic/document with a text record

class datamaestro_ir.data.base.TextItem

Bases: ABC

abstract property text: str: Returns the text

class datamaestro_ir.data.base.TextRecord

Bases: TypedDict

A record with just a text item

Collection

XPM Configdatamaestro_ir.data.Adhoc(*, id, documents, topics, assessments)

Bases: Base

An Adhoc IR collection with documents, topics and their assessments

id: str: The unique (sub-)dataset ID

documents: datamaestro_ir.data.Documents: The set of documents

topics: datamaestro_ir.data.Topics: The set of topics

assessments: datamaestro_ir.data.AdhocAssessments: The set of assessments (for each topic)

Topics

XPM Configdatamaestro_ir.data.Topics(*, id)

Bases: Base, ABC

A set of topics with associated IDs

id: str: The unique (sub-)dataset ID

XPM Configdatamaestro_ir.data.csv.Topics(*, id, path, separator)

Bases: Topics

Pairs of query id - query using a separator

id: str: The unique (sub-)dataset ID

path: path

separator: str

XPM Configdatamaestro_ir.data.FilteredTopics(*, id, topics, qids_path)

Bases: Topics

Merges multiple Topics sources, keeping only query IDs listed in a file

id: str: The unique (sub-)dataset ID

topics: List[datamaestro_ir.data.Topics]

qids_path: path: Path to a file with one query ID per line

XPM Configdatamaestro_ir.transforms.TopicWrapper

Bases: Config, ABC

Modify topics on the fly using a topic wrapper

Dataset-specific Topics

XPM Configdatamaestro_ir.data.beir.BeirTopics(*, id, path)

Bases: Topics

BEIR queries: JSONL with _id and text fields.

id: str: The unique (sub-)dataset ID

path: path

XPM Configdatamaestro_ir.data.beir.BeirParquetTopics(*, id, path)

Bases: Topics

BEIR queries from Parquet.

id: str: The unique (sub-)dataset ID

path: path

XPM Configdatamaestro_ir.data.lotte.LotteTopics(*, id, path)

Bases: Topics

LoTTE queries: TSV with query_id and text fields.

id: str: The unique (sub-)dataset ID

path: path

XPM Configdatamaestro_ir.data.trec.TrecTopics(*, id, path, parts)

Bases: Topics

id: str: The unique (sub-)dataset ID

path: path

parts: List[str]

XPM Configdatamaestro_ir.data.cord19.Topics(*, id, path)

Bases: Topics, File

XML format used in Adhoc topics

id: str: The unique (sub-)dataset ID

path: path: The path of the file

Documents

XPM Configdatamaestro_ir.data.Documents(*, id, count)

Bases: Base

A set of documents with identifiers

id: str: The unique (sub-)dataset ID

count: int: Number of documents

XPM Configdatamaestro_ir.data.csv.Documents(*, id, count, path, separator)

Bases: Documents

One line per document, format pid<SEP>text

id: str: The unique (sub-)dataset ID

count: int: Number of documents

path: path

separator: str

Dataset-specific documents

XPM Configdatamaestro_ir.data.cord19.Documents(*, id, path, delimiter, ignore, names_row, count)

Bases: Documents, Generic

id: str: The unique (sub-)dataset ID

path: path: The path of the file

delimiter: str = ,

ignore: int = 0

names_row: int = -1

count: int: Number of documents

XPM Configdatamaestro_ir.data.trec.TipsterCollection(*, id, count, path, patterns)

Bases: Documents

id: str: The unique (sub-)dataset ID

count: int: Number of documents

path: path

patterns: List[str]

XPM Configdatamaestro_ir.data.beir.BeirDocumentStore(*, id, count, file_access, path, lookup_key)

Bases: CompressedDocumentStore

Document store for BEIR datasets.

Content bytes encode title and text as: title_bytes + b”0” + text_bytes. The only key is “id” (the external document ID).

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

path: path: Path to the impact-index store directory

lookup_key: str = id

XPM Configdatamaestro_ir.data.lotte.LotteDocumentStore(*, id, count, file_access, path, lookup_key)

Bases: CompressedDocumentStore

Document store for LoTTE datasets.

Content bytes encode text as UTF-8. The only key is “id” (the document ID).

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

path: path: Path to the impact-index store directory

lookup_key: str = id

XPM Configdatamaestro_ir.data.stores.MsMarcoPassagesStore(*, id, count, file_access, path)

Bases: CompressedDocumentStore

Document store for MS MARCO passages where internal ID = external ID

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

path: path: Path to the impact-index store directory

XPM Configdatamaestro_ir.data.stores.MsMarcoPassageV2Store(*, id, count, file_access, path)

Bases: CompressedDocumentStore

Document store for MS MARCO passage collection v2.

Each passage has its text, parent document id, and character spans within that document.

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

path: path: Path to the impact-index store directory

XPM Configdatamaestro_ir.data.stores.CarParagraphStore(*, id, count, file_access, path)

Bases: CompressedDocumentStore

Document store for TREC CAR v2.0 paragraphs.

Each document is a simple text paragraph identified by its paragraph ID.

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

path: path: Path to the impact-index store directory

XPM Configdatamaestro_ir.data.stores.WapoDocumentStore(*, id, count, file_access, path)

Bases: CompressedDocumentStore

Document store for Washington Post (WAPO) v2/v4 full documents.

Stores full WAPO documents with all metadata fields.

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

path: path: Path to the impact-index store directory

XPM Configdatamaestro_ir.data.stores.WapoPassageStore(*, id, count, file_access, path)

Bases: CompressedDocumentStore

Document store for WAPO paragraph-level passages (CaST v0).

Each WAPO document is split into paragraphs. Document IDs follow the format {doc_id}-{paragraph_index} (1-indexed) matching the official CaST tools script.

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

path: path: Path to the impact-index store directory

XPM Configdatamaestro_ir.data.stores.KiltDocumentStore(*, id, count, file_access, path)

Bases: CompressedDocumentStore

Document store for KILT (Knowledge Intensive Language Tasks) knowledge source.

Stores KILT documents with title, URL, and body text.

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

path: path: Path to the impact-index store directory

XPM Configdatamaestro_ir.data.stores.MsMarcoDocumentStore(*, id, count, file_access, path)

Bases: CompressedDocumentStore

Document store for MS MARCO document collection (v1).

Each document has URL, title, and body fields.

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

path: path: Path to the impact-index store directory

XPM Configdatamaestro_ir.data.stores.MsMarcoDocumentV2Store(*, id, count, file_access, path)

Bases: CompressedDocumentStore

Document store for MS MARCO document collection v2.

Each document has URL, title, headings, and body fields.

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

path: path: Path to the impact-index store directory

XPM Configdatamaestro_ir.data.stores.CastSegmentedPassageStore(*, id, count, file_access, base_store, offsets_path, dupes_path)

Bases: DocumentStore

Document store for CaST segmented passages (v2/v3).

Reads a base document store and an offset file to create passage-level documents. Each passage is defined by character ranges applied to the base document text.

Offset file format (gzipped JSONL):

{"id":"MARCO_00_1454834","ranges":[[[0,917]],[[918,2082]]],"md5":"..."}

Passage IDs follow the format {doc_id}-{passage_index} (1-indexed).

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

base_store: datamaestro_ir.data.DocumentStore: The base document store containing full documents

offsets_path: path: Path to the gzipped JSONL offset file

dupes_path: path: Path to the duplicates file (one doc ID per line to exclude)

XPM Configdatamaestro_ir.data.stores.TipsterDocumentStore(*, id, count, file_access, path)

Bases: CompressedDocumentStore

Document store for TIPSTER/AQUAINT document collections.

Each document is stored as JSON with title and body fields, matching the structured output of the TIPSTER SGML parser.

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

path: path: Path to the impact-index store directory

XPM Configdatamaestro_ir.data.PrefixedDocumentStore(*, id, count, file_access, sources, prefixes)

Bases: DocumentStore

Combines multiple DocumentStores with ID prefixes.

Each document ID is expected to start with one of the given prefixes, which determines which underlying store to query.

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

sources: List[datamaestro_ir.data.DocumentStore]

prefixes: List[str]

Assessments

XPM Configdatamaestro_ir.data.AdhocAssessments(*, id)

Bases: Base, ABC

Ad-hoc assessments (qrels)

id: str: The unique (sub-)dataset ID

XPM Configdatamaestro_ir.data.beir.BeirAssessments(*, id, path)

Bases: AdhocAssessments

BEIR qrels: 3-column TSV with header (query-id, corpus-id, score).

id: str: The unique (sub-)dataset ID

path: path

XPM Configdatamaestro_ir.data.beir.BeirParquetAssessments(*, id, path)

Bases: AdhocAssessments

BEIR qrels from Parquet.

id: str: The unique (sub-)dataset ID

path: path

XPM Configdatamaestro_ir.data.lotte.LotteAssessments(*, id, path)

Bases: AdhocAssessments

LoTTE qrels: JSONL with qid and answer_pids fields.

id: str: The unique (sub-)dataset ID

path: path

XPM Configdatamaestro_ir.data.trec.TrecAdhocAssessments(*, id, path)

Bases: AdhocAssessments

id: str: The unique (sub-)dataset ID

path: path

class datamaestro_ir.data.AdhocAssessedTopic(topic_id: str, assessments: List[AdhocAssessment]): Bases: object

class datamaestro_ir.data.AdhocAssessment(doc_id: str): Bases: object

Runs

XPM Configdatamaestro_ir.data.AdhocRun(*, id)

Bases: Base

IR adhoc run

id: str: The unique (sub-)dataset ID

XPM Configdatamaestro_ir.data.csv.AdhocRunWithText(*, id, path, separator)

Bases: AdhocRun

(qid, doc.id, query, passage)

id: str: The unique (sub-)dataset ID

path: path

separator: str

XPM Configdatamaestro_ir.data.trec.TrecAdhocRun(*, id, path)

Bases: AdhocRun

id: str: The unique (sub-)dataset ID

path: path

Results

XPM Configdatamaestro_ir.data.AdhocResults(*, id)

Bases: Base

id: str: The unique (sub-)dataset ID

XPM Configdatamaestro_ir.data.trec.TrecAdhocResults(*, id, metrics, results, detailed)

Bases: AdhocResults

Adhoc results (TREC format)

id: str: The unique (sub-)dataset ID

metrics: List[datamaestro_ir.data.Measure]: List of reported metrics

results: path: Main results

detailed: path: Results per topic (if any)

Evaluation

XPM Configdatamaestro_ir.data.Measure

Bases: Config

An Information Retrieval measure

Reranking

XPM Configdatamaestro_ir.data.RerankAdhoc(*, id, documents, topics, assessments, run)

Bases: Adhoc

Re-ranking ad-hoc task based on an existing run

id: str: The unique (sub-)dataset ID

documents: datamaestro_ir.data.Documents: The set of documents

topics: datamaestro_ir.data.Topics: The set of topics

assessments: datamaestro_ir.data.AdhocAssessments: The set of assessments (for each topic)

run: datamaestro_ir.data.AdhocRun: The run to re-rank

Document Index

XPM Configdatamaestro_ir.data.DocumentStore(*, id, count, file_access)

Bases: Documents

A document store

A document store can - match external/internal ID - return the document content - return the number of documents

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

XPM Configdatamaestro_ir.data.CompressedDocumentStore(*, id, count, file_access, path)

Bases: DocumentStore, ABC

A document store backed by impact-index’s compressed document store

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

path: path: Path to the impact-index store directory

XPM Configdatamaestro_ir.data.AdhocIndex(*, id, count, file_access)

Bases: DocumentStore

An index can be used to retrieve documents based on terms

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

XPM Configdatamaestro_ir.data.anserini.Index(*, id, count, file_access, path, storePositions, storeDocvectors, storeRaw, storeContents, stemmer)

Bases: AdhocIndex

Anserini-backed index

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

path: path: Path to the index

storePositions: bool = False: Store term positions

storeDocvectors: bool = False: Store document term vectors

storeRaw: bool = False: Store raw document

storeContents: bool = False: Store processed documents (e.g. without HTML tags)

stemmer: str = porter: The stemmer to use

Training triplets

XPM Configdatamaestro_ir.data.TrainingTriplets(*, id)

Bases: Base, ABC

Triplet for training IR systems: query / query ID, positive document, negative document

id: str: The unique (sub-)dataset ID

XPM Configdatamaestro_ir.data.PairwiseSampleDataset(*, id)

Bases: Base, ABC

Datasets where each record is a query with positive and negative samples

id: str: The unique (sub-)dataset ID

XPM Configdatamaestro_ir.data.TrainingTripletsLines(*, id, sep, path, doc_ids, topic_ids)

Bases: TrainingTriplets

Training triplets with one line per triple (query texts)

id: str: The unique (sub-)dataset ID

sep: str

path: path

doc_ids: bool: True if we have documents IDs

topic_ids: bool: True if we have query IDs

XPM Configdatamaestro_ir.data.huggingface.HuggingFacePairwiseSampleDataset(*, id, repo_id, name, data_files, split, streaming, local_path, ids, query_id, pos_id, neg_id)

Bases: HuggingFaceDataset, PairwiseSampleDataset

Triplet for training IR systems: query / query ID, positive document, negative document

id: str: The unique (sub-)dataset ID

repo_id: str: The HuggingFace repository id (e.g. user/dataset).

name: str: HuggingFace dataset name (a.k.a. config).

data_files: str: Specific data files to load.

split: str: Dataset split to load.

streaming: bool = False: When True, load the dataset in streaming mode — no local cache.

local_path: path: If set, load from this local mirror instead of the HuggingFace Hub. Meta because the logical dataset is the same regardless of where the bytes come from.

ids: bool = True: True if the triplet is made of IDs, False otherwise

query_id: str = qid: The name of the field containing the query ID

pos_id: str = pos: The name of the field containing the positive samples

neg_id: str = neg: The name of the field containing the negative samples

class datamaestro_ir.data.PairwiseSample(*, topics: List[IDTextRecord], positives: List[IDTextRecord], negatives: Dict[str, List[IDTextRecord]])

Bases: ABC

A a query with positive and negative samples

negatives: Dict[str, List[IDTextRecord]]: Non relevant documents, organized in a dictionary where keys are the algorithm used to retrieve the negatives

positives: List[IDTextRecord]: Relevant documents

topics: List[IDTextRecord]: The topic(s)

Distillation

Config classes that stream teacher-scored samples:

XPM Configdatamaestro_ir.data.distillation.PairwiseDistillationSamples(*, id)

Bases: Base, Iterable[PairwiseDistillationSample]

Pairwise distillation file

id: str: The unique (sub-)dataset ID

XPM Configdatamaestro_ir.data.distillation.PairwiseDistillationSamplesTSV(*, id, path, with_docid, with_queryid)

Bases: PairwiseDistillationSamples, File

A TSV file (Score 1, Score 2, Query, Document 1, Document 2)

id: str: The unique (sub-)dataset ID

path: path: The path of the file

with_docid: bool

with_queryid: bool

XPM Configdatamaestro_ir.data.distillation.ListwiseDistillationSamples(*, id)

Bases: Base, Iterable[ListwiseDistillationSample]

Listwise distillation file

id: str: The unique (sub-)dataset ID

XPM Configdatamaestro_ir.data.distillation.ListwiseDistillationSamplesTSV(*, id, path, top_k, with_docid, with_queryid)

Bases: ListwiseDistillationSamples, File

A TSV file (“query_id”, “q0”, “doc_id”, “rank”, “score”, “system”)

id: str: The unique (sub-)dataset ID

path: path: The path of the file

top_k: int

with_docid: bool

with_queryid: bool

XPM Configdatamaestro_ir.data.distillation.ListwiseDistillationSamplesTSVWithAnnotations(*, id, path, top_k, with_docid, with_queryid, qrels)

Bases: ListwiseDistillationSamplesTSV

id: str: The unique (sub-)dataset ID

path: path: The path of the file

top_k: int

with_docid: bool

with_queryid: bool

qrels: datamaestro_ir.data.AdhocAssessments

XPM Configdatamaestro_ir.data.distillation.PointwiseDistillationSamples(*, id)

Bases: Base, Iterable[PointwiseDistillationSample]

Iterable of pointwise distillation samples.

id: str: The unique (sub-)dataset ID

XPM Configdatamaestro_ir.data.distillation.ConcatPointwiseDistillationSamples(*, id, sources)

Bases: PointwiseDistillationSamples

Concatenate several PointwiseDistillationSamples sources in sequence (SQL UNION ALL semantics — no deduplication).

id: str: The unique (sub-)dataset ID

sources: List[datamaestro_ir.data.distillation.PointwiseDistillationSamples]: Sources to iterate in order.

XPM Configdatamaestro_ir.data.distillation.RandomInterleavePointwiseDistillationSamples(*, id, sources, seed)

Bases: PointwiseDistillationSamples

Randomly interleave samples from multiple sources.

At each step, uniformly pick a source among the ones not yet exhausted and yield its next item. Order within each source is preserved — apply the source’s own shuffle (e.g. HF streaming .shuffle(seed=...)) if you also want in-source randomisation.

id: str: The unique (sub-)dataset ID

sources: List[datamaestro_ir.data.distillation.PointwiseDistillationSamples]: Sources to interleave.

seed: int: Seed for the source-selection RNG.

XPM Configdatamaestro_ir.data.huggingface.HuggingFacePointwiseDistillationSamples(*, id, repo_id, name, data_files, split, streaming, local_path, query_field, document_field, score_field)

Bases: HuggingFaceDataset, PointwiseDistillationSamples

(query, document, teacher-score) samples from a HuggingFace dataset.

Schema-agnostic: override the field-name Meta attributes when the source dataset uses different column names.

id: str: The unique (sub-)dataset ID

repo_id: str: The HuggingFace repository id (e.g. user/dataset).

name: str: HuggingFace dataset name (a.k.a. config).

data_files: str: Specific data files to load.

split: str: Dataset split to load.

streaming: bool = False: When True, load the dataset in streaming mode — no local cache.

local_path: path: If set, load from this local mirror instead of the HuggingFace Hub. Meta because the logical dataset is the same regardless of where the bytes come from.

query_field: str = query: Name of the column holding the query text.

document_field: str = document: Name of the column holding the document text.

score_field: str = similarity: Name of the column holding the teacher score.

XPM Configdatamaestro_ir.data.lighton.EmbeddingsPreTrainingSamples(*, id, repo_id, name, data_files, split, streaming, local_path, query_field, document_field, score_field, drop_field, duplicate_field, filter_drop, filter_duplicate, min_similarity, top_percentile, percentile_sample_size, percentile_sample_seed, shuffle_seed, shuffle_buffer_size)

Bases: HuggingFacePointwiseDistillationSamples

lightonai/embeddings-pre-training pointwise samples with the dataset-specific recipe knobs.

The HuggingFace dataset exposes a drop bool and a duplicate (nullable int) column; the recommended pre-training subset is drop=False AND duplicate IS NULL. For the FineWeb-Edu-style “top-K%” recipe, set top_percentile to a float in (0, 1].

id: str: The unique (sub-)dataset ID

repo_id: str: The HuggingFace repository id (e.g. user/dataset).

name: str: HuggingFace dataset name (a.k.a. config).

data_files: str: Specific data files to load.

split: str: Dataset split to load.

streaming: bool = False: When True, load the dataset in streaming mode — no local cache.

local_path: path: If set, load from this local mirror instead of the HuggingFace Hub. Meta because the logical dataset is the same regardless of where the bytes come from.

query_field: str = query: Name of the column holding the query text.

document_field: str = document: Name of the column holding the document text.

score_field: str = similarity: Name of the column holding the teacher score.

drop_field: str = drop: Name of the drop-flag column.

duplicate_field: str = duplicate: Name of the duplicate-index column.

filter_drop: bool = True: When True, skip rows where drop_field is True.

filter_duplicate: bool = True: When True, skip rows where duplicate_field is not None.

min_similarity: float: Minimum similarity (inclusive). Rows below are skipped.

top_percentile: float: If set (e.g. 0.35), keep only the top fraction of rows by similarity. The threshold is estimated from a reservoir sample (see percentile_sample_size) so this works in streaming mode without a full first pass — reproduction of the upstream recipe is therefore accurate up to sampling error.

percentile_sample_size: int = 1000000: Reservoir-sample size used to estimate the top_percentile threshold. Larger = more faithful to the exact quantile at the cost of a longer warmup.

percentile_sample_seed: int = 0: Seed for the reservoir sampler so threshold estimation is deterministic.

shuffle_seed: int: If set, pass through to the HuggingFace dataset’s .shuffle(seed=…) so iteration order is randomised per source. Meant to pair with a cross-source random interleave (e.g. RandomInterleavePointwiseDistillationSamples) for fully shuffled training streams. Param — different seeds produce different sample orderings, so this affects identity.

shuffle_buffer_size: int = 10000: Buffer size for streaming-mode shuffle (HF approximates a full permutation with a rolling buffer of this size). Param because the approximation quality — and thus the exact stream of rows — is a function of the buffer size.

Records yielded by the iterators above:

class datamaestro_ir.data.distillation.PairwiseDistillationSample(query: QueryT, documents: Tuple[DocT, DocT])

Bases: Generic[DocT, QueryT]

documents: Tuple[DocT, DocT]: Positive/negative document with teacher scores

query: QueryT: The query

class datamaestro_ir.data.distillation.ListwiseDistillationSample(query: QueryT, documents: List[DocT])

Bases: Generic[DocT, QueryT]

documents: List[DocT]: List of documents with their ranking position

query: QueryT: The query

class datamaestro_ir.data.distillation.PointwiseDistillationSample(query: QueryT, document: DocT)

Bases: Generic[DocT, QueryT]

A (query, document, teacher-score) triple.

The document carries the teacher’s similarity / relevance score as a ScoredDocument; this is the pointwise analogue of PairwiseDistillationSample (which pairs two docs per query).

document: DocT: The document (typically a ScoredDocument) with its teacher score

query: QueryT: The query

Transforms

XPM Configdatamaestro_ir.transforms.StoreTrainingTripletTopicAdapter(*, id, store, data)

Bases: TrainingTriplets

Retrieve an adhoc topic text from a topic store (given the topic ID)

id: str

store: datamaestro_ir.data.Topics: The topic store to use

data: datamaestro_ir.data.TrainingTriplets: Input data

XPM Configdatamaestro_ir.transforms.StoreTrainingTripletDocumentAdapter(*, id, store, data)

Bases: TrainingTriplets

Transforms training triplets to add the document text from a document store

id: str

store: datamaestro_ir.data.DocumentStore: The topic store to use

data: datamaestro_ir.data.TrainingTriplets: Input data

XPM Taskdatamaestro_ir.transforms.ShuffledTrainingTripletsLines(*, data, doc_ids, topic_ids, seed, compressed, sample_rate, sample_max)

Bases: Task

Submit type: Any

Shuffle a set of training triplets

data: datamaestro_ir.data.TrainingTriplets: Input data

path: pathgenerated: Output path

doc_ids: bool: Whether to use document ids

topic_ids: bool: True if we have query IDs

seed: int: The random seed

compressed: bool = True: Compress the output

sample_rate: float = 1.0: Sampling rate - set to 1 to keep all the samples

sample_max: int = 0: Maximum number of samples

tmp_path: pathgenerated: Path where temporary files will be stored