Information Retrieval API
This module provides data types for Information Retrieval datasets and experiments.
The core abstractions are:
Documents - Collections of documents to be searched
Topics - Queries or information needs
Assessments - Relevance judgments (qrels) linking topics to relevant documents
Adhoc - A complete IR test collection combining documents, topics, and assessments
For training neural rankers:
TrainingTriplets - Training data as (query, positive_doc, negative_doc) triplets
PairwiseSampleDataset - General pairwise training data
Data objects
- class datamaestro_ir.data.base.IDRecord
Bases:
TypedDictA record with just an ID
- class datamaestro_ir.data.base.IDTextRecord
Bases:
dictA record with an ID and a text item
- class datamaestro_ir.data.base.ScoredDocument(document: dict, score: float)
Bases:
objectA data structure that associates a score with a document, allowing to sort documents by score (e.g., for nDCG)
- document: dict
The document (IDRecord, TextRecord, or IDTextRecord)
- score: float
The associated score
- class datamaestro_ir.data.base.SimpleTextItem(text: str)
Bases:
TextItemA topic/document with a text record
- class datamaestro_ir.data.base.TextRecord
Bases:
TypedDictA record with just a text item
Collection
- XPM Configdatamaestro_ir.data.Adhoc(*, id, documents, topics, assessments)
Bases:
BaseAn Adhoc IR collection with documents, topics and their assessments
- id: str
The unique (sub-)dataset ID
- documents: datamaestro_ir.data.Documents
The set of documents
- topics: datamaestro_ir.data.Topics
The set of topics
- assessments: datamaestro_ir.data.AdhocAssessments
The set of assessments (for each topic)
Topics
- XPM Configdatamaestro_ir.data.Topics(*, id)
Bases:
Base,ABCA set of topics with associated IDs
- id: str
The unique (sub-)dataset ID
- XPM Configdatamaestro_ir.data.csv.Topics(*, id, path, separator)
Bases:
TopicsPairs of query id - query using a separator
- id: str
The unique (sub-)dataset ID
- path: path
- separator: str
- XPM Configdatamaestro_ir.data.FilteredTopics(*, id, topics, qids_path)
Bases:
TopicsMerges multiple Topics sources, keeping only query IDs listed in a file
- id: str
The unique (sub-)dataset ID
- topics: List[datamaestro_ir.data.Topics]
- qids_path: path
Path to a file with one query ID per line
- XPM Configdatamaestro_ir.transforms.TopicWrapper
Bases:
Config,ABCModify topics on the fly using a topic wrapper
Dataset-specific Topics
- XPM Configdatamaestro_ir.data.beir.BeirTopics(*, id, path)
Bases:
TopicsBEIR queries: JSONL with _id and text fields.
- id: str
The unique (sub-)dataset ID
- path: path
- XPM Configdatamaestro_ir.data.beir.BeirParquetTopics(*, id, path)
Bases:
TopicsBEIR queries from Parquet.
- id: str
The unique (sub-)dataset ID
- path: path
- XPM Configdatamaestro_ir.data.lotte.LotteTopics(*, id, path)
Bases:
TopicsLoTTE queries: TSV with query_id and text fields.
- id: str
The unique (sub-)dataset ID
- path: path
Documents
- XPM Configdatamaestro_ir.data.Documents(*, id, count)
Bases:
BaseA set of documents with identifiers
A set of documents with identifiers
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- XPM Configdatamaestro_ir.data.csv.Documents(*, id, count, path, separator)
Bases:
DocumentsOne line per document, format pid<SEP>text
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- path: path
- separator: str
Dataset-specific documents
- XPM Configdatamaestro_ir.data.cord19.Documents(*, id, path, delimiter, ignore, names_row, count)
-
- id: str
The unique (sub-)dataset ID
- path: path
The path of the file
- delimiter: str = ,
- ignore: int = 0
- names_row: int = -1
- count: int
Number of documents
- XPM Configdatamaestro_ir.data.trec.TipsterCollection(*, id, count, path, patterns)
Bases:
Documents- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- path: path
- patterns: List[str]
- XPM Configdatamaestro_ir.data.beir.BeirDocumentStore(*, id, count, file_access, path, lookup_key)
Bases:
CompressedDocumentStoreDocument store for BEIR datasets.
Content bytes encode title and text as: title_bytes + b”0” + text_bytes. The only key is “id” (the external document ID).
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- path: path
Path to the impact-index store directory
- lookup_key: str = id
- XPM Configdatamaestro_ir.data.lotte.LotteDocumentStore(*, id, count, file_access, path, lookup_key)
Bases:
CompressedDocumentStoreDocument store for LoTTE datasets.
Content bytes encode text as UTF-8. The only key is “id” (the document ID).
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- path: path
Path to the impact-index store directory
- lookup_key: str = id
- XPM Configdatamaestro_ir.data.stores.MsMarcoPassagesStore(*, id, count, file_access, path)
Bases:
CompressedDocumentStoreDocument store for MS MARCO passages where internal ID = external ID
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- path: path
Path to the impact-index store directory
- XPM Configdatamaestro_ir.data.stores.MsMarcoPassageV2Store(*, id, count, file_access, path)
Bases:
CompressedDocumentStoreDocument store for MS MARCO passage collection v2.
Each passage has its text, parent document id, and character spans within that document.
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- path: path
Path to the impact-index store directory
- XPM Configdatamaestro_ir.data.stores.CarParagraphStore(*, id, count, file_access, path)
Bases:
CompressedDocumentStoreDocument store for TREC CAR v2.0 paragraphs.
Each document is a simple text paragraph identified by its paragraph ID.
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- path: path
Path to the impact-index store directory
- XPM Configdatamaestro_ir.data.stores.WapoDocumentStore(*, id, count, file_access, path)
Bases:
CompressedDocumentStoreDocument store for Washington Post (WAPO) v2/v4 full documents.
Stores full WAPO documents with all metadata fields.
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- path: path
Path to the impact-index store directory
- XPM Configdatamaestro_ir.data.stores.WapoPassageStore(*, id, count, file_access, path)
Bases:
CompressedDocumentStoreDocument store for WAPO paragraph-level passages (CaST v0).
Each WAPO document is split into paragraphs. Document IDs follow the format
{doc_id}-{paragraph_index}(1-indexed) matching the official CaST tools script.- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- path: path
Path to the impact-index store directory
- XPM Configdatamaestro_ir.data.stores.KiltDocumentStore(*, id, count, file_access, path)
Bases:
CompressedDocumentStoreDocument store for KILT (Knowledge Intensive Language Tasks) knowledge source.
Stores KILT documents with title, URL, and body text.
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- path: path
Path to the impact-index store directory
- XPM Configdatamaestro_ir.data.stores.MsMarcoDocumentStore(*, id, count, file_access, path)
Bases:
CompressedDocumentStoreDocument store for MS MARCO document collection (v1).
Each document has URL, title, and body fields.
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- path: path
Path to the impact-index store directory
- XPM Configdatamaestro_ir.data.stores.MsMarcoDocumentV2Store(*, id, count, file_access, path)
Bases:
CompressedDocumentStoreDocument store for MS MARCO document collection v2.
Each document has URL, title, headings, and body fields.
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- path: path
Path to the impact-index store directory
- XPM Configdatamaestro_ir.data.stores.CastSegmentedPassageStore(*, id, count, file_access, base_store, offsets_path, dupes_path)
Bases:
DocumentStoreDocument store for CaST segmented passages (v2/v3).
Reads a base document store and an offset file to create passage-level documents. Each passage is defined by character ranges applied to the base document text.
Offset file format (gzipped JSONL):
{"id":"MARCO_00_1454834","ranges":[[[0,917]],[[918,2082]]],"md5":"..."}
Passage IDs follow the format
{doc_id}-{passage_index}(1-indexed).- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- base_store: datamaestro_ir.data.DocumentStore
The base document store containing full documents
- offsets_path: path
Path to the gzipped JSONL offset file
- dupes_path: path
Path to the duplicates file (one doc ID per line to exclude)
- XPM Configdatamaestro_ir.data.stores.TipsterDocumentStore(*, id, count, file_access, path)
Bases:
CompressedDocumentStoreDocument store for TIPSTER/AQUAINT document collections.
Each document is stored as JSON with title and body fields, matching the structured output of the TIPSTER SGML parser.
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- path: path
Path to the impact-index store directory
- XPM Configdatamaestro_ir.data.PrefixedDocumentStore(*, id, count, file_access, sources, prefixes)
Bases:
DocumentStoreCombines multiple DocumentStores with ID prefixes.
Each document ID is expected to start with one of the given prefixes, which determines which underlying store to query.
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- sources: List[datamaestro_ir.data.DocumentStore]
- prefixes: List[str]
Assessments
- XPM Configdatamaestro_ir.data.AdhocAssessments(*, id)
Bases:
Base,ABCAd-hoc assessments (qrels)
- id: str
The unique (sub-)dataset ID
- XPM Configdatamaestro_ir.data.beir.BeirAssessments(*, id, path)
Bases:
AdhocAssessmentsBEIR qrels: 3-column TSV with header (query-id, corpus-id, score).
- id: str
The unique (sub-)dataset ID
- path: path
- XPM Configdatamaestro_ir.data.beir.BeirParquetAssessments(*, id, path)
Bases:
AdhocAssessmentsBEIR qrels from Parquet.
- id: str
The unique (sub-)dataset ID
- path: path
- XPM Configdatamaestro_ir.data.lotte.LotteAssessments(*, id, path)
Bases:
AdhocAssessmentsLoTTE qrels: JSONL with qid and answer_pids fields.
- id: str
The unique (sub-)dataset ID
- path: path
- XPM Configdatamaestro_ir.data.trec.TrecAdhocAssessments(*, id, path)
Bases:
AdhocAssessments- id: str
The unique (sub-)dataset ID
- path: path
- class datamaestro_ir.data.AdhocAssessedTopic(topic_id: str, assessments: List[AdhocAssessment])
Bases:
object
- class datamaestro_ir.data.AdhocAssessment(doc_id: str)
Bases:
object
Runs
- XPM Configdatamaestro_ir.data.AdhocRun(*, id)
Bases:
BaseIR adhoc run
- id: str
The unique (sub-)dataset ID
Results
- XPM Configdatamaestro_ir.data.trec.TrecAdhocResults(*, id, metrics, results, detailed)
Bases:
AdhocResultsAdhoc results (TREC format)
- id: str
The unique (sub-)dataset ID
- metrics: List[datamaestro_ir.data.Measure]
List of reported metrics
- results: path
Main results
- detailed: path
Results per topic (if any)
Evaluation
- XPM Configdatamaestro_ir.data.Measure
Bases:
ConfigAn Information Retrieval measure
Reranking
- XPM Configdatamaestro_ir.data.RerankAdhoc(*, id, documents, topics, assessments, run)
Bases:
AdhocRe-ranking ad-hoc task based on an existing run
- id: str
The unique (sub-)dataset ID
- documents: datamaestro_ir.data.Documents
The set of documents
- topics: datamaestro_ir.data.Topics
The set of topics
- assessments: datamaestro_ir.data.AdhocAssessments
The set of assessments (for each topic)
- run: datamaestro_ir.data.AdhocRun
The run to re-rank
Document Index
- XPM Configdatamaestro_ir.data.DocumentStore(*, id, count, file_access)
Bases:
DocumentsA document store
A document store can - match external/internal ID - return the document content - return the number of documents
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- XPM Configdatamaestro_ir.data.CompressedDocumentStore(*, id, count, file_access, path)
Bases:
DocumentStore,ABCA document store backed by impact-index’s compressed document store
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- path: path
Path to the impact-index store directory
- XPM Configdatamaestro_ir.data.AdhocIndex(*, id, count, file_access)
Bases:
DocumentStoreAn index can be used to retrieve documents based on terms
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- XPM Configdatamaestro_ir.data.anserini.Index(*, id, count, file_access, path, storePositions, storeDocvectors, storeRaw, storeContents, stemmer)
Bases:
AdhocIndexAnserini-backed index
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- path: path
Path to the index
- storePositions: bool = False
Store term positions
- storeDocvectors: bool = False
Store document term vectors
- storeRaw: bool = False
Store raw document
- storeContents: bool = False
Store processed documents (e.g. without HTML tags)
- stemmer: str = porter
The stemmer to use
Training triplets
- XPM Configdatamaestro_ir.data.TrainingTriplets(*, id)
Bases:
Base,ABCTriplet for training IR systems: query / query ID, positive document, negative document
- id: str
The unique (sub-)dataset ID
- XPM Configdatamaestro_ir.data.PairwiseSampleDataset(*, id)
Bases:
Base,ABCDatasets where each record is a query with positive and negative samples
- id: str
The unique (sub-)dataset ID
- XPM Configdatamaestro_ir.data.TrainingTripletsLines(*, id, sep, path, doc_ids, topic_ids)
Bases:
TrainingTripletsTraining triplets with one line per triple (query texts)
- id: str
The unique (sub-)dataset ID
- sep: str
- path: path
- doc_ids: bool
True if we have documents IDs
- topic_ids: bool
True if we have query IDs
- XPM Configdatamaestro_ir.data.huggingface.HuggingFacePairwiseSampleDataset(*, id, repo_id, name, data_files, split, streaming, local_path, ids, query_id, pos_id, neg_id)
Bases:
HuggingFaceDataset,PairwiseSampleDatasetTriplet for training IR systems: query / query ID, positive document, negative document
- id: str
The unique (sub-)dataset ID
- repo_id: str
The HuggingFace repository id (e.g.
user/dataset).
- name: str
HuggingFace dataset
name(a.k.a. config).
- data_files: str
Specific data files to load.
- split: str
Dataset split to load.
- streaming: bool = False
When True, load the dataset in streaming mode — no local cache.
- local_path: path
If set, load from this local mirror instead of the HuggingFace Hub.
Metabecause the logical dataset is the same regardless of where the bytes come from.
- ids: bool = True
True if the triplet is made of IDs, False otherwise
- query_id: str = qid
The name of the field containing the query ID
- pos_id: str = pos
The name of the field containing the positive samples
- neg_id: str = neg
The name of the field containing the negative samples
- class datamaestro_ir.data.PairwiseSample(*, topics: List[IDTextRecord], positives: List[IDTextRecord], negatives: Dict[str, List[IDTextRecord]])
Bases:
ABCA a query with positive and negative samples
- negatives: Dict[str, List[IDTextRecord]]
Non relevant documents, organized in a dictionary where keys are the algorithm used to retrieve the negatives
- positives: List[IDTextRecord]
Relevant documents
- topics: List[IDTextRecord]
The topic(s)
Distillation
Config classes that stream teacher-scored samples:
- XPM Configdatamaestro_ir.data.distillation.PairwiseDistillationSamples(*, id)
Bases:
Base,Iterable[PairwiseDistillationSample]Pairwise distillation file
- id: str
The unique (sub-)dataset ID
- XPM Configdatamaestro_ir.data.distillation.PairwiseDistillationSamplesTSV(*, id, path, with_docid, with_queryid)
Bases:
PairwiseDistillationSamples,FileA TSV file (Score 1, Score 2, Query, Document 1, Document 2)
- id: str
The unique (sub-)dataset ID
- path: path
The path of the file
- with_docid: bool
- with_queryid: bool
- XPM Configdatamaestro_ir.data.distillation.ListwiseDistillationSamples(*, id)
Bases:
Base,Iterable[ListwiseDistillationSample]Listwise distillation file
- id: str
The unique (sub-)dataset ID
- XPM Configdatamaestro_ir.data.distillation.ListwiseDistillationSamplesTSV(*, id, path, top_k, with_docid, with_queryid)
Bases:
ListwiseDistillationSamples,FileA TSV file (“query_id”, “q0”, “doc_id”, “rank”, “score”, “system”)
- id: str
The unique (sub-)dataset ID
- path: path
The path of the file
- top_k: int
- with_docid: bool
- with_queryid: bool
- XPM Configdatamaestro_ir.data.distillation.ListwiseDistillationSamplesTSVWithAnnotations(*, id, path, top_k, with_docid, with_queryid, qrels)
Bases:
ListwiseDistillationSamplesTSV- id: str
The unique (sub-)dataset ID
- path: path
The path of the file
- top_k: int
- with_docid: bool
- with_queryid: bool
- XPM Configdatamaestro_ir.data.distillation.PointwiseDistillationSamples(*, id)
Bases:
Base,Iterable[PointwiseDistillationSample]Iterable of pointwise distillation samples.
- id: str
The unique (sub-)dataset ID
- XPM Configdatamaestro_ir.data.distillation.ConcatPointwiseDistillationSamples(*, id, sources)
Bases:
PointwiseDistillationSamplesConcatenate several
PointwiseDistillationSamplessources in sequence (SQL UNION ALL semantics — no deduplication).- id: str
The unique (sub-)dataset ID
- sources: List[datamaestro_ir.data.distillation.PointwiseDistillationSamples]
Sources to iterate in order.
- XPM Configdatamaestro_ir.data.distillation.RandomInterleavePointwiseDistillationSamples(*, id, sources, seed)
Bases:
PointwiseDistillationSamplesRandomly interleave samples from multiple sources.
At each step, uniformly pick a source among the ones not yet exhausted and yield its next item. Order within each source is preserved — apply the source’s own shuffle (e.g. HF streaming
.shuffle(seed=...)) if you also want in-source randomisation.- id: str
The unique (sub-)dataset ID
- sources: List[datamaestro_ir.data.distillation.PointwiseDistillationSamples]
Sources to interleave.
- seed: int
Seed for the source-selection RNG.
- XPM Configdatamaestro_ir.data.huggingface.HuggingFacePointwiseDistillationSamples(*, id, repo_id, name, data_files, split, streaming, local_path, query_field, document_field, score_field)
Bases:
HuggingFaceDataset,PointwiseDistillationSamples(query, document, teacher-score) samples from a HuggingFace dataset.
Schema-agnostic: override the field-name
Metaattributes when the source dataset uses different column names.- id: str
The unique (sub-)dataset ID
- repo_id: str
The HuggingFace repository id (e.g.
user/dataset).
- name: str
HuggingFace dataset
name(a.k.a. config).
- data_files: str
Specific data files to load.
- split: str
Dataset split to load.
- streaming: bool = False
When True, load the dataset in streaming mode — no local cache.
- local_path: path
If set, load from this local mirror instead of the HuggingFace Hub.
Metabecause the logical dataset is the same regardless of where the bytes come from.
- query_field: str = query
Name of the column holding the query text.
- document_field: str = document
Name of the column holding the document text.
- score_field: str = similarity
Name of the column holding the teacher score.
- XPM Configdatamaestro_ir.data.lighton.EmbeddingsPreTrainingSamples(*, id, repo_id, name, data_files, split, streaming, local_path, query_field, document_field, score_field, drop_field, duplicate_field, filter_drop, filter_duplicate, min_similarity, top_percentile, percentile_sample_size, percentile_sample_seed, shuffle_seed, shuffle_buffer_size)
Bases:
HuggingFacePointwiseDistillationSampleslightonai/embeddings-pre-trainingpointwise samples with the dataset-specific recipe knobs.The HuggingFace dataset exposes a
dropbool and aduplicate(nullable int) column; the recommended pre-training subset isdrop=False AND duplicate IS NULL. For the FineWeb-Edu-style “top-K%” recipe, settop_percentileto a float in(0, 1].- id: str
The unique (sub-)dataset ID
- repo_id: str
The HuggingFace repository id (e.g.
user/dataset).
- name: str
HuggingFace dataset
name(a.k.a. config).
- data_files: str
Specific data files to load.
- split: str
Dataset split to load.
- streaming: bool = False
When True, load the dataset in streaming mode — no local cache.
- local_path: path
If set, load from this local mirror instead of the HuggingFace Hub.
Metabecause the logical dataset is the same regardless of where the bytes come from.
- query_field: str = query
Name of the column holding the query text.
- document_field: str = document
Name of the column holding the document text.
- score_field: str = similarity
Name of the column holding the teacher score.
- drop_field: str = drop
Name of the drop-flag column.
- duplicate_field: str = duplicate
Name of the duplicate-index column.
- filter_drop: bool = True
When True, skip rows where
drop_fieldis True.
- filter_duplicate: bool = True
When True, skip rows where
duplicate_fieldis not None.
- min_similarity: float
Minimum similarity (inclusive). Rows below are skipped.
- top_percentile: float
If set (e.g.
0.35), keep only the top fraction of rows by similarity. The threshold is estimated from a reservoir sample (seepercentile_sample_size) so this works in streaming mode without a full first pass — reproduction of the upstream recipe is therefore accurate up to sampling error.
- percentile_sample_size: int = 1000000
Reservoir-sample size used to estimate the
top_percentilethreshold. Larger = more faithful to the exact quantile at the cost of a longer warmup.
- percentile_sample_seed: int = 0
Seed for the reservoir sampler so threshold estimation is deterministic.
- shuffle_seed: int
If set, pass through to the HuggingFace dataset’s
.shuffle(seed=…)so iteration order is randomised per source. Meant to pair with a cross-source random interleave (e.g.RandomInterleavePointwiseDistillationSamples) for fully shuffled training streams.Param— different seeds produce different sample orderings, so this affects identity.
- shuffle_buffer_size: int = 10000
Buffer size for streaming-mode shuffle (HF approximates a full permutation with a rolling buffer of this size).
Parambecause the approximation quality — and thus the exact stream of rows — is a function of the buffer size.
Records yielded by the iterators above:
- class datamaestro_ir.data.distillation.PairwiseDistillationSample(query: QueryT, documents: Tuple[DocT, DocT])
Bases:
Generic[DocT,QueryT]- documents: Tuple[DocT, DocT]
Positive/negative document with teacher scores
- query: QueryT
The query
- class datamaestro_ir.data.distillation.ListwiseDistillationSample(query: QueryT, documents: List[DocT])
Bases:
Generic[DocT,QueryT]- documents: List[DocT]
List of documents with their ranking position
- query: QueryT
The query
- class datamaestro_ir.data.distillation.PointwiseDistillationSample(query: QueryT, document: DocT)
Bases:
Generic[DocT,QueryT]A (query, document, teacher-score) triple.
The document carries the teacher’s similarity / relevance score as a
ScoredDocument; this is the pointwise analogue ofPairwiseDistillationSample(which pairs two docs per query).- document: DocT
The document (typically a
ScoredDocument) with its teacher score
- query: QueryT
The query
Transforms
- XPM Configdatamaestro_ir.transforms.StoreTrainingTripletTopicAdapter(*, id, store, data)
Bases:
TrainingTripletsRetrieve an adhoc topic text from a topic store (given the topic ID)
- id: str
- store: datamaestro_ir.data.Topics
The topic store to use
- data: datamaestro_ir.data.TrainingTriplets
Input data
- XPM Configdatamaestro_ir.transforms.StoreTrainingTripletDocumentAdapter(*, id, store, data)
Bases:
TrainingTripletsTransforms training triplets to add the document text from a document store
- id: str
- store: datamaestro_ir.data.DocumentStore
The topic store to use
- data: datamaestro_ir.data.TrainingTriplets
Input data
- XPM Taskdatamaestro_ir.transforms.ShuffledTrainingTripletsLines(*, data, doc_ids, topic_ids, seed, compressed, sample_rate, sample_max)
Bases:
TaskSubmit type:
AnyShuffle a set of training triplets
- data: datamaestro_ir.data.TrainingTriplets
Input data
- path: pathgenerated
Output path
- doc_ids: bool
Whether to use document ids
- topic_ids: bool
True if we have query IDs
- seed: int
The random seed
- compressed: bool = True
Compress the output
- sample_rate: float = 1.0
Sampling rate - set to 1 to keep all the samples
- sample_max: int = 0
Maximum number of samples
- tmp_path: pathgenerated
Path where temporary files will be stored