Distillation Datasets
This section lists datasets for knowledge distillation in neural IR, where teacher model scores are used to train student rankers.
Pointwise Distillation
Pointwise distillation datasets provide teacher-scored (query, document,
similarity) triples — a single document per query together with a teacher
similarity score.
LightOn embeddings-pre-training dataset family.
Registers the HuggingFace dataset
https://huggingface.co/datasets/lightonai/embeddings-pre-training as a
single variant family. Callers select a specific variant via a
query-style selector on the dataset id:
datamaestro prepare 'ai.lighton.embeddings_pre_training[name=agnews]'
All 73 HF configs are exposed through the name axis; additional axes
cover loading mode and the dataset’s recipe knobs (drop filter,
duplicate filter, similarity floor, and top-percentile subset).
See datamaestro.variants.AxesVariants for the query syntax.
The recipe from the mGTE technical report lives in a sibling module
(denseon_lateon) because it UNIONs multiple configs with
config-specific filter rules and is therefore structurally a separate
dataset.
-
Dataset ai.lighton.embeddings_pre_training
datamaestro_ir.data.lighton.EmbeddingsPreTrainingSamples
LightOn ``embeddings-pre-training`` teacher-scored pretraining pairs.
Tags: pointwise, distillation, pre-training
Tasks: learning to rank
External link: https://huggingface.co/datasets/lightonai/embeddings-pre-training
Teacher-scored
(query, document, similarity)pairs across 73 source corpora. Variant axes:name(config),streaming,filter_drop,filter_duplicate,min_similarity,top_percentile. SeeEmbeddingsPreTrainingSamplesfor filter semantics.Variant space for
lightonai/embeddings-pre-training.Variants:
name:str(domain: 73 values)HuggingFace config name — selects one of the 73 source corpora.
streaming:bool(default=True; domain=[False, True]; excluded from id)Pure loading-mode flag (same data either way); excluded from the formatted selector and dataset id. The underlying
HuggingFaceDataset.streamingfield is alsoMeta, so experimaestro’s identity hash already ignores it.filter_drop:bool(default=True; domain=[True, False])When
True, skip rows withdrop=True(the upstream dataset’s recommended pre-training subset).filter_duplicate:bool(default=True; domain=[True, False])When
True, skip rows whoseduplicatecolumn is not null.min_similarity:Optional[float](default=None)Minimum teacher similarity (inclusive). Rows below are skipped.
top_percentile:Optional[float](default=None)Keep only the top fraction of rows by similarity (e.g.
0.35for the FineWeb-Edu top-35% recipe). Threshold is estimated from a reservoir sample, so it works in streaming mode.
DenseON-LateON mGTE-style training recipe
Union three groups of configs with different per-group filter rules:
Standard sources (all configs except
fw-edu,wikipedia_hlp_cm, andwikipedia_hlp_dl): keep rows withdrop=Falseandduplicate IS NULLandsimilarity >= 3.0.fw-edu(FineWeb-Edu, no rule-based filter or dedup applied upstream): keep only the top ~35% by similarity. No drop/duplicate filter — the config itself isn’t pre-filtered.wikipedia_hlp_cmandwikipedia_hlp_dl: included as-is, no filter.
Variant axes:
seed:null(default) concatenates the three groups in order (ConcatPointwise). A non-null integer switches toRandomInterleavePointwiseDistillationSamples— uniformly picking a source at each step — and propagates the seed to each HuggingFace source’s.shuffle(seed=…)for in-source randomisation.download:false(default) streams from the Hub;truedownloads each source config to the local HF cache (streaming=False). Use with care — the full dataset is ~2TB.
Id derived from the package path:
ai.lighton.embeddings_pre_training.denseon_lateon.
Note on config names: the reference recipe refers to fw_edu,
hlp_wikipedia_cm and hlp_wikipedia_dl; the actual HuggingFace
config names are fw-edu, wikipedia_hlp_cm and wikipedia_hlp_dl
(verified against the HF datasets API at dataset-card time).
-
Dataset ai.lighton.embeddings_pre_training.denseon_lateon
datamaestro_ir.data.distillation.PointwiseDistillationSamples
DenseON-LateON mGTE-style pre-training recipe built by UNIONing three groups of ``lightonai/embeddings-pre-training`` configs with group-specific filters.
Tags: distillation, pointwise
Tasks: learning to rank
External link: https://huggingface.co/datasets/lightonai/embeddings-pre-training
Variant space for the DenseON-LateON pre-training recipe.
Variants:
seed:Optional[int](default=None; elides default)Randomisation seed.
None(default) concatenates the three groups in order (ConcatPointwise). A non-null integer switches toRandomInterleavePointwiseDistillationSamples— uniformly picking a source at each step — and propagates the seed to each HuggingFace source’s.shuffle(seed=…)for in-source randomisation. Changes what the dataset yields, so it stays in the id;elide_default=Truedrops it from the id when left atNoneto preserve the pre-variants id…denseon_lateon.download:bool(default=False; domain=[False, True]; excluded from id)False(default) streams from the Hub;Truedownloads each source config to the local HF cache (streaming=False). Use with care — the full dataset is ~2TB. Excluded from the id (in_id=False) because it only toggles the loading mode — same data, different delivery — while still reachingconfig()via the resolved kwargs.
Pairwise Distillation
Pairwise distillation datasets contain triples of (query, positive document, negative document) with teacher model scores for each document.
Hofstaetter Neural Ranking KD
Teacher scores for MS MARCO passage ranking from neural-ranking-kd. Contains ~40M triples with BERT-based teacher scores in TSV format (pos_score, neg_score, query_id, pos_passage_id, neg_passage_id).
com.github.hofstaetter.distillation
-
Dataset com.github.hofstaetter.distillation.msmarco.ensemble.teacher
datamaestro_ir.data.distillation.PairwiseDistillationSamplesTSV
Training files without the text content instead using the ids from MSMARCO
Tags: distillation, pairwise
Tasks: learning to rank
External link: https://github.com/sebastian-hofstaetter/neural-ranking-kd
The teacher files (using the data from “Train Triples Small” with ~40 million triples) with the format pos_score neg_score query_id pos_passage_id neg_passage_id (with tab separation)
-
Dataset com.github.hofstaetter.distillation.msmarco.bert.teacher
datamaestro_ir.data.distillation.PairwiseDistillationSamplesTSV
Training files without the text content instead using the ids from MSMARCO
Tags: distillation, pairwise
Tasks: learning to rank
External link: https://github.com/sebastian-hofstaetter/neural-ranking-kd
The teacher files (using the data from “Train Triples Small” with ~40 million triples) with the format pos_score neg_score query_id pos_passage_id neg_passage_id (with tab separation)
Listwise Distillation
Listwise distillation datasets contain ranked lists of documents for each query, produced by a teacher model.
com.github.webis-de.rank-distillm
-
Dataset com.github.webis-de.rank-distillm.msmarco.bm25.annotated
datamaestro_ir.data.distillation.ListwiseDistillationSamplesTSVWithAnnotations
Top 500 BM25 passages for judged MS MARCO training queries
Tags: listwise, distillation
Tasks: learning to rank
External link: https://github.com/webis-de/rank-distillm
For all queries that have at least one relevance judgement in the MS MARCO training query set retrieved by BM25.
-
Dataset com.github.webis-de.rank-distillm.msmarco.colbertv2.annotated
datamaestro_ir.data.distillation.ListwiseDistillationSamplesTSVWithAnnotations
Top 500 passages retrieved by ColBERTv2
Tags: listwise, distillation
Tasks: learning to rank
External link: https://github.com/webis-de/rank-distillm
for all queries in the MS MARCO training query set.
WARNING: not all 500 docs necessarily contains relevant documents.
-
Dataset com.github.webis-de.rank-distillm.rankzephyr.bm25_10000.sampled100.annotated
datamaestro_ir.data.distillation.ListwiseDistillationSamplesTSV
Top 100 BM25 passages reranked by RankZephyr for 10k sampled MSMARCO queries
Tags: listwise, distillation
Tasks: learning to rank
External link: https://github.com/webis-de/rank-distillm
All passages are then reranked using RankZephyr and can be used for distillation.
-
Dataset com.github.webis-de.rank-distillm.rankzephyr.colbert10000.sampled100.annotated
datamaestro_ir.data.distillation.ListwiseDistillationSamplesTSV
Top 100 ColBERT passages reranked by RankZephyr for 10k sampled MSMARCO queries
Tags: listwise, distillation
Tasks: learning to rank
External link: https://github.com/webis-de/rank-distillm
All passages are then reranked using RankZephyr and can be used for distillation.
-
Dataset com.github.webis-de.rank-distillm.rankzephyr.colbert10000.sampled50.annotated
datamaestro_ir.data.distillation.ListwiseDistillationSamplesTSV
Top 50 ColBERT passages reranked by RankZephyr for 10k sampled MSMARCO queries
Tags: listwise, distillation
Tasks: learning to rank
External link: https://github.com/webis-de/rank-distillm
All passages are then reranked using RankZephyr and can be used for distillation.
-
Dataset com.github.webis-de.rank-distillm.rankzephyr.colbert10000.sampled10.annotated
datamaestro_ir.data.distillation.ListwiseDistillationSamplesTSV
Top 10 ColBERT passages reranked by RankZephyr for 10k sampled MSMARCO queries
Tags: listwise, distillation
Tasks: learning to rank
External link: https://github.com/webis-de/rank-distillm
All passages are then reranked using RankZephyr and can be used for distillation.