Distillation Datasets

This section lists datasets for knowledge distillation in neural IR, where teacher model scores are used to train student rankers.

Pointwise Distillation

Pointwise distillation datasets provide teacher-scored (query, document, similarity) triples — a single document per query together with a teacher similarity score.

LightOn embeddings-pre-training dataset family.

Registers the HuggingFace dataset https://huggingface.co/datasets/lightonai/embeddings-pre-training as a single variant family. Callers select a specific variant via a query-style selector on the dataset id:

datamaestro prepare 'ai.lighton.embeddings_pre_training[name=agnews]'

All 73 HF configs are exposed through the name axis; additional axes cover loading mode and the dataset’s recipe knobs (drop filter, duplicate filter, similarity floor, and top-percentile subset). See datamaestro.variants.AxesVariants for the query syntax.

The recipe from the mGTE technical report lives in a sibling module (denseon_lateon) because it UNIONs multiple configs with config-specific filter rules and is therefore structurally a separate dataset.

Dataset ai.lighton.embeddings_pre_training

→ datamaestro_ir.data.lighton.EmbeddingsPreTrainingSamples

LightOn ``embeddings-pre-training`` teacher-scored pretraining pairs.

Tags: pointwise, distillation, pre-training

Tasks: learning to rank

External link: https://huggingface.co/datasets/lightonai/embeddings-pre-training

Teacher-scored (query, document, similarity) pairs across 73 source corpora. Variant axes: name (config), streaming, filter_drop, filter_duplicate, min_similarity, top_percentile. See EmbeddingsPreTrainingSamples for filter semantics.

Variant space for lightonai/embeddings-pre-training.

Variants:

name : str (domain: 73 values)

HuggingFace config name — selects one of the 73 source corpora.
streaming : bool (default=True; domain=[False, True]; excluded from id)

Pure loading-mode flag (same data either way); excluded from the formatted selector and dataset id. The underlying HuggingFaceDataset.streaming field is also Meta, so experimaestro’s identity hash already ignores it.
filter_drop : bool (default=True; domain=[True, False])

When True, skip rows with drop=True (the upstream dataset’s recommended pre-training subset).
filter_duplicate : bool (default=True; domain=[True, False])

When True, skip rows whose duplicate column is not null.
min_similarity : Optional[float] (default=None)

Minimum teacher similarity (inclusive). Rows below are skipped.
top_percentile : Optional[float] (default=None)

Keep only the top fraction of rows by similarity (e.g. 0.35 for the FineWeb-Edu top-35% recipe). Threshold is estimated from a reservoir sample, so it works in streaming mode.

DenseON-LateON mGTE-style training recipe

Union three groups of configs with different per-group filter rules:

Standard sources (all configs except fw-edu, wikipedia_hlp_cm, and wikipedia_hlp_dl): keep rows with drop=False and duplicate IS NULL and similarity >= 3.0.
fw-edu (FineWeb-Edu, no rule-based filter or dedup applied upstream): keep only the top ~35% by similarity. No drop/duplicate filter — the config itself isn’t pre-filtered.
wikipedia_hlp_cm and wikipedia_hlp_dl: included as-is, no filter.

Variant axes:

seed: null (default) concatenates the three groups in order (ConcatPointwise). A non-null integer switches to RandomInterleavePointwiseDistillationSamples — uniformly picking a source at each step — and propagates the seed to each HuggingFace source’s .shuffle(seed=…) for in-source randomisation.
download: false (default) streams from the Hub; true downloads each source config to the local HF cache (streaming=False). Use with care — the full dataset is ~2TB.

Id derived from the package path: ai.lighton.embeddings_pre_training.denseon_lateon.

Note on config names: the reference recipe refers to fw_edu, hlp_wikipedia_cm and hlp_wikipedia_dl; the actual HuggingFace config names are fw-edu, wikipedia_hlp_cm and wikipedia_hlp_dl (verified against the HF datasets API at dataset-card time).

Dataset ai.lighton.embeddings_pre_training.denseon_lateon

→ datamaestro_ir.data.distillation.PointwiseDistillationSamples

DenseON-LateON mGTE-style pre-training recipe built by UNIONing three groups of ``lightonai/embeddings-pre-training`` configs with group-specific filters.

Tags: distillation, pointwise

Tasks: learning to rank

External link: https://huggingface.co/datasets/lightonai/embeddings-pre-training

Variant space for the DenseON-LateON pre-training recipe.

Variants:

seed : Optional[int] (default=None; elides default)

Randomisation seed. None (default) concatenates the three groups in order (ConcatPointwise). A non-null integer switches to RandomInterleavePointwiseDistillationSamples — uniformly picking a source at each step — and propagates the seed to each HuggingFace source’s .shuffle(seed=…) for in-source randomisation. Changes what the dataset yields, so it stays in the id; elide_default=True drops it from the id when left at None to preserve the pre-variants id …denseon_lateon.
download : bool (default=False; domain=[False, True]; excluded from id)

False (default) streams from the Hub; True downloads each source config to the local HF cache (streaming=False). Use with care — the full dataset is ~2TB. Excluded from the id (in_id=False) because it only toggles the loading mode — same data, different delivery — while still reaching config() via the resolved kwargs.

Pairwise Distillation

Pairwise distillation datasets contain triples of (query, positive document, negative document) with teacher model scores for each document.

Hofstaetter Neural Ranking KD

Teacher scores for MS MARCO passage ranking from neural-ranking-kd. Contains ~40M triples with BERT-based teacher scores in TSV format (pos_score, neg_score, query_id, pos_passage_id, neg_passage_id).

com.github.hofstaetter.distillation

Dataset com.github.hofstaetter.distillation.msmarco.ensemble.teacher

→ datamaestro_ir.data.distillation.PairwiseDistillationSamplesTSV

Training files without the text content instead using the ids from MSMARCO

Tags: distillation, pairwise

Tasks: learning to rank

External link: https://github.com/sebastian-hofstaetter/neural-ranking-kd

The teacher files (using the data from “Train Triples Small” with ~40 million triples) with the format pos_score neg_score query_id pos_passage_id neg_passage_id (with tab separation)

Dataset com.github.hofstaetter.distillation.msmarco.bert.teacher

→ datamaestro_ir.data.distillation.PairwiseDistillationSamplesTSV

Training files without the text content instead using the ids from MSMARCO

Tags: distillation, pairwise

Tasks: learning to rank

External link: https://github.com/sebastian-hofstaetter/neural-ranking-kd

The teacher files (using the data from “Train Triples Small” with ~40 million triples) with the format pos_score neg_score query_id pos_passage_id neg_passage_id (with tab separation)

Listwise Distillation

Listwise distillation datasets contain ranked lists of documents for each query, produced by a teacher model.

com.github.webis-de.rank-distillm

Dataset com.github.webis-de.rank-distillm.msmarco.bm25.annotated

→ datamaestro_ir.data.distillation.ListwiseDistillationSamplesTSVWithAnnotations

Top 500 BM25 passages for judged MS MARCO training queries

Tags: listwise, distillation

Tasks: learning to rank

External link: https://github.com/webis-de/rank-distillm

For all queries that have at least one relevance judgement in the MS MARCO training query set retrieved by BM25.

Dataset com.github.webis-de.rank-distillm.msmarco.colbertv2.annotated

→ datamaestro_ir.data.distillation.ListwiseDistillationSamplesTSVWithAnnotations

Top 500 passages retrieved by ColBERTv2

Tags: listwise, distillation

Tasks: learning to rank

External link: https://github.com/webis-de/rank-distillm

for all queries in the MS MARCO training query set.

WARNING: not all 500 docs necessarily contains relevant documents.

Dataset com.github.webis-de.rank-distillm.rankzephyr.bm25_10000.sampled100.annotated

→ datamaestro_ir.data.distillation.ListwiseDistillationSamplesTSV

Top 100 BM25 passages reranked by RankZephyr for 10k sampled MSMARCO queries

Tags: listwise, distillation

Tasks: learning to rank

External link: https://github.com/webis-de/rank-distillm

All passages are then reranked using RankZephyr and can be used for distillation.

Dataset com.github.webis-de.rank-distillm.rankzephyr.colbert10000.sampled100.annotated

→ datamaestro_ir.data.distillation.ListwiseDistillationSamplesTSV

Top 100 ColBERT passages reranked by RankZephyr for 10k sampled MSMARCO queries

Tags: listwise, distillation

Tasks: learning to rank

External link: https://github.com/webis-de/rank-distillm

All passages are then reranked using RankZephyr and can be used for distillation.

Dataset com.github.webis-de.rank-distillm.rankzephyr.colbert10000.sampled50.annotated

→ datamaestro_ir.data.distillation.ListwiseDistillationSamplesTSV

Top 50 ColBERT passages reranked by RankZephyr for 10k sampled MSMARCO queries

Tags: listwise, distillation

Tasks: learning to rank

External link: https://github.com/webis-de/rank-distillm

All passages are then reranked using RankZephyr and can be used for distillation.

Dataset com.github.webis-de.rank-distillm.rankzephyr.colbert10000.sampled10.annotated

→ datamaestro_ir.data.distillation.ListwiseDistillationSamplesTSV

Top 10 ColBERT passages reranked by RankZephyr for 10k sampled MSMARCO queries

Tags: listwise, distillation

Tasks: learning to rank

External link: https://github.com/webis-de/rank-distillm

All passages are then reranked using RankZephyr and can be used for distillation.