Distillation Datasets

This section lists datasets for knowledge distillation in neural IR, where teacher model scores are used to train student rankers.

Pointwise Distillation

Pointwise distillation datasets provide teacher-scored (query, document, similarity) triples — a single document per query together with a teacher similarity score.

LightOn embeddings-pre-training dataset family.

Registers the HuggingFace dataset https://huggingface.co/datasets/lightonai/embeddings-pre-training as a single variant family. Callers select a specific variant via a query-style selector on the dataset id:

datamaestro prepare 'ai.lighton.embeddings_pre_training[name=agnews]'

All 73 HF configs are exposed through the name axis; additional axes cover loading mode and the dataset’s recipe knobs (drop filter, duplicate filter, similarity floor, and top-percentile subset). See datamaestro.variants.AxesVariants for the query syntax.

The recipe from the mGTE technical report lives in a sibling module (denseon_lateon) because it UNIONs multiple configs with config-specific filter rules and is therefore structurally a separate dataset.

Dataset ai.lighton.embeddings_pre_training

datamaestro_ir.data.lighton.EmbeddingsPreTrainingSamples

LightOn ``embeddings-pre-training`` teacher-scored pretraining pairs.

Tags: pointwise, distillation, pre-training

Tasks: learning to rank

External link: https://huggingface.co/datasets/lightonai/embeddings-pre-training

Teacher-scored (query, document, similarity) pairs across 73 source corpora. Variant axes: name (config), streaming, filter_drop, filter_duplicate, min_similarity, top_percentile. See EmbeddingsPreTrainingSamples for filter semantics.

Variant space for lightonai/embeddings-pre-training.

Variants:

  • name : str (domain: 73 values)

    HuggingFace config name — selects one of the 73 source corpora.

  • streaming : bool (default=True; domain=[False, True]; excluded from id)

    Pure loading-mode flag (same data either way); excluded from the formatted selector and dataset id. The underlying HuggingFaceDataset.streaming field is also Meta, so experimaestro’s identity hash already ignores it.

  • filter_drop : bool (default=True; domain=[True, False])

    When True, skip rows with drop=True (the upstream dataset’s recommended pre-training subset).

  • filter_duplicate : bool (default=True; domain=[True, False])

    When True, skip rows whose duplicate column is not null.

  • min_similarity : Optional[float] (default=None)

    Minimum teacher similarity (inclusive). Rows below are skipped.

  • top_percentile : Optional[float] (default=None)

    Keep only the top fraction of rows by similarity (e.g. 0.35 for the FineWeb-Edu top-35% recipe). Threshold is estimated from a reservoir sample, so it works in streaming mode.

DenseON-LateON mGTE-style training recipe

Union three groups of configs with different per-group filter rules:

  1. Standard sources (all configs except fw-edu, wikipedia_hlp_cm, and wikipedia_hlp_dl): keep rows with drop=False and duplicate IS NULL and similarity >= 3.0.

  2. fw-edu (FineWeb-Edu, no rule-based filter or dedup applied upstream): keep only the top ~35% by similarity. No drop/duplicate filter — the config itself isn’t pre-filtered.

  3. wikipedia_hlp_cm and wikipedia_hlp_dl: included as-is, no filter.

Variant axes:

  • seed: null (default) concatenates the three groups in order (ConcatPointwise). A non-null integer switches to RandomInterleavePointwiseDistillationSamples — uniformly picking a source at each step — and propagates the seed to each HuggingFace source’s .shuffle(seed=…) for in-source randomisation.

  • download: false (default) streams from the Hub; true downloads each source config to the local HF cache (streaming=False). Use with care — the full dataset is ~2TB.

Id derived from the package path: ai.lighton.embeddings_pre_training.denseon_lateon.

Note on config names: the reference recipe refers to fw_edu, hlp_wikipedia_cm and hlp_wikipedia_dl; the actual HuggingFace config names are fw-edu, wikipedia_hlp_cm and wikipedia_hlp_dl (verified against the HF datasets API at dataset-card time).

Dataset ai.lighton.embeddings_pre_training.denseon_lateon

datamaestro_ir.data.distillation.PointwiseDistillationSamples

DenseON-LateON mGTE-style pre-training recipe built by UNIONing three groups of ``lightonai/embeddings-pre-training`` configs with group-specific filters.

Tags: distillation, pointwise

Tasks: learning to rank

External link: https://huggingface.co/datasets/lightonai/embeddings-pre-training

Variant space for the DenseON-LateON pre-training recipe.

Variants:

  • seed : Optional[int] (default=None; elides default)

    Randomisation seed. None (default) concatenates the three groups in order (ConcatPointwise). A non-null integer switches to RandomInterleavePointwiseDistillationSamples — uniformly picking a source at each step — and propagates the seed to each HuggingFace source’s .shuffle(seed=…) for in-source randomisation. Changes what the dataset yields, so it stays in the id; elide_default=True drops it from the id when left at None to preserve the pre-variants id …denseon_lateon.

  • download : bool (default=False; domain=[False, True]; excluded from id)

    False (default) streams from the Hub; True downloads each source config to the local HF cache (streaming=False). Use with care — the full dataset is ~2TB. Excluded from the id (in_id=False) because it only toggles the loading mode — same data, different delivery — while still reaching config() via the resolved kwargs.

Pairwise Distillation

Pairwise distillation datasets contain triples of (query, positive document, negative document) with teacher model scores for each document.

Hofstaetter Neural Ranking KD

Teacher scores for MS MARCO passage ranking from neural-ranking-kd. Contains ~40M triples with BERT-based teacher scores in TSV format (pos_score, neg_score, query_id, pos_passage_id, neg_passage_id).

com.github.hofstaetter.distillation

Dataset com.github.hofstaetter.distillation.msmarco.ensemble.teacher

datamaestro_ir.data.distillation.PairwiseDistillationSamplesTSV

Training files without the text content instead using the ids from MSMARCO

Tags: distillation, pairwise

Tasks: learning to rank

External link: https://github.com/sebastian-hofstaetter/neural-ranking-kd

The teacher files (using the data from “Train Triples Small” with ~40 million triples) with the format pos_score neg_score query_id pos_passage_id neg_passage_id (with tab separation)

Dataset com.github.hofstaetter.distillation.msmarco.bert.teacher

datamaestro_ir.data.distillation.PairwiseDistillationSamplesTSV

Training files without the text content instead using the ids from MSMARCO

Tags: distillation, pairwise

Tasks: learning to rank

External link: https://github.com/sebastian-hofstaetter/neural-ranking-kd

The teacher files (using the data from “Train Triples Small” with ~40 million triples) with the format pos_score neg_score query_id pos_passage_id neg_passage_id (with tab separation)

Listwise Distillation

Listwise distillation datasets contain ranked lists of documents for each query, produced by a teacher model.

com.github.webis-de.rank-distillm

Dataset com.github.webis-de.rank-distillm.msmarco.bm25.annotated

datamaestro_ir.data.distillation.ListwiseDistillationSamplesTSVWithAnnotations

Top 500 BM25 passages for judged MS MARCO training queries

Tags: listwise, distillation

Tasks: learning to rank

External link: https://github.com/webis-de/rank-distillm

For all queries that have at least one relevance judgement in the MS MARCO training query set retrieved by BM25.

Dataset com.github.webis-de.rank-distillm.msmarco.colbertv2.annotated

datamaestro_ir.data.distillation.ListwiseDistillationSamplesTSVWithAnnotations

Top 500 passages retrieved by ColBERTv2

Tags: listwise, distillation

Tasks: learning to rank

External link: https://github.com/webis-de/rank-distillm

for all queries in the MS MARCO training query set.

WARNING: not all 500 docs necessarily contains relevant documents.

Dataset com.github.webis-de.rank-distillm.rankzephyr.bm25_10000.sampled100.annotated

datamaestro_ir.data.distillation.ListwiseDistillationSamplesTSV

Top 100 BM25 passages reranked by RankZephyr for 10k sampled MSMARCO queries

Tags: listwise, distillation

Tasks: learning to rank

External link: https://github.com/webis-de/rank-distillm

All passages are then reranked using RankZephyr and can be used for distillation.

Dataset com.github.webis-de.rank-distillm.rankzephyr.colbert10000.sampled100.annotated

datamaestro_ir.data.distillation.ListwiseDistillationSamplesTSV

Top 100 ColBERT passages reranked by RankZephyr for 10k sampled MSMARCO queries

Tags: listwise, distillation

Tasks: learning to rank

External link: https://github.com/webis-de/rank-distillm

All passages are then reranked using RankZephyr and can be used for distillation.

Dataset com.github.webis-de.rank-distillm.rankzephyr.colbert10000.sampled50.annotated

datamaestro_ir.data.distillation.ListwiseDistillationSamplesTSV

Top 50 ColBERT passages reranked by RankZephyr for 10k sampled MSMARCO queries

Tags: listwise, distillation

Tasks: learning to rank

External link: https://github.com/webis-de/rank-distillm

All passages are then reranked using RankZephyr and can be used for distillation.

Dataset com.github.webis-de.rank-distillm.rankzephyr.colbert10000.sampled10.annotated

datamaestro_ir.data.distillation.ListwiseDistillationSamplesTSV

Top 10 ColBERT passages reranked by RankZephyr for 10k sampled MSMARCO queries

Tags: listwise, distillation

Tasks: learning to rank

External link: https://github.com/webis-de/rank-distillm

All passages are then reranked using RankZephyr and can be used for distillation.