Distillation Datasets
=====================

This section lists datasets for knowledge distillation in neural IR,
where teacher model scores are used to train student rankers.


Pointwise Distillation
----------------------

Pointwise distillation datasets provide teacher-scored ``(query, document,
similarity)`` triples — a single document per query together with a teacher
similarity score.

.. dm:datasets:: ai.lighton.embeddings_pre_training ir

.. dm:datasets:: ai.lighton.embeddings_pre_training.denseon_lateon ir


Pairwise Distillation
---------------------

Pairwise distillation datasets contain triples of (query, positive document,
negative document) with teacher model scores for each document.


Hofstaetter Neural Ranking KD
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Teacher scores for MS MARCO passage ranking from
`neural-ranking-kd <https://github.com/sebastian-hofstaetter/neural-ranking-kd>`_.
Contains ~40M triples with BERT-based teacher scores in TSV format
(pos_score, neg_score, query_id, pos_passage_id, neg_passage_id).

.. dm:datasets:: com.github.hofstaetter.distillation ir


Listwise Distillation
---------------------

Listwise distillation datasets contain ranked lists of documents for each query,
produced by a teacher model.

.. dm:datasets:: com.github.webis-de.rank-distillm ir