knowledge

Cross-Encoder Reranking

Notes on cross-encoder reranking after embedding retrieval.

A bi-encoder maps queries and documents into a shared vector space independently. A cross-encoder instead takes a (query, document) pair and scores them jointly with full self-attention. The first is cheap and parallel-friendly; the second is slow but much more accurate.

The common pattern is to stage them: bi-encoder for recall, cross-encoder for precision.

Two-stage retrieval

  1. Embed the query with a bi-encoder and pull the top-kk candidates (often k=50k = 50 to 200200) from a vector index.
  2. Score each (query, candidate) pair with a cross-encoder.
  3. Return the top-nn after reranking (typically n=5n = 5 to 2020).

Stage one keeps latency low; stage two recovers ordering quality that single-vector retrieval often misses.

Scoring

A cross-encoder produces a single scalar per pair. The standard formulation:

s(q,d)=CE(concat(q,d))Rs(q, d) = \mathrm{CE}\bigl(\mathrm{concat}(q, d)\bigr) \in \mathbb{R}

Train it as pointwise regression on labelled relevance, or pairwise on ordered triples (q,d+,d)(q, d^+, d^-) with a margin loss:

L=max(0,  ms(q,d+)+s(q,d))\mathcal{L} = \max\bigl(0,\; m - s(q, d^+) + s(q, d^-)\bigr)

For ranking quality, evaluate with nDCG@kk:

nDCG@k=1Zki=1k2reli1log2(i+1)\mathrm{nDCG}@k = \frac{1}{Z_k} \sum_{i=1}^{k} \frac{2^{rel_i} - 1}{\log_2(i + 1)}

Latency budget

Cross-encoders are not free. A rough budget for an interactive query (250 ms\leq 250\text{ ms} end-to-end):

stagetop-kper-pair costtotal
bi-encoder ANN1000.1 ms10 ms
cross-encoder1001.5 ms150 ms
business logic20 ms
headroom70 ms

Tuning levers when the budget gets tight:

  • Reduce kk before the reranker (precision vs. recall tradeoff).
  • Distil a smaller cross-encoder from a larger teacher.
  • Batch the rerank step. Most cross-encoders see large speedups at batch sizes of 16–32.

Calling it

A minimal reranker in Python:

from sentence_transformers import CrossEncoder
 
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
 
def rerank(query: str, candidates: list[str], top_n: int = 10):
    pairs = [(query, c) for c in candidates]
    scores = model.predict(pairs, batch_size=32)
    ordered = sorted(zip(candidates, scores), key=lambda x: -x[1])
    return ordered[:top_n]

The bi-encoder gives recall; the cross-encoder gives ordering. Most of the perceived "quality" of a retrieval system lives in the second stage.

The bi-encoder asks which documents might match. The cross-encoder asks which one is the answer.

  • Bi-encoders / dual encoders
  • ANN indexes (HNSW, IVF-PQ)
  • Listwise ranking losses