Research | Chapter 6 | Prerequisites: 4-Training | 56 min read | Home

5. Evaluation

5.1 Endpoint Metrics

MetricBenchmarkComparison
Theorem proved rateMiniF2F-test (488 problems)ReProver, DeepSeek-Prover-V2, HTPS
Theorem proved rateMathlib test split (~10k)ReProver
Premise retrieval recall@kMathlib (ground-truth premises)ReProver's retriever
Proof search nodes expandedMiniF2FAll baselines
Wall-clock time per proofMiniF2FAll baselines

5.2 Trajectory Metrics (PAB)

MetricWhat it measures
Navigational consistencyDo similar goals produce similar directions?
Crystallization curveHow quickly do ternary weights commit?
Progress prediction accuracyDoes the model learn to estimate remaining steps?
Domain progression orderWhich Mathlib namespaces are learned first?
Bank-wise accuracyWhich navigational dimensions are learned first?

5.3 Ablation Studies

VariantWhat it tests
Dense retrieval (no proof network)Is navigational retrieval better than embedding similarity?
Tactic classification (no navigation)Are directional coordinates better than vocabulary indices?
No spreading activationDoes spreading activation improve search?
No progress headDoes progress prediction improve goal selection?
Continuous decoder (no ternary)Are ternary weights beneficial for navigation?
No IDF weightingDoes IDF improve premise retrieval?


Sources: docs/WAYFINDER_RESEARCH.md

See Also