Research | Chapter 6 | Prerequisites: 4-Training | 56 min read | Home
5. Evaluation
5.1 Endpoint Metrics
| Metric | Benchmark | Comparison |
| Theorem proved rate | MiniF2F-test (488 problems) | ReProver, DeepSeek-Prover-V2, HTPS |
| Theorem proved rate | Mathlib test split (~10k) | ReProver |
| Premise retrieval recall@k | Mathlib (ground-truth premises) | ReProver's retriever |
| Proof search nodes expanded | MiniF2F | All baselines |
| Wall-clock time per proof | MiniF2F | All baselines |
5.2 Trajectory Metrics (PAB)
| Metric | What it measures |
| Navigational consistency | Do similar goals produce similar directions? |
| Crystallization curve | How quickly do ternary weights commit? |
| Progress prediction accuracy | Does the model learn to estimate remaining steps? |
| Domain progression order | Which Mathlib namespaces are learned first? |
| Bank-wise accuracy | Which navigational dimensions are learned first? |
5.3 Ablation Studies
| Variant | What it tests |
| Dense retrieval (no proof network) | Is navigational retrieval better than embedding similarity? |
| Tactic classification (no navigation) | Are directional coordinates better than vocabulary indices? |
| No spreading activation | Does spreading activation improve search? |
| No progress head | Does progress prediction improve goal selection? |
| Continuous decoder (no ternary) | Are ternary weights beneficial for navigation? |
| No IDF weighting | Does IDF improve premise retrieval? |
Sources: docs/WAYFINDER_RESEARCH.md
See Also