Experiment Plan | Chapter 6 | Prerequisites: Phase-3 | 28 min read | Home
Phase 4: Evaluation and Ablation (Week 7-9)
Goal: Deep analysis of results across all three streams.
4.1 Stream 1 Analysis: Navigation vs. Dense Retrieval
Questions to answer:
- Where does navigational retrieval outperform dense retrieval? (Which domains, which proof types?)
- Where does it underperform? (Is the deficit in bank positions, anchors, or IDF weighting?)
- What is the compute profile? (Neural forward passes saved per proof)
- Does the multiplicative scoring prevent good matches that additive scoring would find?
4.2 Stream 2 Analysis: Navigational Decoder vs. Token Classification
Questions to answer:
- Do navigational coordinates generalize to unseen theorem types better than vocabulary indices?
- Does the ternary decoder's crystallization pattern reveal structured learning?
- Which navigational bank is learned first? Does the order match architectural expectations?
- Does the anchor prediction quality correlate with proof success?
4.3 Stream 3 Analysis: PAB Trajectory Evaluation
Questions to answer:
- Does navigational accuracy trajectory predict final proof success better than loss trajectory?
- Does crystallization rate predict navigational consistency?
- Does progress prediction accuracy improve monotonically? Does it plateau?
- Does PAB detect failure modes that endpoint metrics miss?
4.4 Ablation Matrix
Run full evaluation for each ablation variant from Phase 2:
| Variant | MiniF2F | Mathlib | Retrieval R@16 | FP/proof | Notes |
| Full Wayfinder | — | — | — | — | Primary |
| Dense retrieval (no proof network) | — | — | — | — | Navigation thesis |
| Tactic classification (no navigation) | — | — | — | — | Architecture thesis |
| No spreading activation | — | — | — | — | Spreading thesis |
| No progress head | — | — | — | — | Progress thesis |
| Continuous decoder (no ternary) | — | — | — | — | Ternary thesis |
| No IDF weighting | — | — | — | — | IDF thesis |
| No bank alignment (anchors only) | — | — | — | — | Bank thesis |
| Binary critic (BCE, not soft MSE) | — | — | — | — | HTPS soft-target thesis |
| No proof history input | — | — | — | — | LeanProgress history thesis |
| No hammer delegation | — | — | — | — | Hammer complementarity thesis |
| No accessible-premises filter | — | — | — | — | ReProver filtering thesis |
| 3-bank navigation (original design) | — | — | — | — | 6-bank expansion thesis |
4.5 Write-up
Deliverables:
- Updated
WAYFINDER_RESEARCH.mdwith experimental results docs/EXPERIMENT_RESULTS.md— raw results ledger- PAB profile comparisons across all variants
- Failure analysis with navigational interpretation
Sources: docs/WAYFINDER_PLAN.md