Research | Chapter 7 | Prerequisites: 1-Introduction | 56 min read | Home
6. Related Work
A comprehensive survey of the external landscape is in docs/Wayfinder_External_Research.md. This section summarizes the most architecturally relevant systems and their implications for Wayfinder.
6.1 ReProver / LeanDojo (Yang et al., NeurIPS 2023)
Retrieval-augmented tactic generation for Lean 4. ByT5-small (299M) dual-encoder retriever with 1472-dim embeddings, MSE-based contrastive loss with 3 negatives per example. Restricts retrieval to accessible premises only (~33k of ~130k), giving ~2% recall improvement for free. On premises never seen during training, recall@10 drops from 38.4% to 27.6% — a 30% degradation revealing the fundamental limitation of dense retrieval for unseen entities. Key difference from Wayfinder: Dense retrieval (learned embeddings) vs structured navigation (bank positions + IDF-weighted anchors). ReProver requires neural forward passes for every retrieval; Wayfinder resolves via SQL queries. Design influence: Accessible-premises filtering adopted directly; ByT5-small as encoder candidate.
6.2 AlphaProof (DeepMind, 2024)
~3B encoder-decoder transformer with AND/OR tree search and AlphaZero-style MCTS. Critically, the value function predicts remaining tactics to completion, not binary provability — because all proven states have probability 1.0, destroying the training signal. Test-time RL (TTRL): for each hard problem, generates millions of variants and trains a specialist agent, requiring hundreds of TPU-days per IMO problem. Backpropagates the value of the hardest branch rather than the product. Design influence: Soft critic targets (distance, not binary) adopted directly. Encoder-decoder pattern with dual policy/value output is the efficiency model for search-heavy systems.
6.3 HTPS (Lample et al., NeurIPS 2022)
AlphaZero-inspired proof search with shared 600M encoder-decoder as policy and critic. Critic restricts decoder to two tokens (PROVABLE/UNPROVABLE). Critical finding: soft critic targets dramatically outperform hard targets — 78.1% vs 63.1% on Equations benchmark. Hard critic is worse than no critic at all. Value backpropagation through AND nodes uses product of children's values. Online training (continuous refresh every 5 minutes) outperforms batch expert iteration. Design influence: Soft critic targets adopted; our spreading activation replaces MCTS for search prioritization.
6.4 BFS-Prover (ByteDance, 2025)
Achieved 72.95% on MiniF2F-test with plain best-first search plus expert iteration, DPO from Lean compiler feedback, and length normalization. At fixed inference budget, outperforms InternLM2.5-StepProver's value-guided search (65.9%) and DeepSeek-Prover-V1.5's MCTS (63.5%). Implication for Wayfinder: With a strong enough policy, sophisticated search may be unnecessary. If Wayfinder's navigational directions are accurate, simple best-first search with spreading activation may suffice — no MCTS needed.
6.5 LeanProgress (2025)
Fine-tuned DeepSeek Coder 1.3B to predict remaining proof steps. Including proof history boosts accuracy from 61.8% to 75.1%. Combined score C(sᵢ) = α·N(sᵢ) + (1-α)·P(sᵢ) yields 3.8% improvement on Mathlib4. Design influence: Progress head inspired by this work; proof history input adopted (mean-pooled embeddings of previously closed goals concatenated to bridge input).
6.6 DeepSeek-Prover-V2 (Ren et al., 2025)
Two-model architecture: DeepSeek-V3 (671B MoE, ~21B activated) for have-chain decomposition, 7B prover for individual subgoals. GRPO with binary Lean compiler rewards. 88.9% on MiniF2F-test. Design influence: SubgoalDecomposer inspired by this, but as ternary decision within a small model. The division of labor (reasoning about proof structure vs executing individual steps) maps to our separation: neural network reasons about direction, proof network handles resolution.
6.7 Aristotle (Harmonic, 2025)
200B+ transformer with Monte Carlo Graph Search (MCTS extended to hypergraphs with state equivalence classes). Lemma-based informal reasoning: natural-language proof sketches → formal lemmas → iterative refinement. Test-time training on search traces. Disproof augmentation: each goal gets a negation transition to prune impossible branches. Implication for Wayfinder: State equivalence merging (if two proof states have identical goals) could reduce the search graph; worth exploring in Phase 5.
6.8 LeanHammer (2025) and the Hammer Tradition
Combines neural premise selection (LeanPremise) with symbolic methods (MePo relevance filter) and routes to both Aesop and ATP translation (Zipperposition). Achieves 37.3% cumulative proof rate on Mathlib, outperforming any single method by 21%. Magnushammer's two-stage SELECT+RERANK pipeline achieves 59.5% on PISA vs Sledgehammer's 38.3%.
Hammer complementarity. The broader hammer tradition (Sledgehammer for Isabelle, CoqHammer for Coq, LeanHammer for Lean) translates ITP goals into first-order logic for external ATP solvers. Wayfinder does not replace hammers; it handles the cases they cannot. Hammers dominate the AUTOMATION = -1 region of proof space (fully decidable, translatable to first-order logic). Wayfinder navigates the AUTOMATION ≥ 0 region (structured manual reasoning, lemma selection, subgoal decomposition). The optimal system is a hybrid: Wayfinder's navigational retrieval feeds premise candidates to hammers for the decidable portion, while handling the non-decidable portion via structured proof search. Integration point: when the navigator predicts AUTOMATION = -1 for a goal, delegate directly to LeanHammer/Aesop rather than attempting navigational resolution.
6.9 Knowledge Graphs for Mathematics
AutoMathKG (2025): 13,388 entities, 29,459 edges, SBERT embeddings, evaluated with TransE/DistMult/R-GCN. A separate Lean 4 Mathlib graph in Neo4j captures theorem dependencies. GNN-augmented ReProver (arXiv 2510.23637) layers an RGCN over ByT5 embeddings. Tree-Based Premise Selection (NeurIPS 2025) proposes training-free selection using Weisfeiler-Lehman kernels.
Critical gap: Not a single published system uses a knowledge graph to guide proof search or premise selection. The integration stops at basic dependency filtering. No system implements spreading activation over mathematical knowledge structures. No system positions entities in structured coordinate systems beyond standard embeddings. This is Wayfinder's most distinctive design opportunity — the proof network architecture is genuinely novel in this landscape.
6.10 TacticToe and Structured Tactic Prediction
TacticToe (HOL4) abstracts tactics to ~100 templates and separately predicts arguments. Graph2Tac (Coq, ICML 2024) uses GNN-based classification over ~100 tactic categories. ASTactic generates tactics as programs via grammar-controlled RNN. Proof transformation prediction (FroCoS 2023) guesses the correct tactic 74% of the time given before-and-after states.
No existing system cleanly separates tactic category prediction from specific tactic/argument generation. TacticToe's abstraction is the closest precedent, and its finding about argument prediction's importance validates the decomposition. Wayfinder's two-level architecture (ternary directions → structured resolution) is the first to develop this separation into a full navigation paradigm.
6.11 BitNet and Ternary Architectures
BitNet b1.58 (ternary weights) at 2B matches full-precision models on mathematical reasoning with 2.7× speedup and 3.5× memory reduction. No one has applied ternary architectures to formal theorem proving. Wayfinder opportunity: A ternary encoder (BitNet-style) feeding into a ternary decoder (existing STE infrastructure) would create a fully ternary pipeline — maximally efficient for real-time proof search.
6.12 ModelAtlas (Vinaik, 2025)
Semantic network of ~40k ML models positioned across 8 orthogonal signed banks with anchor-based similarity. IDF-weighted structured navigation outperforms keyword search and embedding-based similarity. Connection to Wayfinder: The entire navigational paradigm — banks, anchors, IDF scoring, multiplicative composition, spreading activation — is adapted from ModelAtlas.
6.13 Mutation Theory (Vinaik, 2025-2026; formalized in Lean 4)
200+ Lean 4 theorems formalizing specification complexity, exact learning, teaching dimension, decomposition, trajectory convergence, and phase transitions. Connection to Wayfinder: Provides formal convergence guarantees. See Sections 2.7-2.8.
Sources: docs/WAYFINDER_RESEARCH.md