Research | Chapter 2 | Prerequisites: Abstract | 56 min read | Home
1. Introduction
1.1 The Navigation Problem in Proof Search
Every competitive neural theorem prover uses the same fundamental approach: encode the proof state as a dense vector, predict the next tactic token from a vocabulary, and embed the prediction in a tree search (BFS, MCTS, or best-first). Premise selection — choosing which lemmas from a library of ~100k candidates are relevant — is handled by a learned retriever that computes dot-product similarity between the proof state embedding and pre-computed premise embeddings.
This approach works. ReProver (Yang et al., 2024) achieves state-of-the-art premise retrieval via an encoder that jointly embeds proof states and premises. DeepSeek-Prover-V2 (Ren et al., 2025) achieves 88.9% on MiniF2F-test via subgoal decomposition and reinforcement learning. HTPS (Lample et al., 2022) uses AlphaZero-inspired MCTS with a policy-critic architecture.
But the approach has a structural limitation: the intelligence is in the wrong place. The neural network must run at every search node — every candidate tactic, every candidate premise, every expansion of the proof tree. Search budgets are dominated by neural forward passes. A single MiniF2F proof attempt may require thousands of forward passes through a multi-billion parameter model.
Meanwhile, the structure of mathematical knowledge is rich, regular, and well-understood. Mathematicians don't search for premises by computing similarity in a continuous embedding space. They navigate structured relationships: "this is an inequality, so I need monotonicity lemmas; this is about finite sets, so I need the Finset API; I already used linarith here, so the premises should be arithmetic." This navigation is combinatorial — it operates on discrete properties, categorical relationships, and structured hierarchies.
What if we could build a system where the neural network handles the hard part (understanding the proof state, predicting which direction to go) while search and retrieval are handled by deterministic symbolic operations on structured data?
1.2 The ModelAtlas Paradigm
ModelAtlas (Vinaik, 2025) solves an analogous problem in a different domain: navigating a space of ~40,000 ML models on HuggingFace to find the ones most relevant to a user's needs. Its core insight is that structured navigation through signed semantic coordinates outperforms both flat database queries and dense embedding retrieval for this task.
The key mechanisms:
1. Signed orthogonal banks. Eight independent dimensions (ARCHITECTURE, CAPABILITY, EFFICIENCY, etc.), each with a zero state placed at the most common query target. Models are positioned as signed distances from zero. A 3B model is at EFFICIENCY = (-1, 2), meaning "two steps in the negative direction from the 7B zero." Positions enable gradient scoring: a query for "small" gives full marks to negative-efficiency models, partial marks to zero-efficiency, and decaying marks to positive.
2. Anchor dictionary. A shared vocabulary of ~130+ semantic labels (e.g., "instruction-following", "GGUF-available", "Llama-family"). Models link to their applicable anchors. Similarity is emergent — two models sharing 15 anchors are related without an explicit edge. No embeddings are computed or stored.
3. IDF-weighted scoring. Rare anchors (present on few models) count far more than ubiquitous ones. Sharing the anchor "proof-assistant" (12 models) matters more than sharing "decoder-only" (17,000 models). This prevents common features from dominating retrieval.
4. Multiplicative composition. The final score is bank_alignment × anchor_relevance × seed_similarity. A zero in any component kills the result. A model that perfectly matches efficiency but completely fails capability scores 0, not 0.5. This prevents averaging away bad matches.
5. Spreading activation. Bellman-Ford propagation from seed nodes through both explicit links (fine_tuned_from, variant_of) and implicit connections (shared anchors). Bank scoping prevents semantic bleeding — spreading through irrelevant dimensions is suppressed.
6. Zero neural inference at query time. The entire query engine runs on SQLite, integer arithmetic, set intersection, and logarithmic decay. Full semantic decomposition was done once at ingestion. Queries execute in milliseconds.
This paper asks: can the same paradigm be applied to mathematical theorem proving?
1.3 The Lean 4 / Mathlib Testbed
Lean 4 with Mathlib provides an ideal testbed for navigational proof search:
- Formally verified output: The Lean kernel provides unambiguous pass/fail verification. No proxy metrics, no subjective scores.
- Structured mathematical entities: Mathlib's ~100k lemmas have rich type information, namespace hierarchies, and dependency graphs — natural structure for bank positioning and anchor assignment.
- Finite tactic vocabulary: The set of commonly-used Lean 4 tactics is bounded (~50 core tactics, ~200 with variants). This is a navigational decision, not open-ended text generation.
- Hierarchical proof structure: Proofs are trees of subgoals, not flat sequences. Each subgoal has a structured goal state with hypotheses, target type, and local context.
- Existing benchmarks: MiniF2F (488 problems), Mathlib test split (~10k theorems), and full Mathlib (~100k theorems) provide standardized evaluation at multiple scales.
1.4 Research Questions
This work addresses five interconnected questions:
1. Navigational (Q1): Can structured semantic navigation — bank positions, IDF-weighted anchors, and spreading activation — match or exceed dense embedding retrieval for premise selection from Mathlib?
2. Architectural (Q2): Can a ternary {-1, 0, +1} decoder that produces navigational coordinates (directions in proof space) outperform classification-based tactic prediction, by leveraging the semantic network for resolution?
3. Search efficiency (Q3): Does separating the neural network (runs once per proof state) from the search engine (runs deterministic symbolic operations) enable faster proof search with equivalent or better success rates?
4. Process (Q4): Does the architecture's learning trajectory satisfy PAB's quality criteria? Can progress prediction (remaining proof steps) serve as both a training signal and a search heuristic?
5. Convergence (Q5): Do the formal guarantees from Mutation Theory (trajectory monotonicity, phase transitions, fixed-point partitions) apply to this navigational search process, providing formal convergence bounds?
1.5 Scope and Stance
This is a research project with three streams:
- Primary objective: A working theorem prover that navigates proof space via structured semantic networks, evaluated on standard Lean 4 benchmarks against ReProver, LeanProgress, and DeepSeek-Prover-V2 baselines.
- Co-equal objective: Empirical validation that navigational search (symbolic operations on structured data) can compete with or exceed learned retrieval (neural forward passes) for premise selection.
- Tertiary objective: Empirical validation of PAB's trajectory evaluation framework in the theorem-proving domain, with progress prediction as a novel PAB metric.
Non-goal: Beating DeepSeek-Prover-V2's 88.9% on MiniF2F. That system uses a 671B parameter model and reinforcement learning. Our contribution is architectural: demonstrating that a small model (sub-100M trainable parameters) with structured navigation can achieve competitive results by spending compute on symbolic search rather than neural inference.
Sources: docs/WAYFINDER_RESEARCH.md