Research | Chapter 3 | Prerequisites: 1-Introduction | 56 min read | Home

2. Theoretical Framework

2.1 Proof Search as Spatial Navigation

We propose a fundamental reframing: proof search is navigation through a structured mathematical space, not sequential token prediction.

In token prediction, the model outputs a sequence of tactic tokens conditioned on the goal state. Each token is selected from a flat vocabulary via softmax. The model must learn, from training data alone, which tactics are applicable in which situations — the structure of mathematical knowledge is implicit in the weights.

In navigational search, the model outputs directions in a structured coordinate system. The directions resolve to concrete tactics and premises through deterministic lookup in a semantic network that explicitly encodes the structure of mathematical knowledge. The model's job is narrower: predict which region of mathematical space contains the answer. The symbolic system handles the last mile.

This reframing has three theoretical consequences:

Consequence 1: Separation of concerns. The neural network handles understanding (what kind of mathematical situation is this?) while the symbolic system handles retrieval (given this kind of situation, which specific lemmas and tactics are relevant?). Neither system needs to do the other's job. This is the same separation that makes ModelAtlas work: the LLM decomposes the user's natural language query into structured parameters; ModelAtlas does deterministic math.

Consequence 2: Auditable search. Every premise retrieval traces back to specific shared anchors and bank alignments. "This lemma scored 0.87 because it shares the rare anchors monotone-convergence and finset-sum with the current goal, and its DOMAIN bank position aligns with the goal's." Compare to dense retrieval: "the embedding similarity was 0.87" (uninspectable).

Consequence 3: Amortized computation. The semantic network is built once from Mathlib's type information, namespace structure, and dependency graph. At query time, navigation is pure arithmetic on precomputed structures. The neural network runs once per proof state to produce coordinates; the search engine can explore thousands of candidate lemmas in the time it takes to run one forward pass.

2.2 Signed Semantic Banks for Mathematical Entities

We define six orthogonal proof banks — signed dimensions with zero states placed at the mode of mathematical activity:


Bank           Zero State                  Negative                   Positive
────────────── ─────────────────────────── ────────────────────────── ─────────────────────────
STRUCTURE      equality / rewrite goal     arithmetic, decidable      quantified, higher-order
DOMAIN         general algebra             concrete (ℕ, ℤ, ℚ)        abstract (topology, category)
DEPTH          2-3 tactic proof            trivial (1 tactic)         deep (10+ tactics)
AUTOMATION     partially automated         fully auto (omega, simp)   requires manual insight
CONTEXT        moderate hypotheses (3-5)   no hypotheses              rich context (10+ hyps)
DECOMPOSITION  single-goal                 atomic (no subgoals)       multi-subgoal

Why these banks are orthogonal: A theorem can be structurally simple (STRUCTURE=0) but in an abstract domain (DOMAIN=+2); it can be deeply nested (DEPTH=+2) but fully automated (AUTOMATION=-2, e.g., a long simp chain). The banks capture independent dimensions of variation.

Zero placement rationale: Zero is "the most common proof goal a working mathematician encounters." The majority of Mathlib lemmas are about algebraic equalities or inequalities, require 2-3 tactics, and use a mix of automation and manual reasoning. Queries resolve near the origin.

Positioning examples:

LemmaSTRUCTUREDOMAINDEPTHAUTOMATIONCONTEXTDECOMP
Nat.add_comm(0,0)(-1,1)(-1,2)(-1,2)(-1,1)(-1,1)
Set.Finite.induction_on(+1,2)(+1,1)(+1,1)(+1,2)(+1,1)(+1,2)
MonoidHom.map_mul(0,0)(0,0)(-1,1)(-1,1)(0,0)(-1,1)
TopologicalSpace.isOpen_iUnion(+1,1)(+1,2)(0,0)(+1,1)(+1,1)(+1,1)

Bank positions are extracted deterministically from Lean's type information:

  • STRUCTURE: Analyze the goal type — quantifier nesting, dependent types, propositional vs. data.
  • DOMAIN: Namespace hierarchy (Mathlib.Algebra → DOMAIN=0, Mathlib.Topology → DOMAIN=+2).
  • DEPTH: Proof length from ground-truth proofs (when available) or heuristic estimation from type complexity.
  • AUTOMATION: Whether the lemma's proof (or similar lemmas' proofs) use automation tactics.
  • CONTEXT: Number and complexity of hypotheses in the local context.
  • DECOMPOSITION: Whether the proof requires multiple have statements or case splits.

2.3 Anchor Dictionaries as Mathematical Vocabulary

An anchor is a semantic label shared across mathematical entities. Anchors create emergent similarity — two lemmas sharing 12 anchors are related without an explicit edge. The anchor dictionary is the semantic vocabulary of the system.

Proof anchors by bank:

BankExample Anchors
STRUCTUREequality, inequality, membership, universal-quantifier, existential, iff, implication, negation, dependent-type
DOMAINnat-arithmetic, group-theory, ring-theory, topology, measure-theory, linear-algebra, order-theory, set-theory, category-theory
DEPTHone-liner, multi-step, deep-induction, structural-recursion
AUTOMATIONomega-solvable, simp-solvable, decide-solvable, norm_num-solvable, needs-manual-intro, needs-manual-cases
CONTEXThypothesis-rich, hypothesis-free, uses-local-def, type-class-instance
DECOMPOSITIONneeds-cases, needs-induction, needs-have-chain, single-goal, multi-branch

Cross-cutting anchors (not bank-specific): monotonicity, commutativity, associativity, distributivity, idempotency, transitivity, injectivity, surjectivity, continuity, compactness, finiteness, convergence, bounded, well-founded

Anchor extraction:

  • Tier 1 (deterministic): From Lean's type checker — Nat in the type → nat-arithmetic; in the goal → universal-quantifier; namespace Mathlib.Topologytopology.
  • Tier 2 (pattern matching): From proof text — omega in proof → omega-solvable; inductionneeds-induction; rw [mul_comm]commutativity.
  • Tier 3 (optional): LLM-based annotation for subtle mathematical properties not captured by syntax.

IDF weighting: The anchor equality appears on thousands of lemmas. The anchor Lyapunov-monotonicity appears on maybe 5. IDF = log(N / count_lemmas_with_anchor). When a goal state shares both anchors with a candidate lemma, the IDF weighting correctly focuses on the rare, informative one. This mirrors how human mathematicians search: they key off the unusual, distinctive features of a problem.

Anchor gap analysis (critical bootstrap step): The ~300 initial anchors will have blind spots. The system's retrieval ceiling is bounded by anchor expressiveness — if a relevant mathematical property isn't captured by any anchor, it's invisible to navigate(). Known gaps in the initial dictionary include:

  • Type coercion anchors: Many Mathlib proofs struggle with coercion between types (↑n, (n : ℤ), Nat.cast). This is a major source of proof difficulty and needs dedicated anchors: needs-cast, nat-int-coercion, subtype-coercion, coe-simp.
  • Algebraic hierarchy anchors: The Mathlib typeclass hierarchy (Monoid → Group → Ring → Field) creates proof obligations that are structurally similar but domain-specific. Anchors should track position in the hierarchy: monoid-level, group-level, ring-level, field-level.
  • *Proof pattern anchors*: Beyond tactic names, recurring proof patterns need labels: split-and-recombine, contrapositive-argument, epsilon-delta, sequence-limit, diagonal-argument.

Anchor gap analysis procedure (executed iteratively in Phase 0-1): 1. For 500 randomly selected proof steps, attempt to resolve the ground-truth tactic and premises via navigate() using only the existing anchor dictionary. 2. For each failure (correct premise not in top-16), examine what anchors would have connected the goal to the premise. These are gap anchors. 3. Cluster the gap anchors by mathematical theme. Add the most frequent clusters as new anchors. 4. Re-run the analysis. Iterate until top-16 recall on the 500-step sample exceeds 70%.

This iterative refinement is the single highest-leverage activity in Phase 0. The proof network's quality caps the entire system.

2.4 Premise Retrieval as Structured Navigation

With lemmas positioned in banks and connected through anchors, premise retrieval becomes a navigate() query — the same operation ModelAtlas uses to find ML models:


retrieve(goal) =
    bank_alignment(goal_position, lemma_position)      # Are we in the right region?
  × anchor_relevance(goal_anchors, lemma_anchors, IDF)  # Do semantic labels match?
  × seed_similarity(context_lemmas, lemma_anchors, IDF) # Similar to what we've already used?

Three signals, multiplicative:

1. Bank alignment. For each bank where the goal has a position, score the candidate lemma's alignment:

  • Same direction: 1.0 (fully aligned)
  • At zero: 0.5 (neutral)
  • Opposite direction: 1 / (1 + |distance|) (decaying penalty)
  • Multiplicative across banks — a lemma that matches DOMAIN but opposes STRUCTURE gets penalized.

2. Anchor relevance (IDF-weighted). Three anchor lists:

  • require: Hard filter. The goal has nat-arithmetic and inequality → only lemmas with both anchors survive.
  • prefer: Soft boost. The goal also has monotonicity → lemmas with this anchor get IDF-weighted bonus.
  • avoid: Penalty. The search has already tried omega and failed → penalize omega-solvable lemmas.

3. Seed similarity. If the proof has already successfully applied certain lemmas, find candidates with overlapping anchor sets. IDF-weighted Jaccard between the seed's anchors and the candidate's anchors.

Why multiplicative scoring matters for premises: In dense retrieval, a lemma about natural numbers that happens to have similar embedding structure to a topology proof can score highly — the embedding conflates all dimensions. With multiplicative scoring, if the DOMAIN bank doesn't align, the score goes to zero. Nat.add_comm is never relevant to a topology proof, no matter how textually similar the goal states look.

Scoring mechanism — open design question: Pure multiplicative scoring (Π bank_score_i) is both the system's precision advantage and its fragility. A zero in any bank kills the total score. When bank positions are noisy (and they will be, especially early), this means a misclassified STRUCTURE position can exclude correct premises that match perfectly on every other dimension.

Several alternative scoring mechanisms should be evaluated empirically:

MechanismFormulaTradeoff
Pure multiplicative (ModelAtlas)Π score_iMaximum precision, zero tolerance for noise
Confidence-weighted multiplicativeΠ score_i^(confidence_i)Banks with uncertain positions contribute less; requires per-bank confidence estimation
Geometric mean(Π score_i)^(1/k)Softer than product; a single low score reduces but doesn't kill
Log-additive with learned weightsΣ w_i · log(score_i)Equivalent to weighted geometric mean; weights learned per bank
Soft-floor multiplicativeΠ max(score_i, ε)Floor prevents hard zeros; ε tunable per bank
Hybrid: multiplicative filter + additive rankFilter by hard bank constraints, then rank by additive anchor+seed scoreSeparates coarse filtering from fine ranking

The confidence-weighted variant is theoretically most principled: if the GoalAnalyzer's bank head outputs a softmax distribution where the winning class has only 40% probability (uncertain), that bank should influence the final score less than one where the winning class has 95% probability. This naturally degrades to pure multiplicative when confidence is high and to no-bank-influence when confidence is low.

The right mechanism likely depends on proof network maturity. Early (noisy bank positions): confidence-weighted or soft-floor. Late (refined positions): pure multiplicative. This could be a training schedule parameter.

Accessible-premises filtering (free lunch from ReProver). ReProver demonstrated that restricting retrieval to premises accessible via Lean's import structure (~33k of ~130k total) gives a ~2% recall improvement at zero compute cost. Wayfinder's proof network should encode this constraint: each entity stores its import-accessible set, and navigate() pre-filters to accessible premises before scoring. This is a hard filter in SQL (WHERE entity_id IN accessible_set), applied before any bank alignment or anchor scoring. LeanDojo's extraction already provides this accessibility information.

Performance: Batch SQL replaces per-lemma neural forward passes. Pre-filter by accessible premises AND required anchors (SQL HAVING), batch-fetch positions and anchor sets, score in Python from in-memory dicts. Four indexed queries instead of 100k dot products.

2.5 Ternary Weights as Navigational Coordinates

ModelAtlas's bank positions are signed integers. The Balanced Sashimi decoder uses ternary {-1, 0, +1} weights. These are the same conceptual operation: discretizing a continuous space into signed directions relative to a meaningful zero.

We propose that the ternary decoder does not predict tactic tokens from a vocabulary. Instead, it predicts navigational directions in proof space:

  • STRUCTURE direction {-1, 0, +1}: Should the next step simplify (toward decidable/arithmetic), maintain (rewrite), or complexify (introduce quantifiers)?
  • AUTOMATION direction {-1, 0, +1}: Should the next step use full automation (omega, simp), mixed, or manual reasoning?
  • DECOMPOSITION direction {-1, 0, +1}: Should the goal be decomposed into subgoals (via have, cases, induction), maintained as-is, or atomized?

These ternary coordinates, combined with predicted anchor logits, form a StructuredQuery that resolves to concrete tactics and premises through the semantic network:




---
*Sources: `docs/WAYFINDER_RESEARCH.md`*
## See Also

- [1-Introduction](1-Introduction)
- [3-Architecture-Specification](3-Architecture-Specification)