StellarPath Memory System solves the core memory failure modes in long-horizon AI conversations — wrong recall, fact confusion, and the inability to recognize unanswerable questions. Evaluated across five public benchmarks, leading all comparison systems on every one.
// Five-benchmark results
All leading// What we solve
AI applications are moving from single-turn interactions to persistent, long-horizon conversations. Memory failure is not an edge case — it is the central challenge.
Standard RAG systems produce incorrect answers on 25.35% of long-context queries — not "I don't know," but confident wrong answers. That is the most dangerous failure mode in production.
As conversation history grows past tens of thousands of tokens, retrieval systems start confusing similar facts, dropping constraints, and losing temporal order. SPM is designed for exactly this range.
When a question cannot be answered from memory, SPM-Core explicitly abstains rather than hallucinating. LHCSB T5 abstention accuracy: 100%. All comparison systems score below 2.5%.
// LHCSB benchmark
LHCSB (Long-Horizon Conversational State Benchmark) is an internal benchmark built to fill a gap in public evaluation coverage. Corpus derived from LoCoMo and MSC Personas, mechanically constructed, SHA-256 locked, seed=42 reproducible.
Test split: 291 samples (T4: 120, T5: 120, T6: 51), covering fact tracking, absence recognition, and temporal reasoning. SPM-Core primary score: 85.57%. Strongest comparison system (standard_rag): 41.58%.
Full LHCSB results →Cross-session exact fact retrieval. Derived from LoCoMo and MSC Personas corpora.
Explicit abstention when a question cannot be answered. All comparison systems score below 2.5%.
Relative-time expression inference — questions like "three weeks ago."
// Cross-benchmark results
Comparison systems: standard_rag, hybrid_rag, mem0_platform. All test sets use locked data versions with reproducible results.
| Benchmark | SPM-Core | Best baseline | System | Lead | Note |
|---|---|---|---|---|---|
| LHCSB | 85.57% | 41.58% | standard_rag | +43.99pp | T5 abstention 100% |
| LoCoMo 100q | 86.47% | 73.33% | standard_rag | +13.14pp | Multi-session recall |
| BABILong full-250 | 99.60% | 46.99% | hybrid_rag | +52.61pp | QA1–QA4 all pass |
| LongMemEval 100q | 73.00% | 52.00% | hybrid_rag | +21.00pp | Evidence recall 94.58% |
| RULER 720-case | 86.47% | 81.08% | standard_rag | +5.39pp | Full category coverage |
LHCSB vs standard_rag
BABILong vs hybrid_rag
LongMemEval vs hybrid_rag
// ROI estimate
SPM reduces error rates while cutting token consumption. Enter your usage volume to estimate real savings.
// Benchmark Credibility
Scenario math is anchored to the canonical live benchmark: 400 queries, 5 model backends, 13,998 successful calls, and average input-token measurements relative to full-history context.
// ROI Scenario Model
Use your current request volume and prompt-spend assumptions to estimate savings, annualized impact, and payback timing under Standard RAG and SPM memory systems.
Primary selection
// Selected Scenario
Monthly savings
$43,347
8,669,472,893 tokens saved / month
Annualized savings
$520,168
directional annual prompt-spend delta
Payback window
<1
months on the current scenario
Projected reduction
99.1%
prompt-token reduction
Selected operating point detail
// System Compare
Use the same workload assumptions to compare savings, payback timing, and live quality indicators across Standard RAG and SPM systems.
Standard RAG
208.8 live input tokens
Monthly savings
$43,232
Payback
<1 mo
Pass rate
70.60%
Exact accuracy
93.70%
Mature retrieval baseline for evidence retrieval and answer generation.
SPM-Core
162.2 live input tokens
Monthly savings
$43,347
Payback
<1 mo
Pass rate
83.73%
Exact accuracy
91.22%
Lower-token memory system with the lowest wrong-answer rate in the canonical live run.
SPM Full
263.5 live input tokens
Monthly savings
$43,096
Payback
<1 mo
Pass rate
91.00%
Exact accuracy
97.63%
Quality reference for fact-sensitive workloads where wrong answers are costly.
// Go deeper
Full result tables across five benchmarks, scoring methodology, and per-system comparisons.
Read →What each test surface measures — LHCSB, LoCoMo, BABILong, RULER, and LongMemEval.
Read →System architecture, API integration guide, and benchmark reproduction instructions.
Read →// Next step
We accept evaluation partnerships, research collaborations, and technical integration inquiries. Bring your use case.