SPM-Core · Formally evaluated

AI memory that stays accurate at scale.

StellarPath Memory System solves the core memory failure modes in long-horizon AI conversations — wrong recall, fact confusion, and the inability to recognize unanswerable questions. Evaluated across five public benchmarks, leading all comparison systems on every one.

View evaluation →Open report Contact us

// Five-benchmark results

All leading

LHCSB primary score+43.99pp vs RAG

85.57%

BABILong full-250+52.61pp vs RAG

99.60%

T5 abstention accuracyRAG ≤ 2.5%

100%

RULER 720-case+5.39pp vs RAG

86.47%

LongMemEval 100q+21.00pp vs RAG

73.00%

5 benchmarks · 1,461 samplesSPM-Core

LHCSB 85.57%

BABILong 99.60%

T5 abstention 100%

RULER 86.47%

LongMemEval 73.00%

5 benchmarks · 1,461 samples

wrong-rate 0.75% vs 25.35%

evidence recall 94.58%

LHCSB 85.57%

BABILong 99.60%

T5 abstention 100%

RULER 86.47%

LongMemEval 73.00%

5 benchmarks · 1,461 samples

wrong-rate 0.75% vs 25.35%

evidence recall 94.58%

// What we solve

The memory failures that matter in production

AI applications are moving from single-turn interactions to persistent, long-horizon conversations. Memory failure is not an edge case — it is the central challenge.

◈

AI remembers — but remembers wrong

Standard RAG systems produce incorrect answers on 25.35% of long-context queries — not "I don't know," but confident wrong answers. That is the most dangerous failure mode in production.

◉

The longer the context, the worse it gets

As conversation history grows past tens of thousands of tokens, retrieval systems start confusing similar facts, dropping constraints, and losing temporal order. SPM is designed for exactly this range.

◎

Refusing is more valuable than guessing

When a question cannot be answered from memory, SPM-Core explicitly abstains rather than hallucinating. LHCSB T5 abstention accuracy: 100%. All comparison systems score below 2.5%.

// LHCSB benchmark

A stress test designed for long-horizon conversational state

LHCSB (Long-Horizon Conversational State Benchmark) is an internal benchmark built to fill a gap in public evaluation coverage. Corpus derived from LoCoMo and MSC Personas, mechanically constructed, SHA-256 locked, seed=42 reproducible.

Test split: 291 samples (T4: 120, T5: 120, T6: 51), covering fact tracking, absence recognition, and temporal reasoning. SPM-Core primary score: 85.57%. Strongest comparison system (standard_rag): 41.58%.

Full LHCSB results →

T4Fact tracking

74.17%

Cross-session exact fact retrieval. Derived from LoCoMo and MSC Personas corpora.

T5Absence recognition

100%

Explicit abstention when a question cannot be answered. All comparison systems score below 2.5%.

T6Temporal reasoning

78.43%

Relative-time expression inference — questions like "three weeks ago."

// Cross-benchmark results

Five formal test sets. Leading on every one.

Comparison systems: standard_rag, hybrid_rag, mem0_platform. All test sets use locked data versions with reproducible results.

Benchmark	SPM-Core	Best baseline	System	Lead	Note
LHCSB	85.57%	41.58%	standard_rag	+43.99pp	T5 abstention 100%
LoCoMo 100q	86.47%	73.33%	standard_rag	+13.14pp	Multi-session recall
BABILong full-250	99.60%	46.99%	hybrid_rag	+52.61pp	QA1–QA4 all pass
LongMemEval 100q	73.00%	52.00%	hybrid_rag	+21.00pp	Evidence recall 94.58%
RULER 720-case	86.47%	81.08%	standard_rag	+5.39pp	Full category coverage

LHCSB vs standard_rag

SPM-Core85.57%

Baseline41.58%

BABILong vs hybrid_rag

SPM-Core99.6%

Baseline46.99%

LongMemEval vs hybrid_rag

SPM-Core73%

Baseline52%

// ROI estimate

Translate benchmark evidence into budget language.

SPM reduces error rates while cutting token consumption. Enter your usage volume to estimate real savings.

// Benchmark Credibility

Scenario math is anchored to the canonical live benchmark: 400 queries, 5 model backends, 13,998 successful calls, and average input-token measurements relative to full-history context.

400 queries5 models13,998 callsstandard RAG baseline

// ROI Scenario Model

Turn benchmark evidence into a budget scenario.

Use your current request volume and prompt-spend assumptions to estimate savings, annualized impact, and payback timing under Standard RAG and SPM memory systems.

Monthly requestsCurrent avg prompt tokensPrompt cost per 1M tokens

Implementation budget

Primary selection

Lower-token memory system with the lowest wrong-answer rate in the canonical live run.

// Selected Scenario

Monthly savings

$43,347

8,669,472,893 tokens saved / month

Annualized savings

$520,168

directional annual prompt-spend delta

Payback window

months on the current scenario

Projected reduction

99.1%

prompt-token reduction

Selected operating point detail

Current monthly prompt spend$43,750

Projected monthly prompt spend$403

Projected avg prompt tokens161.1

Live pass rate83.73%

Live exact value accuracy91.22%

// System Compare

Side-by-side scenario view

Use the same workload assumptions to compare savings, payback timing, and live quality indicators across Standard RAG and SPM systems.

Standard RAG

208.8 live input tokens

Monthly savings

$43,232

Payback

<1 mo

Pass rate

70.60%

Exact accuracy

93.70%

Mature retrieval baseline for evidence retrieval and answer generation.

SPM-Core

162.2 live input tokens

Monthly savings

$43,347

Payback

<1 mo

Pass rate

83.73%

Exact accuracy

91.22%

Lower-token memory system with the lowest wrong-answer rate in the canonical live run.

SPM Full

263.5 live input tokens

Monthly savings

$43,096

Payback

<1 mo

Pass rate

91.00%

Exact accuracy

97.63%

Quality reference for fact-sensitive workloads where wrong answers are costly.

// Go deeper

// Next step

Ready to see how SPM fits your system?

We accept evaluation partnerships, research collaborations, and technical integration inquiries. Bring your use case.

AI memory that stays accurate at scale.

A stress test designed for long-horizon conversational state

Full LHCSB results →

Benchmark

SPM-Core

Best baseline

System

Lead

Note

LHCSB

85.57%

41.58%

standard_rag

+43.99pp

T5 abstention 100%

LoCoMo 100q

86.47%

73.33%

standard_rag

+13.14pp

Multi-session recall

BABILong full-250

99.60%

46.99%

hybrid_rag

+52.61pp

QA1–QA4 all pass

LongMemEval 100q

73.00%

52.00%

hybrid_rag

+21.00pp

Evidence recall 94.58%

RULER 720-case

86.47%

81.08%

standard_rag

+5.39pp

Full category coverage

AI memory that stays accurate at scale.

The memory failures that matter in production

AI remembers — but remembers wrong

The longer the context, the worse it gets

Refusing is more valuable than guessing

A stress test designed for long-horizon conversational state

Five formal test sets. Leading on every one.

Translate benchmark evidence into budget language.

Turn benchmark evidence into a budget scenario.

Side-by-side scenario view

Start with the evidence, then read outward.

Evaluation results

Dataset coverage

Technical docs

Ready to see how SPM fits your system?

AI memory that stays accurate at scale.

The memory failures that matter in production

AI remembers — but remembers wrong

The longer the context, the worse it gets

Refusing is more valuable than guessing

A stress test designed for long-horizon conversational state

Five formal test sets. Leading on every one.

Translate benchmark evidence into budget language.

Turn benchmark evidence into a budget scenario.

Side-by-side scenario view

Start with the evidence, then read outward.

Evaluation results

Dataset coverage

Technical docs

Ready to see how SPM fits your system?