I Cut AI Memory Costs 88% Without Losing Accuracy — Here's Where It Still Fails
Building SemParity — semantic memory compression for LLMs
Every time you ask an AI a question in a long conversation, you're paying to send thousands of tokens the model mostly ignores.
A 1,500 token conversation costs 89% more than it needs to if you had a smarter way to store and retrieve context.
I built that smarter way. It's called SemParity, and the idea didn't come from a paper. It came from a LinkedIn post about how AWS S3 stores exabytes of data without keeping full copies of everything.
The results: 88-89% token reduction, accuracy matching RAG baseline, and only 59ms of extra latency. On real messy multi-session conversations, it outperforms the classic sliding window approach by 43 percentage points (50% vs 6.7%).
Here's exactly how it works, what it gets right, and where it still fails.
Where the Idea Actually Came From
I was scrolling LinkedIn and came across a post explaining erasure coding the technique AWS S3 uses to protect exabytes of data without the cost of full replication. The breakdown was simple: split data into chunks, generate parity chunks using an algorithm, and if any block is lost, reconstruct it from the remaining blocks plus parity.
What stuck with me wasn't the storage system itself, but the underlying principle. If distributed storage could recover missing information without keeping full replicas everywhere, could a similar approach work for AI memory?
LLM context isn't exabytes of data on spinning disks. But the underlying cost problem is structurally similar you don't want to pay to store and transmit full copies of everything, and you want a way to recover what's missing without keeping a complete replica around.
That question eventually evolved into SemParity. Not from a research lab or a published paper, but from seeing a systems concept in one domain and wondering whether the same idea could be applied to another.
The Problem Nobody Talks About Honestly
Everyone talks about how capable LLMs are. Nobody talks about the bill.
When your application has a long conversation with an LLM a customer support thread, a coding session, a research chat you pay for every token in the context window on every single request.
That means:
Turn 1: send 100 tokens. Pay for 100 tokens.
Turn 10: send 1,000 tokens. Pay for 1,000 tokens.
Turn 50: send 5,000 tokens. Pay for 5,000 tokens.
Most of those tokens are conversation history the model barely uses to answer your specific question. You're paying for tokens the model glances at and ignores.
The existing solutions are unsatisfying:
Truncation — just cut old context. You lose information permanently. The model forgets.
Summarization — compress old context into a summary. Better, but summaries lose specific details. "The meeting was productive" loses every fact from that meeting.
RAG (Retrieval Augmented Generation) — store chunks, retrieve the most relevant ones. Good, but it can only return what it finds. If a relevant chunk scores poorly in semantic search, it's silently dropped.
I wanted something different. Something that could reconstruct missing context rather than just retrieve existing context.
The SemParity Idea
What if instead of storing raw conversation text, you stored:
Semantic chunks — meaningful pieces of the conversation
Parity relationships — LLM-generated cues about how each chunk relates to its neighbors
Then at query time, instead of sending everything or retrieving by similarity alone, you could reconstruct missing chunks from the ones you did find.
Think of it like this:
Normal memory (RAG):
Query → find chunk 5 → send chunk 5 → answer
SemParity:
Query → find chunk 5 → check parity(5,6)
→ chunk 6 not retrieved? reconstruct it from chunk 5 + parity cue
→ send chunk 5 + reconstructed chunk 6 → better answer
RAG retrieves what exists in top-K results. SemParity reconstructs what is absent but inferable.
How I Built It
The system has 5 components:
1. Semantic Chunker
Splits text at semantic boundaries using sentence-transformers/all-MiniLM-L6-v2 (384-dimensional embeddings). Not fixed token counts — actual meaning boundaries. Each chunk is 20-150 tokens. spaCy extracts named entities from each chunk.
2. Parity Generator
For every pair of adjacent chunks (A, B), calls an LLM with this prompt:
Given two text chunks, extract the minimal relational structure
that would allow reconstructing either chunk if it was lost.
Return ONLY valid JSON with these fields:
{
"relationship_type": "causal|temporal|entity|descriptive|conditional",
"shared_entities": [...],
"reconstruct_a_from_b": "minimal cue to rebuild A if lost",
"reconstruct_b_from_a": "minimal cue to rebuild B if lost",
"confidence": 0.0-1.0
}
It also generates cross-chunk parity for semantically similar but non-adjacent chunks (cosine similarity ≥ 0.6). A 50-chunk document gets roughly 49 adjacent + ~10 cross-chunk parity blocks.
3. Memory Store
ChromaDB stores chunks with their embeddings and parity blocks with their cues. Two collections: chunks and parity. Persistent across sessions.
4. Reconstructor
The core of the system. At query time:
# Step 1: fetch top-N chunks by semantic similarity
fetched_chunks = store.query_chunks(query, n_results=5)
# Step 2: build BIDIRECTIONAL parity map (key insight)
parity_map[(a_id, b_id)] = block
parity_map[(b_id, a_id)] = block # both directions
# Step 3: for each fetched chunk, check its neighbors
for fetched in fetched_chunks:
for neighbor_pos in [pos-1, pos+1]:
if neighbor_not_fetched and parity_exists:
# RECONSTRUCT the missing neighbor
reconstructed = llm_reconstruct(fetched, parity_cue)
Reconstruction guards to prevent hallucination:
Cue must be ≥ 20 characters
Parity confidence must be ≥ 0.7
temperature=0, max_tokens=100
Output tagged
[Reconstructed]so it's never mistaken for ground truth
5. Evaluator
Compares full context vs RAG vs SemParity on the same questions with the same key-fact checking. No LLM-as-judge subjectivity — just "does the answer contain these specific facts?"
The Results
I tested on three datasets of increasing difficulty.
Dataset 1: Structured Knowledge (AI History)
100+ sentences about AI history from the 1950s to modern LLMs.
| System | Accuracy | Tokens/query | Latency |
|---|---|---|---|
| Full context | 76.7% | 1,203 | 5,714ms |
| Sliding window (N=5) | 53.3% | low | fast |
| RAG baseline | 73.3% | 168 | 1,677ms |
| SemParity | 73.3% | 168 | 1,736ms |
SemParity matches RAG exactly at 73.3% accuracy while using 88.2% fewer tokens than full context. The reconstruction added 25 chunks that RAG simply missed. RAG reconstructed 0.
Dataset 2: Real Conversation (Startup Founder, 3 Sessions)
This is the hard test. A realistic 3-session conversation with a startup founder — full of pronouns, cross-session references, named entities, and multi-hop reasoning. Questions like "Why is hiring a data engineer more urgent than a backend engineer?" require connecting facts from Session 2 and Session 3.
| System | Easy | Medium | Hard | Overall |
|---|---|---|---|---|
| Full context | 100% | 66.7% | 91.7% | 83.3% |
| Sliding window (N=5) | 0.0% | 0.0% | 16.7% | 6.7% |
| RAG baseline | 77.8% | 33.3% | 44.4% | 46.7% |
| SemParity | 77.8% | 41.7% | 44.4% | 50.0% |
The sliding window result is the most telling: 6.7% accuracy on real conversations. It got 0% on Easy and Medium questions because those answers were in Session 1 and 2, and the window only sees the last 5 chunks. Classic AI memory approaches completely fail on multi-session conversations.
SemParity outperforms RAG by 3.3 percentage points overall, with the improvement concentrated in Medium questions where reconstruction fills gaps RAG misses.
140 chunks were actively reconstructed across the test session. RAG reconstructed 0.
Dataset 3: Technical Documentation (REST API Docs)
API documentation covering authentication, endpoints, rate limits, SDKs, pricing, webhooks.
| System | Accuracy | Compression |
|---|---|---|
| Full context | 80.0% | 0% |
| RAG baseline | 65.0% | 82.3% |
| SemParity | 60.0% | 82.3% |
Here SemParity underperforms RAG by 5%. Honest explanation: technical documentation has very precise terminology (OAuth 2.0, Bearer token, rate limit headers) where semantic reconstruction loses specificity that direct retrieval preserves. This is a real limitation, not a rounding error.
Latency
Full context baseline: 5,714ms
RAG baseline: 1,677ms
SemParity: 1,736ms ← only 59ms more than RAG
SemParity is 3.3x faster than full context and adds only 59ms over RAG. In production, that's negligible.
The Ablation Study
I ran the system with different parity configurations to understand what each component contributes.
On conversational data:
| Configuration | Parity blocks | Reconstructed | Accuracy |
|---|---|---|---|
| Adjacent parity only | 62 | 111 | 56.7% |
| Cross-chunk only | 15 | 36 | 40.0% |
| Full system | 77 | 140 | 53.3% |
Interesting tension here: adjacent-only gets higher accuracy (56.7%) than the full system (53.3%) despite fewer reconstructions. Cross-chunk parity adds more reconstructed chunks but some of them introduce noisy context that hurts the QA step.
This is an honest finding. More reconstruction isn't always better. Cue quality matters more than reconstruction volume.
The 6 Silent Bugs I Found
Building this taught me more about LLM system failures than any tutorial could. Every one of these bugs returned None or 0 silently — no crashes, no error messages, just wrong results.
Bug 1: response_format not supported Using response_format={"type": "json_object"} with a model that doesn't support it. Every parity generation call threw an exception, except Exception: return None caught it silently, 0 parity blocks stored. Lesson: never swallow exceptions in LLM code.
Bug 2: Unidirectional parity lookup Parity stored as (chunk_a=5, chunk_b=6). When chunk 6 was fetched and I looked up (chunk_6, neighbor_5), no match. Reconstruction never triggered. Lesson: undirected relationships need bidirectional indexes.
Bug 3: Storage cost vs query cost Compression ratio was negative because I counted all stored tokens instead of per-query tokens. For large datasets, parity tokens > original tokens if you measure storage instead of usage. Lesson: measure what matters to the user, not what's easy to count.
Bug 4: JSON wrapped in markdown fences LLM judge returned ```json {...} ``` despite instructions not to. json.loads() failed silently, accuracy showed 0.0% for every question. Lesson: LLMs ignore formatting instructions ~20% of the time. Always strip fences.
Bug 5: Hallucinated reconstructions Weak parity cues + no confidence gate = LLM invents plausible-sounding but completely wrong content. One reconstruction invented an entirely fictional conversation that was never in the original text. Lesson: gate on confidence. Short cues mean the model guesses.
Bug 6: Entities as dicts instead of strings ", ".join(shared_entities) crashed because the LLM returned [{"name": "Alan Turing", "type": "PERSON"}] instead of ["Alan Turing"]. Even with JSON mode, never trust the structure of LLM output. Lesson: defensive parsing everywhere.
What It Can't Do Yet
I want to be honest about the limitations because the AI field has too much hype and not enough honesty.
Named entity disambiguation fails. When two similar companies appear in the same context, semantic search sometimes retrieves the wrong one. An entity index would fix this — it's next on the roadmap.
Cross-session references are hard. "Sandra said last week..." requires connecting one session to another. The current chunker has no session awareness. Session-aware chunking is the fix.
Technical documentation underperforms RAG by 5%. Precise terminology loses specificity in reconstruction cues. This is a fundamental tradeoff between compression and precision.
Parity generation is slow offline. 133 seconds for a 50-chunk document. This is a one-time cost — you generate parity when you store, not when you query — and it's parallelizable, but it's real.
The Honest Comparison to Existing Work
| System | Retrieves | Reconstructs | Compression |
|---|---|---|---|
| RAG | ✅ Top-K chunks | ❌ | ~85% |
| MemGPT | ✅ Memory tiers | ❌ | varies |
| Sliding window | ❌ Recency only | ❌ | varies |
| LongLLMLingua | ✅ Token pruning | ❌ (lossy) | ~70% |
| SemParity | ✅ Semantic search | ✅ Parity-based | 88-89% |
The reconstruction column is what makes SemParity different. RAG can't tell you what it missed. SemParity knows what's adjacent to what it found and can fill the gaps.
What I'd Build Differently
Session-aware chunking first. The single biggest accuracy improvement available. Add session IDs and timestamps to chunk metadata before doing anything else. It would have fixed several failures before they happened.
Tighter confidence gating. I set the threshold at 0.7 but should have started at 0.8. Every hallucination I saw came from low-confidence parity blocks that slipped through.
Parallel parity generation from day one. The 133 second generation time is embarrassing. It's trivially parallelizable with asyncio. I was building fast and optimized later.
The Roadmap
Four specific things that would close the accuracy gap on hard conversational data:
Entity index — direct
entity_name → chunk_idslookup bypassing embedding similarity for proper nounsSession-aware chunking — session ID and timestamp in every chunk's metadata
Query decomposer — break multi-hop questions into atomic sub-queries, merge results
Adaptive n_results — heuristic complexity scoring to fetch 3 chunks for simple questions, 7-10 for complex ones
How to Run It
git clone https://github.com/PuneetKumar1790/SemParity
cd SemParity/semparity
pip install -r requirements.txt
python -m spacy download en_core_web_sm
Create .env:
GROQ_API_KEY=your_key_here
Run:
python main.py # full pipeline + Gradio UI at localhost:7860
python tests/test_e2e.py # verify everything works
What I Learned
Here's what I'd tell anyone attempting something similar:
Silent failures are the real enemy. Every bug I hit returned a plausible-looking result with wrong data. Nothing crashed. The system looked like it was working. Defensive logging and explicit validation from the start would have saved days.
Test on hard data early. I only tested on clean structured data for the first few days. The clean data showed 100% accuracy. The hard conversational dataset showed 50%. The gap between those two numbers is where the real research is.
Honest benchmarks build more trust than impressive numbers. I could have only shown the 100% accuracy result on structured data. Instead I showed 6.7% for sliding window, 50% for hard conversational queries, and 5% below RAG on technical docs. Those honest numbers are what make the 89% compression result credible.
The erasure coding analogy is approximate, not exact. Real erasure coding gives mathematical recovery guarantees. SemParity gives heuristic approximations. The analogy is useful for intuition but not precise. Being clear about this distinction matters.
What's Next
I'm writing a formal research paper on this. It will include mathematical notation for chunks, parity, and reconstruction; formal comparisons against RAG, MemGPT, LongLLMLingua, and RAPTOR; additional datasets; and results once the entity index and session-aware chunking are implemented.
If you work somewhere spending serious money on LLM API costs and want to talk, or if you find a bug, or you have a better idea for reconstruction I want to hear it.
The code is messy in places. The results are honest. The idea is real.
Built with: Python, ChromaDB, sentence-transformers, spaCy, Groq API, Gradio
Benchmarks run on: Groq llama-3.1-8b-instant
Tags: machine-learning llm nlp ai python research open-source rag compression memory

