I Gave an LLM a Holographic Brain — Here’s What Actually Worked (and What Didn’t)

Date:

We have a memory problem in AI. If you’ve built applications with Large Language Models, you’ve hit the wall: the context window. The moment your conversation exceeds the token limit, the model forgets. The beginning of the chat falls off a cliff, and the AI loses who you are. The industry’s standard fix is RAG (Retrieval Augmented Generation). We chop up data, embed it, and stuff it into a vector database like Pinecone or Chroma. It works. But RAG has a hidden cost: linear growth. To store 1 million facts, you need 1 million vectors. Your database grows with every interaction.

I started asking a different question:

Can we compress memory — not into a list of files, but into a single, fixed-size mathematical object that never grows?

It turns out, the answer was published in 1995. A researcher named Tony Plate described Holographic Reduced Representations (HRR) — a way to store thousands of key-value associations inside a single vector using circular convolution. The idea was elegant but largely forgotten, buried under the neural network revolution.

I decided to resurrect it. I built a Python agent that gives any LLM a “Holographic Brain” — a fixed-size associative memory that stores facts in superposition, like light waves in a hologram. This is the story of what worked, what failed catastrophically, and what I learned debugging it.

The Core Idea: Memory as a Wave, Not a File

In standard computing, memory is like a filing cabinet. If you have 100 documents, you need 100 folders.

In holographic memory, it’s like a choir. One singer produces one note. A hundred singers produce a complex chord. The room didn’t get bigger, but it now carries 100 distinct pieces of information simultaneously.

This works because of a remarkable property of high-dimensional spaces: random vectors are almost perfectly orthogonal. In a 4,096-dimensional vector space, any two random vectors have a cosine similarity of approximately 0.0. They don’t interfere with each other. You can pile thousands of them on top of each other, and the interference pattern preserves each one individually — just like a hologram encodes a full 3D image in a flat plate.

The mathematical operations are:

Binding (storing a key-value pair):

\[\text{bind}(k, v) = \text{IFFT}(\text{FFT}(k) \cdot \text{FFT}(v))\]

This is circular convolution — it creates a new vector that is dissimilar to both $k$ and $v$ individually, but encodes their association.

Unbinding (retrieving a value from a key):

\[\text{unbind}(k, \text{trace}) = \text{IFFT}\left(\frac{\text{FFT}(\text{trace}) \cdot \overline{\text{FFT}(k)}}{|\text{FFT}(k)|^2}\right)\]

This is the approximate inverse — given a key and a superimposed memory trace, it recovers the value.

Superposition (storing multiple facts in one vector):

\[M = \text{bind}(k_1, v_1) + \text{bind}(k_2, v_2) + \ldots + \text{bind}(k_n, v_n)\]

All facts exist simultaneously in a single vector $M$. No list. No database. Just one 4,096-dimensional vector.


The Architecture: How HoloMem Actually Works

The system sits between the user and an LLM (I used DeepSeek V3 via OpenRouter, but any model works). It has three layers:

Layer 1: Fact Extraction

When a user says something declarative like “Alice is the CEO”, the LLM extracts a structured key-value pair:

"Alice is the CEO." → Key: "Who is Alice?" | Value: "The CEO"

Layer 2: Holographic Storage

This is where it gets interesting. Each key-value pair is encoded into the memory shard using two operations:

association = bind(random_id(key), random_id(value))
shard = shard + association

The shard is a single 4,096-dimensional vector. Each new fact is added to it via pure vector addition. The shard never gets “bigger” — it’s always exactly 4,096 floats (32KB of RAM).

Layer 3: Two-Stage Retrieval

When the user asks a question, retrieval happens in two stages:

  1. Semantic Search: The question is embedded and matched against stored key embeddings via FAISS to find the closest known question.
  2. Holographic Decode: The matched key’s random ID is used to unbind from the memory shard, recovering the value’s random ID, which is then matched back to the stored answer text.
"Who is Alice?"
  → FAISS finds closest key: "Who is Alice?" (score: 1.0)
  → Unbind random_id("Who is Alice?") from shard
  → Decoded trace matches random_id("The CEO")
  → Return: "The CEO"

Multi-Hop Reasoning

The most exciting capability is chained retrieval. If I teach the system:

  • “Alice is the CEO.”
  • “The CEO lives in London.”

And then ask: “Who is Alice?”

The system:

  1. Queries “Who is Alice?” → finds “The CEO”
  2. Notices “The CEO” appears as a concept in another stored key
  3. Automatically follows the link: “Where does the CEO live?” → “London”
  4. Returns: “The CEO (Linked: London)”

It traverses the knowledge graph autonomously, up to 3 hops deep. No graph database. No explicit edges. Just the interference patterns in the holographic vectors.


The Journey: From 5% Recall to 100%

This is the part that doesn’t usually make it into blog posts. I’m including it because I think the debugging process is more instructive than the final result.

Attempt 1: Naive Implementation — 10% Recall

My first implementation used semantic embeddings (from all-MiniLM-L6-v2) directly as both keys and values for the HRR operations. The benchmark result:

10 facts → 10% accuracy
20 facts →  5% accuracy
⚠️ Accuracy collapsed.

Catastrophic failure. The holographic math was supposed to handle hundreds of facts. What went wrong?

The Diagnosis

I wrote a diagnostic script that measured the cosine similarity between key vectors:

Semantic keys:
  cos("Where does Person_0 live?", "Where does Person_1 live?") = 0.7644
  cos("Where does Person_0 live?", "Where does Person_50 live?") = 0.7144

Random ID keys:
  cos(Person_0, Person_1) = -0.0241
  cos(Person_0, Person_50) = -0.0062

There it was. Semantic embeddings of similar sentences are NOT orthogonal. “Where does Person_0 live?” and “Where does Person_1 live?” have a cosine similarity of 0.76. HRR requires quasi-orthogonal keys to avoid cross-talk — when you unbind with key $k_0$, the result is:

\[v_0 + \underbrace{\cos(k_0, k_1) \cdot v_1 + \cos(k_0, k_2) \cdot v_2 + \ldots}_{\text{cross-talk noise, each term } \approx 0.76 \cdot v_i}\]

With 10 facts, the noise terms (each at 76% strength) overwhelm the signal. The correct answer is buried.

The same diagnostic revealed something else: diverse natural-language questions work perfectly:

Diverse questions (naturally dissimilar):
  cos("Who is Alice?", "Where does the CEO live?") = 0.1483
  Recall: 10/10 = 100%

The HRR math was correct all along. The problem was the embedding similarity.

The Fix: Hybrid Architecture

The insight: use the right tool for each job.

  • Semantic embeddings are good at understanding what a question means (for matching user queries to stored keys)
  • Random ID vectors are good at being orthogonal (for HRR binding and unbinding)

So I split the architecture: every key and every value gets two vectors:

 Semantic EmbeddingRandom ID (SHA-256 → RNG)
PurposeFAISS query matchingHRR bind/unbind
PropertyCaptures meaningGuarantees orthogonality
Similarity“Person_0” ≈ “Person_1” (0.76)“Person_0” ⊥ “Person_1” (~0.00)

Storage: shard += bind(random_id(key), random_id(value)) Retrieval: FAISS finds the semantic key → looks up its random ID → unbinds from shard

The Result: 100% Recall at 2,600 Facts

============================================================
 CAPACITY BENCHMARK: Storing up to 2,600 facts
============================================================

   10 facts → 100.0% recall
  100 facts → 100.0% recall
  500 facts → 100.0% recall
 1000 facts → 100.0% recall
 2000 facts → 100.0% recall
 2500 facts → 100.0% recall | avg conf: 0.504
 2600 facts → 100.0% recall | avg conf: 0.505  (52 shards)

  ✓ Usable capacity (≥80% recall): 2,600+ facts and counting

2,600 facts across 52 auto-rotated shards. 100% recall. Zero failures. Confidence rock-steady at ~0.50.

Each shard holds 50 facts in a single 4,096-dimensional vector — 32KB of RAM per shard. The entire 2,600-fact memory takes 52 shards × 32KB = ~1.6MB total. A comparable vector database storing the same facts would need 5,200 separate vectors (key + value each): roughly 68MB. That’s a ~42× compression ratio with no loss in recall.


Storage Comparison

Storage Method2,600 Facts10,000 FactsGrowth
Vector Database (RAG)5,200 vectors (~68MB)20,000 vectors (~260MB)Linear: $O(n)$
HoloMem52 shards (~1.6MB)200 shards (~6.4MB)Linear in shards, O(1) per shard
Compression Ratio~42×~40×Roughly constant

The per-shard storage is $O(1)$ — a fixed 32KB regardless of how many facts it holds (up to capacity). Total storage grows linearly in shard count, but with a ~40–50× constant-factor advantage over naive vector storage.


What This Is NOT

Let me be precise about what I’m not claiming:

This is not a replacement for RAG. RAG retrieves from large document corpora using semantic search. HoloMem stores discrete facts in compressed form. They solve different problems.

This is not infinite storage. Each shard has a theoretical capacity of $\sqrt{d} \approx 64$ facts (we cap at 50 for safety). Total storage still grows linearly — just with a much smaller constant.

This is not a standalone memory system. It still relies on an LLM for fact extraction (turning natural language into key-value pairs) and for presenting answers. The holographic layer handles storage and retrieval only.

The benchmark uses synthetic data. “Person_0 → City_0” is a controlled test. Real-world facts will have more varied structure, which actually helps the system (more diverse keys = better orthogonality).


Honest Limitations

1. You Can’t Delete a Memory

In a hologram, every memory is mathematically entangled with every other memory in the same shard. You can’t surgically remove “Alice is the CEO” — you’d have to subtract bind(key, value) exactly, which requires knowing the original vectors. For GDPR “right to be forgotten” compliance, this is a genuine problem. The workaround is to rebuild the shard without the unwanted fact, but that’s expensive.

2. Shard Capacity is Bounded (But Sharding Solves It)

The theoretical limit is $\sqrt{d}$ facts per shard. For $d = 4096$, that’s ~64. We set it to 50 for safety. Auto-rotation means we’ve tested up to 52 shards (2,600 facts) with zero degradation, so total system capacity is practically unbounded — but you pay for it in shard count. You can increase per-shard capacity by increasing dimensionality (e.g., $d = 16384$ gives ~128 facts per shard), reducing the shard count needed.

3. The System Requires Exact Key Matching

If the user asks “Tell me about Alice” and the stored key is “Who is Alice?”, retrieval depends on the semantic similarity between these two sentences being high enough for FAISS to find the match. Highly ambiguous or novel phrasings may miss. This is the same limitation RAG has.

4. Multi-Hop Reasoning is Greedy

The chain traversal (“Alice → CEO → London”) works by string-matching the current answer against stored keys. It’s greedy (follows the first strong link) and can’t do branching search. A more sophisticated graph traversal would improve this.


Why This Matters (and Where It’s Going)

The Biological Parallel

The human brain doesn’t store memories as files in folders. Neuroscience increasingly suggests that memories are stored as distributed interference patterns across neural populations — remarkably similar to holographic superposition. HRR was originally inspired by this observation.

What I’ve built is a crude computational analogue: a system where memories exist as superimposed patterns in a fixed-size vector space, decoded via resonance rather than lookup. It’s not biologically accurate, but it follows the same architectural principle.

Practical Applications

  1. Edge AI: A holographic memory fits in kilobytes. An agent running on a phone or IoT device could maintain persistent memory without cloud storage.

  2. Privacy-Preserving Memory: The shard vectors are uninterpretable — you can’t read them without the decoding apparatus. This is “encrypted by default” memory.

  3. Long-Running Agents: LLM agents that operate over days or weeks could use holographic memory as a persistent scratchpad that never hits a context limit.

  4. Compression Layer for RAG: HoloMem could sit as a “L1 cache” in front of a vector database, holding frequently-accessed facts in compressed form.

Future Directions

  • Scaling experiments: We’ve confirmed 2,600 facts at 100% recall; the next milestone is 10K+ using higher dimensionality ($d = 16384$ or $d = 65536$)
  • Interference analysis: Store contradictory facts (“Alice is the CEO” / “Bob is the CEO”) and study resolution strategies
  • Rule-based extraction: Remove the LLM dependency for fact extraction using spaCy dependency parsing
  • Attention-weighted shards: Instead of uniform shard search, learn which shard is most likely to contain the answer
  • Forgetting mechanisms: Implement principled memory decay (exponential, importance-weighted) for long-running agents

The Code

The full implementation is ~500 lines of Python using NumPy, FAISS, and SentenceTransformers. The core mathematical operations — bind, unbind, superposition — are each a single line of NumPy:

def bind(k, v):
    return np.real(np.fft.ifft(np.fft.fft(k) * np.fft.fft(v)))

def unbind(k, trace):
    fft_k = np.fft.fft(k)
    return np.real(np.fft.ifft(
        np.fft.fft(trace) * np.conj(fft_k) / (np.abs(fft_k)**2 + 1e-8)
    ))

# Store: just add to the shard
shard = shard + bind(key, value)

The breakthrough was realizing that the HRR math requires quasi-orthogonal vectors, which semantic embeddings don’t provide for similar sentences. The fix — using deterministic random IDs (seeded from SHA-256 hashes) for binding, while keeping semantic embeddings for query matching — turns a 5%-accuracy system into a 100%-accuracy system that scales to 2,600+ facts with perfect recall.

The math was written in 1995. We just had to use it correctly.


Try It Yourself

# Run the capacity benchmark (no API key needed)
python holographic_brain_v3.py --bench --max 2600

# Interactive chat mode (needs OpenRouter API key)
python holographic_brain_v3.py

# In chat:
>> Alice is the CEO.
>> The CEO lives in London.
>> Who is Alice?
Bot: Alice is the CEO. (Linked: London)
>> health           # Show shard diagnostics
>> quit             # Save and exit

The source code is available on GitHub. Questions, ideas, or counterarguments? I’d love to hear them.