Blog/Product·March 23, 2026·10 min read

How We Built an AI Memory System That Actually Learns

Name: Coherence
Brand: Coherence
Rating: 4.8 (50 reviews)

A deep dive into our multi-layered memory architecture — from vector embeddings to biographical peer cards — and what we learned from studying the best in the space.

Coherence Team

Product

The Problem: AI Agents That Forget Everything

Most AI agent systems treat every interaction as a blank slate. Your agent completes a task, generates insights, discovers user preferences — and then forgets all of it. The next task starts from zero. This is the fundamental gap between a tool and a teammate: teammates learn.

We set out to build a memory system for our autonomous agents that would make them genuinely learn over time — not just retrieve facts, but reason about what they've learned, identify patterns, and build an evolving understanding of the people they work with.

Along the way, we studied platforms like Honcho, mem0, and Polsia — each taking a different approach to AI memory. Honcho's concept of "dreaming" (periodic reflection on accumulated knowledge) and Polsia's three-tier memory hierarchy both heavily influenced our architecture.

Three Tiers of Memory: User, Account, Platform

Before diving into implementation, it's worth understanding the scoping model. Inspired by Polsia's approach to memory hierarchy, we designed three distinct tiers:

User-Scoped Memory — Personal to each team member. When Nash (our autopilot agent) runs a personal autopilot cycle for you, it saves memories scoped to your UserId: your communication preferences, your pipeline priorities, the follow-ups you care about. These memories are invisible to other users' agents.

Account-Scoped Memory — Shared across the entire workspace. Business facts, company strategy, market intelligence, customer patterns — things that benefit every team member and every agent in the organization. When Nash discovers that "enterprise deals close 40% faster when a technical champion is identified early," that's account-level knowledge.

Platform-Scoped Memory (Global Learning) — Anonymized patterns that propagate across all accounts. When agents across different organizations independently discover the same operational insight, it gets promoted to the global layer. This creates a flywheel — the more accounts use the system, the smarter every agent gets out of the box.

This three-tier model ensures privacy (personal memories stay personal), organizational knowledge compounds (account memories get richer), and the entire platform gets smarter over time (global learnings benefit everyone).

Architecture: Five Layers Deep

Our memory system operates across five distinct layers, each serving a different purpose:

Layer 1: Atomic Memories

The foundation. Every observation, pattern, preference, fact, and episode gets stored as an individual memory with a 1536-dimensional vector embedding via pgvector. Each memory has confidence scoring, occurrence tracking, and category taxonomy.

Key design choices:

Hash-partitioned by tenant (16 partitions) for physical isolation
HNSW indexes per partition for fast cosine similarity search
Content encrypted at rest (AES-256-GCM) while embeddings remain unencrypted for search
Confidence + Occurrences fields that increase when the same insight is reinforced
Agent-scoped memories (AgentId column) so agents can build self-knowledge — Nash accumulates memories about what approaches worked, which tools were most effective, and what strategies produced the best outcomes, carrying that self-knowledge forward across cycles

Layer 2: Automatic Extraction

After every agent task, a lightweight LLM pass (Claude Haiku or GPT-5-mini) analyzes the execution output and decides what's worth remembering. The extraction is selective — trivial tasks like simple record updates are pre-filtered, and the LLM can explicitly declare "nothing worth remembering" for routine operations.

For cycles that execute multiple tasks, we batch the extraction into a single LLM call. Instead of N separate passes, one pass reviews all outputs together, looking for cross-task patterns and themes. This typically reduces extraction cost by 60-70% during busy cycles.

Layer 3: Semantic Deduplication & Search

Our rememberOrReinforce() method prevents memory bloat. Before creating a new memory, it searches for semantically similar existing memories (cosine similarity >= 0.9). If a match exists, it increments the occurrence count and boosts confidence instead of creating a duplicate. The system naturally converges — repeated patterns get stronger, not noisier.

On the retrieval side, not every query needs the same depth of processing. A synthesize parameter controls this: default mode returns a fast, cheap vector similarity search, while synthesize mode adds an LLM pass that weaves results into a coherent narrative — more expensive, but dramatically better for complex queries spanning multiple memories.

Layer 4: The Dreaming Job

This is where the learning happens. Inspired by Honcho's concept of "sleep for a memory system," a background job runs periodically to reflect on accumulated knowledge:

Consolidation: Finds groups of semantically similar memories (0.85 threshold, looser than the 0.9 dedup threshold) and merges them via LLM into single, stronger statements. The originals get archived, and the consolidated memory inherits the sum of all occurrences.

Promotion: Observations reinforced 3+ times get promoted to "pattern" type with a confidence boost. The system recognizes that something noticed once is actually a reliable signal.

Deduction Specialist: For entities with enough accumulated data (50+ new conclusions since last dream), an LLM analyzes existing memories to identify logical implications, contradictions, and knowledge updates. If a user changed roles, the deduction pass catches the conflicting facts and resolves them.

Induction Specialist: Scans for recurring behavioral patterns across multiple observations. A preference must have evidence from at least two independent data points before it's promoted — single observations stay as observations.

Both specialist passes feed results into the global learning layer when patterns are strong enough (high confidence + multiple occurrences), completing the user → account → platform propagation loop.

Layer 5: Peer Cards (Biographical Profiles)

Instead of requiring agents to search for user information every time, a stable biographical profile is always injected into the agent's context. No similarity search needed, no query to get right — it's just there.

Each peer card is a list of up to 40 atomic facts, prefixed by category:

Name: Keith
TIMEZONE: US/Pacific
PREFERENCE: Outcome-oriented execution without repeated approval-seeking
PREFERENCE: Async communication over synchronous meetings
INSTRUCTION: Do not create documents unless explicitly asked
TRAIT: Delegates broadly to trusted agents
INTEREST: AI agent architectures, autonomous systems
ROLE: Founder, technical product lead

Cards are auto-populated by the dreaming job's specialist passes, but can also be manually edited. They bootstrap automatically on first access with whatever data the system already knows.

The key insight: separate stable biographical facts from transient memories. A user's timezone doesn't change — it shouldn't be competing for relevance in a semantic search alongside yesterday's task results.

What We Learned

Memory without reasoning is just storage. Adding the dreaming job — especially the deduction and induction specialists — transformed our system from a recall engine into a learning engine.

Three-tier scoping is essential. Without it, you either over-share (personal preferences leak to other users) or under-share (valuable business patterns stay locked in one user's context). Polsia's hierarchy model was the right starting point.

Peer cards are deceptively powerful. The simplest feature (a list of strings) had the biggest impact on agent quality. When your agent always knows your name, timezone, and communication preferences, every interaction starts from a foundation of understanding instead of a blank slate.

Threshold-based processing beats interval-based. Triggering dreams based on conclusion counts + cooldown periods is smarter than a fixed interval. It ensures the system only does expensive reasoning when there's enough new data to justify it.

Batch extraction is a free optimization. Combining multiple task outputs into a single extraction pass doesn't just save tokens — it actually produces better memories because the LLM can identify cross-task patterns that individual extraction would miss.

The Self-Improvement Loop: Closing the Feedback Gap

A memory system that stores and retrieves is necessary but not sufficient. The missing piece was self-evaluation — the agent needs to know whether it's actually getting better.

We recently drew inspiration from two very different approaches to this problem:

MiniMax's M2.7 model uses a recursive self-evolution cycle where the model analyzes its own failure trajectories, modifies its scaffold code, evaluates results, and decides whether to keep or revert changes — running autonomously for 100+ iterations. Their Dev Harness architecture treats persistent memory, hierarchical skills, and evaluation infrastructure as first-class components of the training loop itself, not afterthoughts. Their Forge RL framework goes further, making context management an explicit action the agent learns through reinforcement — the agent decides what to remember, compress, or discard, rather than following hand-coded rules.

Andrej Karpathy's autoresearch takes the opposite extreme: radical simplicity. One metric (val_bpb), one file to edit (train.py), and a strict hill-climbing loop — try a change, measure, keep or revert, repeat indefinitely. The human's job is to write the agent's instructions (program.md), not the code. The key insight: an append-only results log (results.tsv) gives the agent complete self-awareness about what's been tried and what worked.

We took the core pattern common to both — measure → inject feedback → self-correct — and applied it to our autopilot cycles:

Cycle Quality Scoring: After each autopilot cycle completes, a lightweight LLM call scores the cycle's performance on a 1-10 scale across four criteria: task relevance, output quality, efficiency, and improvement over previous cycles. The score and actionable feedback are persisted and injected into the next cycle's planning context, so the agent can see its own trajectory and adjust.

Structured Performance Log: A rolling 7-day view of task execution metrics — success rates by task type, token efficiency, tool error rates, and user ratings — compiled from the AgentTask table and injected alongside the quality scores. When Nash sees that research tasks average 45K tokens but one outlier consumed 89K, it self-corrects on scope.

Tool Call Instrumentation: Every tool invocation is wrapped with timing and success tracking. Per-tool breakdowns (call count, error count, average duration) flow through to the performance log, giving the agent visibility into its own operational efficiency.

Agent-Driven Memory: Rather than always running an expensive LLM extraction pass after every task, we let the agent decide what's worth remembering in real-time. When the agent explicitly calls saveMemory during execution, the post-task extraction is skipped in favor of a lightweight episode log. Agent-chosen memories are higher signal and lower cost — the agent develops intuition for what matters through the natural feedback of "I remembered X and it was useful later."

The result is a closed loop: tool stats feed the performance log, the performance log feeds the quality scorer, and the quality scorer feeds the next planning cycle. Each cycle is slightly better-informed than the last.

What's Next

We're exploring several directions:

Compound tools: Higher-level skills that chain common tool sequences (research → create record → write doc), reducing token waste on re-discovering obvious patterns — inspired by MiniMax's hierarchical skill architecture
Strategy documents: Evolving the flat FocusAreas text field into a structured, version-controlled strategy doc that the user and agent co-author over time — the program.md pattern from autoresearch applied to business operations
RL-informed context management: Using quality scores and performance data to automatically tune which context blocks are most valuable, rather than injecting everything — taking a page from Forge's approach to learned context policies

We're excited about where this is heading. Every improvement to our memory system compounds — as more teams use Coherence, the agents get smarter, the global learning layer gets richer, and the feedback loops tighten. What Nash learns working with your business today makes it better for everyone tomorrow.

Onward and upward.

Coherence Team

Product

The team behind Coherence — building AI-native tools for modern businesses.

Product

The Best Tech Stack for Early-Stage Startups in 2026: 12 Tools That Scale Together

Recommended technology stack for early-stage startups covering CRM, payment processing, analytics, and operations. Includes integration recommendations and migration paths.

Coherence Team

May 14, 2026 · 7m

Product

Founder CRM Benchmark Report 2026: 78% Abandonment Rate Reveals Industry Crisis

Original research on founder CRM adoption showing 78% abandonment rate, 300% conversion lift with successful implementation, and actionable insights for early-stage teams.

Coherence Research

May 14, 2026 · 5m

Product

CRM Pricing Comparison 2026: The True Cost Guide for Founders

BLUF-optimized pricing guide with complete 2026 CRM cost breakdown including Year 1 and Year 2 costs for 10-seat and 25-seat teams.

Coherence Team

May 13, 2026 · 2m

Startups & Tech

How We Built an AI Memory System That Actually Learns

The Problem: AI Agents That Forget Everything