The Problem: AI Agents That Forget Everything
Most AI agent systems treat every interaction as a blank slate. Your agent completes a task, generates insights, discovers user preferences — and then forgets all of it. The next task starts from zero. This is the fundamental gap between a tool and a teammate: teammates learn.
We set out to build a memory system for our autonomous agents that would make them genuinely learn over time — not just retrieve facts, but reason about what they've learned, identify patterns, and build an evolving understanding of the people they work with.
Along the way, we studied platforms like Honcho, mem0, and Polsia — each taking a different approach to AI memory. Honcho's concept of "dreaming" (periodic reflection on accumulated knowledge) and Polsia's three-tier memory hierarchy both heavily influenced our architecture.
Three Tiers of Memory: User, Account, Platform
Before diving into implementation, it's worth understanding the scoping model. Inspired by Polsia's approach to memory hierarchy, we designed three distinct tiers:
User-Scoped Memory — Personal to each team member. When Nash (our autopilot agent) runs a personal autopilot cycle for you, it saves memories scoped to your UserId: your communication preferences, your pipeline priorities, the follow-ups you care about. These memories are invisible to other users' agents.
Account-Scoped Memory — Shared across the entire workspace. Business facts, company strategy, market intelligence, customer patterns — things that benefit every team member and every agent in the organization. When Nash discovers that "enterprise deals close 40% faster when a technical champion is identified early," that's account-level knowledge.
Platform-Scoped Memory (Global Learning) — Anonymized patterns that propagate across all accounts. When agents across different organizations independently discover the same operational insight, it gets promoted to the global layer. This creates a flywheel — the more accounts use the system, the smarter every agent gets out of the box.
This three-tier model ensures privacy (personal memories stay personal), organizational knowledge compounds (account memories get richer), and the entire platform gets smarter over time (global learnings benefit everyone).
Architecture: Five Layers Deep
Our memory system operates across five distinct layers, each serving a different purpose:
Layer 1: Atomic Memories
The foundation. Every observation, pattern, preference, fact, and episode gets stored as an individual memory with a 1536-dimensional vector embedding via pgvector. Each memory has confidence scoring, occurrence tracking, and category taxonomy.
Key design choices:
- Hash-partitioned by tenant (16 partitions) for physical isolation
- HNSW indexes per partition for fast cosine similarity search
- Content encrypted at rest (AES-256-GCM) while embeddings remain unencrypted for search
- Confidence + Occurrences fields that increase when the same insight is reinforced
- Agent-scoped memories (AgentId column) so agents can build self-knowledge — Nash accumulates memories about what approaches worked, which tools were most effective, and what strategies produced the best outcomes, carrying that self-knowledge forward across cycles
Layer 2: Automatic Extraction
After every agent task, a lightweight LLM pass (Claude Haiku or GPT-5-mini) analyzes the execution output and decides what's worth remembering. The extraction is selective — trivial tasks like simple record updates are pre-filtered, and the LLM can explicitly declare "nothing worth remembering" for routine operations.
For cycles that execute multiple tasks, we batch the extraction into a single LLM call. Instead of N separate passes, one pass reviews all outputs together, looking for cross-task patterns and themes. This typically reduces extraction cost by 60-70% during busy cycles.
Layer 3: Semantic Deduplication & Search
Our rememberOrReinforce() method prevents memory bloat. Before creating a new memory, it searches for semantically similar existing memories (cosine similarity >= 0.9). If a match exists, it increments the occurrence count and boosts confidence instead of creating a duplicate. The system naturally converges — repeated patterns get stronger, not noisier.
On the retrieval side, not every query needs the same depth of processing. A synthesize parameter controls this: default mode returns a fast, cheap vector similarity search, while synthesize mode adds an LLM pass that weaves results into a coherent narrative — more expensive, but dramatically better for complex queries spanning multiple memories.
Layer 4: The Dreaming Job
This is where the learning happens. Inspired by Honcho's concept of "sleep for a memory system," a background job runs periodically to reflect on accumulated knowledge:
Consolidation: Finds groups of semantically similar memories (0.85 threshold, looser than the 0.9 dedup threshold) and merges them via LLM into single, stronger statements. The originals get archived, and the consolidated memory inherits the sum of all occurrences.
Promotion: Observations reinforced 3+ times get promoted to "pattern" type with a confidence boost. The system recognizes that something noticed once is actually a reliable signal.
Deduction Specialist: For entities with enough accumulated data (50+ new conclusions since last dream), an LLM analyzes existing memories to identify logical implications, contradictions, and knowledge updates. If a user changed roles, the deduction pass catches the conflicting facts and resolves them.
Induction Specialist: Scans for recurring behavioral patterns across multiple observations. A preference must have evidence from at least two independent data points before it's promoted — single observations stay as observations.
Both specialist passes feed results into the global learning layer when patterns are strong enough (high confidence + multiple occurrences), completing the user → account → platform propagation loop.
Layer 5: Peer Cards (Biographical Profiles)
Instead of requiring agents to search for user information every time, a stable biographical profile is always injected into the agent's context. No similarity search needed, no query to get right — it's just there.
Each peer card is a list of up to 40 atomic facts, prefixed by category:
Name: Keith
TIMEZONE: US/Pacific
PREFERENCE: Outcome-oriented execution without repeated approval-seeking
PREFERENCE: Async communication over synchronous meetings
INSTRUCTION: Do not create documents unless explicitly asked
TRAIT: Delegates broadly to trusted agents
INTEREST: AI agent architectures, autonomous systems
ROLE: Founder, technical product lead
Cards are auto-populated by the dreaming job's specialist passes, but can also be manually edited. They bootstrap automatically on first access with whatever data the system already knows.
The key insight: separate stable biographical facts from transient memories. A user's timezone doesn't change — it shouldn't be competing for relevance in a semantic search alongside yesterday's task results.
What We Learned
Memory without reasoning is just storage. Adding the dreaming job — especially the deduction and induction specialists — transformed our system from a recall engine into a learning engine.
Three-tier scoping is essential. Without it, you either over-share (personal preferences leak to other users) or under-share (valuable business patterns stay locked in one user's context). Polsia's hierarchy model was the right starting point.
Peer cards are deceptively powerful. The simplest feature (a list of strings) had the biggest impact on agent quality. When your agent always knows your name, timezone, and communication preferences, every interaction starts from a foundation of understanding instead of a blank slate.
Threshold-based processing beats interval-based. Triggering dreams based on conclusion counts + cooldown periods is smarter than a fixed interval. It ensures the system only does expensive reasoning when there's enough new data to justify it.
Batch extraction is a free optimization. Combining multiple task outputs into a single extraction pass doesn't just save tokens — it actually produces better memories because the LLM can identify cross-task patterns that individual extraction would miss.
The Self-Improvement Loop: Closing the Feedback Gap
A memory system that stores and retrieves is necessary but not sufficient. The missing piece was self-evaluation — the agent needs to know whether it's actually getting better.
We recently drew inspiration from two very different approaches to this problem:
MiniMax's M2.7 model uses a recursive self-evolution cycle where the model analyzes its own failure trajectories, modifies its scaffold code, evaluates results, and decides whether to keep or revert changes — running autonomously for 100+ iterations. Their Dev Harness architecture treats persistent memory, hierarchical skills, and evaluation infrastructure as first-class components of the training loop itself, not afterthoughts. Their Forge RL framework goes further, making context management an explicit action the agent learns through reinforcement — the agent decides what to remember, compress, or discard, rather than following hand-coded rules.
Andrej Karpathy's autoresearch takes the opposite extreme: radical simplicity. One metric (val_bpb), one file to edit (train.py), and a strict hill-climbing loop — try a change, measure, keep or revert, repeat indefinitely. The human's job is to write the agent's instructions (program.md), not the code. The key insight: an append-only results log (results.tsv) gives the agent complete self-awareness about what's been tried and what worked.
We took the core pattern common to both — measure → inject feedback → self-correct — and applied it to our autopilot cycles:
Cycle Quality Scoring: After each autopilot cycle completes, a lightweight LLM call scores the cycle's performance on a 1-10 scale across four criteria: task relevance, output quality, efficiency, and improvement over previous cycles. The score and actionable feedback are persisted and injected into the next cycle's planning context, so the agent can see its own trajectory and adjust.
Structured Performance Log: A rolling 7-day view of task execution metrics — success rates by task type, token efficiency, tool error rates, and user ratings — compiled from the AgentTask table and injected alongside the quality scores. When Nash sees that research tasks average 45K tokens but one outlier consumed 89K, it self-corrects on scope.
Tool Call Instrumentation: Every tool invocation is wrapped with timing and success tracking. Per-tool breakdowns (call count, error count, average duration) flow through to the performance log, giving the agent visibility into its own operational efficiency.
Agent-Driven Memory: Rather than always running an expensive LLM extraction pass after every task, we let the agent decide what's worth remembering in real-time. When the agent explicitly calls saveMemory during execution, the post-task extraction is skipped in favor of a lightweight episode log. Agent-chosen memories are higher signal and lower cost — the agent develops intuition for what matters through the natural feedback of "I remembered X and it was useful later."
The result is a closed loop: tool stats feed the performance log, the performance log feeds the quality scorer, and the quality scorer feeds the next planning cycle. Each cycle is slightly better-informed than the last.
What's Next
We're exploring several directions:
- Compound tools: Higher-level skills that chain common tool sequences (research → create record → write doc), reducing token waste on re-discovering obvious patterns — inspired by MiniMax's hierarchical skill architecture
- Strategy documents: Evolving the flat
FocusAreastext field into a structured, version-controlled strategy doc that the user and agent co-author over time — theprogram.mdpattern from autoresearch applied to business operations - RL-informed context management: Using quality scores and performance data to automatically tune which context blocks are most valuable, rather than injecting everything — taking a page from Forge's approach to learned context policies
We're excited about where this is heading. Every improvement to our memory system compounds — as more teams use Coherence, the agents get smarter, the global learning layer gets richer, and the feedback loops tighten. What Nash learns working with your business today makes it better for everyone tomorrow.
Onward and upward.
Coherence Team
Product
The team behind Coherence — building AI-native tools for modern businesses.
Related Articles
CRM for Consultants: Managing Client Engagements and Deliverables
How consultants use CRM to manage client engagements, track deliverables, build referral pipelines, and grow revenue. Includes practical setup advice and tool comparisons.
CRM for Event Planners: Track Clients, Vendors, and Bookings
Learn how CRM helps event planners manage vendor databases, client communication, bookings, and budgets. Includes setup tips and features to look for in an event CRM.
CRM for Construction Companies: Managing Projects, Clients, and Subcontractors
Discover how CRM helps construction companies track bids, manage subcontractors, and strengthen client relationships. Includes features to look for and setup tips.