RECALL Benchmark — Twenty CRM

Methodology

Accuracy (0 - 1)

Does the answer match verified ground truth? 0 = wrong/hallucinated, 0.5 = partially correct, 1 = fully correct.

Hallucination (yes/no)

Did the tool invent facts not present in the codebase? Each question lists common hallucination traps.

Specificity (0 - 1)

Does it cite actual file paths, class names, or module names? Or just vague hand-wavy answers?

Completeness (0 - 1)

How many of the required "must include" facts did the answer cover?

Test Conditions

All tools were given the same 30 questions in a single batch. This is a stress test — real-world usage would be one question at a time, where each tool would likely perform better. The batch approach ensures identical conditions across all tools.

RECALL MCP

Pre-scanned knowledge base (42 MB, 31,045 facts, 3,200 modules) queried via MCP tools. All 30 questions answered by keyword search against the knowledge base — no source files were read during the test. Questions were batched in groups of 5 with multi-keyword queries. ~15 seconds per question. ~6,000 tokens per question (~180k total).

Claude Code (with tools)

Claude Opus 4.6 with full codebase access via grep, read, and glob tools. All 30 questions answered sequentially in a single session — no subagents. Searched the 264 MB codebase in real-time using standard file operations. ~6 minutes per question. ~30,000 tokens per question (~900k total).

GitHub Copilot

Full codebase access with agent mode (spawned subagents, file reads, regex search). All 30 questions submitted in a single prompt. Copilot ran extensive real-time search across the 264 MB codebase — reading source files, searching patterns, and retrying failed queries. ~6 minutes per question. ~30,000 tokens per question (~900k total).

Scoring

Each answer scored against a golden truth dataset of 30 verified Q&A pairs. Every golden truth answer was manually verified against the actual Twenty CRM source code. Scores were assigned by comparing must-include facts, checking for hallucination traps, and evaluating specificity of file/symbol references.

Tool	Model / Version	Accuracy	Hallucination Rate	Specificity	Completeness	Total Time
RECALL MCP (Ours)	Claude Opus 4.6 + testigo-recall MCP	1.00	0%	0.95	0.92	~15s
Claude Code (with tools)	Claude Opus 4.6 (grep/read/glob)	1.00	0%	1.00	0.90	~6min
GitHub Copilot	Copilot (agent mode + subagents)	1.00	0%	1.00	0.92	~6min
Cursor	—	—	—	—	—	—
Sourcegraph Cody	—	—	—	—	—	—

Key Findings

All Tools Hit 100% Accuracy

RECALL MCP, Claude Code, and Copilot all answered 30/30 questions correctly with zero hallucinations when given codebase access.

RECALL: 24x Faster

RECALL answered in ~15 seconds what took Claude Code and Copilot ~6 minutes each. Pre-indexed knowledge eliminates the need to search 264 MB of source files in real-time.

RECALL: 5x More Token-Efficient

RECALL uses ~6,000 tokens per question vs ~30,000 for Claude Code and ~30,000 for Copilot. Over 30 questions: RECALL ~180k tokens total vs Claude Code ~900k and Copilot ~900k. Pre-extracted facts deliver answers in 5x fewer tokens than raw file reading.

Efficiency Comparison (equal accuracy)

Speed (per question)

RECALL

~15s

Claude Code

~6min

Copilot