30 verified questions testing code understanding accuracy across BA, QA, Dev, Architect, and DevOps perspectives on a real-world 39k-star open-source codebase.
Does the answer match verified ground truth? 0 = wrong/hallucinated, 0.5 = partially correct, 1 = fully correct.
Did the tool invent facts not present in the codebase? Each question lists common hallucination traps.
Does it cite actual file paths, class names, or module names? Or just vague hand-wavy answers?
How many of the required "must include" facts did the answer cover?
All tools were given the same 30 questions in a single batch. This is a stress test — real-world usage would be one question at a time, where each tool would likely perform better. The batch approach ensures identical conditions across all tools.
Pre-scanned knowledge base (42 MB, 31,045 facts, 3,200 modules) queried via MCP tools. All 30 questions answered by keyword search against the knowledge base — no source files were read during the test. Questions were batched in groups of 5 with multi-keyword queries. ~15 seconds per question. ~6,000 tokens per question (~180k total).
Claude Opus 4.6 with full codebase access via grep, read, and glob tools. All 30 questions answered sequentially in a single session — no subagents. Searched the 264 MB codebase in real-time using standard file operations. ~6 minutes per question. ~30,000 tokens per question (~900k total).
Full codebase access with agent mode (spawned subagents, file reads, regex search). All 30 questions submitted in a single prompt. Copilot ran extensive real-time search across the 264 MB codebase — reading source files, searching patterns, and retrying failed queries. ~6 minutes per question. ~30,000 tokens per question (~900k total).
Each answer scored against a golden truth dataset of 30 verified Q&A pairs. Every golden truth answer was manually verified against the actual Twenty CRM source code. Scores were assigned by comparing must-include facts, checking for hallucination traps, and evaluating specificity of file/symbol references.
| Tool | Model / Version | Accuracy | Hallucination Rate | Specificity | Completeness | Total Time |
|---|---|---|---|---|---|---|
| RECALL MCP (Ours) | Claude Opus 4.6 + testigo-recall MCP | 1.00 | 0% | 0.95 | 0.92 | ~15s |
| Claude Code (with tools) | Claude Opus 4.6 (grep/read/glob) | 1.00 | 0% | 1.00 | 0.90 | ~6min |
| GitHub Copilot | Copilot (agent mode + subagents) | 1.00 | 0% | 1.00 | 0.92 | ~6min |
| Cursor | — | — | — | — | — | — |
| Sourcegraph Cody | — | — | — | — | — | — |
RECALL MCP, Claude Code, and Copilot all answered 30/30 questions correctly with zero hallucinations when given codebase access.
RECALL answered in ~15 seconds what took Claude Code and Copilot ~6 minutes each. Pre-indexed knowledge eliminates the need to search 264 MB of source files in real-time.
RECALL uses ~6,000 tokens per question vs ~30,000 for Claude Code and ~30,000 for Copilot. Over 30 questions: RECALL ~180k tokens total vs Claude Code ~900k and Copilot ~900k. Pre-extracted facts deliver answers in 5x fewer tokens than raw file reading.
Accuracy / Specificity / Completeness per question. H = hallucinated. Hover any score cell for detailed notes.
| Q# | Question | Diff | RECALL MCP | Claude Code | Copilot | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Acc | Spec | Comp | H? | Acc | Spec | Comp | H? | Acc | Spec | Comp | H? | |||
Click any question to expand the verified answer, required facts, and hallucination traps.