Open-Source Code Intelligence Benchmark

RECALL Benchmark: Twenty CRM

30 verified questions testing code understanding accuracy across BA, QA, Dev, Architect, and DevOps perspectives on a real-world 39k-star open-source codebase.

30
Verified Questions
17,700
Source Files
264 MB
Source Code
5
Perspectives
1-10
Difficulty Range
31,045
RECALL Facts
42 MB
Knowledge Base

Methodology

Accuracy (0 - 1)

Does the answer match verified ground truth? 0 = wrong/hallucinated, 0.5 = partially correct, 1 = fully correct.

Hallucination (yes/no)

Did the tool invent facts not present in the codebase? Each question lists common hallucination traps.

Specificity (0 - 1)

Does it cite actual file paths, class names, or module names? Or just vague hand-wavy answers?

Completeness (0 - 1)

How many of the required "must include" facts did the answer cover?

Test Conditions

All tools were given the same 30 questions in a single batch. This is a stress test — real-world usage would be one question at a time, where each tool would likely perform better. The batch approach ensures identical conditions across all tools.

RECALL MCP

Pre-scanned knowledge base (42 MB, 31,045 facts, 3,200 modules) queried via MCP tools. All 30 questions answered by keyword search against the knowledge base — no source files were read during the test. Questions were batched in groups of 5 with multi-keyword queries. ~15 seconds per question. ~6,000 tokens per question (~180k total).

Claude Code (with tools)

Claude Opus 4.6 with full codebase access via grep, read, and glob tools. All 30 questions answered sequentially in a single session — no subagents. Searched the 264 MB codebase in real-time using standard file operations. ~6 minutes per question. ~30,000 tokens per question (~900k total).

GitHub Copilot

Full codebase access with agent mode (spawned subagents, file reads, regex search). All 30 questions submitted in a single prompt. Copilot ran extensive real-time search across the 264 MB codebase — reading source files, searching patterns, and retrying failed queries. ~6 minutes per question. ~30,000 tokens per question (~900k total).

Scoring

Each answer scored against a golden truth dataset of 30 verified Q&A pairs. Every golden truth answer was manually verified against the actual Twenty CRM source code. Scores were assigned by comparing must-include facts, checking for hallucination traps, and evaluating specificity of file/symbol references.

Scoreboard

Tool Model / Version Accuracy Hallucination Rate Specificity Completeness Total Time
RECALL MCP (Ours) Claude Opus 4.6 + testigo-recall MCP 1.00 0% 0.95 0.92 ~15s
Claude Code (with tools) Claude Opus 4.6 (grep/read/glob) 1.00 0% 1.00 0.90 ~6min
GitHub Copilot Copilot (agent mode + subagents) 1.00 0% 1.00 0.92 ~6min
Cursor
Sourcegraph Cody

Key Findings

All Tools Hit 100% Accuracy

RECALL MCP, Claude Code, and Copilot all answered 30/30 questions correctly with zero hallucinations when given codebase access.

RECALL: 24x Faster

RECALL answered in ~15 seconds what took Claude Code and Copilot ~6 minutes each. Pre-indexed knowledge eliminates the need to search 264 MB of source files in real-time.

RECALL: 5x More Token-Efficient

RECALL uses ~6,000 tokens per question vs ~30,000 for Claude Code and ~30,000 for Copilot. Over 30 questions: RECALL ~180k tokens total vs Claude Code ~900k and Copilot ~900k. Pre-extracted facts deliver answers in 5x fewer tokens than raw file reading.

Efficiency Comparison (equal accuracy)

Speed (per question)
RECALL
~15s
Claude Code
~6min
Copilot
~6min
Token usage per question
RECALL
~6k
Claude Code
~30k
Copilot
~30k

Per-Question Results

Accuracy / Specificity / Completeness per question. H = hallucinated. Hover any score cell for detailed notes.

Q# Question Diff RECALL MCP Claude Code Copilot
Acc Spec Comp H? Acc Spec Comp H? Acc Spec Comp H?

Detailed Question Cards

Click any question to expand the verified answer, required facts, and hallucination traps.