Testigo Recall vs GitHub Copilot vs CodeRabbit vs Greptile — 5 tests, 35 issues, head-to-head
ado122/analyzer— 2,904 KB facts (Tests 1, 2, 4–5)
twentyhq/twenty— 31,045 KB facts (Test 3)
Testigo Recall
31/35
issues found across all tests
T1: 7/8T2: 5/5T3: 2/2T4: 9/10T5: 8/10
Only tool with cross-module awareness
CodeRabbit Pro
29/35
issues found across all tests
T1: 7/8T2: 5/5T3: 0/2T4: 9/10T5: 8/10
Strong paid reviewer — no cross-module awareness
GitHub Copilot
28.5/35
issues found across all tests
T1: 8/8T2: 4.5/5T3: 0/2T4: 8/10T5: 8/10
Best free option, very verbose
Greptile $30/dev/mo
22.5/35
issues found across all tests
T1: 4/8T2: 4.5/5T3: 0/2T4: 7/10T5: 7/10
Most expensive, fewest issues found
Test 1
PR #13 — Edge Case Gauntlet
4 files, 8 real bugs + 2 non-bugs (traps), ado122/analyzer
Real Bugs (8)
Rate limiter >= → > (off-by-one)
CSRF HMAC-SHA256 → base64 (security downgrade)
API timeout 15min → 30s (breaks analysis jobs)
SQL injection ×2 in new exportService.js
CSV injection in new exportService.js
Off-by-one pagination in new exportService.js
Path traversal in new exportService.js
Promise error swallowing
Traps (not real bugs): Harmless log simplification (all tools correctly skipped) · Loose equality == on two numbers (behaves identically to === — Copilot & CodeRabbit flagged as FP)
Results
Bug
Testigo Recall
Copilot
CodeRabbit
Greptile
1. Rate limiter >=→>
YES (KB)
YES
YES
~summary only
2. CSRF HMAC → base64
YES (KB)
YES
YES (critical)
YES
3. API timeout 15m → 30s
YES (KB)
YES
YES
MISSED ("reasonable")
4. SQL injection ×2
YES (both)
YES
YES (critical)
YES
5. CSV injection
YES
YES
YES (critical)
YES
6. Off-by-one pagination
YES
YES
YES
~summary only
7. Path traversal
MISSED
YES
MISSED
MISSED
8. Promise error bug
YES
YES
YES
~summary only
Scores
Copilot
8/8
~5 false positives (noisy)
CodeRabbit
7/8
~2 FP
Testigo Recall
7/8
0 false positives + bonus CSRF timing attack
Greptile
4/8
0 false positives
Key Observations
Copilot 8/8 — only tool to find the path traversal. Noisy though (~5 false positives).
Recall & CodeRabbit tied at 7/8 — both missed only path traversal. Recall had 0 false positives + a bonus CSRF timing attack finding.
Testigo Recall was the ONLY tool that caught the API timeout as a KB-grounded regression — Greptile called it "reasonable" because it lacks codebase context.
Test 2
PR #14 — Domain-Specific Logic Bugs
5 files, KB-heavy planted bugs — all are documented value/behavior changes, ado122/analyzer
Testigo Recall quoted exact facts: "JWT tokens expire after 7 days" (98%), "strategy: 6hr (21600000ms)", "Free: 1 channel", "AES-256-GCM with 16-byte IV". CodeRabbit found all 5 + a bonus runtime crash.
Test 3
PR #18108 — Real-World Billing Fix
Real PR from twentyhq/twenty (open-source CRM, 31K KB facts) — billing credits display fix, 3 files, +82/−28 lines. Not planted bugs — real issues found during review.
Issues Identified
Behavioral change: upTo raw value → toDisplayCredits() conversion (undocumented)
Cross-module side effect: shouldUpdateAtSubscriptionPeriodEnd compares in internal units — potential 1000× mismatch
Results
Issue
Testigo Recall
Copilot
CodeRabbit
Greptile
1. Behavioral change (upTo conversion)
YES (KB, 100%)
MISSED
MISSED
MISSED
2. Cross-module unit mismatch
YES (KB, 95%)
MISSED
MISSED
MISSED
Scores
Testigo Recall
2/2
Both findings from KB layer — required cross-file knowledge
CodeRabbit
0/2
1 nitpick (test edge case), 0 real findings
Copilot
0/2
"Reviewed 3 files, generated no comments"
Greptile
0/2
"5/5 confidence — safe to merge with minimal risk"
Key Observations
All 3 competitors approved this PR with zero real findings — Copilot, CodeRabbit, and Greptile all concluded the change was safe to ship.
Testigo Recall found both issues by cross-referencing the diff against KB facts about billing-subscription-update.service.ts — a file not in the diff.
This test highlights the core value of KB-grounded review — cross-module side effects are invisible to tools that only analyze the diff in isolation.
Recall & CodeRabbit tie at 9/10 — both catch NoSQL injection and conditional rendering that Copilot and Greptile missed.
Copilot uniquely found string comparison (#10) — cross-file reasoning connecting MongoDB String schema to numeric comparison in UI.
Recall's KB layer found 6 bonus issues — missing tier gates, wrong API utility, wrong tier limit values, missing ownership checks. Things only a knowledge-base-aware tool can catch.
Recall's Haiku semantic dedup merged 22 raw findings into 14 unique with zero false kills.
Test 5
PR #19 — Scheduled Reports & Data Export (10 bugs)
11 files, 10 planted bugs (all different from Test 4), 1,485 lines added, ado122/analyzer
Planted Bugs
Missing CSRF on report routes KB
Path traversal in report file download
SSRF via webhook delivery URL
Insecure randomness (Math.random for share tokens)
IDOR — no ownership check on getById/delete/generate/download
Prototype pollution in deep merge
Missing await on async delete
ReDoS regex in template name validation
Info disclosure (stack trace in error response)
Off-by-one date range (29 days instead of 30)
Results
Bug
Testigo Recall
Copilot
CodeRabbit
Greptile
1. Missing CSRF
YES (KB)
YES
YES
YES
2. Path traversal
YES
YES
YES (critical)
YES
3. SSRF (webhook)
YES
YES
YES
YES
4. Insecure randomness
YES
YES
YES
YES
5. IDOR (4 endpoints)
YES (all 4)
YES (all 4)
YES (critical)
YES
6. Prototype pollution
YES
MISSED
YES
MISSED
7. Missing await
YES (KB)
YES
YES
YES
8. ReDoS regex
MISSED
MISSED
MISSED
MISSED
9. Info disclosure (stack trace)
YES
YES
YES
MISSED
10. Off-by-one date (29 not 30)
MISSED
YES
MISSED
MISSED
Scores
Testigo Recall
8/10
19 comments — 9 critical, 0 FP
CodeRabbit
8/10
20 comments + full fix diffs
Copilot
8/10
33 comments — very verbose, lots of noise
Greptile
7/10
10 comments — concise but shallow
Key Observations
ReDoS universally missed — all 4 tools failed to identify catastrophic backtracking in /^([a-zA-Z0-9]+[\s_-]*)+$/. ReDoS remains a blind spot for all AI code reviewers.
Copilot is the only tool to catch the off-by-one date — "last30days" subtracting 29 instead of 30. Subtle logic bug that other tools overlooked.
Recall & CodeRabbit both caught prototype pollution while Copilot and Greptile missed it.
Recall's KB layer detected missing CSRF by comparing against documented CSRF patterns, flagged wrong Claude model ID, wrong env var name, and webhooks needing Enterprise tier gate.
82 total inline comments — Copilot (33), CodeRabbit (20), Recall (19), Greptile (10). Copilot is the noisiest; Greptile the most concise.
The 2 misses (ReDoS + off-by-one) are the lowest-severity bugs — all critical security issues were caught by Recall.
Conclusions (across all 5 tests)
1
Testigo Recall leads on total detection — 31/35 overall. The only tool that finds cross-module side effects (Test 3) and KB-grounded regressions (wrong values, wrong patterns, wrong conventions).
2
CodeRabbit is the strongest paid competitor — 29/35, with excellent fix suggestions. But it's completely blind to cross-module issues (scored 0/2 on Test 3 where Recall found both).
3
Copilot is the best free option — 28.5/35, occasionally catches unique bugs (path traversal in T1, string comparison in T4, off-by-one in T5). Trade-off: extremely verbose (33 comments on T5 alone, many are noise).
4
Greptile trails at $30/dev/mo — 22.5/35, consistently last place across all 5 tests. Underperforms the free Copilot on every test.
5
ReDoS is a universal blind spot — all 4 tools missed it in Test 5. AI code reviewers don't analyze regex computational complexity.
6
Test 3 demonstrates the KB advantage — on a real developer PR where all 3 competitors approved with zero findings, Testigo Recall found a potential 1000× unit mismatch in a file not even in the diff. Cross-module awareness is where KB-grounded review shines.
7
Recall complements existing tools. Copilot and CodeRabbit handle code-level bugs well; Recall adds the codebase-level layer — convention violations, KB regressions, cross-module side effects. Together they cover the full spectrum.