Testigo Recall — Code Review Tool Comparison

Testigo Recall vs GitHub Copilot vs CodeRabbit vs Greptile — 5 tests, 35 issues, head-to-head

Testigo Recall

31/35

issues found across all tests

T1: 7/8 T2: 5/5 T3: 2/2 T4: 9/10 T5: 8/10

Only tool with cross-module awareness

CodeRabbit Pro

29/35

issues found across all tests

T1: 7/8 T2: 5/5 T3: 0/2 T4: 9/10 T5: 8/10

Strong paid reviewer — no cross-module awareness

GitHub Copilot

28.5/35

issues found across all tests

T1: 8/8 T2: 4.5/5 T3: 0/2 T4: 8/10 T5: 8/10

Best free option, very verbose

Greptile $30/dev/mo

22.5/35

issues found across all tests

T1: 4/8 T2: 4.5/5 T3: 0/2 T4: 7/10 T5: 7/10

Most expensive, fewest issues found

Test 1

PR #13 — Edge Case Gauntlet

4 files, 8 real bugs + 2 non-bugs (traps), ado122/analyzer

Real Bugs (8)

Rate limiter >= → > (off-by-one)
CSRF HMAC-SHA256 → base64 (security downgrade)
API timeout 15min → 30s (breaks analysis jobs)
SQL injection ×2 in new exportService.js
CSV injection in new exportService.js
Off-by-one pagination in new exportService.js
Path traversal in new exportService.js
Promise error swallowing

Traps (not real bugs): Harmless log simplification (all tools correctly skipped) · Loose equality == on two numbers (behaves identically to === — Copilot & CodeRabbit flagged as FP)

Results

Bug	Testigo Recall	Copilot	CodeRabbit	Greptile
1. Rate limiter `>=`→`>`	YES (KB)	YES	YES	~summary only
2. CSRF HMAC → base64	YES (KB)	YES	YES (critical)	YES
3. API timeout 15m → 30s	YES (KB)	YES	YES	MISSED ("reasonable")
4. SQL injection ×2	YES (both)	YES	YES (critical)	YES
5. CSV injection	YES	YES	YES (critical)	YES
6. Off-by-one pagination	YES	YES	YES	~summary only
7. Path traversal	MISSED	YES	MISSED	MISSED
8. Promise error bug	YES	YES	YES	~summary only

Scores

Copilot

8/8

~5 false positives (noisy)

CodeRabbit

7/8

~2 FP

Testigo Recall

7/8

0 false positives + bonus CSRF timing attack

Greptile

4/8

0 false positives

Key Observations

Copilot 8/8 — only tool to find the path traversal. Noisy though (~5 false positives).
Recall & CodeRabbit tied at 7/8 — both missed only path traversal. Recall had 0 false positives + a bonus CSRF timing attack finding.
Testigo Recall was the ONLY tool that caught the API timeout as a KB-grounded regression — Greptile called it "reasonable" because it lacks codebase context.

Test 2

PR #14 — Domain-Specific Logic Bugs

5 files, KB-heavy planted bugs — all are documented value/behavior changes, ado122/analyzer

Planted Bugs

JWT token expiry 7d → 30min (breaks stateless auth)
Virality score weights swapped: velocity 40%→10%
Strategy cache duration 6h → 5s (excessive API calls)
Free tier channels 1 → unlimited (breaks business model)
Encryption AES-256-GCM → AES-128-ECB (security regression)

Results

Bug	Testigo Recall	Copilot	CodeRabbit	Greptile
1. JWT 7d → 30min	YES (KB, critical)	YES	YES	YES
2. Virality weights swapped	YES (KB, warning)	~partial (stale comments)	YES	~summary only
3. Strategy cache 6h → 5s	YES (KB, warning)	YES	YES	YES
4. Free tier 1 → unlimited	YES (KB, warning)	YES	YES	YES
5. AES-256-GCM → AES-128-ECB	YES (KB, 2 critical + 1 code)	YES (3 comments)	YES + runtime crash	YES (1 + 3 detail)

Scores

Testigo Recall

5/5

0 false positives — quoted exact KB values

CodeRabbit

5/5

+1 unique: getAuthTag runtime crash

Copilot

4.5/5

Virality = "stale comment" not logic bug

Greptile

4.5/5

Virality only in summary, not inline

Testigo Recall & CodeRabbit Both Aced This Test

Testigo Recall quoted exact facts: "JWT tokens expire after 7 days" (98%), "strategy: 6hr (21600000ms)", "Free: 1 channel", "AES-256-GCM with 16-byte IV". CodeRabbit found all 5 + a bonus runtime crash.

Test 3

PR #18108 — Real-World Billing Fix

Real PR from twentyhq/twenty (open-source CRM, 31K KB facts) — billing credits display fix, 3 files, +82/−28 lines. Not planted bugs — real issues found during review.

Issues Identified

Behavioral change: upTo raw value → toDisplayCredits() conversion (undocumented)
Cross-module side effect: shouldUpdateAtSubscriptionPeriodEnd compares in internal units — potential 1000× mismatch

Results

Issue	Testigo Recall	Copilot	CodeRabbit	Greptile
1. Behavioral change (upTo conversion)	YES (KB, 100%)	MISSED	MISSED	MISSED
2. Cross-module unit mismatch	YES (KB, 95%)	MISSED	MISSED	MISSED

Scores

Testigo Recall

2/2

Both findings from KB layer — required cross-file knowledge

CodeRabbit

0/2

1 nitpick (test edge case), 0 real findings

Copilot

0/2

"Reviewed 3 files, generated no comments"

Greptile

0/2

"5/5 confidence — safe to merge with minimal risk"

Key Observations

All 3 competitors approved this PR with zero real findings — Copilot, CodeRabbit, and Greptile all concluded the change was safe to ship.
Testigo Recall found both issues by cross-referencing the diff against KB facts about billing-subscription-update.service.ts — a file not in the diff.
This test highlights the core value of KB-grounded review — cross-module side effects are invisible to tools that only analyze the diff in isolation.

Test 4

PR #16 — Competitor Benchmarking (10 bugs)

10 files, 10 planted bugs (3 KB-detectable + 7 pure code), 1,329 lines added, ado122/analyzer

Planted Bugs

Missing CSRF on competitor routes KB
optionalAuth on DELETE (bypasses auth) KB
Conditional rendering (loses tab state) KB
NoSQL injection / ReDoS via unescaped RegExp
IDOR — no ownership check on getById
XSS via dangerouslySetInnerHTML
Off-by-one pagination
Missing await on async delete
Mass assignment / prototype pollution
String vs number comparison

Results

Bug	Testigo Recall	Copilot	CodeRabbit	Greptile
1. Missing CSRF	YES (KB)	YES	YES	YES
2. optionalAuth on DELETE	YES (KB)	YES	YES	YES
3. Conditional rendering	YES (KB)	MISSED	YES	MISSED
4. NoSQL injection / ReDoS	YES	MISSED	YES	MISSED
5. IDOR	YES	YES	YES	YES
6. XSS dangerouslySetInnerHTML	YES	YES	YES	YES
7. Off-by-one pagination	YES	YES	YES	YES
8. Missing await	YES	YES	YES	YES
9. Mass assignment / proto pollution	YES	YES	YES	YES
10. String vs number comparison	MISSED	YES	MISSED	MISSED

Scores

Testigo Recall

9/10

+6 bonus findings (tier limits, ownership)

CodeRabbit

9/10

+5 bonus (TOCTOU race, compound index)

Copilot

8/10

+5 bonus — only tool finding string comparison

Greptile

7/10

+0 bonus — precise but shallow

Key Observations

Recall & CodeRabbit tie at 9/10 — both catch NoSQL injection and conditional rendering that Copilot and Greptile missed.
Copilot uniquely found string comparison (#10) — cross-file reasoning connecting MongoDB String schema to numeric comparison in UI.
Recall's KB layer found 6 bonus issues — missing tier gates, wrong API utility, wrong tier limit values, missing ownership checks. Things only a knowledge-base-aware tool can catch.
Recall's Haiku semantic dedup merged 22 raw findings into 14 unique with zero false kills.

Test 5

PR #19 — Scheduled Reports & Data Export (10 bugs)

11 files, 10 planted bugs (all different from Test 4), 1,485 lines added, ado122/analyzer

Planted Bugs

Missing CSRF on report routes KB
Path traversal in report file download
SSRF via webhook delivery URL
Insecure randomness (Math.random for share tokens)
IDOR — no ownership check on getById/delete/generate/download
Prototype pollution in deep merge
Missing await on async delete
ReDoS regex in template name validation
Info disclosure (stack trace in error response)
Off-by-one date range (29 days instead of 30)

Results

Bug	Testigo Recall	Copilot	CodeRabbit	Greptile
1. Missing CSRF	YES (KB)	YES	YES	YES
2. Path traversal	YES	YES	YES (critical)	YES
3. SSRF (webhook)	YES	YES	YES	YES
4. Insecure randomness	YES	YES	YES	YES
5. IDOR (4 endpoints)	YES (all 4)	YES (all 4)	YES (critical)	YES
6. Prototype pollution	YES	MISSED	YES	MISSED
7. Missing await	YES (KB)	YES	YES	YES
8. ReDoS regex	MISSED	MISSED	MISSED	MISSED
9. Info disclosure (stack trace)	YES	YES	YES	MISSED
10. Off-by-one date (29 not 30)	MISSED	YES	MISSED	MISSED

Scores

Testigo Recall

8/10

19 comments — 9 critical, 0 FP

CodeRabbit

8/10

20 comments + full fix diffs

Copilot

8/10

33 comments — very verbose, lots of noise

Greptile

7/10

10 comments — concise but shallow

Key Observations

ReDoS universally missed — all 4 tools failed to identify catastrophic backtracking in /^([a-zA-Z0-9]+[\s_-]*)+$/. ReDoS remains a blind spot for all AI code reviewers.
Copilot is the only tool to catch the off-by-one date — "last30days" subtracting 29 instead of 30. Subtle logic bug that other tools overlooked.
Recall & CodeRabbit both caught prototype pollution while Copilot and Greptile missed it.
Recall's KB layer detected missing CSRF by comparing against documented CSRF patterns, flagged wrong Claude model ID, wrong env var name, and webhooks needing Enterprise tier gate.
82 total inline comments — Copilot (33), CodeRabbit (20), Recall (19), Greptile (10). Copilot is the noisiest; Greptile the most concise.
The 2 misses (ReDoS + off-by-one) are the lowest-severity bugs — all critical security issues were caught by Recall.

Conclusions (across all 5 tests)

Testigo Recall leads on total detection — 31/35 overall. The only tool that finds cross-module side effects (Test 3) and KB-grounded regressions (wrong values, wrong patterns, wrong conventions).

CodeRabbit is the strongest paid competitor — 29/35, with excellent fix suggestions. But it's completely blind to cross-module issues (scored 0/2 on Test 3 where Recall found both).

Copilot is the best free option — 28.5/35, occasionally catches unique bugs (path traversal in T1, string comparison in T4, off-by-one in T5). Trade-off: extremely verbose (33 comments on T5 alone, many are noise).

Greptile trails at $30/dev/mo — 22.5/35, consistently last place across all 5 tests. Underperforms the free Copilot on every test.

ReDoS is a universal blind spot — all 4 tools missed it in Test 5. AI code reviewers don't analyze regex computational complexity.

Test 3 demonstrates the KB advantage — on a real developer PR where all 3 competitors approved with zero findings, Testigo Recall found a potential 1000× unit mismatch in a file not even in the diff. Cross-module awareness is where KB-grounded review shines.

Recall complements existing tools. Copilot and CodeRabbit handle code-level bugs well; Recall adds the codebase-level layer — convention violations, KB regressions, cross-module side effects. Together they cover the full spectrum.