Code Review That Scales: An Open-Source Runbook for Agentic Engineering

I wrote this for engineers and engineering leads who do code reviews regularly — especially those practicing agentic coding with tools like Claude Code, Cursor, or Copilot, where an AI agent writes and modifies code under human direction.

The runbook: Enterprise-Grade Codebase Review Runbook (GitHub, MIT license, open source)

The problem with every code review checklist I’ve found

Search for “code review checklist” and you’ll find two kinds of results.

The first kind has ten items. Check for bugs. Write tests. Use meaningful variable names. This is fine if you’re teaching a junior developer what code review is. It’s useless for a production system handling payments, storing health data, or serving millions of requests. It says nothing about secret scanning, dependency licensing, multi-tenancy isolation, observability, or disaster recovery.

The second kind is an enterprise standard with hundreds of items that apply to every project equally. SOC 2 controls. HIPAA safeguards. PCI-DSS compliance. Distributed tracing. SSO federation. All of it, for every project, every time. A team of three building an internal CLI tool gets the same checklist as a team of fifty building a financial trading platform.

The shallow checklist misses things that matter. The enterprise checklist buries what matters under things that don’t. Both fail for the same reason: they treat all projects the same.

And neither was written for a world where AI agents generate significant portions of the codebase. Agentic coding changes the review problem — code is produced faster, in larger volumes, and with different failure modes than purely human-written code. The review process needs to keep up.

The fix is tiering

The insight is simple: project complexity should determine review depth.

A weekend side project and a healthcare platform have fundamentally different risk profiles. Different user counts, different data sensitivity, different regulatory exposure, different blast radius when something goes wrong. The review process should reflect that.

I built an Enterprise-Grade Codebase Review Runbook — open source, MIT licensed — that works this way. It starts with a Project Complexity Assessment: a scoring system across five dimensions (scale, business criticality, data sensitivity, architecture complexity, operational requirements). Your score determines your tier:

Tier 1 — Essential (~50 checks): Solo projects, prototypes, internal tools. The basics that apply to everything.
Tier 2 — Standard (~150 checks): Small team projects, production apps, startups. Adds testing depth, performance, reliability.
Tier 3 — Enterprise (~400 checks): Large teams, regulated industries, enterprise customers. Adds multi-tenancy, SSO, compliance, observability.
Tier 4 — Mission-Critical (~900+ checks): Financial systems, healthcare, critical infrastructure. Everything.

Every check in the runbook is tagged with a tier marker. If you’re Tier 2, you review Tier 1 and Tier 2 items. You skip Tier 3 and 4. No guilt, no gaps — you’re reviewing exactly what’s appropriate for your project’s complexity.

What it covers

The runbook spans 16 categories. Some you’d expect in a code review: architecture, code quality, testing, security, deployment. Others you might not:

Secrets and credentials — not just “don’t hardcode passwords” but a systematic approach to scanning git history, validating CI/CD configs, and verifying preventive controls
AI and human maintainability — whether code is structured for both human engineers AND agentic coding tools to understand and modify effectively
Multi-tenancy — data isolation, tenant configuration, tenant lifecycle management
Licensing and legal — dependency license compatibility, attribution requirements, IP provenance
Open source addendum — community documentation, security policies, contribution workflows, project health — a complete section for open source projects
Industry-specific addenda — financial services, healthcare, e-commerce, government — each with targeted regulatory checks

There’s also a 15-minute Minimum Viable Review for when you need a quick health check: 15 essential items covering security, code health, and operations basics.

Designed for synthesis coding and agentic workflows

Here’s where this connects to synthesis coding and synthesis engineering — the disciplined practice of building software through human-AI collaboration, where the human provides direction, judgment, and architectural decisions while the AI agent handles systematic execution. If you’re doing agentic coding with tools like Claude Code or Cursor, synthesis coding is the methodology that makes it effective rather than chaotic.

I built this runbook working with Claude Code, and I designed it to work naturally within agentic coding workflows. The checks are written as clear, evaluable statements — not vague guidance like “ensure code quality” but specific items like “No O(n^2) in hot paths” or “All network calls have timeouts configured.” This precision matters because it’s what makes the runbook machine-readable. An AI agent can evaluate each check against actual code and return concrete findings.

The workflow: you determine your project’s tier (human judgment — you understand your project’s context, users, and risk profile). Then you hand the runbook to your AI coding agent with the tier and your codebase. The agent systematically evaluates each applicable check, identifies specific file paths and line numbers, and reports findings with severity levels and remediation steps.

Human decides what matters. AI agent does the systematic checking. Human interprets and prioritizes the findings. This is synthesis coding applied to the meta-problem of code quality — the same direction dynamic that makes agentic engineering work in general. The human sets direction; the agent covers ground.

The runbook also works without AI. The checklist is useful on its own — print it, work through it manually, use it in team review sessions. The agentic approach just makes it practical to cover hundreds of checks against a real codebase in minutes instead of days.

How to use it

Assess your project. Complete the Project Complexity Assessment at the top of the runbook. Score five dimensions, get your tier.
Run the review. Hand the runbook to your AI coding agent (or work through it yourself). Review only the items at your tier or below.
Act on findings. The runbook includes output format templates — simplified reports for Tier 1-2, detailed enterprise reports for Tier 3-4 — so findings are structured for action.

Prompts that work

Here are prompts I actually use with Claude Code. Each one demonstrates the synthesis coding pattern: human sets direction and scope, AI agent does the systematic work.

Tier assessment — let the agent score, but you decide:

Read the codebase review runbook at https://rajiv.com/open-source/codebase-review-runbook. Then assess this project against the Project Complexity Assessment. Score each dimension and recommend a tier. Explain your reasoning — I’ll confirm or override before we proceed.

The override matters. The agent doesn’t know your roadmap. A prototype today might be handling PII next quarter — that’s a human judgment call.

Full review at a specific tier:

Using the codebase review runbook, review this project at Tier 2 (Standard). For each category, list specific findings with file paths, line numbers, and severity. Skip Tier 3 and 4 checks entirely.

Explicitly stating the tier and telling it to skip higher tiers prevents the agent from over-reviewing. Without that constraint, it will flag enterprise concerns on your weekend project.

Focused review — one category deep:

Using the security section of the codebase review runbook, do a deep review of this project at Tier 2. Check every applicable item. Pay special attention to hardcoded secrets, .env files tracked in git, and AI tool configuration files (.cursorrules, .cursor/, .aider*, etc.) that might expose infrastructure details.

When you already know where the risk is, point the agent there instead of running the full 150-check sweep.

Delta review — trajectory, not snapshot:

I ran a code review of this project last month. The findings are in review-findings-jan.md. Run the same review again using the runbook at Tier 2, and for each finding from January, classify it as: fixed, partially fixed, still present, worse, or new. Summarize the trajectory at the top.

This is the most underrated use. A single review says “here are your problems.” A delta review says “here’s your direction.” Leadership needs the second one.

What I learned using it

Since releasing this runbook, I’ve used it on real engagements — reviewing a security startup’s full-stack platform, upgrading three prototype repos to production quality. The tiering system held up in both cases. The runbook caught things that would have slipped past a manual review: subtle cryptographic choices, AI tool configuration files leaking infrastructure details, tests that looked comprehensive but only verified imports.

I also discovered something I hadn’t planned for. When you use the runbook on a codebase you own, the review categories map naturally to execution phases — build first, document second, restructure third, test fourth. A review checklist becomes an upgrade playbook. I’ll share those stories in companion articles.

Get it, use it, improve it

The full runbook is here: Enterprise-Grade Codebase Review Runbook

It’s MIT licensed and part of the Ragbot.AI open-source project. Contributions welcome — report issues, suggest additions, submit PRs. If you use it and find checks that are missing or mistiered, that feedback makes it better for everyone.

Updated February 2026: I’ve since used this runbook on real engagements and wrote about what I learned. See What reviewing real codebases taught me about code review (lessons from an advisory review) and When a code review runbook becomes an upgrade playbook (using the runbook to fix, not just find). Those experiences also drove the runbook to v2.1.

This article is part of the synthesis coding series.

Rajiv Pant is President of Flatiron Software and Snapshot AI, where he leads organizational growth and AI innovation. He is former Chief Product & Technology Officer at The Wall Street Journal, The New York Times, and Hearst Magazines. Earlier in his career, he headed technology for Conde Nast’s brands including Reddit. Rajiv coined the terms “synthesis engineering” and “synthesis coding” to describe the systematic integration of human expertise with AI capabilities in professional software development. Connect with him on LinkedIn or read more at rajiv.com.