When a code review runbook becomes an upgrade playbook

On the same day I used my code review runbook to review a client’s product, I used it on three internal prototype repos — and discovered a completely different use case.

The client review produced a report: findings, grades, recommendations. The prototypes needed more than a report. They needed the fixes implemented. The same runbook categories that identified problems also defined the fix scope. Review categories became execution phases. A read-only assessment tool became a write/fix playbook.

This wasn’t planned. But it turns out to be the runbook’s most powerful property.

The prototypes

A VP of Marketing, not an engineer, had built three agent prototypes using AI coding tools. Think of it as vibe coding by a domain expert: the code worked, the business logic was sophisticated, but the engineering fundamentals needed serious attention. TypeScript that wouldn’t compile. A 3,610-line single-file React component. Hardcoded paths to a developer’s desktop. Zero tests.

The goal: bring these repos to a standard where professional software engineers could take over development. Not just review and report — actually do the work.

Review categories become execution phases

The runbook has 15 review categories covering architecture, security, testing, code quality, deployment, and so on. When I ran the assessment, the categories naturally mapped to six execution phases:

Phase 1: Make it build. Nothing else matters until the code compiles. One repo had missing types, function signatures that didn’t match their callers, and local type definitions conflicting with framework globals. Another had hardcoded paths to /Users/dev/Desktop/... that broke on any other machine. A third had dependency conflicts that prevented npm run build from completing.

Phase 2: Make it documented. ARCHITECTURE.md, README.md, .env.example files. Once the code builds, document what it does before changing anything else. This creates a reference point for everything that follows.

Phase 3: Make it linted. ESLint, Prettier, consistent formatting. You can’t lint code that doesn’t compile, which is why this comes after Phase 1.

Phase 4: Make it structured. Monolith decomposition, module extraction, separation of concerns. You can’t restructure code you haven’t documented, which is why this comes after Phase 2.

Phase 5: Make it tested. Unit tests for pure functions, behavior verification, regression protection. You can’t meaningfully test code with 3,610 lines in a single file, which is why this comes after Phase 4.

Phase 6: Make it secure. Environment variables instead of hardcoded secrets, .env.example files, gitignore cleanup. Security is last not because it’s least important, but because earlier phases often eliminate security issues as a side effect (hardcoded paths become env vars in Phase 1, for instance).

Each phase unlocks the next. Skip Phase 1 and you can’t do Phase 3. Skip Phase 4 and Phase 5 is painful. The ordering isn’t arbitrary — it’s a dependency chain.

The tiering system prevented over-engineering

These were internal prototypes used by a small team. Not customer-facing products, not handling PII, not subject to compliance requirements. The Project Complexity Assessment scored them at Tier 2 (Standard), which meant ~150 checks.

Without the tiering, I know what would have happened. I’d have added authentication. Database migration frameworks. CI/CD pipelines. Docker containerization. Multi-environment deployment configs. All reasonable for a production product, all pointless for internal tools that three people use.

The tier said: stop here. ESLint, Prettier, tests, documentation, clean architecture. Practical improvements an engineer can use on day one. Not infrastructure for a product that doesn’t exist yet.

The tiering system’s greatest value isn’t what it tells you to do. It’s what it prevents you from doing.

Respect the domain logic

Here’s a trap I almost fell into. When engineers see prototype code — inconsistent formatting, no tests, mixed patterns — the instinct is to dismiss it as “bad code” that needs rewriting.

These prototypes had impressive domain knowledge embedded in them. One had a multi-factor scoring model with weighted criteria and bias detection. Another had a full human-in-the-loop decision workflow with status transitions and approval chains. A third had a predictive analytics framework with population-level modeling.

The business logic was sophisticated and correct. A domain expert had spent months refining those algorithms. Rewriting them from scratch would have taken weeks of requirements gathering to reconstruct what was already working.

So I added an explicit constraint: no functionality changes, no feature additions, no scoring algorithm changes. Structural refactoring only. Treat the domain logic as an asset to preserve, not a liability to rewrite.

Before restructuring anything, I wrote tests that captured the current behavior — similarity calculations, input validation rules, state transitions. Hundreds of tests, all for pure functions with deterministic behavior.

Those tests served two purposes: regression protection during the upgrade, and executable documentation of how the scoring systems work. An engineer picking up any scoring function can read the test descriptions instead of reverse-engineering the implementation.

The single highest-impact change

Across all six phases, one change created more engineering value than everything else combined: decomposing the monolith.

The main component file was 3,610 lines. The entire UI in a single file, with 24 useState hooks, inline mock data, deeply nested conditional rendering, and alert() calls for user feedback. It worked. It was also impossible for two engineers to modify simultaneously without merge conflicts on every commit.

I decomposed it into 17 focused modules: a mock data layer, custom hooks, shared components, view-specific components, modals, and a notification system replacing alert(). The main file went from 3,610 lines to 212.

Every other improvement — linting, testing, documentation — helps the current developers work faster. Decomposition changes who can work on the code at all. Before the split, one person could modify the UI at a time. After, one engineer can work on the scoring workflow while another modifies the audience brief modal, and they’ll never touch the same file.

If you’re evaluating a codebase for upgrade, start with the largest file. It’s almost always the highest-impact refactoring target. I’d flag anything over 500 lines as a decomposition candidate and anything over 1,000 lines as a decomposition requirement.

What scales and what doesn’t

Working across three repos revealed a clear split.

Phases 1-3 (build fixes, documentation, tooling) and Phase 6 (security cleanup) followed the same template across all three repos. Same ESLint config pattern. Same ARCHITECTURE.md structure. Same .env.example convention. These were highly parallelizable — I could apply the same approach to each repo without customization.

Phase 4 (decomposition) was unique to the repo with the 3,610-line component. Phase 5 (testing) required understanding each repo’s specific scoring algorithms, data structures, and edge cases. These couldn’t be templated.

When you’re upgrading multiple repos, plan accordingly. Infrastructure tasks (build, docs, tooling, security) follow repeatable patterns and can run in parallel. Domain tasks (decomposition, testing) require focused analysis per repo. Don’t expect to template everything.

Framing matters

After six phases of work across three repos, the total scope of changes was substantial: new files, deleted code, restructured modules, 286 tests, full documentation. The right framing for the stakeholder was: “This is an overhaul, close to a complete rewrite. We preserved all business logic, but integrations may be broken. The next step is testing with the domain expert.”

If you frame a major upgrade as “some cleanup,” the stakeholder expects everything to work perfectly. If you’re honest that it’s a near-rewrite with preserved logic, they prepare for a testing phase. The second framing is more accurate and sets up productive next steps rather than defensive finger-pointing.

When the runbook is used for upgrades, always plan a testing phase with the domain expert as the final step. The engineer who wrote the upgrade understands the code. The domain expert understands the requirements. Both are needed.

The runbook’s hidden property

I built the runbook to find problems. I didn’t expect it to define solutions. But when you own the fix, the same 15 categories that structure an assessment also structure the remediation. The review phase doesn’t produce throwaway artifacts — the findings become the work plan.

That’s a useful property for any tool to have. An assessment that doubles as an action plan means you never have to translate findings into tasks. The categories are the tasks.

If you’ve been using the code review runbook for assessment only, try it on a codebase you own. Run the assessment, then work through the findings as an execution plan. Phase the work by dependency order: build, document, lint, structure, test, secure. Let the tiering system tell you where to stop.

You might find, as I did, that the review tool is more useful as a fix tool.

This article is part of the code review series. See also: Code review that scales (introducing the runbook) and What reviewing real codebases taught me about code review (lessons from advisory reviews).

Rajiv Pant is President of Flatiron Software and Snapshot AI, where he leads organizational growth and AI innovation. He is former Chief Product & Technology Officer at The Wall Street Journal, The New York Times, and Hearst Magazines. Earlier in his career, he headed technology for Conde Nast’s brands including Reddit. Rajiv coined the terms “synthesis engineering” and “synthesis coding” to describe the systematic integration of human expertise with AI capabilities in professional software development. Connect with him on LinkedIn or read more at rajiv.com.