How I Write Software With LLMs: A Practical Multi-Model Workflow - Toolsify AI Blog

On March 10, 2026, Stavros published what may be the most honest and practical guide to building software with large language models. Not a hype piece. Not a "look what the AI made in 10 minutes" demo. A real workflow, tested across multiple shipped projects, with clear-eyed acknowledgment of where it works and where it breaks.

The post matters because most writing about AI-assisted development falls into two camps: breathless enthusiasm or dismissive skepticism. Stavros lands somewhere more useful — he uses LLMs heavily, enjoys it, and still describes the failure modes with precision.

The Starting Point: Making Things, Not Programming

Stavros opens with a distinction that reframes the entire conversation. He doesn't care about programming as an end in itself. He cares about making things. LLMs changed the equation by making programming feel closer to direct construction — less time fighting syntax, more time shaping what the software actually does.

This matters because it shifts the value proposition. If your goal is the craft of coding, LLMs might feel like cheating. If your goal is shipping working software, LLMs are a force multiplier. Stavros falls firmly in the second camp, and his workflow reflects that priority.

He has used this approach to build and maintain several projects: a personal assistant called Stavrobot, a voice-note recording device, an art-clock project, and a small town simulation called Pine Town. These aren't toy demos. They're maintained, evolving codebases with real users and real requirements.

The Three-Model Architecture

The core of Stavros's workflow is a separation of concerns across three model roles. Each role uses a different model, selected for the specific task. This isn't about finding the "best" model — it's about matching model strengths to workflow stages.

1. The Planning Model (Architect)

The first model acts as the architect. Stavros spends up to 30 minutes in conversation with this model before any code is written. The discussion covers goals, trade-offs, edge cases, and architectural decisions. The key instruction he gives: don't start implementing until I explicitly approve the plan.

This constraint is critical. LLMs are eager to generate code. Without a hard gate, the planning conversation slides into implementation before the design is settled. By enforcing a "no code until approved" rule, Stavros gets the thinking separated from the typing.

The planning model needs to be strong at reasoning and trade-off analysis. This is where you want the most capable model in your rotation. A weak planning phase creates downstream problems that no amount of implementation skill can fix.

2. The Development Model (Implementer)

Once the plan is approved, a cheaper model handles implementation. This model gets limited leeway — it executes the plan, not redesigns it. The instructions are specific: follow the architecture, implement the described changes, don't introduce new patterns without asking.

Using a cheaper model here serves two purposes. First, it's cost-effective for the high-volume token generation that implementation requires. Second, constraining the implementation model reduces the risk of creative deviations from the agreed architecture.

Stavros is explicit about the leeway constraint. An implementation model with too much freedom will "improve" your architecture in ways that break the coherence of the overall design. The plan is the contract. The implementer's job is to fulfill it, not renegotiate it.

3. The Review Models (Multiple Reviewers)

After implementation, Stavros runs the code through multiple reviewer models. He specifically mentions using Codex, Gemini, and Opus for review. The diversity matters — different models catch different issues.

One model might flag performance problems. Another might catch edge cases in error handling. A third might notice inconsistencies in naming or API design. Using a single reviewer gives you one perspective. Using multiple reviewers creates overlapping coverage.

This mirrors a principle from human code review: a single reviewer, no matter how skilled, has blind spots. The same is true for models. Diversifying reviewers is cheap insurance against missing issues that any individual model overlooks.

The Human Writes the Agent Instructions

One detail that separates Stavros's approach from the "let AI do everything" crowd: he writes agent instructions by hand. He does not ask an LLM to generate its own skill file or configuration. The human defines the constraints. The models execute within them.

This is a deliberate choice with a clear rationale. When you ask an LLM to write its own instructions, it optimizes for what it thinks you want to hear, not for what actually works. The generated instructions tend to be verbose, generic, and subtly misaligned with the actual use case. Hand-written instructions are shorter, more specific, and more honest about what the model should and shouldn't do.

Where It Works Well

The multi-model workflow works best when Stavros already understands the technology stack he's working with. When he knows the frameworks, the patterns, and the expected behavior, he can evaluate the model's output against a clear mental model. He can spot when the implementation deviates from the plan. He can judge whether the reviewer's suggestions are actually improvements.

In this context, LLMs function as a productivity multiplier. The human provides the judgment and architectural vision. The models provide the speed and the mechanical implementation. The division of labor works because each side contributes what it does best.

Where It Breaks Down

Stavros is candid about the limitation: the workflow works much less well in unfamiliar territory. When you don't understand the technology deeply, you can't effectively evaluate the model's output. Bad decisions compound. The codebase accumulates technical debt that the human doesn't recognize until it's too late.

This is the real operational risk, not the individual bugs or hallucinations that people typically worry about. A single bad output is easy to fix. A series of subtly wrong architectural decisions, accepted because the human couldn't evaluate them, creates a codebase that's expensive to repair.

The compounding design mistake problem is worse than it sounds. Early decisions set patterns that later code follows. If the initial architecture is weak because the human accepted the model's suggestions without sufficient understanding, every subsequent addition reinforces those weaknesses. By the time problems become visible, the cost of correction can exceed the time saved by using LLMs in the first place.

The Shifting Role of the Developer

Stavros's workflow points to a broader trend in software development. The human role is moving from line-level coding to architecture-level oversight. The day-to-day work shifts from writing functions to defining constraints, evaluating trade-offs, and maintaining the coherence of the overall design.

This doesn't make the human less important. It makes the human's judgment more important, not less. Writing a function is a bounded task with clear success criteria. Defining the right architecture for a project requires understanding requirements, anticipating future needs, and making trade-offs that no model can fully evaluate.

The developers who will thrive in this environment aren't the ones who can code the fastest. They're the ones who can define clean constraints, recognize weak abstractions early, and run a disciplined multi-model process without letting any single model dominate the design.

Practical Takeaways

If you're considering adopting a similar workflow, here are the principles that make Stavros's approach work:

Separate planning from implementation. Don't let the model start coding until the design is settled. The 30 minutes spent in planning conversation saves hours of rework.

Use different models for different roles. The best reasoning model isn't necessarily the best implementer. Match model strengths to workflow stages.

Diversify your reviewers. Multiple models catch different issues. A single reviewer, human or AI, has blind spots.

Write your own instructions. Don't ask the model to define its own constraints. The human sets the rules; the model follows them.

Stay in familiar territory. Use LLMs to amplify skills you already have. In unfamiliar domains, invest in learning before automating.

Watch for compounding errors. If early decisions feel wrong, investigate before building on top of them. Technical debt from misunderstood architecture is the most expensive kind.

The post is worth reading in full — it's dense with practical insight and refreshingly free of hype. Stavros doesn't claim LLMs will replace developers. He claims they change what developers do, and he shows exactly how he navigates that change in practice.