Where this started

A few months ago, I wrote about using Claude for Chrome as an ad-hoc QA testing assistant. The experiment was simple: give a browser agent detailed testing instructions, watch it click through a complex web application, and see if it could generalize across similar UI patterns without being explicitly told how each one worked.

It worked better than expected. The agent navigated sidebars, expanded table rows, tested filters and search functions, and even produced a nicely formatted report. More importantly, when I told it to "repeat this testing process for the remaining two tabs," it figured out what that meant and adapted to different labels and content without additional hand-holding.

But that experiment had limits. It was ad-hoc. Manual. One test at a time, with me watching the browser. Useful for exploratory testing, but not something that could run autonomously or scale across an entire application.

So I built something bigger. QA Automation Studio is a full application that takes the core idea—AI-powered test script generation and execution—and turns it into a production-ready tool. It generates UI guides from application code, creates YAML testing scripts, executes them with Playwright, produces detailed reports, and can run in batch mode for comprehensive regression testing.

Here's the interesting part: I'm not a developer. I built this entire application using Claude Code and plain English conversations.

What vibe coding actually means

The term "vibe coding" gets thrown around a lot, usually to describe the experience of working with AI coding assistants. But there's a meaningful distinction between using AI to help you write code and using AI to build an application while you focus on what the application should do.

Traditional development, even with AI assistance, still requires you to think in terms of functions, classes, data structures, and implementation details. You're the programmer. The AI is a faster way to type.

Vibe coding flips that relationship. You describe what you want in terms of outcomes and behaviors. The AI handles implementation. You review, test, and iterate—but you're operating as a product manager, not a developer.

This isn't about not understanding code. I can read Python. I understand what Playwright does. I know how Flask works. But I'm not writing functions or debugging syntax errors. I'm describing features, reviewing results, and making decisions about what to build next.

KEY DISTINCTION

Vibe coding isn't "AI writes code for me." It's "I specify the product, AI implements it, and I verify it works." The human stays in the loop as the decision-maker, not the typist.

The development environment

The entire application was built using Claude Code running in my terminal. Claude Code is Anthropic's command-line coding assistant that can create files, run commands, execute tests, and interact with your development environment directly.

But the real power came from Claude Code's agent system. Agents are specialized configurations that give Claude specific expertise and behaviors for particular tasks. I used four agents throughout development:

code-reviewer — Analyzes code for bugs, security issues, and adherence to best practices. I'd run this after major feature additions to catch problems early.

code-optimizer — Looks for performance improvements, redundant code, and opportunities to simplify. Particularly useful after getting a feature working but before considering it "done."

security-reviewer — Specifically focused on security vulnerabilities, injection risks, authentication issues, and data handling problems. Critical for an application that interacts with external AI APIs and executes browser automation.

document-writer — Generates documentation, README files, and inline comments. Used to create user guides and technical documentation as features were completed.

The workflow looked like this: describe a feature in plain English, let Claude implement it, run the appropriate agent to review the work, iterate based on findings, then move to the next feature. Multiple streams of development happening in parallel, all coordinated through natural language.

What QA Automation Studio does

Before diving into how I built it, let me explain what the application actually does. Understanding the end product helps contextualize the development decisions.

QA Automation Studio is a two-part application:

Part 1: Manual Mode — An interactive interface where QA engineers generate, review, and execute test scripts with AI assistance. This is the primary workflow for developing new tests and validating application changes.

Part 2: Batch Mode — Autonomous execution of approved test scripts in the background. This enables 100% QA coverage across all functions every time a feature or application is released.

The manual mode has five major capabilities:

1. Multiple script generation inputs

Users can feed the system information about the application to be tested in several ways:

  • Screenshot upload — Upload existing screenshots from the application
  • Built-in screenshot capture — Take screenshots directly from the app with cropping, custom naming, and metadata tagging
  • Playwright code capture — Crawl the application website and capture underlying CSS styles, components, and controls

The Playwright capture is particularly powerful. It extracts a JSON representation of all interactive elements, their selectors, states, and behaviors. This JSON gets sent to the AI, which generates an application-specific UI guide.

2. AI-generated UI guides

The UI guide is the bridge between the application's actual implementation and the test scripts. It contains:

  • Component registry with exact selectors
  • YAML step examples for each component type
  • Interaction patterns and required follow-up actions
  • State detection rules

When the application changes, the AI can compare baseline JSON to new JSON and update the UI guide with differences. It can also generate new baselines for applications not yet tested.

3. Script generation with review

The script generation agent uses the UI guide to create YAML testing scripts. But unlike a black-box generation process, the user can:

  • Review each step individually
  • Make AI-assisted changes to specific steps
  • Edit the YAML directly
  • Delete steps that aren't needed
  • Insert new actions using a library of available actions

Before execution, a final confirmation screen shows the complete YAML script, all available actions for easy insertion, and configuration options: target URL, step delay, page delay, browser visibility, and manual login toggle.

4. Execution with real-time feedback

The QA execution agent follows the script using Playwright. It opens a browser, pauses for user login (if manual login is enabled), then executes each step: performing the action, validating the result, and capturing a screenshot.

The application updates dynamically, showing each step as it completes. You can watch the test run in real-time or let it execute while you do other work.

5. Reporting and iteration

When execution completes, the app generates a detailed HTML report. Each step shows:

  • The action taken
  • Expected vs. actual results
  • Pass/fail status
  • Corresponding screenshot

Failed steps can be corrected directly from the report, and you can re-run individual steps or the entire test. If the test completes with 100% success, you can generate a UAT testing script—a human-readable document with estimated completion time, step counts, checkboxes, and specific instructions for manual verification.

All artifacts from any test run can be downloaded for documentation or compliance purposes.

Building the foundation

I started with a clear architectural decision: Python with Flask for the backend, HTML/JavaScript for the frontend, and Playwright for browser automation. This stack was chosen because it's well-documented, Claude Code handles it reliably, and it's straightforward to deploy.

The first conversation with Claude Code went something like this:

MY INSTRUCTION

"I want to build a QA automation application. It needs a web interface where users can upload screenshots, generate test scripts from those screenshots using AI, review and edit the scripts, then execute them against a target application using Playwright. Start with the basic Flask structure and a simple upload page."

Within minutes, I had a working Flask application with file upload functionality. Not production-ready, but functional. That's the vibe coding pattern: get something working, then iterate.

The next instruction was about AI integration:

MY INSTRUCTION

"Add the ability to send uploaded screenshots to an AI API and get back a YAML test script. Support Claude, ChatGPT, and Gemini as options. The user should be able to select which AI to use at the start of their workflow."

This is where the multi-model support came from. Not because I wanted to build a comparison tool, but because different users have different preferences and API access. Making it selectable was a product decision, not a technical one.

The UI guide breakthrough

Early testing revealed a problem. The AI could look at screenshots and generate test scripts, but the selectors were often wrong. It would guess at CSS selectors based on visual appearance, and those guesses frequently failed when Playwright tried to execute them.

The solution came from a separate project where I'd been working on optimizing UI guides for AI-generated test scripts. The insight: if you give the AI the actual HTML structure and selectors from the application, it generates dramatically better scripts.

So I added the Playwright capture feature. Instead of just screenshots, the system could crawl the application and extract:

  • All interactive elements and their exact selectors
  • Component hierarchies and parent-child relationships
  • Available states (expanded/collapsed, active/inactive, etc.)
  • CSS classes and data attributes

This gets converted to JSON, which the AI uses to generate a UI guide. The guide includes YAML examples for every component type, so the script generation agent has concrete patterns to follow.

EXAMPLE UI GUIDE ENTRY

### Primary Sidebar Navigation

SELECTOR PATTERN: [data-component="MainLayout:primary-sidebar"] a:has-text("ITEM_NAME")

YAML Pattern - Click Primary Sidebar Item:
- action_type: click
  target: Admin primary sidebar item
  selector_hints: 
    - "[data-component='MainLayout:primary-sidebar'] a:has-text('Admin')"

MANDATORY: After clicking, add mouse_move action to collapse 
primary sidebar and keep secondary visible.

The difference in script quality was immediate and significant. Scripts generated with the UI guide required far less manual editing than scripts generated from screenshots alone.

The agent workflow

Building QA Automation Studio wasn't a linear process. Features got added, broke other features, needed optimization, introduced security concerns, and required documentation. Managing all of this through plain English conversations could have been chaotic.

The Claude Code agents provided structure. Here's how a typical feature development cycle worked:

Step 1: Feature implementation

MY INSTRUCTION

"Add a step editor to the script review page. Users should be able to click on any step, see its details in a modal, make changes, and save. The modal should also have a 'regenerate with AI' button that takes the current step context and asks the AI for an improved version."

Claude Code implements the feature. I test it manually to verify basic functionality works.

Step 2: Code review

MY INSTRUCTION

"Run code-reviewer agent on the step editor implementation."

The agent analyzes the new code, identifies potential bugs, suggests error handling improvements, and flags any patterns that don't match the rest of the codebase.

Step 3: Security review

MY INSTRUCTION

"Run security-reviewer agent on the step editor, particularly the AI regeneration feature that sends data to external APIs."

This caught several issues early: improper input sanitization, API keys potentially exposed in client-side code, and insufficient validation of AI responses before rendering them in the UI.

Step 4: Optimization

MY INSTRUCTION

"Run code-optimizer agent on the step editor. The modal feels slow to open when there are many steps."

The optimizer identified unnecessary re-renders, suggested lazy loading for step data, and recommended caching AI responses to avoid redundant API calls.

Step 5: Documentation

MY INSTRUCTION

"Run document-writer agent to update the user guide with the new step editor feature."

Documentation happened as features were completed, not as an afterthought at the end of the project.

The batch processing system

Part 2 of the application—batch processing—required a different approach. The manual mode is interactive, with a human watching and making decisions. Batch mode needs to run autonomously, potentially overnight, executing approved test scripts without supervision.

The key insight was treating approved test scripts as first-class artifacts. When a test completes with 100% success in manual mode, that script becomes a candidate for batch execution. Users can:

  • Mark scripts as "approved for batch"
  • Group scripts into test suites
  • Schedule batch runs
  • Configure failure handling (stop on first failure, continue and report, retry failed steps)

The batch executor runs in headless browser mode—no visible window, just automated execution. It produces the same detailed reports as manual mode, but collected across all scripts in the batch.

THE VISION

Every time a feature is released or an application is updated, batch mode can run the complete test suite. 100% QA coverage, every time, without manual intervention. The reports tell you exactly what passed, what failed, and where to focus human attention.

What I learned about vibe coding

Building a production application through natural language conversations taught me several things about this approach:

Specificity matters more than you think

Early in the project, I'd give instructions like "add error handling." The results were inconsistent. Sometimes Claude added try/catch blocks everywhere. Sometimes it added user-facing error messages. Sometimes it logged errors to the console.

Better instructions looked like: "Add error handling to the script execution flow. If a step fails, capture the error message, take a screenshot of the current browser state, mark the step as failed in the report, and continue to the next step. Don't stop execution unless the browser crashes."

The more specific the instruction, the more predictable the result.

Testing is your responsibility

Claude Code can write tests, but deciding what to test and verifying that tests actually validate the right behavior is still a human job. I spent significant time manually testing each feature, finding edge cases, and then describing those edge cases so Claude could handle them.

This isn't a weakness of the approach—it's the appropriate division of labor. The AI handles implementation. You handle verification.

Agents provide necessary structure

Without the agent system, I would have been constantly context-switching between "implement this feature" and "review this code" and "check for security issues." The agents let me compartmentalize different concerns and address them systematically.

Running security-reviewer on every feature that touches external APIs or user input became a habit. Running code-optimizer after getting something working became a habit. These habits prevented problems from accumulating.

Version control is non-negotiable

Even though I wasn't writing code directly, I still used Git for version control. Every major feature got committed. Every agent review cycle got committed. If something broke badly, I could roll back.

Vibe coding doesn't mean abandoning software engineering practices. It means applying them at a different level of abstraction.

The risks and limitations

This approach isn't without drawbacks, and being honest about them matters.

Understanding the code you ship — I can read and roughly understand the Python code in QA Automation Studio. But there are implementation details I didn't specify and haven't deeply analyzed. If something breaks in a subtle way, debugging requires either understanding the code or describing the problem well enough for Claude to diagnose it.

Security depends on your prompts — The security-reviewer agent catches issues I describe and common patterns it's trained on. But novel security vulnerabilities in unusual code patterns might slip through. For production deployment, independent security review by someone who reads code directly is advisable.

Performance isn't automatic — Code generated this way tends to be correct but not optimal. The code-optimizer agent helps, but truly performance-critical applications might need human optimization.

API costs add up — Every conversation with Claude Code consumes API tokens. Every agent run consumes tokens. For a complex application built over weeks, this becomes a meaningful expense.

IMPORTANT

Vibe coding works well for applications where you can thoroughly test the results and iterate based on what you find. It's less suitable for safety-critical systems where you need to understand and verify every line of implementation.

Who should try this approach

Based on my experience, vibe coding with Claude Code works well for:

Technical leaders who understand systems but don't code daily — You know what you want, you can evaluate whether the result is correct, but writing the code yourself would be slow and frustrating.

QA engineers building automation tools — You understand testing deeply. You know what good test scripts look like. You need tooling that doesn't exist or costs too much.

Business analysts with technical backgrounds — You can read code, you understand data flows, but you're not a developer. You need prototypes or internal tools that IT doesn't have bandwidth to build.

Anyone building internal tools — Lower stakes than customer-facing software. Faster iteration. Easier to fix problems when you're the primary user.

The bigger picture

A year ago, building QA Automation Studio would have required either hiring developers or spending months learning web development. Today, it took weeks of focused conversations with an AI assistant.

This isn't about replacing developers. Complex, mission-critical systems still need human engineers who understand every line of code. But the category of problems that non-developers can solve has expanded dramatically.

The original Claude for Chrome experiment was about seeing if a browser agent could handle ad-hoc QA testing. QA Automation Studio is the answer to "what if we built that into a real product?" Both represent the same underlying trend: AI making it possible to build things that were previously out of reach.

If you've been thinking about a tool you wish existed—something that would make your work better but doesn't seem worth the investment to build—consider whether vibe coding might get you there. Start with a simple version. Iterate based on what works. Use agents to maintain quality.

You might be surprised what you can build.

KEY TAKEAWAYS

  • Vibe coding means specifying products in plain English and letting AI handle implementation
  • Claude Code agents (code-reviewer, code-optimizer, security-reviewer, document-writer) provide structure for quality assurance
  • UI guides generated from application code dramatically improve AI-generated test scripts
  • The human stays in the loop as product manager and verifier, not as coder
  • This approach works best for internal tools and applications where you can thoroughly test results
  • Traditional software engineering practices (version control, testing, security review) still apply

Want to learn more? Check out Practical AI for Humans for more practical guides on using AI effectively.