Skip to content
Sagar Wavhal
// PROCESS_LOG

Build Log: AI Voice Assistant

From idea to production in 7 days

This is a transparent look at how I built the Gemini-powered Voice Assistant. Not a tutorial — just proof of thinking and execution.

Day 1

Problem Definition & Architecture

Human Decision
  • ├─Identified gap: voice assistants sacrifice privacy for convenience
  • ├─Decided on offline-first approach with cloud AI fallback
  • ├─Chose Next.js + TypeScript for production-grade PWA support
AI Execution
  • ├─Generated initial project structure via prompting
  • ├─Suggested Web Speech API vs Whisper WASM tradeoffs
  • ├─Created TypeScript interfaces for voice state management
Key Decision

Chose hybrid approach: offline speech recognition (Web Speech API) + cloud AI (Gemini) for responses. Privacy for listening, intelligence for responding.

Day 2-3

Gemini Integration & Voice Pipeline

Human Decision
  • ├─Designed conversation state machine
  • ├─Wrote Gemini prompt engineering for assistant persona
  • ├─Debugged audio context issues across browsers
AI Execution
  • ├─Generated Gemini API integration code
  • ├─Created Web Speech API hooks and utilities
  • ├─Built audio visualization components
Key Decision

Implemented streaming responses from Gemini for natural conversation flow. AI suggested this pattern after initial batch response felt robotic.

Day 4-5

PWA & Offline Architecture

Human Decision
  • ├─Designed service worker caching strategy
  • ├─Tested offline scenarios and fallback states
  • ├─Optimized bundle for mobile performance
AI Execution
  • ├─Generated PWA manifest and service worker
  • ├─Created WebWorker setup for Whisper WASM
  • ├─Built glassmorphism UI components
Key Decision

Opted for Web Speech API as primary (better browser support) with Whisper WASM as experimental alternative for true offline.

Day 6-7

Polish, Test, Deploy

Human Decision
  • ├─Cross-browser testing (Safari issues resolved)
  • ├─Refined Gemini prompts for better responses
  • ├─Deployed to Vercel with environment configuration
AI Execution
  • ├─Debugged Safari-specific Web Speech quirks
  • ├─Suggested bundle size optimizations
  • ├─Generated README and documentation
Key Decision

Added visual feedback for voice activity after user testing showed confusion about listening state.

Build Metrics

7
Days
~65%
AI Code
~2.8k
Lines
~25h
Human Hours
Key Insight

AI didn't replace the thinking — it amplified it. The architecture decisions, debugging strategy, and UX refinements were human. AI accelerated the execution.