How It Works
The 7-stage pipeline from speech to polished text — cleanup, formatting, AI, and paste.
When you record with VivaDicta, your speech goes through a multi-stage pipeline that turns raw audio into clean, polished text — ready to paste wherever your cursor is.
The Pipeline
Transcription
Your speech is converted to text using your chosen model (local or cloud).
Cleanup
Filler words ("um", "uh", "hmm") are removed, along with hallucinated artifacts like bracketed text.
Smart Formatting
Continuous text is broken into readable paragraphs using language-aware sentence detection.
Word Replacements
Your custom find-and-replace rules are applied — fixing names, expanding abbreviations, standardizing terms.
AI Processing
If enabled, the text is sent to your chosen AI provider with your selected preset for polishing, summarizing, translating, or reformatting.
Output Cleanup
AI artifacts (thinking tags, wrapper XML) are stripped for a clean final result.
Smart Insert
Capitalization and spacing are adjusted based on surrounding text so the result blends seamlessly at your cursor.
Paste at Cursor
The finished text is pasted directly where your cursor is — in any app.
Stage 1: Transcription
VivaDicta supports a wide range of transcription engines. Your audio is processed by whichever model you've selected — either locally on your device or via a cloud provider.
- Local: Whisper, Parakeet, Apple Speech — fully private, no internet needed.
- Cloud: Groq, Cohere, Deepgram, ElevenLabs, Gemini, Mistral, and more — fast and accurate.
Some cloud providers (like Deepgram) also apply server-side smart formatting and punctuation before the text enters the local pipeline.
Stage 2: Cleanup
Raw transcription output often contains artifacts that you don't want in your final text. VivaDicta automatically removes:
- Filler words — "um", "uh", "hmm", and variations. You can customize the filler word list or disable this in settings.
- Hallucinated text — bracketed artifacts like [music], (inaudible), etc. that speech models sometimes generate.
- XML tags — some models wrap output in XML tags; these are stripped automatically.
After cleanup, VivaDicta checks if the text has meaningful content. If it's empty or just punctuation, processing stops — no empty transcription clutters your history.
Stage 3: Smart Formatting
Speech produces a continuous stream of text. VivaDicta uses Apple's Natural Language framework to intelligently break it into paragraphs:
- Detects the dominant language automatically.
- Splits text into sentences using linguistic tokenization.
- Groups sentences into paragraphs (~50 words or 4+ sentences each).
- Short utterances ("Yes.", "OK.") don't trigger paragraph breaks.
This can be toggled on/off in settings. When disabled, text remains as a single block.
Stage 4: Word Replacements
Your custom word replacement rules are applied next. These are find-and-replace pairs you define in settings — useful for:
- Correcting frequently misspelled names.
- Expanding abbreviations ("ASAP" → "as soon as possible").
- Standardizing terminology.
- Fixing persistent transcription errors.
Replacements support multiple languages (Latin, Cyrillic, CJK) with smart word-boundary detection. Each rule can be individually enabled/disabled. See Vocabulary & Replacements.
Stage 5: AI Processing
If AI processing is enabled, the cleaned transcription is sent to your chosen AI provider with your selected preset. The AI can:
- Fix grammar and punctuation (Regular preset).
- Reformat for a specific tone (Professional, Casual, Chat).
- Summarize or extract action points.
- Translate to another language.
- Format as code or technical content.
- Follow any custom instructions you've written.
Context Injection
On macOS, VivaDicta can inject additional context so the AI understands what you're working on:
- Clipboard content — what you last copied.
- Screen text (OCR) — text visible on your screen.
- Selected text — your text selection in the active app.
- Custom vocabulary — domain-specific terms are included in the AI prompt.
Each context source is opt-in per Viva Mode — you control exactly when and where it's used.
Stage 6: Output Cleanup
AI models sometimes include artifacts in their responses — thinking tags, wrapper XML, chain-of-thought reasoning. VivaDicta strips all of this automatically so you get clean final text.
Stage 7: Smart Insert
Before pasting, VivaDicta reads the text surrounding your cursor and adjusts the transcription to fit naturally:
- Smart capitalization — capitalizes the first word if you're starting a new sentence, or lowercases it if you're mid-sentence.
- Smart spacing — adds a leading space if the cursor is right after a word, and a trailing space so the next word doesn't run into the pasted text.
This uses macOS accessibility APIs to detect cursor context. You can toggle it on/off in settings (Smart Insert), and override it per Viva Mode.
Stage 8: Paste at Cursor
The finished text is copied to your clipboard and pasted at your current cursor position via a simulated keyboard shortcut. It works in any app — your text editor, email client, chat app, browser, or IDE.
Both the raw transcription and AI-processed version are saved to your transcription history, so you can always go back.