How It Works

The 7-stage pipeline from speech to polished text — cleanup, formatting, AI, and paste.

When you record with VivaDicta, your speech goes through a multi-stage pipeline that turns raw audio into clean, polished text — ready to paste wherever your cursor is.

The Pipeline

1

Transcription

Your speech is converted to text using your chosen model (local or cloud).

2

Cleanup

Filler words ("um", "uh", "hmm") are removed, along with hallucinated artifacts like bracketed text.

3

Smart Formatting

Continuous text is broken into readable paragraphs using language-aware sentence detection.

4

Word Replacements

Your custom find-and-replace rules are applied — fixing names, expanding abbreviations, standardizing terms.

5

AI Processing

If enabled, the text is sent to your chosen AI provider with your selected preset for polishing, summarizing, translating, or reformatting.

6

Output Cleanup

AI artifacts (thinking tags, wrapper XML) are stripped for a clean final result.

7

Smart Insert

Capitalization and spacing are adjusted based on surrounding text so the result blends seamlessly at your cursor.

8

Paste at Cursor

The finished text is pasted directly where your cursor is — in any app.

Stage 1: Transcription

VivaDicta supports a wide range of transcription engines. Your audio is processed by whichever model you've selected — either locally on your device or via a cloud provider.

  • Local: Whisper, Parakeet, Apple Speech — fully private, no internet needed.
  • Cloud: Groq, Cohere, Deepgram, ElevenLabs, Gemini, Mistral, and more — fast and accurate.

Some cloud providers (like Deepgram) also apply server-side smart formatting and punctuation before the text enters the local pipeline.

Stage 2: Cleanup

Raw transcription output often contains artifacts that you don't want in your final text. VivaDicta automatically removes:

  • Filler words — "um", "uh", "hmm", and variations. You can customize the filler word list or disable this in settings.
  • Hallucinated text — bracketed artifacts like [music], (inaudible), etc. that speech models sometimes generate.
  • XML tags — some models wrap output in XML tags; these are stripped automatically.

After cleanup, VivaDicta checks if the text has meaningful content. If it's empty or just punctuation, processing stops — no empty transcription clutters your history.

Stage 3: Smart Formatting

Speech produces a continuous stream of text. VivaDicta uses Apple's Natural Language framework to intelligently break it into paragraphs:

  • Detects the dominant language automatically.
  • Splits text into sentences using linguistic tokenization.
  • Groups sentences into paragraphs (~50 words or 4+ sentences each).
  • Short utterances ("Yes.", "OK.") don't trigger paragraph breaks.

This can be toggled on/off in settings. When disabled, text remains as a single block.

Stage 4: Word Replacements

Your custom word replacement rules are applied next. These are find-and-replace pairs you define in settings — useful for:

  • Correcting frequently misspelled names.
  • Expanding abbreviations ("ASAP" → "as soon as possible").
  • Standardizing terminology.
  • Fixing persistent transcription errors.

Replacements support multiple languages (Latin, Cyrillic, CJK) with smart word-boundary detection. Each rule can be individually enabled/disabled. See Vocabulary & Replacements.

Stage 5: AI Processing

If AI processing is enabled, the cleaned transcription is sent to your chosen AI provider with your selected preset. The AI can:

  • Fix grammar and punctuation (Regular preset).
  • Reformat for a specific tone (Professional, Casual, Chat).
  • Summarize or extract action points.
  • Translate to another language.
  • Format as code or technical content.
  • Follow any custom instructions you've written.

Context Injection

On macOS, VivaDicta can inject additional context so the AI understands what you're working on:

  • Clipboard content — what you last copied.
  • Screen text (OCR) — text visible on your screen.
  • Selected text — your text selection in the active app.
  • Custom vocabulary — domain-specific terms are included in the AI prompt.

Each context source is opt-in per Viva Mode — you control exactly when and where it's used.

Stage 6: Output Cleanup

AI models sometimes include artifacts in their responses — thinking tags, wrapper XML, chain-of-thought reasoning. VivaDicta strips all of this automatically so you get clean final text.

Stage 7: Smart Insert

Before pasting, VivaDicta reads the text surrounding your cursor and adjusts the transcription to fit naturally:

  • Smart capitalization — capitalizes the first word if you're starting a new sentence, or lowercases it if you're mid-sentence.
  • Smart spacing — adds a leading space if the cursor is right after a word, and a trailing space so the next word doesn't run into the pasted text.

This uses macOS accessibility APIs to detect cursor context. You can toggle it on/off in settings (Smart Insert), and override it per Viva Mode.

Stage 8: Paste at Cursor

The finished text is copied to your clipboard and pasted at your current cursor position via a simulated keyboard shortcut. It works in any app — your text editor, email client, chat app, browser, or IDE.

Both the raw transcription and AI-processed version are saved to your transcription history, so you can always go back.