Skip to content
Runs local · no upload

Audio Transcription — Speech to Text

Hours of audio, zero typing — your browser does the listening.

How It Works

  1. 01

    Paste text or code

    Paste your content into the input field or type directly.

  2. 02

    Instant processing

    The tool processes your content immediately and shows the result.

  3. 03

    Copy result

    Copy the result to your clipboard with one click.

Privacy

All calculations run directly in your browser. No data is sent to any server.

Voice memos from a client, WhatsApp recordings from a meeting nobody minuted, podcast episodes you need quoted in a blog post. Drop the file here and the transcript appears — with timestamps for subtitles. Everything runs on your device. No cloud, no API key, no logs.

01 — How to Use

How do you use this tool?

  1. Click the upload area or drag and drop an audio file (MP3, WAV, M4A, OGG, WebM).
  2. Pick a quality tier: Fast (~152 MB, mobile-friendly), Accurate (~291 MB, default) or Precise (~968 MB, desktop).
  3. Force a language if auto-detect struggles with short clips or strong accents — otherwise leave it on auto.
  4. Click Transcribe. The model loads into your browser cache once, then works offline forever after.
  5. Copy the transcript or download it as TXT (plain text) or SRT (with timestamps for subtitles).

What does this tool do?

This tool converts spoken audio into a plain-text transcript — without uploading anything. It uses a compact speech-recognition model compiled to WebAssembly, running directly inside your browser tab. You get the full transcript in a scrollable, editable panel that you can copy or download as TXT or SRT.

Supported input formats include MP3, WAV, M4A (AAC), OGG Vorbis, and WebM Opus — the most common formats produced by phones, voice recorders, video editors, and meeting apps.

How Does It Work?

The pipeline runs in two stages, both on your device:

  1. Decode and normalize. The Web Audio API decodes your file and resamples it to 16 kHz mono — the input format speech-recognition models expect. Stereo channels are averaged to a single mono signal.

  2. Inference. A compact transformer model converts each 30-second window of audio into text tokens, then merges the windows into a continuous transcript with timestamps. Everything runs inside your browser tab — no cloud API, no third-party service.

The model is downloaded once on first use and cached in your browser. After that, transcription works fully offline.

Three Quality Tiers — Which to Pick?

The choice trades download size and speed against recognition accuracy:

TierModel sizeSpeedBest for
Fast~152 MBvery fastMobile, short voice memos, quick notes
Accurate~291 MBbalancedDefault for meetings, interviews, podcasts
Precise~968 MBslowerStudio recordings, lectures, accented speech

Pick a tier in the model selector below the upload area. Each tier is cached separately, so you can switch back and forth.

How does language detection work?

The model auto-detects the spoken language from the first 30 seconds of audio. If detection misfires — common with short clips or heavy accents — use the language dropdown to force a specific language before transcribing.

SettingWhen to use
Auto-detectMonolingual recordings ≥ 30 seconds
Force languageShort clips, strong regional accents
EnglishPodcasts, meetings, dictation
GermanGerman-language interviews, lectures
French/SpanishNative-speaker recordings

TXT or SRT — Which Export?

The download dialog offers two formats:

  • TXT — plain running text, one paragraph. Best for meeting minutes, blog drafts, research notes.
  • SRTSubRip subtitle format with start/end timestamps per block (00:01:23,456 --> 00:01:28,910). Imports directly into YouTube, Premiere Pro, DaVinci Resolve, CapCut, VLC, and most editors that handle captions.

For social-video subtitles, download SRT and import it into your editor — font, size and position are rendered by the player.

What are common use cases?

Meeting notes. Drop a recorded Zoom or Teams call and get a rough transcript to clean up into minutes. A 1-hour meeting typically produces a 5,000–8,000-word transcript.

Podcast show notes. Transcribe an episode to pull quotes, build timestamped chapters, or generate an SEO-friendly description.

Video captions. Extract dialogue, format as SRT, drop into your video editor for closed captions. Improves accessibility for deaf viewers and silent autoplay on social.

Dictation cleanup. iPhone or Android voice memos transcribed in seconds, then edited as plain text.

Academic research. Qualitative researchers transcribe interview recordings without sending sensitive data to a third-party transcription service — GDPR-friendly, no DPA required.

What are the best-practice tips?

  • Quiet environment beats post-processing filters every time.
  • Mic distance 20–30 cm reduces plosives and distortion.
  • Speak deliberately — slow, clear delivery boosts recognition, especially for technical vocabulary.
  • 128 kbps MP3 is plenty for transcription. Higher bitrates don’t improve accuracy.
  • Split long recordings into 30–60-minute chunks before transcribing — more stable and gives natural break points.
  • Force the language for clips under 30 seconds or with strong accents.

From the kittokit ecosystem for the full audio-to-text workflow:

  • Speech Enhancer — Remove noise, echo and background hum before transcribing. Translates directly into higher word accuracy.
  • Character Counter — Count words, characters and reading time of your transcript. Handy for trimming meeting minutes into newsletter or blog length.
  • Text Diff — Compare two transcript versions, e.g. raw vs. proofread. Highlights changes word-by-word.

Last updated:

You might also like