Voice transcription is messy. Even the best models like Whisper faithfully reproduce every “um”, “uh”, and rambling run-on sentence. That’s correct behavior for transcription, but not what you want when texting someone.

I added a “polish mode” to my macOS speech-to-text app that optionally sends Whisper’s output through an LLM to clean it up. The interaction model: hold Fn to record, tap Ctrl anytime during recording to enable polish, release to transcribe and paste.

The Modifier Key Challenge

The obvious approach - require Ctrl held simultaneously with Fn - felt clunky in testing. You’d have to coordinate two fingers before speaking, and the physical position is awkward.

A “latch” pattern works better: pressing Ctrl anytime while Fn is held latches the polish flag. You can press Ctrl before speaking, during, or just before release. The flag resets when you start a new recording.

let ctrl_latched = Arc::new(AtomicBool::new(false));

// In the event tap callback:
if key_pressed && !prev_pressed {
    // Recording started - reset latch
    ctrl_latched.store(false, Ordering::SeqCst);
    start_recording(&state);
} else if !key_pressed && prev_pressed {
    // Recording stopped - check if Ctrl was ever pressed
    let polish = ctrl_latched.load(Ordering::SeqCst);
    stop_recording(&state, polish);
}

// Latch Ctrl if pressed anytime during recording
if key_pressed && ctrl_pressed {
    ctrl_latched.store(true, Ordering::SeqCst);
}

The macOS CGEventFlags expose modifier state as bitmasks. Control is 0x40000:

const CONTROL_KEY_FLAG: u64 = 0x40000;

let flags = event.get_flags().bits();
let ctrl_pressed = (flags & CONTROL_KEY_FLAG) != 0;

The Polish Function

The polish step is a straightforward LLM API call. I’m using Groq’s hosted llama-3.3-70b-versatile because I’m already using Groq for Whisper transcription - one API key, one vendor.

fn polish_text(text: &str, api_key: &str) -> Option<String> {
    let client = reqwest::blocking::Client::new();

    let body = serde_json::json!({
        "model": "llama-3.3-70b-versatile",
        "messages": [
            {
                "role": "system",
                "content": "Clean up this voice message for texting. Remove filler words (um, uh, like, you know). Fix punctuation and sentence structure. Break up run-on sentences. Keep it casual. No trailing period. Output ONLY the cleaned text - no explanations, no quotes."
            },
            {
                "role": "user",
                "content": text
            }
        ],
        "temperature": 0.2
    });

    let response = client
        .post("https://api.groq.com/openai/v1/chat/completions")
        .header("Authorization", format!("Bearer {}", api_key))
        .header("Content-Type", "application/json")
        .json(&body)
        .timeout(Duration::from_secs(30))
        .send()
        .ok()?;

    if !response.status().is_success() {
        return None;
    }

    let chat_response: ChatResponse = response.json().ok()?;
    chat_response.choices.first().map(|c| c.message.content.clone())
}

The function returns Option<String> - this matters for the fallback logic.

Parsing the Response

Groq uses the OpenAI-compatible chat completions format. The response structure:

#[derive(serde::Deserialize)]
struct ChatResponse {
    choices: Vec<ChatChoice>,
}

#[derive(serde::Deserialize)]
struct ChatChoice {
    message: ChatMessage,
}

#[derive(serde::Deserialize)]
struct ChatMessage {
    content: String,
}

Using serde to parse into typed structs catches malformed responses at parse time rather than panicking on field access later.

Prompt Engineering Lessons

The system prompt went through several iterations:

First attempt: “Clean up this transcription.”

Problem: The LLM would respond conversationally. “Sure! Here’s the cleaned up version: …”

Second attempt: “Output only the cleaned text.”

Problem: It would wrap the output in quotes: "Here's what I meant to say"

Third attempt: Added explicit prohibitions.

Output ONLY the cleaned text - no explanations, no quotes.

This worked. The key insight: LLMs default to being helpful and conversational. For tool use, you need to explicitly tell them to suppress that behavior.

Other prompt decisions:

  • “Keep it casual” - prevents the LLM from making the text overly formal
  • “No trailing period” - texting convention; a period at the end feels curt
  • “Break up run-on sentences” - spoken language naturally runs together

Low temperature (0.2) keeps output consistent. Higher temperatures occasionally produced creative reinterpretations of what I said.

Graceful Degradation

The polish step can fail: network issues, rate limits, API changes. The user still expects their transcription to paste.

let final_text = if polish {
    polish_text(text, api_key).unwrap_or_else(|| text.to_string())
} else {
    text.to_string()
};

Option::unwrap_or_else is the right pattern here. If polish fails for any reason, fall back to the raw Whisper transcription. The user gets something rather than nothing.

This is a general principle for LLM features: treat them as enhancements, not requirements. The core functionality should work without them.

Latency Considerations

Polish adds a second API call, roughly 200-400ms on Groq. For a texting use case, this is acceptable - you’re not in a real-time conversation. For live captioning or dictation into a text field, it would be too slow.

The transcription already happens in a background thread:

thread::spawn(move || {
    transcribe_and_paste(audio_data, sample_rate, &api_key, polish);
});

Both the Whisper call and the polish call happen sequentially in this thread. The UI remains responsive; the user just waits slightly longer for paste.

Trade-offs

When polish helps:

  • Texting, where filler words and run-ons look sloppy
  • Drafting messages you want to sound more coherent
  • Quick notes that benefit from basic cleanup

When to skip it:

  • Dictating into forms or code comments
  • When you want exact transcription (quotes, interviews)
  • Low-latency scenarios

What polish can break:

  • Proper nouns and technical terms may get “corrected”
  • The LLM might misinterpret intent on ambiguous input
  • Short inputs (“ok”, “yes”) sometimes get expanded unnecessarily

The latch pattern makes this an explicit user choice. Default is raw transcription; polish is opt-in.

Conclusion

  • Latch pattern beats simultaneous press - let users enable modes at any point during an action
  • Explicit prompt constraints - tell the LLM what NOT to do (no explanations, no quotes)
  • Low temperature for tools - you want consistency, not creativity
  • Graceful fallback is mandatory - LLM features should enhance, not gate, core functionality
  • Choose your latency budget - 200-400ms is fine for async use cases, not for real-time

Related: Building an AI-Powered Changelog GitHub Action - Similar prompt engineering patterns for developer tooling.