I Built a Private Chinese Live Translator on My Mac

I work at a Chinese company. Every standup, every design review, every Friday demo — Mandarin. My ear is getting better, but at meeting speed it isn't fast enough yet, and pulling out a phone translator means I miss the next sentence to catch the last one.

So I built one. A native macOS app that listens to system audio, transcribes the Mandarin on-device, drops a pinyin line underneath, and shows the English translation — all without sending a byte to a cloud. Here's the architecture, why every "obvious" cloud-based shortcut fails for live meetings, and the engineering decisions that mattered.

我们今天先对一下产品方向

wǒ men jīn tiān xiān duì yí xià chǎn pǐn fāng xiàng

Let's align on product direction first today

这个功能下周能上线吗

zhè ge gōng néng xià zhōu néng shàng xiàn ma

Can this feature ship next week?

我们再讨论一下接口的设计

wǒ men zài tǎo lùn yí xià jiē kǒu de shè jì

Let's discuss the API design once more

Why Cloud Translators Fail at Meeting Speed

Google Translate, DeepL, Apple's Live Captions — all decent for static text or solo speech. None of them survive a real meeting. The reason is latency budget, not accuracy.

The Cloud Round-Trip Tax

  • Step 1 — Capture: chunk audio, encode, sometimes buffer 2–3 seconds for ASR confidence.
  • Step 2 — Upload: push to a remote ASR endpoint. Add network jitter, TLS handshake, queue.
  • Step 3 — Transcribe: server runs ASR, returns text. Usually fast — but not local-fast.
  • Step 4 — Translate: second cloud hop, second model, second queue.
  • Step 5 — Render: finally, English shows up on your screen. ~3–6s total in good conditions.

Three to six seconds doesn't sound like much. In a meeting full of "对,对,那个不行," it's two thoughts behind. By the time the translation lands, the speaker has moved on and the cursor in your brain is still on the previous sentence. You don't fall behind by one beat — you fall behind compounding.

On-device ASR + on-device translation can run the same pipeline in 200–500ms. That's the difference between following a meeting and watching one happen to you.

The 4 Numbers That Matter

~300ms

End-to-End Latency

0

Cloud Calls

3

Lines Shown

100%

Private

The Pipeline (4 Stages, All Local)

Every stage runs on the M-series Neural Engine or CPU. No outbound network. The trade-off is model size on disk and a one-time download — in return you get latency a cloud service cannot match.

Stage 1

Audio Capture (System + Mic)

CoreAudio · ScreenCaptureKit

macOS 13+ lets you tap system audio via ScreenCaptureKit alongside the mic. That single API is the whole reason this app is possible — before it, capturing Zoom audio meant kernel extensions or BlackHole. Now it's one entitlement and a stream subscription.

Stage 2

On-Device Mandarin ASR

Whisper · Speech framework

Apple's SFSpeechRecognizer works for Mandarin (zh-CN) and runs on-device on Apple Silicon when you set requiresOnDeviceRecognition = true. For richer accuracy, a Whisper small/medium CoreML model is the alternate path — slower but more forgiving with accents and noisy meetings.

Stage 3

Pinyin Romanization

CFStringTransform

This one surprised me — it's a single line. Apple ships a Unicode transform that converts Han characters to Latin pinyin with tone marks. No model, no library, no network. kCFStringTransformMandarinLatin handles it natively.

Stage 4

Offline Translation (zh → en)

Apple Translation

macOS 15's Translation framework runs offline once you've downloaded the language pack. TranslationSession handles batching and streaming results. For older macOS, swap in an Argos Translate or local ONNX Marian model — same shape of pipeline.

Pinyin in One Line (No, Really)

The detail I want to highlight, because nobody told me it existed: Apple gives you Mandarin → pinyin for free. Built into Foundation. Has been there for years.

// Swift — Han → pinyin with tone marks let mandarin = "我们今天先对一下产品方向" let mutable = NSMutableString(string: mandarin) as CFMutableString CFStringTransform(mutable, nil, kCFStringTransformMandarinLatin, false) // → "wǒ men jīn tiān xiān duì yī xià chǎn pǐn fāng xiàng"

If you want pinyin without tone marks (cleaner for some readers), chain a second transform with kCFStringTransformStripDiacritics. That's the whole pinyin layer. The other 95% of the work was audio capture and UI.

Decision Table — Stack Choices I Made

Every stage had a cloud option and a local option. Here's where I landed and why.

ChoicePickedWhy
ASR engineApple Speech (on-device) + Whisper fallbackZero round-trip, free, ships with the OS.
Audio captureScreenCaptureKit (system + mic mix)One API, no kernel extension, works with Zoom/Meet/Teams.
Pinyin layerCFStringTransformOne line of Foundation. No model.
TranslationApple Translation (offline pack)Offline once downloaded. Good enough for technical meetings.
UI shellSwiftUI floating windowAlways-on-top panel above Zoom. NSPanel hosting a SwiftUI view.
StateActor + AsyncStreamAudio buffer → ASR → translate is a natural async pipeline.

5 Things That Bit Me

  1. Gotcha 01Partial vs final ASR results. SFSpeechRecognizer emits partial transcripts that get rewritten. Don't translate every partial — debounce by ~400ms or wait for isFinal. Translating mid-sentence is what makes apps feel broken.
  2. Gotcha 02Audio routing changes break the stream. Plug in AirPods mid-meeting and ScreenCaptureKit can drop the system tap. Listen for route changes and restart the capture session.
  3. Gotcha 03Mandarin segmentation is hard. ASR gives you a wall of characters with no spaces. For pinyin readability you want word-level segmentation — but CFStringTransform doesn't know words. Either accept char-level pinyin or layer in a segmenter (Jieba-style).
  4. Gotcha 04Translation pack must be pre-downloaded. First run, TranslationSession can prompt the user. Trigger the download proactively at first launch, otherwise mid-meeting download = mid-meeting outage.
  5. Gotcha 05Battery on laptops. Continuous ASR + translation + an always-on-top SwiftUI window eats power. On a MacBook on battery, I throttle the ASR sample rate and translate every Nth final segment instead of every one.
Full Story on Medium · 12 min read

Read The Build Log On Medium

The Medium version goes into the actual code — ScreenCaptureKit setup, the SwiftUI floating panel, how I tuned partial-result debouncing, and the moment I realized Apple shipped pinyin in Foundation a decade ago and nobody talks about it.

Read the Full Article on Medium →

Why I'm Sharing This

Two reasons. First, "build it instead of buying it" is the right move more often than people admit — especially for tools you'll touch ten times a day. The cloud translators charge per minute, leak audio to servers, and lag enough to be useless for the actual problem.

Second, the macOS stack is quietly absurd. ScreenCaptureKit, SFSpeechRecognizer, CFStringTransform, TranslationSession — four Apple APIs and you've replaced a paid SaaS. The hard part wasn't AI. It was knowing which Foundation transform to call.

If You're Building Mac/iOS Tools Like This

Mastering Claude Code

20 chapters + 50 hidden tips. Pair Claude Code with your iOS/macOS workflow to ship features in half the time.

Gumroad — $19 →

100DaysSwiftUI-to-Expert

The complete 100-article SwiftUI curriculum — fundamentals to production apps, free on Medium.

Open the List →

Medium Income 2026

The 2026 playbook for monetizing your dev knowledge on Medium. Real receipts, 100+ countries supported.

Gumroad — $19 →

Frequently Asked Questions

Does this work on Intel Macs?

The audio capture and pinyin transform do. On-device ASR with SFSpeechRecognizer requires Apple Silicon for the offline mode — on Intel, it falls back to cloud-assisted recognition, which defeats the latency point. If you're on Intel, swap in a local Whisper.cpp build instead.

Why not Whisper for everything?

Whisper is more accurate, especially with accents, but the small/medium models still cost real time on CPU. Apple's SFSpeechRecognizer is tuned for streaming and ships with the OS — for typical meeting Mandarin it's faster and quite accurate. I keep Whisper as a "noisy meeting" mode.

What about Cantonese or other Chinese variants?

SFSpeechRecognizer supports yue-CN for Cantonese. The pinyin step is Mandarin-specific (you'd want Jyutping for Cantonese — a separate transform). Translation works similarly via the Apple Translation framework if the language pack exists.

Could I open-source this?

Possibly. The core pipeline is straightforward enough that I'm considering it. The interesting code isn't the APIs — it's the debounce logic, the floating panel behavior, and the meeting-specific tuning. Watch the Medium article for updates.

Is on-device translation good enough for technical work?

For technical and product meetings — yes. The Apple Translation models handle engineering jargon surprisingly well. For literary or marketing language, cloud models are still better. For "can this ship next week" and "who owns the API change," local is fine.

Working at a Chinese company? Stop missing meetings. Read the full build.

Read The Full Walkthrough On Medium →

No comments: