Path to expertise: On-Device Speech & Language Processing

19 minute read

Published: September 27, 2025

Intermediate → Expert: On-Device Speech & Language Processing

Niche: The person who can optimize a Conformer for ARM, build WFST decoding graphs, run a 3B LLM at 30 tok/s on an automotive SoC, and ship an end-to-end on-device voice pipeline in C++.

Timeline: 8–9 months at ~2 hours/day (~36 weeks) Philosophy: Depth over breadth. Every concept gets implemented. Theory is learned through debugging.

The Focus

Go deep (expert-level):

Streaming ASR — CTC, transducers, WFST decoding, on-device Conformer
On-device NLU — compact intent-slot models, hybrid cloud fallback
On-device LLM inference — small LLM optimization, speculative decoding, KV-cache management, llama.cpp/MLC-LLM
Edge model optimization — quantization, distillation, C++ inference, ARM profiling

Go competent (working knowledge, not expert):

Audio signal processing — enough to debug feature pipelines
Retrieval/ranking — enough for NLU candidate selection and on-device RAG
LLM training/fine-tuning — enough to adapt models with LoRA, not enough to pretrain

Deferred (separate learning tracks):

Learning-to-rank as a standalone discipline
Research-level NAS, RLHF, alignment

Phase 1: Audio & Transformer Foundations (Weeks 1–5)

1A. Speech Signal Processing (Weeks 1–2)

Competent-level. You need to understand the input to your models, not publish DSP papers.

Sub-skills in order:

Sampling theory, Nyquist, aliasing — why 16kHz for speech
STFT, spectrograms, mel-scale, MFCCs — the feature extraction pipeline
Voice Activity Detection — energy-based, then model-based
Noise estimation & spectral subtraction basics

80/20 concepts:

MFCCs — still the backbone of on-device ASR feature extraction
Log-mel spectrograms — the input to nearly all modern audio models
Windowing and hop size — understand the time-frequency resolution tradeoff

Resources:

Speech and Language Processing — Jurafsky & Martin, Ch. 16 (web.stanford.edu/~jurafsky/slp3)
Librosa tutorials: librosa.org/doc
Stanford CS224S lecture notes (online)

Milestone: Python pipeline: raw WAV → mel spectrogram → MFCC → VAD segmentation. Visualize each stage. Then port MFCC extraction to C++ using only standard math.

1B. Transformer & Attention Internals (Weeks 3–5)

Expert-level. This is the architecture you’ll be optimizing and deploying — for both ASR and LLMs.

Sub-skills in order:

Self-attention — Q/K/V projections, scaled dot-product
Multi-head attention — what each head learns
Positional encoding — sinusoidal, learned, rotary (RoPE)
Layer normalization — pre-norm vs post-norm
KV-cache — THE bottleneck for on-device LLM inference
Flash Attention — memory-efficient tiling approach
Grouped-Query Attention (GQA) — the KV-cache compression trick used by Llama 2/3, Gemma, Phi
Encoder-only vs encoder-decoder vs decoder-only — when to use which

80/20 concepts:

KV-cache mechanics — dominates memory on edge devices for both ASR decoders and LLMs
GQA — reduces KV-cache size by sharing key/value heads; critical for fitting LLMs in edge memory
Flash Attention tiling — O(n²) compute but O(n) memory
Attention complexity — and when linear alternatives are worth the accuracy tradeoff

Resources:

Attention Is All You Need (Vaswani et al., 2017) — implement from scratch
Karpathy’s “Let’s build GPT from scratch” + nanoGPT repo
Jay Alammar’s “The Illustrated Transformer”
Lilian Weng’s “Attention? Attention!”
GQA paper (Ainslie et al., 2023)

Milestone: Implement a transformer decoder from scratch in PyTorch (no nn.Transformer). Implement KV-cache with both MHA and GQA. Profile: measure memory and FLOPs for each. Understand why GQA cuts KV-cache by 4-8x.

Phase 2: ASR Deep Dive (Weeks 6–12)

This is your core. Go deep.

2A. End-to-End ASR Architectures (Weeks 6–8)

Sub-skills in order:

CTC loss — the math, the blank token, the conditional independence assumption
CTC decoding — greedy, beam search, prefix beam search
RNN-Transducer (RNN-T) — joint network, how it differs from CTC
Conformer — convolution + attention, why it beats pure transformers for speech
Streaming ASR — chunked attention, causal convolutions, look-ahead tradeoff
Endpointer design — when to stop listening (energy + model-based)

80/20 concepts:

CTC + beam search — how most on-device ASR works today
Conformer architecture — current production SOTA, understand every block
Streaming chunked attention — the latency vs accuracy knob you’ll tune constantly
RNN-T vs CTC tradeoff — RNN-T is better but heavier; know when each wins

Resources:

ESPnet toolkit — reference Conformer-CTC and Conformer-Transducer implementations
NVIDIA NeMo ASR tutorials
Conformer paper (Gulati et al., 2020)
RNN-T paper (Graves, 2012) + Google’s streaming RNN-T papers

Milestone: Train a Conformer-CTC model on LibriSpeech-100h. Compare streaming (chunked) vs non-streaming WER. Profile model: FLOPs, memory, latency per utterance.

2B. WFST-Based Decoding (Weeks 9–10)

Expert-level. Directly relevant to your OpenFST work.

Sub-skills in order:

FST fundamentals — acceptors, transducers, weights, semirings
Core operations — composition, determinization, minimization, epsilon removal
The ASR decoding pipeline — H∘C∘L∘G graph construction
On-the-fly composition — when full composition is too expensive
Lattice generation and rescoring
Integrating neural model scores with WFST decoding

80/20 concepts:

H∘C∘L∘G composition — understand what each transducer does and why order matters
Determinization — why it’s needed and when it blows up
On-the-fly composition — critical for on-device where memory is constrained
Lattice rescoring with LM — how to improve accuracy without rerunning the acoustic model

Resources:

OpenFST documentation and tutorials
Kaldi WFST recipes — the best practical WFST tutorials available
Mohri et al., “Weighted Finite-State Transducers in Speech Recognition” (the foundational paper)
Speech Recognition Algorithms Using Weighted Finite-State Transducers — Takaaki Hori & Atsushi Nakamura

Milestone: Build a WFST decoding graph from scratch using OpenFST: compile H, C, L, G transducers individually, compose them, decode CTC output through it. Measure WER improvement over greedy CTC decode.

2C. Wake Word & Keyword Spotting (Weeks 11–12)

Sub-skills in order:

Small-footprint model architectures — DS-CNN, CRNN, attention-based KWS
Streaming detection — sliding window, threshold tuning
False accept / false reject tradeoff — ROC curves for KWS
Data augmentation for KWS — noise injection, room impulse response simulation
Personalized wake words — few-shot enrollment

80/20 concepts:

DS-CNN (depthwise separable CNN) — the go-to architecture for tiny keyword spotters
Threshold tuning — the FA/FR tradeoff is the entire product decision
Data augmentation — more important than model architecture for KWS robustness

Resources:

Google Speech Commands dataset
“Hello Edge” paper (Zhang et al., 2017) — keyword spotting on microcontrollers
Arm ML Zoo KWS examples

Milestone: Train a keyword spotter (<100KB model) on Speech Commands. Deploy in C++ with streaming audio input. Tune the detection threshold to hit a target FA rate.

Phase 3: On-Device NLU (Weeks 13–17)

3A. Intent & Slot Models (Weeks 13–16)

Expert-level. This is the brain after the ears.

Sub-skills in order:

Intent classification — multi-class, confidence calibration
Slot filling / NER — BIO sequence labeling
Joint intent-slot models — shared encoder, why joint > separate
Dialogue context — slot carryover, context resolution
Model compression for NLU — DistilBERT, TinyBERT, pruned transformers
On-device inference — ONNX export, INT8 quantization of NLU models
Hybrid routing — deciding what runs on-device vs cloud

80/20 concepts:

Joint intent-slot architecture (JointBERT-style) — this is production NLU
BIO decoding edge cases — partial entities, nested entities
Distillation for on-device — the teacher-student pipeline for NLU
Confidence-based routing — when to fall back to cloud (or to the on-device LLM)

Resources:

Speech and Language Processing — Jurafsky & Martin, Ch. 8 (Sequence Labeling), Ch. 15 (Dialogue)
Hugging Face token-classification tutorials
Snips NLU paper — on-device NLU design reference
MASSIVE dataset (Amazon) — multilingual NLU benchmark

Milestone: Build a joint intent-slot model for a car-control domain (“set temperature to 72”, “navigate to home”). Train, distill to a smaller model, quantize to INT8, deploy in C++ via ONNX Runtime. Measure: accuracy, latency, model size at each step.

3B. Retrieval & Ranking for NLU (Week 17)

Competent-level. Supporting skill for both task-specific NLU and on-device RAG.

Sub-skills:

Bi-encoder embeddings for intent/entity retrieval
FAISS / HNSW for fast approximate nearest neighbor search
Cross-encoder reranking for disambiguation
On-device vector search — small index, constrained memory

80/20 concepts:

Bi-encoder vs cross-encoder tradeoff — speed vs accuracy, and when to cascade
Hard negative mining — the single biggest lever for embedding quality
FAISS IVF-PQ — the quantized index that fits in edge memory

Resources:

Manning’s IR book Ch. 1-6 (nlp.stanford.edu/IR-book)
Sentence-Transformers docs and training examples
FAISS wiki and tutorials

Milestone: Build a semantic intent matcher: encode candidate intents with a bi-encoder, index with FAISS, retrieve top-K, rerank with a cross-encoder. Measure latency on a constrained device.

Phase 4: Edge Optimization & Deployment (Weeks 18–25)

4A. Model Optimization (Weeks 18–21)

Expert-level. This is where you make models actually run on hardware. Applies to ASR, NLU, AND LLMs.

Sub-skills in order:

Post-training quantization (PTQ) — INT8, calibration strategies, per-channel vs per-tensor
Quantization-aware training (QAT) — fake quantization, straight-through estimator
INT4 and mixed-precision — which layers tolerate low precision (attention vs FFN)
Knowledge distillation — teacher-student, feature-based, attention transfer
Structured pruning — channel pruning, block pruning (actually speeds up inference)
Operator fusion — Conv+BN+ReLU, attention fusion, why it matters
Memory optimization — activation checkpointing, weight sharing, memory-mapped models

80/20 concepts:

INT8 PTQ — always try this first, gets 2-4x speedup with minimal accuracy loss
INT4 for LLM weights — the standard for on-device LLMs (GPTQ, AWQ, GGUF formats)
Knowledge distillation — most reliable path to a small accurate model
Structured pruning — the only pruning that speeds up inference without sparse hardware
Operator fusion — the first thing every inference runtime does

Resources:

MIT HAN Lab: TinyML and Efficient Deep Learning (Song Han) — YouTube lectures
Efficient Deep Learning — Menghani (O’Reilly, 2022)
TensorRT quantization guide
ONNX Runtime optimization docs
AWQ paper (Lin et al., 2023) — activation-aware weight quantization for LLMs

Milestone: Take your Conformer-CTC from Phase 2. Apply: INT8 PTQ → INT4 mixed-precision QAT → structured pruning → distillation into smaller architecture. Build a Pareto curve: WER vs latency vs model size vs peak memory.

4B. C++ Inference & Hardware Deployment (Weeks 22–25)

Expert-level. Where the rubber meets the road.

Sub-skills in order:

ONNX Runtime — export, optimize, execution providers, custom operators
TensorRT — engine building, layer fusion, INT8 calibration
ARM NEON intrinsics — SIMD for feature extraction and post-processing
Threading — pipeline parallelism for streaming (capture → features → inference → decode)
Memory management — arena allocators, avoiding fragmentation on embedded
Profiling — roofline model, memory bandwidth vs compute bound analysis
On-device model updates — differential updates, versioning

80/20 concepts:

ONNX Runtime — most portable inference runtime, learn it thoroughly
ARM NEON — where the real speedup lives for automotive/mobile targets
Roofline model — tells you if you’re memory-bound or compute-bound, determines your optimization strategy
Pipeline parallelism — overlap audio capture with inference for streaming

Resources:

ONNX Runtime docs: onnxruntime.ai
ARM NEON developer guides
TensorRT developer guide
Pete Warden’s “TinyML” book

Milestone: Full C++ inference pipeline: audio capture → MFCC extraction (NEON-optimized) → Conformer inference (ONNX Runtime, INT8) → CTC+WFST decode. Run on ARM hardware (Raspberry Pi or Android NDK). Target: <200ms end-to-end for a 3-second utterance.

Phase 5: On-Device LLM Inference (Weeks 26–31)

Expert-level. This is the new frontier — making language models run fast on constrained hardware.

5A. Small LLM Architectures & Adaptation (Weeks 26–28)

Sub-skills in order:

Small LLM landscape — Phi-3/3.5, Gemma 2B, Llama 3.2 1B/3B, Qwen2 0.5B/1.5B
Architecture choices at small scale — depth vs width, MoE for edge, shared embeddings
Tokenization for on-device — BPE, SentencePiece, vocabulary size tradeoffs (smaller vocab = smaller embedding table = less memory)
LoRA / QLoRA fine-tuning — adapt a base model to your domain
Prompt engineering for constrained models — shorter prompts = fewer KV-cache entries = faster
When to use an LLM vs task-specific model — the decision framework

80/20 concepts:

Model selection for edge — not all 3B models are equal; GQA head count, vocab size, and context length determine memory footprint more than parameter count alone
QLoRA — the practical way to adapt; understand that the base weights stay quantized during training
The decision boundary — a 2MB intent classifier at 5ms beats a 2GB LLM at 500ms for “set temperature to 72”. The LLM wins for “what’s a good restaurant near my next meeting that has outdoor seating”
Vocabulary size impact — 32K vocab vs 128K vocab is ~200MB difference in the embedding table at FP16

Resources:

Phi-3 technical report (Microsoft, 2024)
Llama 3 paper (Meta, 2024) — especially the efficiency sections
Hugging Face PEFT library + QLoRA tutorials
Sebastian Raschka’s “Build a Large Language Model From Scratch”

Milestone: Fine-tune Llama 3.2 1B with QLoRA on your car-control domain from Phase 3. Compare accuracy, latency, and memory against your task-specific intent-slot model. Document the crossover point: at what query complexity does the LLM start winning?

5B. LLM Inference Optimization for Edge (Weeks 29–31)

This is the deep expertise that’s rare and valuable.

Sub-skills in order:

KV-cache management — pre-allocation, rolling buffer for streaming, memory-mapped KV
Speculative decoding — draft model + verify, why it gives free speedup on edge
Continuous batching on single-user devices — interleaving prefill and decode phases
Weight formats — GGUF, GPTQ, AWQ, EXL2 — understand the tradeoffs (accuracy, speed, compatibility)
llama.cpp internals — GGML tensor library, quantization kernels, metal/NEON backends
MLC-LLM / ExecuTorch — alternative runtimes, when each is better than llama.cpp
Context length management — sliding window attention, StreamingLLM, when to truncate vs summarize
Prefill vs decode optimization — prefill is compute-bound, decode is memory-bound; different optimizations for each
On-device RAG — small vector index + LLM for grounded responses without cloud

80/20 concepts:

Prefill is compute-bound, decode is memory-bound — this single insight determines your entire optimization strategy. Prefill benefits from parallelism and fusion. Decode benefits from quantization and memory bandwidth.
Speculative decoding — use a tiny draft model (or n-gram) to propose tokens, verify in parallel with the big model. 2-3x speedup for free on edge where memory bandwidth is the bottleneck.
KV-cache is the memory wall — a 3B model with 4K context at FP16 KV-cache uses ~1.5GB just for KV. GQA + INT8 KV quantization + rolling window brings this to manageable levels.
llama.cpp GGUF format — the de facto standard for on-device LLM deployment. Understand Q4_K_M, Q5_K_M, Q8_0 quantization levels and their accuracy/speed tradeoffs.
On-device RAG — retrieve from a local FAISS index, inject into a short prompt, generate with the local LLM. No cloud needed. Context window is precious — retrieval quality directly determines generation quality.

Resources:

llama.cpp repo + GGML documentation (github.com/ggerganov/llama.cpp)
MLC-LLM docs (mlc.ai/mlc-llm)
ExecuTorch docs (pytorch.org/executorch)
Speculative decoding paper (Leviathan et al., 2023)
StreamingLLM paper (Xiao et al., 2023)
“Efficient Memory Management for Large Language Model Serving with PagedAttention” (vLLM paper — the concepts apply to edge too)

Milestone: Deploy Llama 3.2 1B (Q4_K_M) on ARM hardware using llama.cpp. Implement:

Baseline: measure tokens/sec for prefill and decode separately
Add speculative decoding with a tiny draft model — measure speedup
Add INT8 KV-cache quantization — measure memory reduction vs quality
Build on-device RAG: FAISS index of car manual chunks + LLM generation
Target: 20+ tokens/sec decode on Raspberry Pi 5 or equivalent ARM SoC

Phase 6: Full Integration (Weeks 32–36)

The Complete On-Device Voice + Language Pipeline

Build the full system in C++:

Wake word detection — small CNN, always-on, <1MB
Streaming ASR — chunked Conformer-CTC with WFST decoding
Intent router — fast classifier that decides: task-specific NLU vs on-device LLM vs cloud
On-device NLU (fast path) — distilled intent-slot model, <10ms for simple commands
On-device LLM (complex path) — quantized small LLM for open-ended queries, with optional RAG
Hybrid cloud fallback — confidence-based routing when on-device can’t handle it
Graceful degradation — full offline capability for core commands

Expert concepts to integrate:

Three-tier routing: task-specific (fast) → on-device LLM (medium) → cloud (slow but capable)
Memory budget management — ASR model + NLU model + LLM sharing a fixed pool; LLM weights can be memory-mapped and paged in/out
Model hot-swapping — loading domain-specific NLU or LoRA adapters on demand
Latency profiling of the full pipeline, not just individual models
Endpointer → ASR → router → NLU/LLM pipeline with proper state management

Milestone: Demo running on a single-board computer:

Wake-to-response <500ms for simple on-device commands (NLU fast path)
Wake-to-first-token <1.5s for complex queries (LLM path)
Full offline operation for core car-control commands
Write a technical doc covering architecture decisions, tradeoffs, and profiling results

Competence Track (Lighter Touch)

Audio Beyond Speech (~1 week, during Phase 2)

Speaker embeddings (x-vectors, ECAPA-TDNN) for personalization
Speech enhancement via T-F masking
Read: SpeechBrain tutorials for speaker verification
Build: A speaker verification demo — enroll 3 speakers, verify from new audio

Daily Practice Plan

Weekdays (2 hours/day):

Block	Duration	Activity
Theory	40 min	Read textbook/paper for current sub-skill. Handwritten or markdown notes.
Build	50 min	Implement what you just read. No copy-paste. Type it, debug it, understand it.
Reflect	20 min	Learning log: what clicked, what’s fuzzy, one question for tomorrow.
Queue	10 min	Bookmark tomorrow’s resources, write down the starting point.

Weekend (one day, 3–4 hours):

Block	Duration	Activity
Project	2 hours	Milestone work — the bigger builds that tie concepts together.
Paper	45 min	One paper per week. Structured notes: problem, method, result, limitation.
Share	15 min	Blog post, SO answer, or internal tech doc. Teaching solidifies understanding.

Progress Checkpoints

Week	You Should Be Able To…
5	Explain MFCCs and transformer attention from memory. Working KV-cache with GQA implemented and profiled.
12	Train a Conformer-CTC, build an H∘C∘L∘G decoding graph, deploy a keyword spotter in C++.
17	Build and compress a joint intent-slot model. Build a semantic retrieval pipeline for NLU.
25	Quantize, prune, distill, and deploy a model in C++ on ARM. Hit latency targets. Explain roofline model.
31	Run a quantized LLM on ARM at 20+ tok/s. Implement speculative decoding. Build on-device RAG. Explain prefill vs decode optimization.
36	Run a full wake-word → ASR → router → NLU/LLM pipeline on-device. Three-tier routing working. Full offline mode for core commands.

Deferred Learning Tracks

Revisit these when your core expertise is solid (after week 36+):

Track A: Learning to Rank

Pointwise, pairwise, listwise losses
LambdaMART, ColBERT, neural ranking
Feature engineering for ranking
Online learning to rank, click models
Entry point: Learning to Rank for Information Retrieval — Tie-Yan Liu

Track B: LLM Alignment & Advanced NAS

RLHF, DPO, constitutional AI
Hardware-aware Neural Architecture Search
Multi-objective NAS for edge targets
Entry point: Lilian Weng’s RLHF blog + MIT HAN Lab NAS lectures

Key Principle

Depth beats breadth. An expert in on-device speech + LLM inference who can ship a full voice pipeline in C++ is far more valuable than someone who’s intermediate at all four fields. The hardware will tell you exactly what you don’t know — deploy early, profile often, and let the constraints teach you.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Paul Nguyen

Intermediate → Expert: On-Device Speech & Language Processing

The Focus

Phase 1: Audio & Transformer Foundations (Weeks 1–5)

1A. Speech Signal Processing (Weeks 1–2)

1B. Transformer & Attention Internals (Weeks 3–5)

Phase 2: ASR Deep Dive (Weeks 6–12)

2A. End-to-End ASR Architectures (Weeks 6–8)

2B. WFST-Based Decoding (Weeks 9–10)

2C. Wake Word & Keyword Spotting (Weeks 11–12)

Phase 3: On-Device NLU (Weeks 13–17)

3A. Intent & Slot Models (Weeks 13–16)

3B. Retrieval & Ranking for NLU (Week 17)

Phase 4: Edge Optimization & Deployment (Weeks 18–25)

4A. Model Optimization (Weeks 18–21)

4B. C++ Inference & Hardware Deployment (Weeks 22–25)

Phase 5: On-Device LLM Inference (Weeks 26–31)

5A. Small LLM Architectures & Adaptation (Weeks 26–28)

5B. LLM Inference Optimization for Edge (Weeks 29–31)

Phase 6: Full Integration (Weeks 32–36)

The Complete On-Device Voice + Language Pipeline

Competence Track (Lighter Touch)

Audio Beyond Speech (~1 week, during Phase 2)

Daily Practice Plan

Progress Checkpoints

Deferred Learning Tracks

Track A: Learning to Rank

Track B: LLM Alignment & Advanced NAS

Key Principle

Share on