mobile / May 8, 2026 / 14 min

react native tts false interrupts and synthesis latency

rms fires on speaker output. VOICE_COMMUNICATION fixes android; whisper VoiceChat handles ios. piper: xnnpack + warmup. kokoro: forward() not stream().

I was building a hands-free conversational mode for an offline React Native AI app. It works without internet, runs the model on-device, and speaks responses aloud while staying ready to listen for the next input. The interaction loop I was going for was similar to Gemini Live or ChatGPT voice mode: the assistant speaks, the user can interrupt mid-sentence, the assistant stops and processes the new input immediately.

The last piece was interruption detection. I implemented it with RMS amplitude monitoring at 16kHz from the device microphone. I set a threshold of 0.02. It worked on the first manual test. It failed on every run after that.

The assistant was interrupting itself. Every TTS sentence crossed the amplitude threshold, fired the interrupt handler, stopped playback, and started waiting for user input. The conversational loop would spin on interrupted half-sentences until I manually stopped it.

The root cause was one line. The fix was one line. Finding the root cause took most of a day. Making synthesis fast enough that the loop actually felt responsive took longer.

The conversational mode setup

The setup was a continuous loop:

Microphone recording starts
User speaks, Whisper (via whisper.rn running ggml models on-device) transcribes
LLM generates a response
TTS synthesises and plays the response sentence by sentence
While TTS plays, microphone monitors for user interruption
If interruption detected, stop TTS, go to step 2

The app supports three TTS backends: Android's native TextToSpeech for guaranteed availability, Piper (a VITS model served through react-native-sherpa-onnx), and Kokoro (an ExecuTorch model through react-native-executorch). The interrupt detection problem affects all three equally; it lives one layer below the synthesis engine in the microphone recording path. Step 5 is where the problem was.

The microphone monitoring runs concurrently with TTS playback. On Android, recording and playback can operate simultaneously when the audio session is configured for PlayAndRecord. The recording was running. The TTS was playing. The microphone was picking up the TTS.

Why rms detection fails during tts

RMS (root mean square) amplitude is the standard first approach for voice activity detection:

var sumSq = 0.0
for (i in 0 until read) sumSq += (buf[i] / 32768.0) * (buf[i] / 32768.0)
val rms = sqrt(sumSq / read).toFloat()

When the audio source is MediaRecorder.AudioSource.MIC, the buffer contains everything the microphone transducer picks up. On a phone held in hand or sitting on a desk, speaker audio reaches the microphone at roughly -20 to -30 dB. At normal TTS playback volume, that is still well above a 0.02 RMS threshold.

The microphone cannot distinguish your voice from the device's own speaker output. They are both pressure waves at the transducer. You can raise the threshold, but then you miss soft speech. You can add a cooldown after TTS starts, but then you miss early interruptions. There is no threshold value that solves this without hardware help.

The same failure mode affects WebRTC VAD, silero-vad, or any energy-based voice detection running on raw microphone input during audio playback. The audio source is the problem, not the detection algorithm.

The fix: AudioSource.VOICE_COMMUNICATION

Android's MediaRecorder.AudioSource.VOICE_COMMUNICATION (integer value 7) routes the microphone input through the hardware Acoustic Echo Cancellation processor before delivering it to your app. The AEC maintains a reference signal of what the speaker is currently outputting and subtracts it in firmware.

What your AudioRecord buffer receives after switching: microphone audio with the speaker echo removed. The user's voice is present. The TTS playback audio is not.

// Before: raw mic, picks up speaker audio during TTS
val audioRecord = AudioRecord(
    MediaRecorder.AudioSource.MIC,
    SAMPLE_RATE,
    AudioFormat.CHANNEL_IN_MONO,
    AudioFormat.ENCODING_PCM_16BIT,
    bufferSize
)
 
// After: hardware AEC enabled, speaker audio cancelled before buffer delivery
val audioRecord = AudioRecord(
    MediaRecorder.AudioSource.VOICE_COMMUNICATION,
    SAMPLE_RATE,
    AudioFormat.CHANNEL_IN_MONO,
    AudioFormat.ENCODING_PCM_16BIT,
    bufferSize
)

That one-line change to the audio source constant eliminated every false interrupt in testing.

The android native module

The interruption detector lives in a Kotlin native module called AudioAECMonitorModule. Its API is a one-shot promise: start() opens an AudioRecord session, polls for speech, and resolves the promise exactly once when speech is detected. stop() tears down the session and abandons any pending promise. This is intentional: the JS layer only ever needs one notification per TTS utterance, not a stream of amplitude events.

class AudioAECMonitorModule(
    private val reactContext: ReactApplicationContext,
) : ReactContextBaseJavaModule(reactContext) {
 
    companion object {
        private const val NAME = "AudioAECMonitorModule"
        private const val SAMPLE_RATE = 16_000
        private const val CHANNEL = AudioFormat.CHANNEL_IN_MONO
        private const val ENCODING = AudioFormat.ENCODING_PCM_16BIT
        private const val SPEECH_THRESHOLD = 0.02f
        private const val HOLD_MS = 200L
    }
 
    @Volatile private var recorder: AudioRecord? = null
    @Volatile private var aec: AcousticEchoCanceler? = null
    @Volatile private var ns: NoiseSuppressor? = null
    @Volatile private var running = false
    private var monitorThread: Thread? = null
 
    override fun getName(): String = NAME
 
    @ReactMethod
    fun start(promise: Promise) {
        stopInternal()
        try {
            val minBuf = AudioRecord.getMinBufferSize(SAMPLE_RATE, CHANNEL, ENCODING)
            if (minBuf == AudioRecord.ERROR_BAD_VALUE || minBuf == AudioRecord.ERROR) {
                promise.reject("AEC_INIT", "AudioRecord not supported on this device")
                return
            }
            val bufSize = minBuf.coerceAtLeast(4096)
 
            val rec = AudioRecord(
                MediaRecorder.AudioSource.VOICE_COMMUNICATION,
                SAMPLE_RATE, CHANNEL, ENCODING, bufSize,
            )
            if (rec.state != AudioRecord.STATE_INITIALIZED) {
                rec.release()
                promise.reject("AEC_INIT", "AudioRecord failed to initialize")
                return
            }
 
            // Hardware AEC removes speaker echo from the mic signal.
            if (AcousticEchoCanceler.isAvailable()) {
                val canceler = AcousticEchoCanceler.create(rec.audioSessionId)
                if (canceler != null) { canceler.enabled = true; aec = canceler }
            }
 
            // Noise suppressor removes steady-state ambient noise (HVAC, fans).
            if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.JELLY_BEAN
                && NoiseSuppressor.isAvailable()) {
                val suppressor = NoiseSuppressor.create(rec.audioSessionId)
                if (suppressor != null) { suppressor.enabled = true; ns = suppressor }
            }
 
            recorder = rec
            running = true
            rec.startRecording()
 
            monitorThread = Thread {
                val buf = ShortArray(bufSize / 2)
                var aboveThresholdSince = -1L
 
                while (running) {
                    val read = rec.read(buf, 0, buf.size)
                    if (read <= 0) continue
 
                    var sumSq = 0.0
                    for (i in 0 until read) sumSq += (buf[i] / 32768.0) * (buf[i] / 32768.0)
                    val rms = sqrt(sumSq / read).toFloat()
 
                    val now = System.currentTimeMillis()
                    if (rms > SPEECH_THRESHOLD) {
                        if (aboveThresholdSince < 0) aboveThresholdSince = now
                        else if (now - aboveThresholdSince >= HOLD_MS) {
                            stopInternal()
                            promise.resolve(null)
                            return@Thread
                        }
                    } else {
                        aboveThresholdSince = -1L
                    }
                }
            }.also { it.isDaemon = true; it.start() }
 
        } catch (e: SecurityException) {
            promise.reject("AEC_PERMISSION", "Microphone permission denied")
        } catch (e: Exception) {
            promise.reject("AEC_ERROR", e.message ?: "Unknown error")
        }
    }
 
    @ReactMethod
    fun stop() { stopInternal() }
 
    private fun stopInternal() {
        running = false
        monitorThread?.interrupt()
        monitorThread = null
        aec?.release(); aec = null
        ns?.release(); ns = null
        try { recorder?.stop() } catch (_: Exception) {}
        recorder?.release(); recorder = null
    }
 
    override fun invalidate() { stopInternal(); super.invalidate() }
}

The hold detection uses wall-clock time rather than frame counting. aboveThresholdSince is a timestamp and the speech event fires once the signal has stayed above SPEECH_THRESHOLD for 200 consecutive milliseconds. Frame counting would produce different durations depending on the buffer size the OS allocates, which varies across devices. Wall-clock time is consistent regardless of buffer shape.

NoiseSuppressor runs alongside AcousticEchoCanceler. AEC removes speaker echo. NoiseSuppressor removes steady-state ambient noise: HVAC, desk fans, road noise from a car mount. After both effects are applied, the buffer contains the microphone signal with two categories of unwanted audio removed. The 0.02 threshold sits comfortably above the residual floor in most environments.

The monitoring thread is marked as a daemon thread. The JVM will not wait for it to finish when the app process exits normally, which prevents hangs during app shutdown if stopInternal is called late in the lifecycle.

class AudioAECMonitorPackage : ReactPackage {
    override fun createNativeModules(
        reactContext: ReactApplicationContext
    ): List<NativeModule> = listOf(AudioAECMonitorModule(reactContext))
 
    override fun createViewManagers(
        reactContext: ReactApplicationContext
    ): List<ViewManager<*, *>> = emptyList()
}

// MainApplication.kt
override val reactHost: ReactHost by lazy {
    getDefaultReactHost(
        context = applicationContext,
        packageList = PackageList(this).packages.apply {
            add(NativeTTSPackage())
            add(AudioAECMonitorPackage())
        },
    )
}

iOS: aec via whisper voicechat

iOS does not have an equivalent of AudioSource.VOICE_COMMUNICATION. The AEC configuration lives in AVAudioSession mode rather than in the recording source. Setting the session mode to VoiceChat before opening the microphone instructs the OS to maintain a speaker reference and subtract it from mic input, the same firmware-level operation that VOICE_COMMUNICATION triggers on Android.

The app uses whisper.rn for on-device transcription. The transcribeRealtime API accepts an audioSessionOnStartIos configuration that sets the AVAudioSession category and mode before the session opens:

const { stop, subscribe } = await whisperContext.transcribeRealtime({
  language: 'en',
  realtimeAudioSec: 30,
  realtimeAudioSliceSec: 2, // 1s is too short on Android, causes immediate empty completion
  audioSessionOnStartIos: {
    category: 'PlayAndRecord',
    options: ['AllowBluetooth', 'DefaultToSpeaker'],
    mode: 'VoiceChat',   // hardware AEC, same mechanism Siri and FaceTime use
  },
  audioSessionOnStopIos: 'restore',
});

VoiceChat mode is what Siri, FaceTime, and ChatGPT Voice Mode use on iOS. The OS handles the playback reference subtraction before any samples reach whisper.rn's buffer or the AudioRecorder used for interrupt detection. No additional iOS-specific native module is needed.

The interrupt monitor on iOS uses react-native-audio-api's AudioRecorder directly in JavaScript. Since the AVAudioSession is already in VoiceChat mode when Whisper's transcribeRealtime is active, echo cancellation is active when the JS recorder opens the microphone. The same 0.02 threshold and 200ms wall-clock hold apply, identical logic to the Kotlin module applied in JS against an already-cancelled buffer.

The interrupt service and the loop

Both platform paths are wrapped in a single AudioLevelMonitor service. The UI layer calls start(onSpeechDetected) and stop() without knowing which platform it is on:

// src/services/audioLevelMonitor.ts
 
const SPEECH_THRESHOLD = 0.02;
const HOLD_MS = 200;
 
class AudioLevelMonitor {
  private recorder: AudioRecorder | null = null;
  private aboveThresholdSince: number | null = null;
  private nativeActive = false;
 
  start(onSpeechDetected: () => void): void {
    this.stop();
    if (Platform.OS === 'android') {
      this.startNative(onSpeechDetected);
    } else {
      this.startJS(onSpeechDetected);
    }
  }
 
  private startNative(onSpeechDetected: () => void): void {
    const mod = NativeModules.AudioAECMonitorModule as NativeAECModule | undefined;
    if (!mod) {
      // Native module absent, fall back to JS monitor (dev builds without native rebuild)
      this.startJS(onSpeechDetected);
      return;
    }
    this.nativeActive = true;
    mod.start().then(() => {
      if (this.nativeActive) { this.nativeActive = false; onSpeechDetected(); }
    }).catch(() => { this.nativeActive = false; });
  }
 
  private startJS(onSpeechDetected: () => void): void {
    this.aboveThresholdSince = null;
    const rec = new AudioRecorder();
    rec.onAudioReady(
      { sampleRate: 16000, bufferLength: 512, channelCount: 1 },
      (event) => {
        const samples = event.buffer.getChannelData(0);
        let sumSq = 0;
        for (let i = 0; i < samples.length; i++) sumSq += samples[i] * samples[i];
        const rms = Math.sqrt(sumSq / samples.length);
        const now = Date.now();
        if (rms > SPEECH_THRESHOLD) {
          if (!this.aboveThresholdSince) this.aboveThresholdSince = now;
          else if (now - this.aboveThresholdSince >= HOLD_MS) {
            this.stopJS(); onSpeechDetected();
          }
        } else { this.aboveThresholdSince = null; }
      },
    );
    rec.start();
    this.recorder = rec;
  }
 
  stop(): void {
    if (this.nativeActive) {
      this.nativeActive = false;
      (NativeModules.AudioAECMonitorModule as NativeAECModule | undefined)?.stop();
    }
    this.stopJS();
  }
}
 
export const audioLevelMonitor = new AudioLevelMonitor();

The useAgenticLoop hook wires this into the TTS lifecycle. The monitor starts when isSpeaking becomes true and stops the moment it becomes false. When speech is detected, the hook stops TTS, waits 150ms for the audio system to settle, then opens the mic for the next turn:

useEffect(() => {
  if (!isActive || !isSpeaking) {
    audioLevelMonitor.stop();
    return;
  }
  audioLevelMonitor.start(() => {
    useTTSStore.getState().stop();
    setTimeout(() => {
      if (!stateRef.current.isRecording) tryStartRecording();
    }, 150);
  });
  return () => audioLevelMonitor.stop();
}, [isActive, isSpeaking]);

After TTS finishes normally (not interrupted), an 800ms timer fires before the next recording cycle. This gap clears room reverberation that persists briefly after playback stops. Even with AEC active, the speaker physically rings for a fraction of a second after the last sample plays.

The loop also uses text-based VAD to decide when the user has finished speaking. Whisper fires partial results every 2 seconds (realtimeAudioSliceSec: 2). [BLANK_AUDIO] tokens indicate silence in that slice. After real speech has been detected in the current recording, one blank slice triggers stopRecording(). Two seconds of trailing silence is enough to assume the utterance is complete. If the recording has been all-blank since it started, two blank slices (4 seconds) trigger a discard and restart, preventing the loop from hanging on ambient noise.

Piper synthesis latency: xnnpack and warm weights

Interrupt detection is only half the problem. If TTS synthesis is slow, the loop feels unresponsive even when interruptions work correctly. The user interrupts, the mic opens, Whisper transcribes, the LLM responds. Then there is a long silence before the first audio plays.

Piper runs as a VITS model via react-native-sherpa-onnx. The ONNX inference provider matters significantly. Three candidates exist: nnapi, coreml, and xnnpack. NNAPI causes dimension mismatches on VITS TTS models (sherpa-onnx issues #958 and #2443). CoreML triggers shape inference errors with VITS and Kokoro models (#1792). xnnpack is the optimized CPU backend and is stable on both platforms:

function getProviderCandidates(): string[] {
  // nnapi crashes on TTS with dimension mismatches (sherpa-onnx #958, #2443).
  // coreml has shape inference errors with Kokoro/VITS (sherpa-onnx #1792).
  // xnnpack is the optimized CPU backend and is stable on both platforms.
  if (Platform.OS === 'ios') return ['xnnpack', 'cpu'];
  if (Platform.OS === 'android') return ['xnnpack', 'cpu'];
  return ['cpu'];
}

After the engine initializes, a warmup synthesis runs immediately before any user-facing speak() call:

private async _warmUpStreamingTts(): Promise<void> {
  const tts = this._streamTts;
  if (!tts) return;
 
  await new Promise<void>((resolve, reject) => {
    tts.generateSpeechStream(
      'Hello, how are you today?',
      { sid: this._getSidForVoice(), speed: 1.0 },
      { onEnd: () => resolve(), onError: event => reject(new Error(event.message)) },
    ).catch(reject);
  });
}

This ensures the ONNX model weights are loaded into the CPU cache and xnnpack's internal buffers are allocated before the first real speak() call arrives. On 2022 midrange Android hardware (Cortex-A55/A78 class SoCs), first-audio latency after warmup is near zero: the model emits audio chunks almost immediately after generateSpeechStream is called, because the streaming VITS decoder starts producing samples before synthesis is complete and the first chunk is scheduled to audioCtx.currentTime for immediate playback.

The AudioContext must be pinned to the model's native sample rate (22,050 Hz for Piper):

if (!this._audioCtx || this._audioCtx.state === 'closed') {
  this._audioCtx = new AudioContext({ sampleRate });
}

react-native-audio-api on Android does not resample: it plays buffers against the native audio rate and ignores any mismatch between the AudioContext sample rate and the buffer sample rate. A 22,050 Hz buffer in a 44,100 Hz context plays at 2x speed. Pinning the context to sampleRate (from tts.getSampleRate() after initialization) prevents this. The value is cached in _sampleRate so no native round-trip is needed on each speak() call.

Short consecutive sentences are merged before synthesis. Sentences shorter than 70 characters get merged into the next one, replacing sentence-final punctuation with a comma at the join point:

const MIN_MERGE_LEN = 70;
let mergeBuf = '';
for (const s of rawSentences) {
  mergeBuf = mergeBuf ? `${mergeBuf.replace(/[;!?,.]+$/, ',')} ${s}` : s;
  if (mergeBuf.length >= MIN_MERGE_LEN) {
    mergedSentences.push(mergeBuf);
    mergeBuf = '';
  }
}

The merging keeps per-unit synthesis time proportional to audio duration, which keeps AudioContext scheduling gap-free. When sentences are very short, synthesis time exceeds playback duration and the scheduler has to wait. Merging avoids that.

Kokoro: forward() per sentence and the drain queue

Kokoro uses react-native-executorch's ExecuTorch runtime rather than ONNX, and runs at 24,000 Hz. The useTextToSpeech hook exposes two synthesis modes: stream() (an async iterator over audio chunks) and forward() (synthesises one text input, returns a Float32Array). stream() looks like the natural choice for sentence-by-sentence synthesis, but it has a critical bug on Android.

After calling streamStop(true) to cancel an in-flight stream, the native code kills the generation thread but never signals the JS async iterator as done. The for-await loop hangs indefinitely, leaving isGenerating pinned to true. Any subsequent speak() call that checks isGenerating before issuing the next forward() either blocks forever or throws "currently generating".

forward() does not have this problem. It resets isGenerating in its own finally block regardless of how it exits. The bridge uses forward() one sentence at a time, with a poll loop to wait for any in-flight call to complete before issuing the next:

// forward()'s finally calls setIsGenerating(false) as an async React state update,
// so the re-render may not be visible yet on the next event-loop tick.
if (ttsRef.current.isGenerating) {
  const settle = Date.now() + 500;
  while (ttsRef.current.isGenerating && Date.now() < settle) {
    await new Promise<void>(r => setTimeout(r, 10));
  }
}
const raw = await ttsRef.current.forward({ text: sentence, speed });

The gap between sentences is eliminated by overlapping synthesis with playback. Completed sentence chunks are pushed onto a queue; a drainQueue loop pulls from the queue and calls playSamples() concurrently with the synthesis loop, so sentence N+1 is being generated while sentence N plays:

const drainQueue = async (): Promise<void> => {
  while (playQueueRef.current.length > 0 && isMine() && !isStoppedRef.current) {
    const samples = playQueueRef.current.shift()!;
    await playSamples(ctx, samples);
  }
};
 
// After each forward() returns:
playQueueRef.current.push(samples);
if (!drainHolder.p) {
  drainHolder.p = drainQueue().finally(() => { drainHolder.p = null; });
}

On 2022 midrange hardware, Kokoro synthesis takes 2-3 seconds per sentence. A five-sentence response plays without gaps because the synthesis time for each sentence roughly matches the playback duration of the previous one; the drain queue stays one sentence ahead.

ExecuTorch native callbacks return JSI host objects that look like Float32Array but lack a real ArrayBuffer. Calling copyToChannel on them throws "getPropertyAsObject: property 'buffer' is undefined" on Android. A copy into a standard JS-heap Float32Array is required before scheduling audio:

function toJsFloat32Array(src: Float32Array): Float32Array {
  const dst = new Float32Array(src.length);
  dst.set(src);
  return dst;
}
 
const samples = trimTrailingSilence(toJsFloat32Array(raw as unknown as Float32Array));

trimTrailingSilence walks backward from the end of each chunk and removes trailing samples below 0.001 amplitude, provided at least 1,200 samples would be removed. ExecuTorch's Kokoro implementation pads the tail of each forward() result with silence, and without trimming, inter-sentence gaps accumulate visibly across longer responses.

source.onEnded does not fire reliably on Android. playSamples uses a 300ms grace timeout as a fallback so the drain loop always makes progress:

const durationMs = (samples.length / 24000) * 1000;
const timeout = setTimeout(() => { cleanup(); resolve(); }, durationMs + 300);
source.onEnded = () => { clearTimeout(timeout); cleanup(); resolve(); };

The AudioContext is created at 24,000 Hz to match Kokoro's output rate, for the same reason as Piper: mismatched rates on Android produce incorrect playback speed without an error.

What to keep after the fix

After the AEC changes landed, several workarounds became dead weight. The adaptive threshold logic that tracked background noise levels was removed. The 500ms cooldown after TTS starts (meant to skip initial speaker bleed) was removed. The gain normalisation pass that compensated for mic-to-speaker distance variation on different devices was removed. All of those were patches on top of the wrong audio source.

What remains is three things: the 0.02 threshold (now serving as a floor for genuine ambient noise rather than a ceiling below speaker bleed), the 200ms hold (rejecting keyboard clicks and notification sounds), and the 800ms post-TTS delay before the next recording cycle (clearing room reverberation that persists briefly after playback stops).

The conversational loop ran cleanly after all of these changes. The assistant speaks, the user can interrupt mid-sentence, playback stops within 150ms, and the mic opens for the next turn. Piper responds in near-zero time after the warmup fires. Kokoro's first sentence takes 2-3 seconds on 2022 midrange hardware, but the drain queue keeps subsequent sentences gap-free. Testing covered a Nothing Phone 1 running Android 15, a Pixel 6, and a mid-range Samsung.

frequently asked questions

why does rms amplitude detection trigger false interrupts during tts playback on android?

when you record with AudioSource.MIC or AudioSource.DEFAULT, the audio buffer includes everything the microphone transducer picks up, including sound from the device's own speaker. during tts playback the speaker emits audio that reaches the microphone at -20 to -30 dB relative to the original signal, which is still above any practical interrupt threshold. the microphone has no way to distinguish your voice from the device's own output. you need hardware-level cancellation, not a smarter threshold.

what does AudioSource.VOICE_COMMUNICATION do differently from AudioSource.MIC?

AudioSource.VOICE_COMMUNICATION (value 7 in MediaRecorder.AudioSource) activates the device's hardware Acoustic Echo Cancellation processor before the audio stream reaches your app. the AEC maintains a reference signal of what the speaker is currently playing and subtracts it from the microphone input in firmware. what your AudioRecord buffer receives is microphone audio with the speaker echo removed. your voice is preserved. the tts playback audio is not. this runs in device firmware, not in your app, so it adds no CPU overhead and works even during heavy on-device inference.

how does ios handle echo cancellation for voice interrupts?

on iOS, whisper.rn's transcribeRealtime accepts an audioSessionOnStartIos configuration. setting mode to VoiceChat instructs AVAudioSession to apply hardware AEC before samples reach the microphone buffer. since whisper already has the AVAudioSession configured for PlayAndRecord, the same session controls both playback and capture. VoiceChat mode causes the OS to subtract the playback reference from mic input before Whisper or the JS-side AudioRecorder ever sees the samples.

do you still need a threshold check after switching to VOICE_COMMUNICATION?

yes, but the threshold serves a different purpose. hardware aec removes speaker bleed, not ambient room noise. a threshold check after switching answers 'is the user actually speaking' rather than 'is something making noise'. values around 0.02 rms work well in quiet environments. add a wall-clock hold requiring the signal to stay above threshold for at least 200ms to reject brief spikes like keyboard clicks or notification sounds.

why can't you use nnapi or coreml as the onnx provider for piper?

nnapi causes dimension mismatches on VITS TTS models (sherpa-onnx issues #958 and #2443). coreml triggers shape inference errors with VITS and Kokoro models (#1792). xnnpack is the optimized CPU backend and is stable on both platforms. with 6 threads and a warmup synthesis at initialization, it delivers near-zero first-audio latency on 2022 midrange hardware.

why use forward() instead of stream() for kokoro on android?

react-native-executorch's stream() hangs indefinitely on Android after streamStop(true). the native code kills the generation thread but never signals the JS async iterator as done, leaving isGenerating pinned to true forever and blocking any subsequent speak() call. forward() resets isGenerating in its own finally block, so cancellation is safe: poll isGenerating until the in-flight sentence finishes, then start the next forward() fresh.

more writing

writing

mobile / May 8, 2026 / 14 min

react native tts false interrupts and synthesis latency

rms fires on speaker output. VOICE_COMMUNICATION fixes android; whisper VoiceChat handles ios. piper: xnnpack + warmup. kokoro: forward() not stream().

The root cause was one line. The fix was one line. Finding the root cause took most of a day. Making synthesis fast enough that the loop actually felt responsive took longer.

The conversational mode setup

The setup was a continuous loop:

Microphone recording starts
User speaks, Whisper (via whisper.rn running ggml models on-device) transcribes
LLM generates a response
TTS synthesises and plays the response sentence by sentence
While TTS plays, microphone monitors for user interruption
If interruption detected, stop TTS, go to step 2

Why rms detection fails during tts

RMS (root mean square) amplitude is the standard first approach for voice activity detection:

var sumSq = 0.0
for (i in 0 until read) sumSq += (buf[i] / 32768.0) * (buf[i] / 32768.0)
val rms = sqrt(sumSq / read).toFloat()

The fix: AudioSource.VOICE_COMMUNICATION

What your AudioRecord buffer receives after switching: microphone audio with the speaker echo removed. The user's voice is present. The TTS playback audio is not.

// Before: raw mic, picks up speaker audio during TTS
val audioRecord = AudioRecord(
    MediaRecorder.AudioSource.MIC,
    SAMPLE_RATE,
    AudioFormat.CHANNEL_IN_MONO,
    AudioFormat.ENCODING_PCM_16BIT,
    bufferSize
)
 
// After: hardware AEC enabled, speaker audio cancelled before buffer delivery
val audioRecord = AudioRecord(
    MediaRecorder.AudioSource.VOICE_COMMUNICATION,
    SAMPLE_RATE,
    AudioFormat.CHANNEL_IN_MONO,
    AudioFormat.ENCODING_PCM_16BIT,
    bufferSize
)

That one-line change to the audio source constant eliminated every false interrupt in testing.

The android native module

class AudioAECMonitorModule(
    private val reactContext: ReactApplicationContext,
) : ReactContextBaseJavaModule(reactContext) {
 
    companion object {
        private const val NAME = "AudioAECMonitorModule"
        private const val SAMPLE_RATE = 16_000
        private const val CHANNEL = AudioFormat.CHANNEL_IN_MONO
        private const val ENCODING = AudioFormat.ENCODING_PCM_16BIT
        private const val SPEECH_THRESHOLD = 0.02f
        private const val HOLD_MS = 200L
    }
 
    @Volatile private var recorder: AudioRecord? = null
    @Volatile private var aec: AcousticEchoCanceler? = null
    @Volatile private var ns: NoiseSuppressor? = null
    @Volatile private var running = false
    private var monitorThread: Thread? = null
 
    override fun getName(): String = NAME
 
    @ReactMethod
    fun start(promise: Promise) {
        stopInternal()
        try {
            val minBuf = AudioRecord.getMinBufferSize(SAMPLE_RATE, CHANNEL, ENCODING)
            if (minBuf == AudioRecord.ERROR_BAD_VALUE || minBuf == AudioRecord.ERROR) {
                promise.reject("AEC_INIT", "AudioRecord not supported on this device")
                return
            }
            val bufSize = minBuf.coerceAtLeast(4096)
 
            val rec = AudioRecord(
                MediaRecorder.AudioSource.VOICE_COMMUNICATION,
                SAMPLE_RATE, CHANNEL, ENCODING, bufSize,
            )
            if (rec.state != AudioRecord.STATE_INITIALIZED) {
                rec.release()
                promise.reject("AEC_INIT", "AudioRecord failed to initialize")
                return
            }
 
            // Hardware AEC removes speaker echo from the mic signal.
            if (AcousticEchoCanceler.isAvailable()) {
                val canceler = AcousticEchoCanceler.create(rec.audioSessionId)
                if (canceler != null) { canceler.enabled = true; aec = canceler }
            }
 
            // Noise suppressor removes steady-state ambient noise (HVAC, fans).
            if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.JELLY_BEAN
                && NoiseSuppressor.isAvailable()) {
                val suppressor = NoiseSuppressor.create(rec.audioSessionId)
                if (suppressor != null) { suppressor.enabled = true; ns = suppressor }
            }
 
            recorder = rec
            running = true
            rec.startRecording()
 
            monitorThread = Thread {
                val buf = ShortArray(bufSize / 2)
                var aboveThresholdSince = -1L
 
                while (running) {
                    val read = rec.read(buf, 0, buf.size)
                    if (read <= 0) continue
 
                    var sumSq = 0.0
                    for (i in 0 until read) sumSq += (buf[i] / 32768.0) * (buf[i] / 32768.0)
                    val rms = sqrt(sumSq / read).toFloat()
 
                    val now = System.currentTimeMillis()
                    if (rms > SPEECH_THRESHOLD) {
                        if (aboveThresholdSince < 0) aboveThresholdSince = now
                        else if (now - aboveThresholdSince >= HOLD_MS) {
                            stopInternal()
                            promise.resolve(null)
                            return@Thread
                        }
                    } else {
                        aboveThresholdSince = -1L
                    }
                }
            }.also { it.isDaemon = true; it.start() }
 
        } catch (e: SecurityException) {
            promise.reject("AEC_PERMISSION", "Microphone permission denied")
        } catch (e: Exception) {
            promise.reject("AEC_ERROR", e.message ?: "Unknown error")
        }
    }
 
    @ReactMethod
    fun stop() { stopInternal() }
 
    private fun stopInternal() {
        running = false
        monitorThread?.interrupt()
        monitorThread = null
        aec?.release(); aec = null
        ns?.release(); ns = null
        try { recorder?.stop() } catch (_: Exception) {}
        recorder?.release(); recorder = null
    }
 
    override fun invalidate() { stopInternal(); super.invalidate() }
}

class AudioAECMonitorPackage : ReactPackage {
    override fun createNativeModules(
        reactContext: ReactApplicationContext
    ): List<NativeModule> = listOf(AudioAECMonitorModule(reactContext))
 
    override fun createViewManagers(
        reactContext: ReactApplicationContext
    ): List<ViewManager<*, *>> = emptyList()
}

// MainApplication.kt
override val reactHost: ReactHost by lazy {
    getDefaultReactHost(
        context = applicationContext,
        packageList = PackageList(this).packages.apply {
            add(NativeTTSPackage())
            add(AudioAECMonitorPackage())
        },
    )
}

iOS: aec via whisper voicechat

const { stop, subscribe } = await whisperContext.transcribeRealtime({
  language: 'en',
  realtimeAudioSec: 30,
  realtimeAudioSliceSec: 2, // 1s is too short on Android, causes immediate empty completion
  audioSessionOnStartIos: {
    category: 'PlayAndRecord',
    options: ['AllowBluetooth', 'DefaultToSpeaker'],
    mode: 'VoiceChat',   // hardware AEC, same mechanism Siri and FaceTime use
  },
  audioSessionOnStopIos: 'restore',
});

The interrupt service and the loop

Both platform paths are wrapped in a single AudioLevelMonitor service. The UI layer calls start(onSpeechDetected) and stop() without knowing which platform it is on:

// src/services/audioLevelMonitor.ts
 
const SPEECH_THRESHOLD = 0.02;
const HOLD_MS = 200;
 
class AudioLevelMonitor {
  private recorder: AudioRecorder | null = null;
  private aboveThresholdSince: number | null = null;
  private nativeActive = false;
 
  start(onSpeechDetected: () => void): void {
    this.stop();
    if (Platform.OS === 'android') {
      this.startNative(onSpeechDetected);
    } else {
      this.startJS(onSpeechDetected);
    }
  }
 
  private startNative(onSpeechDetected: () => void): void {
    const mod = NativeModules.AudioAECMonitorModule as NativeAECModule | undefined;
    if (!mod) {
      // Native module absent, fall back to JS monitor (dev builds without native rebuild)
      this.startJS(onSpeechDetected);
      return;
    }
    this.nativeActive = true;
    mod.start().then(() => {
      if (this.nativeActive) { this.nativeActive = false; onSpeechDetected(); }
    }).catch(() => { this.nativeActive = false; });
  }
 
  private startJS(onSpeechDetected: () => void): void {
    this.aboveThresholdSince = null;
    const rec = new AudioRecorder();
    rec.onAudioReady(
      { sampleRate: 16000, bufferLength: 512, channelCount: 1 },
      (event) => {
        const samples = event.buffer.getChannelData(0);
        let sumSq = 0;
        for (let i = 0; i < samples.length; i++) sumSq += samples[i] * samples[i];
        const rms = Math.sqrt(sumSq / samples.length);
        const now = Date.now();
        if (rms > SPEECH_THRESHOLD) {
          if (!this.aboveThresholdSince) this.aboveThresholdSince = now;
          else if (now - this.aboveThresholdSince >= HOLD_MS) {
            this.stopJS(); onSpeechDetected();
          }
        } else { this.aboveThresholdSince = null; }
      },
    );
    rec.start();
    this.recorder = rec;
  }
 
  stop(): void {
    if (this.nativeActive) {
      this.nativeActive = false;
      (NativeModules.AudioAECMonitorModule as NativeAECModule | undefined)?.stop();
    }
    this.stopJS();
  }
}
 
export const audioLevelMonitor = new AudioLevelMonitor();

useEffect(() => {
  if (!isActive || !isSpeaking) {
    audioLevelMonitor.stop();
    return;
  }
  audioLevelMonitor.start(() => {
    useTTSStore.getState().stop();
    setTimeout(() => {
      if (!stateRef.current.isRecording) tryStartRecording();
    }, 150);
  });
  return () => audioLevelMonitor.stop();
}, [isActive, isSpeaking]);

Piper synthesis latency: xnnpack and warm weights

function getProviderCandidates(): string[] {
  // nnapi crashes on TTS with dimension mismatches (sherpa-onnx #958, #2443).
  // coreml has shape inference errors with Kokoro/VITS (sherpa-onnx #1792).
  // xnnpack is the optimized CPU backend and is stable on both platforms.
  if (Platform.OS === 'ios') return ['xnnpack', 'cpu'];
  if (Platform.OS === 'android') return ['xnnpack', 'cpu'];
  return ['cpu'];
}

After the engine initializes, a warmup synthesis runs immediately before any user-facing speak() call:

private async _warmUpStreamingTts(): Promise<void> {
  const tts = this._streamTts;
  if (!tts) return;
 
  await new Promise<void>((resolve, reject) => {
    tts.generateSpeechStream(
      'Hello, how are you today?',
      { sid: this._getSidForVoice(), speed: 1.0 },
      { onEnd: () => resolve(), onError: event => reject(new Error(event.message)) },
    ).catch(reject);
  });
}

The AudioContext must be pinned to the model's native sample rate (22,050 Hz for Piper):

if (!this._audioCtx || this._audioCtx.state === 'closed') {
  this._audioCtx = new AudioContext({ sampleRate });
}

Short consecutive sentences are merged before synthesis. Sentences shorter than 70 characters get merged into the next one, replacing sentence-final punctuation with a comma at the join point:

const MIN_MERGE_LEN = 70;
let mergeBuf = '';
for (const s of rawSentences) {
  mergeBuf = mergeBuf ? `${mergeBuf.replace(/[;!?,.]+$/, ',')} ${s}` : s;
  if (mergeBuf.length >= MIN_MERGE_LEN) {
    mergedSentences.push(mergeBuf);
    mergeBuf = '';
  }
}

Kokoro: forward() per sentence and the drain queue

// forward()'s finally calls setIsGenerating(false) as an async React state update,
// so the re-render may not be visible yet on the next event-loop tick.
if (ttsRef.current.isGenerating) {
  const settle = Date.now() + 500;
  while (ttsRef.current.isGenerating && Date.now() < settle) {
    await new Promise<void>(r => setTimeout(r, 10));
  }
}
const raw = await ttsRef.current.forward({ text: sentence, speed });

const drainQueue = async (): Promise<void> => {
  while (playQueueRef.current.length > 0 && isMine() && !isStoppedRef.current) {
    const samples = playQueueRef.current.shift()!;
    await playSamples(ctx, samples);
  }
};
 
// After each forward() returns:
playQueueRef.current.push(samples);
if (!drainHolder.p) {
  drainHolder.p = drainQueue().finally(() => { drainHolder.p = null; });
}

function toJsFloat32Array(src: Float32Array): Float32Array {
  const dst = new Float32Array(src.length);
  dst.set(src);
  return dst;
}
 
const samples = trimTrailingSilence(toJsFloat32Array(raw as unknown as Float32Array));

source.onEnded does not fire reliably on Android. playSamples uses a 300ms grace timeout as a fallback so the drain loop always makes progress:

const durationMs = (samples.length / 24000) * 1000;
const timeout = setTimeout(() => { cleanup(); resolve(); }, durationMs + 300);
source.onEnded = () => { clearTimeout(timeout); cleanup(); resolve(); };

The AudioContext is created at 24,000 Hz to match Kokoro's output rate, for the same reason as Piper: mismatched rates on Android produce incorrect playback speed without an error.

What to keep after the fix

frequently asked questions

why does rms amplitude detection trigger false interrupts during tts playback on android?

what does AudioSource.VOICE_COMMUNICATION do differently from AudioSource.MIC?

how does ios handle echo cancellation for voice interrupts?

do you still need a threshold check after switching to VOICE_COMMUNICATION?

why can't you use nnapi or coreml as the onnx provider for piper?

why use forward() instead of stream() for kokoro on android?

more writing

ABHK®

react native tts false interrupts and synthesis latency

The conversational mode setup

Why rms detection fails during tts

The fix: AudioSource.VOICE_COMMUNICATION

The android native module

iOS: aec via whisper voicechat

The interrupt service and the loop

Piper synthesis latency: xnnpack and warm weights

Kokoro: forward() per sentence and the drain queue

What to keep after the fix

frequently asked questions

ABHK®

Loading

ABHK®

react native tts false interrupts and synthesis latency

The conversational mode setup

Why rms detection fails during tts

The fix: AudioSource.VOICE_COMMUNICATION

The android native module

iOS: aec via whisper voicechat

The interrupt service and the loop

Piper synthesis latency: xnnpack and warm weights

Kokoro: forward() per sentence and the drain queue

What to keep after the fix

frequently asked questions