Loading
Please wait while your experience is prepared...
Please wait while your experience is prepared...
mobile / May 8, 2026 / 8 min
rms amplitude detection fires on the device's own tts audio. switching to AudioSource.VOICE_COMMUNICATION enables hardware aec and gives you clean mic input.
I was building a hands-free conversational mode for an offline React Native AI app. It works without internet, runs the model on-device, and speaks responses aloud while staying ready to listen for the next input. The interaction loop I was going for was similar to Gemini Live or ChatGPT voice mode: the assistant speaks, the user can interrupt mid-sentence, the assistant stops and processes the new input immediately.
The last piece was interruption detection. I implemented it with RMS amplitude monitoring at 16kHz from the device microphone. I set a threshold of 0.02. It worked on the first manual test. It failed on every run after that.
The assistant was interrupting itself. Every TTS sentence crossed the amplitude threshold, fired the interrupt handler, stopped playback, and started waiting for user input. The conversational loop would spin on interrupted half-sentences until I manually stopped it.
The root cause was one line. The fix was one line. Finding the root cause took most of a day.
The setup was a continuous loop:
Step 5 is where the problem was. The microphone monitoring runs concurrently with TTS playback. On Android, recording and playback can operate simultaneously when the audio session is configured for PlayAndRecord. The recording was running. The TTS was playing. The microphone was picking up the TTS.
RMS (root mean square) amplitude is the standard first approach for voice activity detection:
private fun calculateRms(buffer: ShortArray, readSize: Int): Float {
var sum = 0.0
for (i in 0 until readSize) {
val sample = buffer[i] / 32768.0
sum += sample * sample
}
return sqrt(sum / readSize).toFloat()
}When the audio source is MediaRecorder.AudioSource.MIC, the buffer contains everything the microphone transducer picks up. On a phone held in hand or sitting on a desk, speaker audio reaches the microphone at roughly -20 to -30 dB. At normal TTS playback volume, that is still well above a 0.02 RMS threshold.
The microphone cannot distinguish your voice from the device's own speaker output. They are both pressure waves at the transducer. You can raise the threshold, but then you miss soft speech. You can add a cooldown after TTS starts, but then you miss early interruptions. There is no threshold value that solves this without hardware help.
The same failure mode affects WebRTC VAD, silero-vad, or any energy-based voice detection running on raw microphone input during audio playback. The audio source is the problem, not the detection algorithm.
Android's MediaRecorder.AudioSource.VOICE_COMMUNICATION (integer value 7) routes the microphone input through the hardware Acoustic Echo Cancellation processor before delivering it to your app. The AEC maintains a reference signal of what the speaker is currently outputting and subtracts it in firmware.
What your AudioRecord buffer receives after switching: microphone audio with the speaker echo removed. The user's voice is present. The TTS playback audio is not.
// Before: raw mic, picks up speaker audio during TTS
val audioRecord = AudioRecord(
MediaRecorder.AudioSource.MIC,
SAMPLE_RATE_HZ,
AudioFormat.CHANNEL_IN_MONO,
AudioFormat.ENCODING_PCM_16BIT,
bufferSize
)
// After: hardware AEC enabled, speaker audio cancelled before buffer delivery
val audioRecord = AudioRecord(
MediaRecorder.AudioSource.VOICE_COMMUNICATION,
SAMPLE_RATE_HZ,
AudioFormat.CHANNEL_IN_MONO,
AudioFormat.ENCODING_PCM_16BIT,
bufferSize
)That one-line change to the audio source constant eliminated every false interrupt in testing.
The interruption detector lives in a Kotlin native module exposed to React Native. It runs AudioRecord in a background thread and emits events to JS when the user speaks.
class VoiceInterruptModule(reactContext: ReactApplicationContext) :
ReactContextBaseJavaModule(reactContext) {
companion object {
const val NAME = "VoiceInterruptModule"
private const val SAMPLE_RATE_HZ = 16000
private const val INTERRUPT_THRESHOLD = 0.015f
private const val SUSTAINED_FRAMES_REQUIRED = 3
private const val EVENT_INTERRUPT = "onVoiceInterrupt"
}
override fun getName() = NAME
private var audioRecord: AudioRecord? = null
private var monitorThread: Thread? = null
@Volatile private var isMonitoring = false
@ReactMethod
fun startMonitoring(promise: Promise) {
if (isMonitoring) {
promise.resolve(null)
return
}
val bufferSize = AudioRecord.getMinBufferSize(
SAMPLE_RATE_HZ,
AudioFormat.CHANNEL_IN_MONO,
AudioFormat.ENCODING_PCM_16BIT
)
val record = AudioRecord(
MediaRecorder.AudioSource.VOICE_COMMUNICATION, // hardware AEC
SAMPLE_RATE_HZ,
AudioFormat.CHANNEL_IN_MONO,
AudioFormat.ENCODING_PCM_16BIT,
bufferSize * 4
)
if (record.state != AudioRecord.STATE_INITIALIZED) {
promise.reject("INIT_FAILED", "AudioRecord failed to initialise")
return
}
// Attach software AEC as a fallback layer on devices where
// VOICE_COMMUNICATION alone does not fully activate hardware AEC.
if (AcousticEchoCanceler.isAvailable()) {
AcousticEchoCanceler.create(record.audioSessionId)
}
audioRecord = record
isMonitoring = true
record.startRecording()
monitorThread = Thread {
val buffer = ShortArray(bufferSize)
var sustainedFrames = 0
while (isMonitoring) {
val readSize = record.read(buffer, 0, buffer.size)
if (readSize <= 0) continue
val amplitude = calculateRms(buffer, readSize)
if (amplitude > INTERRUPT_THRESHOLD) {
sustainedFrames++
if (sustainedFrames >= SUSTAINED_FRAMES_REQUIRED) {
emitInterruptEvent()
sustainedFrames = 0
}
} else {
sustainedFrames = 0
}
}
}.also { it.start() }
promise.resolve(null)
}
@ReactMethod
fun stopMonitoring(promise: Promise) {
isMonitoring = false
monitorThread?.join(500)
monitorThread = null
audioRecord?.stop()
audioRecord?.release()
audioRecord = null
promise.resolve(null)
}
private fun calculateRms(buffer: ShortArray, readSize: Int): Float {
var sum = 0.0
for (i in 0 until readSize) {
val sample = buffer[i] / 32768.0
sum += sample * sample
}
return sqrt(sum / readSize).toFloat()
}
private fun emitInterruptEvent() {
reactApplicationContext
.getJSModule(DeviceEventManagerModule.RCTDeviceEventEmitter::class.java)
.emit(EVENT_INTERRUPT, null)
}
@ReactMethod
fun addListener(eventName: String) {}
@ReactMethod
fun removeListeners(count: Int) {}
}A few things worth noting in this implementation:
SUSTAINED_FRAMES_REQUIRED = 3 requires 3 consecutive frames above threshold before firing. With a 16kHz sample rate and a typical AudioRecord buffer of 1024 samples, 3 frames is about 192ms. This rejects brief spikes from keyboard sounds or notification audio without adding perceptible response latency.
The AcousticEchoCanceler.create(record.audioSessionId) call attaches a software echo cancellation layer as a belt-and-suspenders measure. On most modern Android devices, VOICE_COMMUNICATION activates hardware AEC automatically. On older devices or heavily customised OEM audio stacks, it may not. The software AEC adds another cancellation pass at minimal CPU cost.
The threshold of 0.015f is lower than what you'd use with a raw mic source. With hardware AEC, ambient noise is the main remaining signal, not speaker bleed, so you can afford a more sensitive threshold.
Register the module in the package class:
class VoiceInterruptPackage : ReactPackage {
override fun createNativeModules(
reactContext: ReactApplicationContext
): List<NativeModule> = listOf(VoiceInterruptModule(reactContext))
override fun createViewManagers(
reactContext: ReactApplicationContext
): List<ViewManager<*, *>> = emptyList()
}Add it to MainApplication.kt:
override fun getPackages(): List<ReactPackage> =
PackageList(this).packages.apply {
add(VoiceInterruptPackage())
}On the JS side, subscribe to the interrupt event before TTS starts and unsubscribe after:
import { NativeEventEmitter, NativeModules } from 'react-native'
const { VoiceInterruptModule } = NativeModules
const emitter = new NativeEventEmitter(VoiceInterruptModule)
async function startConversationalMode() {
await VoiceInterruptModule.startMonitoring()
const subscription = emitter.addListener('onVoiceInterrupt', () => {
ttsEngine.stop()
subscription.remove()
VoiceInterruptModule.stopMonitoring()
startListeningForNextTurn()
})
}The RECORD_AUDIO permission must be requested before calling startMonitoring(). On Android 13 and above, this is Manifest.permission.RECORD_AUDIO, the same permission used for any microphone access.
After switching to VOICE_COMMUNICATION, I removed the adaptive threshold logic that was trying to track background noise levels, the 500ms cooldown after TTS starts that was meant to skip the initial speaker bleed, and a gain normalisation pass that was compensating for mic-to-speaker distance variation on different devices.
All of those were patches on top of the wrong audio source. With hardware AEC active, the only thing that reaches the RMS check is genuine ambient audio. A fixed threshold and a sustained-frames check handle it correctly.
The conversational loop ran cleanly after the change. The user could interrupt mid-sentence. The assistant stopped, listened, and responded. No false triggers in testing across 3 devices: a Xiaomi 23129RN51H (Android 15), a Pixel 6, and a mid-range Samsung. The one-line audio source change was the entire fix.
why does rms amplitude detection trigger false interrupts during tts playback on android?
when you record with AudioSource.MIC or AudioSource.DEFAULT, the audio buffer includes everything the microphone transducer picks up, including sound from the device's own speaker. during tts playback the speaker emits audio that reaches the microphone at -20 to -30 dB relative to the original signal, which is still above any practical interrupt threshold. the microphone has no way to distinguish your voice from the device's own output. you need hardware-level cancellation, not a smarter threshold.
what does AudioSource.VOICE_COMMUNICATION do differently from AudioSource.MIC?
AudioSource.VOICE_COMMUNICATION (value 7 in MediaRecorder.AudioSource) activates the device's hardware Acoustic Echo Cancellation processor before the audio stream reaches your app. the AEC maintains a reference signal of what the speaker is currently playing and subtracts it from the microphone input in firmware. what your AudioRecord buffer receives is microphone audio with the speaker echo removed. your voice is preserved. the tts playback audio is not. this runs in device firmware, not in your app, so it adds no CPU overhead and works even during heavy on-device inference.
do you still need a threshold check after switching to VOICE_COMMUNICATION?
yes, but the threshold serves a different purpose. hardware aec removes speaker bleed, not ambient room noise. a threshold check after switching to VOICE_COMMUNICATION answers 'is the user actually speaking' rather than 'is something making noise'. you can set it lower than you would with a raw mic source. values around 0.01 to 0.02 rms work well in quiet environments. add a sustained-frames check requiring 2-3 consecutive frames above threshold to reject brief noise spikes like keyboard clicks or notification sounds.
is AcousticEchoCanceler the same as using AudioSource.VOICE_COMMUNICATION?
they overlap but are separate. AudioSource.VOICE_COMMUNICATION is an audio source selector that tells Android to route the input through the hardware AEC chain. AcousticEchoCanceler is a software AudioEffect that can be attached to an AudioRecord session as an additional processing stage. in practice, AudioSource.VOICE_COMMUNICATION alone is sufficient for tts interruption detection. AcousticEchoCanceler is useful as a fallback on older devices where the hardware AEC chain may not be activated automatically by the audio source selector.