Case-Study: Conditional Voice Recorder
A research group from the https://www.uni-siegen.de/ at the Collaborative Research Center “Media of Cooperation” contacted us for support with a piece of open source software to adapt it to their specific needs.
The Project B06 is researching “intelligent personal assistants”, where Amazon Alexa, Apple Siri and the Google assistant are the most popular examples. The research is focussed on the interactions that lead up to a request to the assistant and the following utterances.
To evaluate these interactions it is necessary to record the audio up to and following the request. To protect the privacy of the study participants the audio recording is required to be at the minimum possible. There is some existing software out there to facilitate that. The Mixed Reality Laboratory at the University of Nottingham provides their Conditional Voice Recorder that continuously listens for a wake-word and if it is detected it starts recording audio (Porcheron et al. 2018). It also stores a fixed size ring-buffer of audio preceding the actual wake-word, to gain insight into the interactions leading up to the request to the assistant.
The hardware setup is fairly straightforward: We use a Raspberry Pi 3 to record the audio from a conference microphone. This setup is placed near the intelligent assistant to catch most of the interactions. There is no need for a hardware modification to the assistant itself.
The original CVR software uses a proprietary wakeword-engine (“snowboy”) that did not support all of the required wake-words (“Alexa”, “OK Google” and “Hey Siri”). During the project duration it also turned out that the engine got discontinued and pulled from the internet. So we added support for a wakeword-engine based on tensorflow-lite. However, there were no pre-made models for the above mentioned wakewords. We tried to train our own with a body of audio provided by the research group. However, the collected audio samples could not successfully be used for training the machine learning algorithm. Unfortunately we ended up with a tensorflow model, that reliably detected all of the wakewords, but also got triggered by a ton of different stuff. In the end the actual use of this engine was not feasible.
In a stroke of luck the “porcupine” wakeword engine of the pico-voice project started providing pre-trained models for these wakewords. By porting the CVR-Software to this engine we were able to provide a reliable and configurable detection for the research project.
We also did some more work on the software. To minimize the impact of false triggers we prepended the recording of the triggering sound to the actual recording, separated by clearly noticeable dtmf-tones. That way the person evaluating the audio recordings can immediately judge if the recording was triggered erroneously and avoid listening to irrelevant recordings. This also helps in terms of privacy, because in this way it is no longer necessary to listen to false-positives completely just to identify them as such.
To finish up this project we also prepared a SD card image for the Raspberry Pi and customized the hardware. That way the study group could send the setup to the study participant, avoiding the need for any special knowledge to set up the research infrastructure.
At a later stage of the project some of the microphone hardware developed some connectivity faults. We then hardened the CVR software against disappearing/reconnecting microphones. In the end the research project managed to collect the necessary recordings. The data collection phase of the study was concluded successfully, we’re excited for the final results.
For further more detailed information please see also: