- Sphinx Based Plugin
- Whisper Speech-to-Text Unreal Engine Plugin
- REST API (Unreal Engine) – Flask
GIT : Sphinx Unreal Engine Plugin
Acoustic Model : Contains a statistical representation of the distinct sounds for every word in vocab and each sound corresponds to a phoneme Language Model : Contains list of words and their probability of occurrence in sequence
Fig. (a) Folder structure inside content/model directory (b) Phonemes inside the vocab
Fig. (a) Setting Probability tolerance for Recognised phrases (b) Reading and Displaying Recognised text
Fig. Overview of Sphinx plugin Speech to Text operation
Drawbacks:
- Need to add phonemes (vocabulary) for the words to get recognised
- Performance is not good for text with 2 or more words
Reference : Whisper Cpp
GIT : ../blob/main/SpeechRecognition.zip
Libraries Used:
- SDL2
- Whisper (C++)
- Standard Library C++ 17
- Containers : Array, Vector, Map, Set
- Streams : fstream, iostream, sstream
- Concurrency : thread, mutex, atomic
Fig. Code snippet inside Build.cs of speech-to-text unreal engine plugin
Inside 'SpeechRecognition\Source\SpeechRecognition\Private\ MySpeechWorker ', functions to record audio, scaling, filtering are found.
Processed audio is passed on to whisper network to get transcripted text as output.
Fig. Code snippet to retrieve audio buffer and invoke whisper for transcriptions
Drawbacks:
- Speed : The speed of transciption is greater than 8 secs which is not reliable
- Accuracy : Obtained transcriptions doesnt match the speaker utterances always
GIT (USemLog) : ../USemLog/tree/SpeechRecord
GIT (Flask Python file) : ../IAI_USEMLOG_REST_Speech/blob/master/voice.py
-
FLASK :
- Flask, a lightweight framework for building web applications in Python
- Used Python’s PyAudio to read in audio data with required format, rate etc.,
- Used routes to map URLs to functions that handle the requests
- Listened for incoming HTTP requests and responds with appropriate HTTP responses (transcriptions)
-
Unreal REST API :
- Used Unreal engine’s C++ HTTP modules to raise API requests (start and stop recordings)
- Used parsing libraries (JSON) to parse the received response from flask
-
VR Motion Controller Mappings :
Fig. (a) Params in SL_LoggerManager used to map inputs (b) VR Trackpad buttons mapped as inputs to params in project_settings/inputs
Libraries and Tools Used :
- Unreal Engine (4.27,5.1.1)
- USemlog Plugin
- C++ Libraries (STL:Containers, JSON, HTTP)
- PyCharm (Flask API)
- Python Libraries (PyAudio, Whisper, os, wave, datetime, torch, threading)
Fig. Overview of Speech to Text Operation in ‘RobCoG’ using FLASK API
Fig. Code snippet of variables to facilitate controller mappings
Fig. Code snippet of controller mappings to functions
Fig. Code snippet of mapped functions calls definition
Fig. Code snippet of function to send start audio signal API request
Fig. Code snippet of function to send stop audio signal API request
- Out of 3 approaches mentioned above, third approach i.e.,
REST API (Unreal Engine) – Flask
performance is satisfactory - Steps to follow:
- Get the updated USemLog plugin with speech scripts available into the unreal engine project
- Inside PyCharm open and install all the packages used in
voice.py
script - Then run the flask application
voice.py
script in local host server - Can check the server status inside a POSTMAN plugin on chrome using the local host server URL
- Flask application is listening for API requests
- Start the RobCoG project
- Use the VR controllers to start and stop audio recordings
- Click the right controller trackpad down button to raise an API request, to start the audio recording
- User should start speaking
- Running Flask applcation receives the request and starts recording the user utterances
- Click the left controller trackpad down button to raise an API request, to stop the audio recording
- OR the recording also gets stopped when the unreal game gets ended
- Upon stop request, the flask application finishes recording and saves the audio in .wav file and invokes whisper package for transcriptions
- The transcriptions are then formatted to JSON and are passed back to RobCoG and displayed to the user.