You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to pitch an idea for a successor protocol to Wyoming, which I am tentatively calling Montana. Because this is a GitHub issue, I will be brief in my criticisms of the current protocol, all of which are technical in nature. I hope that does not come off as curt or rude, because I honestly love what the Open Home Foundation (and @synethesiam in particular) has created here in terms of community and software.
Shortfalls in Wyoming
There are a number of shortfalls in Wyoming that ultimately make it unsuitable as the core protocol in a voice assistant.
First, the Speech to Text interface is unable to handle many important cases. From the README:
Speech to Text
→ transcribe event with name of model to use or language (optional)
→ audio-start (required)
→ audio-chunk (required)
Send audio chunks until silence is detected
→ audio-stop (required)
← transcript
Contains text transcription of spoken audio
The chunking interface does not allow for streaming of Speech into Text. I'm calling this the "Word Boundary Problem"; in essence, the transcription API assumes the entirety of the utterance is delivered, and then a full transcript is returned. You cannot break the audio up into smaller bits, because you can't be sure you're splitting on word boundaries. For example, in #5, user @sdetweil appears to have run into this problem (though, I cannot be sure). In the above issue, @synesthesiam mentions the possibility of chunking the transcript, but notably this doesn't solve the word boundary problem. Conversation is impossible without this streaming support.
Compounding the above issue is that data sent does not have any "request ID" or anything associated with it. We are left simply assuming that if we were to request two transcripts quickly one after another, the first response back should be associated to the first transcript requested. This is not going to always be true, and we cannot build reliable software on such assumptions.
Solution: We should be able to stream audio and receive a stream of replies through a separate data channel, all simultaneously. Separate requests should be able to be reconstituted reliably, even if sent or received out-of-order or over an unreliable network. The basis of this would be a real-time transport protocol, such as WebRTC.
More generally:
The API is awkward to work with (data being returned in two places -- the header and the payload -- for utterances, for example) c.f. discussion in #14. It pushes a tonne of unnecessary work onto application developers, because they have to basically wireshark the protocol of whatever they are talking to actually know what to expect in terms of payload vs data vs header.
Solution: Simplify the protocol so that data can only appear in one place -- hopefully the most obvious one!
The current version is unfixable because it is unversioned and does not contain necessary extension points. For this reason, any change to the protocol will break older clients (for example, adding some sort of streaming support to Speech to Text through the suggestion above) with no way to tell them that they will break. This makes it much more difficult (perhaps impossible) to build progressively more complex software on Wyoming as a foundation.
Solution: The protocol must be explicitly versioned so that older clients can properly handle messages they know come from newer versions, and fail gracefully.
The protocol is poorly specified (i.e. without enough detail, without any full examples) in just a README.md that isn't really good enough to understand Wyoming without looking at code examples or using wireshark (c.f. #14). This is unreasonable and limits engagements/uptake by members of this and other communities. If a goal is to be an "open standard", as claimed in the README, then more work is needed here.
Solution: A formal specification would be appreciated for anyone trying to interface with devices. Adoption of an IDL would simplify this task a lot. (This obviously can be fixed without a new protocol, but I include it here for completeness)
Pure TCP lacks security (authentication, encryption), forcing users to either live without security, or implement their own.
Solution: The protocol should be at least as secure as a standard HTTPS request, with encryption and authentication.
Sending PCM data over the network is wasteful/awkward when Opus etc. exist
Solution: Allow for the use of different formats/codecs for audio, not just raw PCM.
Let me again reiterate that I really love what Wyoming is trying to do, and hope the above criticisms will be viewed as constructive.
Montana
First, if we are going to call this "an Open Standard from the Open Home Foundation", then I must emphasize that my design cannot be final. Many "Open X Foundations" have XIPs for "X Improvement Proposals" that are developed in the open and the community gives feedback on, rather like RFCs. Perhaps something similar -- a HIP -- could be done with OHF? This is more just a sounding board for if I were designing a voice assistant protocol from scratch, how would I do it. Open Standards are developed in the open!
Second, I am more than willing to cede the copyright or whatever rights needed to the Open Home Foundation for this work, if they deem it necessary.
But onto the protocol.
I believe WebRTC is ultimately a better base protocol than stdin/stdout or tcp for this. It is an open standard, widely supported in browsers, gaining support in many languages, and perfectly encapsulates what we want to do. It is normally used for conversations between humans -- why not between voice assistant agents and humans then?
I am not a WebRTC expert, so I'd invite someone to jump in if I got something wrong here!
WebRTC has three core types of channels: audio, video, and data. In the simplest incarnation, Montana would not use the video channel -- but it's kind of exciting to think that we have access to it if we wanted to add an avatar!
The basic idea is that we would:
Establish a WebRTC connection between server and client (this is a p2p connection, so would not require a beefy server in the middle, just as today, but would require a signalling server process to help connect the hosts to one another). This connection is secure in the sense it supports encryption and authentication using the standard web protocols.
There is a minimum of one channel which will be established: a data channel which I will call the "control channel". This is how the server and client will actually communicate with a protocol. Instead of detailing the entire format of the protocol here, however, I offer some advice.
Pick Protobuf, Cap'n Proto, JSON, some open standard at least for use as the data protocol. Not a non-standard JSONL+Binary. Ideally, if sending audio data, it will be sent over an audio channel. I would suggest protocol buffers, because we might still want to send binary data, and they are relatively efficient, inspectable (with correct tools) and expressive. There is also the benefit of being able to generate bindings in many languages -- I figure many people working on hobby projects have a favourite language that might not be Python/JS that they want to use to power their servers.
Have a handshake at the start of the protocol, exchanging information on what the server and client mutually support for codecs, etc. This is where the version of the protocol would be specified too, and allow us to fallback to an older protocol for older clients if needed (or just fail gracefully!).
Use the other channels for the main data transfer. If doing a transcription service, open a separate audio and data channel pair, streaming audio through the audio channel and streaming text blobs (in some format) back through the second data channel. Do not do everything on the control channel.
Like a TCP connection, this will be stateful and remain open. Perhaps having a heartbeat on the control channel would be a good idea, as this allows us to detect network faults and adjust services based on latency. Presence is powerful.
Have IDs joining related channels together, so that data sent by one can be reconstructed on the other side as being related to whatever request sent it.
Use Opus for voice. There is no reason to send enormous raw PCM data over the network.
I really like the idea of having events emitted by the server and client that Wyoming has, but I have reservations about how this is currently done. For example, with WebRTC there would be no need for the below:
Text to Speech
→ synthesize event with text (required)
← audio-start
← audio-chunk
- One or more audio chunks
← audio-stop
The synthesize event might be similar, but the audio chunks are no longer necessary, nor is the stateful start/stop events. This massively simplifies building and using of the protocol, for both implementors and users. It would look something like:
-> request to open an audio channel for speech synthesis, and a related data channel.
-> send stream of text data (so that we can support token-by-token generation by an LLM) to be synthesized into speech over the new data channel
<- send the audio back over the audio channel simultaneously
(optional) -> close the channels using the channel ID over the control channel, and clean up
While, for example, Wake Word Detection might look like:
-> request to open an audio channel for wake word detection, and a related data channel.
-> send stream of audio data over audio channel, no chunking required. WebRTC does all the heavy lifting here.
<- receive detection events over the new data channel simutaneously
(optional) -> close the channel using the channel ID over the control channel, and clean up
In any case, with the framework for a protocol sketched above, I think we could have something really powerful to build on in the future! Please let me know what you think. Again, thank you to the Open Home Foundation and their sponsors for creating such a vibrant community of tinkerers here.
In particular, if you feel any part of this protocol would make it unsuitable for a voice assistant, or that I haven't solved the problems outlined under the Wyoming section, please comment.
Parting Thoughts
I wish to emphasize one thing which seems true to me: that the current Wyoming protocol (call it W1) and any future versions (W2, W3, ... etc) will be incompatible with each other, no matter what steps are taken. W1 is not up to the task of providing the necessary services that a protocol must in its shoes (streaming, etc). A new protocol will have to be specified at some point; the question is whether the new, incompatible protocol will be a Wyoming-like protocol, or a completely different one (like Montana). I favour the latter obviously, but I just wanted to ensure we are on the same page when it comes to compatibility: W1 cannot be compatibly evolved because of technical features of its design. It must be replaced. If you disagree with this, I would be very interested to hear how/why!
I am happy to help in the development of either Montana or W2, working with community members to ensure we have a great foundation to build on for our future voice (and possibly video!) assistants.
The text was updated successfully, but these errors were encountered:
yes, I have the (word boundary) problem, if I don't want the wait til end of speech problem (no live feedback)
each connection/client is a unique session, so there is no intermingling of events or data (and of course the handler has to use instance variables, not globals.)
but I like the idea for streaming..
as my attempts to build async callback notification doesn't work with the chunks.
the base code that uses this google STT is using a pipe of the stdout of arecord frames to the engine
Montana Protocol Pitch
Introduction
I would like to pitch an idea for a successor protocol to Wyoming, which I am tentatively calling Montana. Because this is a GitHub issue, I will be brief in my criticisms of the current protocol, all of which are technical in nature. I hope that does not come off as curt or rude, because I honestly love what the Open Home Foundation (and @synethesiam in particular) has created here in terms of community and software.
Shortfalls in Wyoming
There are a number of shortfalls in Wyoming that ultimately make it unsuitable as the core protocol in a voice assistant.
First, the Speech to Text interface is unable to handle many important cases. From the README:
The chunking interface does not allow for streaming of Speech into Text. I'm calling this the "Word Boundary Problem"; in essence, the transcription API assumes the entirety of the utterance is delivered, and then a full transcript is returned. You cannot break the audio up into smaller bits, because you can't be sure you're splitting on word boundaries. For example, in #5, user @sdetweil appears to have run into this problem (though, I cannot be sure). In the above issue, @synesthesiam mentions the possibility of chunking the transcript, but notably this doesn't solve the word boundary problem. Conversation is impossible without this streaming support.
Compounding the above issue is that data sent does not have any "request ID" or anything associated with it. We are left simply assuming that if we were to request two transcripts quickly one after another, the first response back should be associated to the first transcript requested. This is not going to always be true, and we cannot build reliable software on such assumptions.
Solution: We should be able to stream audio and receive a stream of replies through a separate data channel, all simultaneously. Separate requests should be able to be reconstituted reliably, even if sent or received out-of-order or over an unreliable network. The basis of this would be a real-time transport protocol, such as WebRTC.
More generally:
Solution: Simplify the protocol so that data can only appear in one place -- hopefully the most obvious one!
Solution: The protocol must be explicitly versioned so that older clients can properly handle messages they know come from newer versions, and fail gracefully.
Solution: A formal specification would be appreciated for anyone trying to interface with devices. Adoption of an IDL would simplify this task a lot. (This obviously can be fixed without a new protocol, but I include it here for completeness)
Solution: The protocol should be at least as secure as a standard HTTPS request, with encryption and authentication.
Solution: Allow for the use of different formats/codecs for audio, not just raw PCM.
Let me again reiterate that I really love what Wyoming is trying to do, and hope the above criticisms will be viewed as constructive.
Montana
First, if we are going to call this "an Open Standard from the Open Home Foundation", then I must emphasize that my design cannot be final. Many "Open X Foundations" have XIPs for "X Improvement Proposals" that are developed in the open and the community gives feedback on, rather like RFCs. Perhaps something similar -- a HIP -- could be done with OHF? This is more just a sounding board for if I were designing a voice assistant protocol from scratch, how would I do it. Open Standards are developed in the open!
Second, I am more than willing to cede the copyright or whatever rights needed to the Open Home Foundation for this work, if they deem it necessary.
But onto the protocol.
I believe WebRTC is ultimately a better base protocol than stdin/stdout or tcp for this. It is an open standard, widely supported in browsers, gaining support in many languages, and perfectly encapsulates what we want to do. It is normally used for conversations between humans -- why not between voice assistant agents and humans then?
I am not a WebRTC expert, so I'd invite someone to jump in if I got something wrong here!
WebRTC has three core types of channels: audio, video, and data. In the simplest incarnation, Montana would not use the video channel -- but it's kind of exciting to think that we have access to it if we wanted to add an avatar!
The basic idea is that we would:
Establish a WebRTC connection between server and client (this is a p2p connection, so would not require a beefy server in the middle, just as today, but would require a signalling server process to help connect the hosts to one another). This connection is secure in the sense it supports encryption and authentication using the standard web protocols.
There is a minimum of one channel which will be established: a data channel which I will call the "control channel". This is how the server and client will actually communicate with a protocol. Instead of detailing the entire format of the protocol here, however, I offer some advice.
Use Opus for voice. There is no reason to send enormous raw PCM data over the network.
I really like the idea of having events emitted by the server and client that Wyoming has, but I have reservations about how this is currently done. For example, with WebRTC there would be no need for the below:
The synthesize event might be similar, but the audio chunks are no longer necessary, nor is the stateful start/stop events. This massively simplifies building and using of the protocol, for both implementors and users. It would look something like:
-> request to open an audio channel for speech synthesis, and a related data channel.
-> send stream of text data (so that we can support token-by-token generation by an LLM) to be synthesized into speech over the new data channel
<- send the audio back over the audio channel simultaneously
(optional) -> close the channels using the channel ID over the control channel, and clean up
While, for example, Wake Word Detection might look like:
-> request to open an audio channel for wake word detection, and a related data channel.
-> send stream of audio data over audio channel, no chunking required. WebRTC does all the heavy lifting here.
<- receive detection events over the new data channel simutaneously
(optional) -> close the channel using the channel ID over the control channel, and clean up
In any case, with the framework for a protocol sketched above, I think we could have something really powerful to build on in the future! Please let me know what you think. Again, thank you to the Open Home Foundation and their sponsors for creating such a vibrant community of tinkerers here.
In particular, if you feel any part of this protocol would make it unsuitable for a voice assistant, or that I haven't solved the problems outlined under the Wyoming section, please comment.
Parting Thoughts
I wish to emphasize one thing which seems true to me: that the current Wyoming protocol (call it W1) and any future versions (W2, W3, ... etc) will be incompatible with each other, no matter what steps are taken. W1 is not up to the task of providing the necessary services that a protocol must in its shoes (streaming, etc). A new protocol will have to be specified at some point; the question is whether the new, incompatible protocol will be a Wyoming-like protocol, or a completely different one (like Montana). I favour the latter obviously, but I just wanted to ensure we are on the same page when it comes to compatibility: W1 cannot be compatibly evolved because of technical features of its design. It must be replaced. If you disagree with this, I would be very interested to hear how/why!
I am happy to help in the development of either Montana or W2, working with community members to ensure we have a great foundation to build on for our future voice (and possibly video!) assistants.
The text was updated successfully, but these errors were encountered: