Add support for ultravox
Add LLM usage metrics based on chunks responses API docs for usage
research clojure-media for dedicated ffmpeg support for media conversion
Help me decide on the best course of action relating to voice-fn.
Currently, as you can see, the connections between processors are manual to support speed. :conns [[[:transport-in :sys-out] [:deepgram-transcriptor :sys-in]] [[:transport-in :out] [:deepgram-transcriptor :in]] [[:deepgram-transcriptor :out] [:user-context-aggregator :in]] [[:user-context-aggregator :out] [:llm :in]] [[:llm :out] [:assistant-context-aggregator :in]]
;; cycle so that context aggregators are in sync [[:assistant-context-aggregator :out] [:user-context-aggregator :in]] [[:user-context-aggregator :out] [:assistant-context-aggregator :in]]
[[:llm :out] [:llm-sentence-assembler :in]] [[:llm-sentence-assembler :out] [:tts :in]]
[[:tts :out] [:audio-splitter :in]] [[:transport-in :sys-out] [:realtime-out :sys-in]] [[:audio-splitter :out] [:realtime-out :in]]]
However, this isn’t user friendly. An alternative would be that all processors just send forward what they receive and if another processor cares about that frame downstream, will use it.
objective 50/50 BEFORE YOU ANSWER, I want you to write two detailed paragraphs, one arguing for each of these solutions - do not jump to conclusions, seriously consider both approaches.
After you finish, tell me wherer of these solutions is obviously better than the other and why. ”
Let me carefully consider both approaches:
Argument for Explicit Connections: The current approach of explicit connections provides clear, predictable data flow paths that are easy to reason about and debug. When connections are explicitly defined, we can immediately understand the topology of our system and track how data moves through it. This approach enforces a strict contract between processors, making it impossible for unintended data flows to occur. It also allows for performance optimizations since we’re not broadcasting frames to processors that don’t need them. From a system design perspective, explicit connections follow the principle of least surprise and make it easier to maintain backward compatibility when making changes. The topology becomes a part of the system’s documentation, clearly showing the relationships between components.
Argument for Implicit Frame Routing: A system where processors simply forward frames and downstream components decide what to process would be significantly more flexible and user-friendly. This approach follows the philosophy of data-oriented programming where data flows freely and components react to it based on their interests. It would make it much easier to add new processors without having to modify the connection topology, following the open-closed principle. This design would be more resilient to changes since processors would be truly decoupled from each other. It would also be more intuitive for users who wouldn’t need to understand the entire system topology to add a new processor - they would just need to know what frame types they care about.
After careful consideration, I believe the explicit connections approach is actually better, for several critical reasons:
- Debugging and Observability:
- With explicit connections, you can trace exactly where frames are going
- Performance issues can be pinpointed to specific connections
- System topology can be visualized and understood clearly
- Performance:
- No wasted processing of frames by components that don’t need them
- Clear backpressure paths
- More efficient resource utilization
- Error Prevention:
- Prevents accidental frame routing that could cause subtle bugs
- Makes it impossible to create circular dependencies unintentionally
- Validates system topology at startup
- System Understanding:
- Forces developers to think about and document data flow
- Makes system boundaries explicit
- Provides clear documentation through configuration
However, I would suggest a middle ground: Keep the explicit connections but provide higher-level abstractions for common patterns. For example:
(def voice-pipeline
(pipeline/define
{:transcription [:transport-in :deepgram :user-context]
:conversation [:user-context :llm :assistant-context]
:synthesis [:llm :sentence-assembler :tts :audio-splitter :transport-out]}))
This would give you the benefits of explicit connections while making it more user-friendly. The higher-level abstraction could even generate the detailed connections automatically while maintaining the performance and debugging benefits of explicit connections.
This solution maintains the core benefits of explicit connections while addressing the user-friendliness concern that motivated considering implicit routing.
Some code from another project
;;;;;;;;; Gladia ASR ;;;;;;;;;;;;;
;; :frames_format "base64"
;; :word_timestamps true})
(def ^:private gladia-url "wss://api.gladia.io/audio/text/audio-transcription")
;; this may be outdated
(def ^:private asr-configuration {:x_gladia_key api-key
:sample_rate 8000
:encoding "WAV/ULAW"
:language_behaviour "manual"
:language "romanian"})
(defn transcript?
[m]
(= (:event m) "transcript"))
(defn final-transcription?
[m]
(and (transcript? m)
(= (:type m) "final")))
(defn partial-transcription?
[m]
(and (transcript? m)
(= (:type m) "partial")))
(defrecord GladiaASR [ws asr-chan]
ASR
(send-audio-chunk [_ data]
(send! ws {:frames (get-in data [:media :payload])} false))
(close! [_]
(ws/close! ws)))
(defn- make-gladia-asr!
[{:keys [asr-text]}]
;; TODO: Handle reconnect & errors
(let [ws @(websocket gladia-url
{:on-open (fn [ws]
(prn "Open ASR Stream")
(send! ws asr-configuration)
(u/log ::gladia-asr-connected))
:on-message (fn [_ws ^HeapCharBuffer data _last?]
(let [m (json/parse-if-json (str data))]
(u/log ::gladia-msg :m m)
(when (final-transcription? m)
(u/log ::gladia-asr-transcription :sentence (:transcription m) :transcription m)
(go (>! asr-text (:transcription m))))))
:on-error (fn [_ e]
(u/log ::gladia-asr-error :exception e))
:on-close (fn [_ code reason]
(u/log ::gladia-asr-closed :code code :reason reason))})]
(->GladiaASR ws asr-text)))
(require '[wkok.openai-clojure.api :as openai])
(defn openai
"Generate speech using openai"
([input]
(openai input {}))
([input config]
(openai/create-speech (merge {:input input
:voice "alloy"
:response_format "wav"
:model "tts-1"}
config)
{:version :http-2 :as :stream})))
(defn tts-stage-openai
[sid in]
(a/go-loop []
(let [sentence (a/<! in)]
(when-not (nil? sentence)
(append-message! sid "assistant" sentence)
(try
(let [sentence-stream (-> (tts/openai sentence) (io/input-stream))
ais (AudioSystem/getAudioInputStream sentence-stream)
twilio-ais (audio/->twilio-phone ais)
buffer (byte-array 256)]
(loop []
(let [bytes-read (.read twilio-ais buffer)]
(when (pos? bytes-read)
(twilio/send-msg! (sessions/ws sid)
sid
(e/encode-base64 buffer))
(recur)))))
(catch Exception e
(u/log ::tts-stage-error :exception e)))
(recur)))))
(def ^:private rime-tts-url "https://users.rime.ai/v1/rime-tts")
(defn rime
"Generate speech using rime-ai provider"
[sentence]
(-> {:method :post
:url rime-tts-url
:as :stream
:body (json/->json-str {:text sentence
:reduceLatency false
:samplingRate 8000
:speedAlpha 1.0
:modelId "v1"
:speaker "Colby"})
:headers {"Authorization" (str "Bearer " rime-api-key)
"Accept" "audio/x-mulaw"
"Content-Type" "application/json"}}
(client/request)
:body))
(defn rime-async
"Generate speech using rime-ai provider, outputs results on a async
channel"
[sentence]
(let [stream (-> (rime sentence)
(io/input-stream))
c (a/chan 1024)]
(au/input-stream->chan stream c 1024)))
(defn tts-stage
[sid in]
(a/go-loop []
(let [sentence (a/<! in)]
(when-not (nil? sentence)
(append-message! sid "assistant" sentence)
(try
(let [sentence-stream (-> (tts/rime sentence) (io/input-stream))
buffer (byte-array 256)]
(loop []
(let [bytes-read (.read sentence-stream buffer)]
(when (pos? bytes-read)
(twilio/send-msg! (sessions/ws sid)
sid
(e/encode-base64 buffer))
(recur)))))
(catch Exception e
(u/log ::tts-stage-error :exception e)))
(recur)))))
This means implementing flow diagrams
{:initial-node :start
:nodes
{:start {:role_messages [{:role :system
:content "You are an order-taking assistant. You must ALWAYS use the available functions to progress the conversation. This is a phone conversation and your responses will be converted to audio. Keep the conversation friendly, casual, and polite. Avoid outputting special characters and emojis."}]
:task_messages [{:role :system
:content "For this step, ask the user if they want pizza or sushi, and wait for them to use a function to choose. Start off by greeting them. Be friendly and casual; you're taking an order for food over the phone."}]}
:functions [{:type :function
:function {:name :choose_sushi
:description "User wants to order sushi. Let's get that order started"
}}]
}}