- Core Features
- Quick Start: Twilio WebSocket Example
- Supported Providers
- Key Concepts
- Adding Custom Processes
- Built With
- License
is a Clojure framework for building real-time voice AI applications using a data-driven, functional approach. Built on top of clojure.core.async.flow
, it provides a composable pipeline architecture for processing audio, text, and AI interactions with built-in support for major AI providers.
This project's status is experimental. Expect breaking changes.
- Flow-Based Architecture: Built on
for robust concurrent processing - Data-First Design: Define AI pipelines as data structures for easy configuration and modification
- Streaming Architecture: Efficient real-time audio and text processing
- Extensible Processors: Simple protocol-based system for adding new processing components
- Flexible Frame System: Type-safe message passing between pipeline components
- Built-in Services: Ready-to-use integrations with major AI providers
(defn make-twilio-flow
[in out]
(let [encoding :ulaw
sample-rate 8000
sample-size-bits 8
channels 1 ;; mono
chunk-duration-ms 20
llm-context {:messages [{:role "system"
:content "You are a voice agent operating via phone. Be concise. The input you receive comes from a speech-to-text (transcription) system that isn't always efficient and may send unclear text. Ask for clarification when you're unsure what the person said."}]
:tools [{:type :function
{:name "get_weather"
:description "Get the current weather of a location"
:parameters {:type :object
:required [:town]
:properties {:town {:type :string
:description "Town for which to retrieve the current weather"}}
:additionalProperties false}
:strict true}}]}]
{:transport-in {:proc transport/twilio-transport-in
:args {:transport/in-ch in}}
:deepgram-transcriptor {:proc asr/deepgram-processor
:args {:transcription/api-key (secret [:deepgram :api-key])
:transcription/interim-results? true
:transcription/vad-events? true
:transcription/smart-format? true
:transcription/model :nova-2
:transcription/utterance-end-ms 1000
:transcription/language :en
:transcription/encoding :mulaw
:transcription/sample-rate sample-rate}}
:user-context-aggregator {:proc context/user-aggregator-process
:args {:llm/context llm-context}}
:assistant-context-aggregator {:proc context/assistant-context-aggregator
:args {:llm/context llm-context
:debug? true
:llm/registered-tools {"get_weather" {:async false
:tool (fn [{:keys [town]}] (str "The weather in " town " is 17 degrees celsius"))}}}}
:llm {:proc llm/openai-llm-process
:args {:openai/api-key (secret [:openai :new-api-sk])
:llm/model "gpt-4o-mini"}}
:llm-sentence-assembler {:proc (flow/step-process #'context/sentence-assembler)}
:tts {:proc tts/elevenlabs-tts-process
:args {:elevenlabs/api-key (secret [:elevenlabs :api-key])
:elevenlabs/model-id "eleven_flash_v2_5"
:elevenlabs/voice-id "7sJPxFeMXAVWZloGIqg2"
:voice/stability 0.5
:voice/similarity-boost 0.8
:voice/use-speaker-boost? true
:flow/language :en
:audio.out/encoding encoding
:audio.out/sample-rate sample-rate}}
:transport-out {:proc transport/realtime-transport-out-processor
:args {:transport/out-chan out}}}
:conns [[[:transport-in :sys-out] [:deepgram-transcriptor :sys-in]]
[[:transport-in :out] [:deepgram-transcriptor :in]]
[[:deepgram-transcriptor :out] [:user-context-aggregator :in]]
[[:user-context-aggregator :out] [:llm :in]]
[[:llm :out] [:assistant-context-aggregator :in]]
;; cycle so that context aggregators are in sync
[[:assistant-context-aggregator :out] [:user-context-aggregator :in]]
[[:user-context-aggregator :out] [:assistant-context-aggregator :in]]
[[:llm :out] [:llm-sentence-assembler :in]]
[[:llm-sentence-assembler :out] [:tts :in]]
[[:tts :out] [:transport-out :in]]
[[:transport-in :sys-out] [:transport-out :sys-in]]
[[:audio-splitter :out] [:realtime-out :in]]]}))
(defn start-flow []
(let [in (a/chan 1024)
out (a/chan 1024)
flow (flow/create-flow (make-twilio-flow in out))]
(flow/start flow)
{:in in :out out :flow flow}))
(defn stop-flow [{:keys [flow in out]}]
(flow/stop flow)
(a/close! in)
(a/close! out))
Which roughly translates to:
See examples for more usages.
- ElevenLabs
- Models:
and more. - Features: Real-time streaming, multiple voices, multilingual support
- Models:
- Deepgram
- Models:
and more. - Features: Real-time transcription, punctuation, smart formatting
- Models:
- OpenAI
- Models:
(fastest, cheapest),gpt-4
and more - Features: Function calling, streaming responses
- Models:
The core building block of voice-fn pipelines:
- Composed of processes connected by channels
- Processes can be:
- Input/output handlers
- AI service integrations
- Data transformers
- Managed by
for lifecycle control
The basic unit of data flow, representing typed messages like:
- Raw audio data:transcription/result
- Transcribed text:llm/text-chunk
- LLM response chunks:system/start
- Control signals
Each frame has a type and optionally a schema for the data contained in it.
See frame.clj for all possible frames.
Components that transform frames:
- Define input/output requirements
- Can maintain state
- Use core.async for async processing
- Implement the
(defn custom-processor []
{:describe (fn [] {:ins {:in "Input channel"}
:outs {:out "Output channel"}})
:init (fn [args] {:state args})
:transform (fn [state in msg]
[state {:out [(process-message msg)]}])}))
- core.async - Concurrent processing
- core.async.flow - Flow control
- Hato - WebSocket support
- Malli - Schema validation
Voice-fn takes heavy inspiration from pipecat. Differences:
- voice-fn uses a graph instead of a bidirectional queue for frame transport
- voice-fn has a data centric implementation. The processors in voice-fn are
pure functions in the
transform syntax