voice-fn - Real-time Voice AI Pipeline Framework

voice-fn is a Clojure framework for building real-time voice AI applications using a data-driven, functional approach. Built on top of clojure.core.async.flow, it provides a composable pipeline architecture for processing audio, text, and AI interactions with built-in support for major AI providers.

This project's status is experimental. Expect breaking changes.

Core Features

Flow-Based Architecture: Built on core.async.flow for robust concurrent processing
Data-First Design: Define AI pipelines as data structures for easy configuration and modification
Streaming Architecture: Efficient real-time audio and text processing
Extensible Processors: Simple protocol-based system for adding new processing components
Flexible Frame System: Type-safe message passing between pipeline components
Built-in Services: Ready-to-use integrations with major AI providers

Quick Start: Twilio WebSocket Example

(defn make-twilio-flow
  [in out]
  (let [encoding :ulaw
        sample-rate 8000
        sample-size-bits 8
        channels 1 ;; mono
        chunk-duration-ms 20
        llm-context {:messages [{:role "system"
                                 :content  "You are a voice agent operating via phone. Be concise. The input you receive comes from a speech-to-text (transcription) system that isn't always efficient and may send unclear text. Ask for clarification when you're unsure what the person said."}]
                     :tools [{:type :function
                              :function
                              {:name "get_weather"
                               :description "Get the current weather of a location"
                               :parameters {:type :object
                                            :required [:town]
                                            :properties {:town {:type :string
                                                                :description "Town for which to retrieve the current weather"}}
                                            :additionalProperties false}
                               :strict true}}]}]
    {:procs
     {:transport-in {:proc transport/twilio-transport-in
                     :args {:transport/in-ch in}}
      :deepgram-transcriptor {:proc asr/deepgram-processor
                              :args {:transcription/api-key (secret [:deepgram :api-key])
                                     :transcription/interim-results? true
                                     :transcription/vad-events? true
                                     :transcription/smart-format? true
                                     :transcription/model :nova-2
                                     :transcription/utterance-end-ms 1000
                                     :transcription/language :en
                                     :transcription/encoding :mulaw
                                     :transcription/sample-rate sample-rate}}
      :user-context-aggregator  {:proc context/user-aggregator-process
                                 :args {:llm/context llm-context}}
      :assistant-context-aggregator {:proc context/assistant-context-aggregator
                                     :args {:llm/context llm-context
                                            :debug? true
                                            :llm/registered-tools {"get_weather" {:async false
                                                                                  :tool (fn [{:keys [town]}] (str "The weather in " town " is 17 degrees celsius"))}}}}
      :llm {:proc llm/openai-llm-process
            :args {:openai/api-key (secret [:openai :new-api-sk])
                   :llm/model "gpt-4o-mini"}}

      :llm-sentence-assembler {:proc (flow/step-process #'context/sentence-assembler)}
      :tts {:proc tts/elevenlabs-tts-process
            :args {:elevenlabs/api-key (secret [:elevenlabs :api-key])
                   :elevenlabs/model-id "eleven_flash_v2_5"
                   :elevenlabs/voice-id "7sJPxFeMXAVWZloGIqg2"
                   :voice/stability 0.5
                   :voice/similarity-boost 0.8
                   :voice/use-speaker-boost? true
                   :flow/language :en
                   :audio.out/encoding encoding
                   :audio.out/sample-rate sample-rate}}
       :transport-out {:proc transport/realtime-transport-out-processor
                       :args {:transport/out-chan out}}}

     :conns [[[:transport-in :sys-out] [:deepgram-transcriptor :sys-in]]
             [[:transport-in :out] [:deepgram-transcriptor :in]]
             [[:deepgram-transcriptor :out] [:user-context-aggregator :in]]
             [[:user-context-aggregator :out] [:llm :in]]
             [[:llm :out] [:assistant-context-aggregator :in]]

             ;; cycle so that context aggregators are in sync
             [[:assistant-context-aggregator :out] [:user-context-aggregator :in]]
             [[:user-context-aggregator :out] [:assistant-context-aggregator :in]]

             [[:llm :out] [:llm-sentence-assembler :in]]
             [[:llm-sentence-assembler :out] [:tts :in]]

             [[:tts :out] [:transport-out :in]]
             [[:transport-in :sys-out] [:transport-out :sys-in]]
             [[:audio-splitter :out] [:realtime-out :in]]]}))

(defn start-flow []
   (let [in (a/chan 1024)
         out (a/chan 1024)
         flow (flow/create-flow (make-twilio-flow in out))]
     (flow/start flow)
     {:in in :out out :flow flow}))

(defn stop-flow [{:keys [flow in out]}]
   (flow/stop flow)
   (a/close! in)
   (a/close! out))

Which roughly translates to:

See examples for more usages.

Supported Providers

Text-to-Speech (TTS)

ElevenLabs
- Models: eleven_multilingual_v2, eleven_turbo_v2, eleven_flash_v2 and more.
- Features: Real-time streaming, multiple voices, multilingual support

Speech-to-Text (STT)

Deepgram
- Models: nova-2, nova-2-general, nova-2-meeting and more.
- Features: Real-time transcription, punctuation, smart formatting

Large Language Models (LLM)

OpenAI
- Models: gpt-4o-mini(fastest, cheapest), gpt-4, gpt-3.5-turbo and more
- Features: Function calling, streaming responses

Key Concepts

Flows

The core building block of voice-fn pipelines:

Composed of processes connected by channels
Processes can be:
- Input/output handlers
- AI service integrations
- Data transformers
Managed by core.async.flow for lifecycle control

Frames

The basic unit of data flow, representing typed messages like:

:audio/input-raw - Raw audio data
:transcription/result - Transcribed text
:llm/text-chunk - LLM response chunks
:system/start, :system/stop - Control signals

Each frame has a type and optionally a schema for the data contained in it.

See frame.clj for all possible frames.

Processes

Components that transform frames:

Define input/output requirements
Can maintain state
Use core.async for async processing
Implement the flow/process protocol

Adding Custom Processes

(defn custom-processor []
  (flow/process
    {:describe (fn [] {:ins {:in "Input channel"}
                       :outs {:out "Output channel"}})
     :init (fn [args] {:state args})
     :transform (fn [state in msg]
                  [state {:out [(process-message msg)]}])}))

Built With

core.async - Concurrent processing
core.async.flow - Flow control
Hato - WebSocket support
Malli - Schema validation

Acknowledgements

Voice-fn takes heavy inspiration from pipecat. Differences:

voice-fn uses a graph instead of a bidirectional queue for frame transport
voice-fn has a data centric implementation. The processors in voice-fn are pure functions in the core.async.flow transform syntax

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 181 Commits
.clj-kondo		.clj-kondo
.github/workflows		.github/workflows
dev		dev
doc		doc
examples		examples
resources		resources
src/voice_fn		src/voice_fn
test/voice_fn		test/voice_fn
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
TODO.org		TODO.org
TODO.org_archive		TODO.org_archive
architecture.org		architecture.org
bb.edn		bb.edn
build.clj		build.clj
deps.edn		deps.edn
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

voice-fn - Real-time Voice AI Pipeline Framework

Table of Contents

Core Features

Quick Start: Twilio WebSocket Example

Supported Providers

Text-to-Speech (TTS)

Speech-to-Text (STT)

Large Language Models (LLM)

Key Concepts

Flows

Frames

Processes

Adding Custom Processes

Built With

Acknowledgements

License

About

Releases 1

Packages

Languages

License

shipclojure/voice-fn

Folders and files

Latest commit

History

Repository files navigation

voice-fn - Real-time Voice AI Pipeline Framework

Table of Contents

Core Features

Quick Start: Twilio WebSocket Example

Supported Providers

Text-to-Speech (TTS)

Speech-to-Text (STT)

Large Language Models (LLM)

Key Concepts

Flows

Frames

Processes

Adding Custom Processes

Built With

Acknowledgements

License

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages