ssm.html


        <!DOCTYPE html>
            <html lang="en">
            <head>
                <meta charset="UTF-8">
                <meta name="viewport" content="width=device-width, initial-scale=1.0">
                <meta http-equiv="X-UA-Compatible" content="ie=edge">
                <title>Learning "Playable" State-Space Models from Audio</title>
                <link rel="preconnect" href="https://fonts.googleapis.com">
                <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
                <link href="https://fonts.googleapis.com/css2?family=Gowun+Batang:wght@400;700&display=swap" rel="stylesheet">
                <script src="https://cdn.jsdelivr.net/gh/JohnVinyard/web-components@v0.0.12/build/components/bundle.js"></script>
                <style>
                    body {
                        font-family: "Gowun Batang", serif;
                        margin: 5vh 5vw;
                        color: #333;
                        background-color: #f0f0f0;
                        font-size: 1.1em;
                    }
                    .back-to-top {
                        position: fixed;
                        bottom: 20px;
                        right: 20px;
                        background-color: #333;
                        color: #f0f0f0;
                        padding: 10px;
                        font-size: 0.9em;
                    }
                    img {
                        width: 100%;
                    }
                    ul {
                        list-style-type: none;
                        padding-inline-start: 20px;
                        font-size: 20px;
                    }
                    a {
                        color: #660000;
                    }
                    a:visited {
                        color: #000066;
                    }
                    caption {
                        text-decoration: underline;
                        font-size: 0.6em;
                    }
                    blockquote {
                        background-color: #d5d5d5;
                        padding: 2px 10px;
                    }
                </style>
            </head>
            <body>
                <h1>Learning "Playable" State-Space Models from Audio</h1>
<p><caption>Table of Contents</caption></p>
<ul>
<li><a href="#The Model">The Model</h1></a><ul>
<li><a href="#Formula">Formula</h2></a></li>
<li><a href="#The Experiment">The Experiment</h2></a></li>
</ul>
</li>
<li><a href="#The &lt;code&gt;SSM&lt;/code&gt; Class">The <code>SSM</code> Class</h1></a></li>
<li><a href="#The &lt;code&gt;OverfitControlPlane&lt;/code&gt; Class">The <code>OverfitControlPlane</code> Class</h1></a></li>
<li><a href="#The Training Process">The Training Process</h1></a><ul>
<li><a href="#Reconstruction Loss">Reconstruction Loss</h2></a></li>
<li><a href="#Sparsity Loss">Sparsity Loss</h2></a></li>
</ul>
</li>
<li><a href="#Examples">Examples</h1></a><ul>
<li><a href="#Example 1">Example 1</h2></a></li>
<li><a href="#Example 2">Example 2</h2></a></li>
<li><a href="#Example 3">Example 3</h2></a></li>
<li><a href="#Example 4">Example 4</h2></a></li>
</ul>
</li>
<li><a href="#Code For Generating this Article">Code For Generating this Article</h1></a></li>
<li><a href="#Training Code">Training Code</h1></a></li>
<li><a href="#Conclusion">Conclusion</h1></a><ul>
<li><a href="#Future Work">Future Work</h2></a></li>
</ul>
</li>
<li><a href="#Cite this Article">Cite this Article</h1></a></li>
</ul>
                
<p>This work attempts to reproduce a short segment of "natural" (i.e., produced by acoustic
instruments or physical objects in the world) audio by decomposing it into two distinct pieces:</p>
<ol>
<li>A state-space model simulating the resonances of the system</li>
<li>a sparse control signal, representing energy injected into the system.</li>
</ol>
<p>The control signal can be thought of as roughly corresponding to a musical score, and the state-space model
can be thought of as the dynamics/resonances of the musical instrument and the room in which it was played.</p>
<p>It's notable that in this experiment (unlike
<a href="https://blog.cochlea.xyz/siam.html">my other recent work</a>), <strong>there is no learned "encoder"</strong>.  We simply "overfit"
parameters to a single audio sample, by minimizing a combination of <a href="#Sparsity Loss">reconstruction and sparsity losses</a>.</p>
<p>As a sneak-peek, here's a novel sound created by feeding a random, sparse control signal into
a state-space model "extracted" from an audio segment from Beethoven's "Piano Sonata No 15 in D major".</p>
<p>Feel free to <a href="#Examples">jump ahead</a> if you're curious to hear all of the audio examples first!</p>

<audio-view
            src="https://state-space-model-demo.s3.amazonaws.com/audio_ef5e535057bbc0eae1f0f310d42a892cb5c8a96e"
            height="200"
            samples="512"
            scale="1"
            controls
        ></audio-view>

<p>First, we'll set up high-level parameters for the experiment</p>

<code-block language="python"># the size, in samples of the audio segment we'll overfit

n_samples = 2 ** 18


# the samplerate, in hz, of the audio signal

samplerate = 22050


# derived, the total number of seconds of audio

n_seconds = n_samples / samplerate


# the size of each, half-lapped audio "frame"

window_size = 512


# the dimensionality of the control plane or control signal

control_plane_dim = 32


# the dimensionality of the state vector, or hidden state

state_dim = 128</code-block>


<a id="The Model"></a>
<h1>The Model</h1>
<p>State-Space models look a lot like 
<a href="https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks">RNNs (recurrent neural networks)</a> 
in that they are auto-regressive and have a hidden/inner state vector that represents
something like the "memory" of the model.  In this example, I tend to think of the
hidden state as the stored energy of the resonant object.  A human
musician has injected energy into the system by striking, plucking, or dragging a bow and the instrument stores that 
energy and "leaks" it out in ways that are (hopefully) pleasing to the ear.</p>

<a id="Formula"></a>
<h2>Formula</h2>
<p>Formally, state space models take the following form (in pseudocode)</p>
<p>First, we initialize the state/hidden vector</p>
<p><code>state_vector = zeros(state_dim)</code></p>
<p>Then, we transform the input and the <em>previous hidden state</em> into a <em>new</em> hidden state.  This is where the 
"auto-regressive" or recursive nature of the model comes into view;  notice that <code>state-vector</code> is on both sides of the 
equation.  <strong>There's a feedback look happening here</strong>, which is a hallmark of 
<a href="https://www.osar.fr/notes/waveguides/">waveguide synthesis</a> and other physical modelling synthesis.</p>
<p><code>state_vector = (state_vector * state_matrix) + (input * input_matrix)</code></p>
<p>Finally, we map the hidden state and the input into a new output</p>
<p><code>output_vector = (state_vector * output_matrix) + (input * direct_matrix)</code></p>
<p>This process is repeated until we have no more inputs to process.  The <code>direct_matrix</code> is a mapping from
inputs directly to the output vector, rather like a "skip connection" in other neural network architectures.</p>
<p>As long as we have something like conservation of energy happening (not enforced explicitly), it's easy to see how
the exponential decay we observe in resonant objects emerges from our model. </p>

<a id="The Experiment"></a>
<h2>The Experiment</h2>
<p>We'll build a <a href="https://pytorch.org/">PyTorch</a> model that will learn the four matrices described 
above, along with a sparse control signal, by "overfitting" the model to a single segment of ~12 seconds of audio drawn 
from my favorite source for acoustic musical signals, the 
<a href="https://zenodo.org/records/5120004#.Yhxr0-jMJBA">MusicNet dataset</a> dataset.  For the final example, we'll try fitting
a different kind of "natural" acoustic signal, human speech, just for funsies!</p>
<p>Even though we're <em>overfitting</em> a single audio signal, the sparsity term serves 
as a 
<a href="https://www.reddit.com/r/learnmachinelearning/comments/w7yrog/what_regularization_does_to_a_machine_learning/">regularizer</a> 
that still forces the model to generalize in some way.  Our working theory is that the control signal must be <em>sparse</em>, 
which places certain constraints on the type of matrices the model must learn to accurately reproduce the audio.  If I
strike a piano key, the sound does not die away immediately and I do not have to continue to "drive" the sound by
continually injecting energy;  the strings and the body of the piano continue to resonate for quite some time.  </p>
<p>While it hasn't showed up in the code we've seen so far, but we'll be using 
<a href="https://github.com/JohnVinyard/conjure"><code>conjure</code></a> to monitor the training process while iterating on the code, and 
eventually to generate this article once things have settled.</p>
<p>We'll start with some boring imports.</p>

<code-block language="python">from io import BytesIO

from typing import Dict, Union


import numpy as np

import torch

from torch import nn

from itertools import count


from data import get_one_audio_segment, get_audio_segment

from modules import max_norm, flattened_multiband_spectrogram

from modules.overlap_add import overlap_add

from torch.optim import Adam


from util import device, encode_audio


from conjure import logger, LmdbCollection, serve_conjure, SupportedContentType, loggers, \

    NumpySerializer, NumpyDeserializer, S3Collection, \

    conjure_article, CitationComponent, numpy_conjure, AudioComponent, pickle_conjure, ImageComponent, \

    CompositeComponent

from torch.nn.utils.clip_grad import clip_grad_value_

from argparse import ArgumentParser

from matplotlib import pyplot as plt


remote_collection_name = 'state-space-model-demo'</code-block>


<a id="The <code>SSM</code> Class"></a>
<h1>The <code>SSM</code> Class</h1>
<p>Now, for the good stuff!  We'll define our simple State-Space Model as an 
<a href="https://pytorch.org/docs/stable/generated/torch.nn.Module.html"><code>nn.Module</code></a>-derived class with four parameters 
corresponding to each of the four matrices.</p>
<p>Note that there is a slight deviation from the canonical SSM in that we have a fifth matrix, which projects from our
"control plane" for the instrument into the dimension of a single audio frame.</p>

<code-block language="python">class SSM(nn.Module):

    """

    A state-space model-like module, with one additional matrix, used to project the control

    signal into the shape of each audio frame.


    The final output is produced by overlap-adding the windows/frames of audio into a single

    1D signal.

    """


    def __init__(self, control_plane_dim: int, input_dim: int, state_matrix_dim: int):

        super().__init__()

        self.state_matrix_dim = state_matrix_dim

        self.input_dim = input_dim

        self.control_plane_dim = control_plane_dim


        # matrix mapping control signal to audio frame dimension

        self.proj = nn.Parameter(

            torch.zeros(control_plane_dim, input_dim).uniform_(-0.01, 0.01)

        )


        # state matrix mapping previous state vector to next state vector

        self.state_matrix = nn.Parameter(

            torch.zeros(state_matrix_dim, state_matrix_dim).uniform_(-0.01, 0.01))


        # matrix mapping audio frame to hidden/state vector dimension

        self.input_matrix = nn.Parameter(

            torch.zeros(input_dim, state_matrix_dim).uniform_(-0.01, 0.01))


        # matrix mapping hidden/state vector to audio frame dimension

        self.output_matrix = nn.Parameter(

            torch.zeros(state_matrix_dim, input_dim).uniform_(-0.01, 0.01)

        )


        # skip-connection-like matrix mapping input audio frame to next

        # output audio frame

        self.direct_matrix = nn.Parameter(

            torch.zeros(input_dim, input_dim).uniform_(-0.01, 0.01)

        )


    def forward(self, control: torch.Tensor) -> torch.Tensor:

        batch, cpd, frames = control.shape

        assert cpd == self.control_plane_dim


        control = control.permute(0, 2, 1)


        proj = control @ self.proj

        assert proj.shape == (batch, frames, self.input_dim)


        results = []

        state_vec = torch.zeros(batch, self.state_matrix_dim, device=control.device)


        for i in range(frames):

            inp = proj[:, i, :]

            state_vec = (state_vec @ self.state_matrix) + (inp @ self.input_matrix)

            output = (state_vec @ self.output_matrix) + (inp @ self.direct_matrix)

            results.append(output.view(batch, 1, self.input_dim))


        result = torch.cat(results, dim=1)

        result = result[:, None, :, :]


        result = overlap_add(result)

        return result[..., :frames * (self.input_dim // 2)]</code-block>


<a id="The <code>OverfitControlPlane</code> Class"></a>
<h1>The <code>OverfitControlPlane</code> Class</h1>
<p>This model encapsulates an <code>SSM</code> instance, and also has a parameter for the sparse "control plane" that will serve
as the input energy for our resonant model.  I think of this as a time-series of vectors that describe the different
ways that energy can be injected into the model, e.g., you might have individual dimensions representing different
keys on a piano, or strings on a cello.  </p>
<p>I don't expect the control signals learned here to be quite <em>that</em> clear-cut
and interpretable, but you might notice that the random audio samples produced using the learned models
do seem to disentangle some characteristics of the instruments being played!</p>

<code-block language="python">class OverfitControlPlane(nn.Module):

    """

    Encapsulates parameters for control signal and state-space model

    """


    def __init__(self, control_plane_dim: int, input_dim: int, state_matrix_dim: int, n_samples: int):

        super().__init__()

        self.ssm = SSM(control_plane_dim, input_dim, state_matrix_dim)

        self.n_samples = n_samples

        self.n_frames = int(n_samples / (input_dim // 2))


        self.control = nn.Parameter(

            torch.zeros(1, control_plane_dim, self.n_frames).uniform_(-0.01, 0.01))


    @property

    def control_signal_display(self) -> np.ndarray:

        return self.control_signal.data.cpu().numpy().reshape((-1, self.n_frames))


    @property

    def control_signal(self) -> torch.Tensor:

        return torch.relu(self.control)


    def random(self, p=0.001):

        """

        Produces a random, sparse control signal, emulating short, transient bursts

        of energy into the system modelled by the `SSM`

        """

        cp = torch.zeros_like(self.control, device=self.control.device).bernoulli_(p=p)

        audio = self.forward(sig=cp)

        return max_norm(audio)


    def forward(self, sig=None):

        """

        Inject energy defined by `sig` (or by the `control` parameters encapsulated by this class)

        into the system modelled by `SSM`

        """

        return self.ssm.forward(sig if sig is not None else self.control_signal)</code-block>


<a id="The Training Process"></a>
<h1>The Training Process</h1>
<p>To train the <code>OverfitControlPlane</code> model, we randomly initialize parameters for <code>SSM</code> and the learned
control signal, and minimize a loss that consists of a reconstruction term and a sparsity term via gradient
descent.  For this experiment, we're using the <a href="https://pytorch.org/docs/stable/generated/torch.optim.Adam.html"><code>Adam</code></a>
optimizer with a learning rate of <code>1e-2</code>.</p>


<a id="Reconstruction Loss"></a>
<h2>Reconstruction Loss</h2>
<p>The first loss term is a simple reconstruction loss, consisting of the l1 norm of the difference between
two multi-samplerate and multi-resolution spectrograms. </p>

<code-block language="python">def transform(x: torch.Tensor):

    """

    Decompose audio into sub-bands of varying sample rate, and compute spectrogram with

    varying time-frequency tradeoffs on each band.

    """

    return flattened_multiband_spectrogram(

        x,

        stft_spec={

            'long': (128, 64),

            'short': (64, 32),

            'xs': (16, 8),

        },

        smallest_band_size=512)


def reconstruction_loss(recon: torch.Tensor, target: torch.Tensor) -> torch.Tensor:

    """

    Compute the l1 norm of the difference between the `recon` and `target`

    representations

    """

    fake_spec = transform(recon)

    real_spec = transform(target)

    return torch.abs(fake_spec - real_spec).sum()</code-block>


<a id="Sparsity Loss"></a>
<h2>Sparsity Loss</h2>
<p>Ideally, we want the model to resonate, or store and "leak" energy slowly in the way that an
acoustic instrument might.  This means that the control signal is not dense and continually "driving" the instrument, 
but injecting energy infrequently in ways that encourage the natural resonances of the physical object.  </p>
<p>I'm not fully satisfied with this approach. e.g. it tends to pull away from what might be a nice, 
natural control signal for a violin or other bowed instrument.  In my mind, this might look like a sub-20hz sawtooth 
wave that would "drive" the string, continually catching and releasing as the bow drags across the string.</p>
<p>For now, the sparsity term <em>does</em> seem to encourage models that resonate, but my intuition is that
there is a better, more nuanced approach that could handle bowed string instruments and wind instruments, 
in addition to percussive instruments, where this approach really seems to shine.</p>

<code-block language="python">def sparsity_loss(c: torch.Tensor) -> torch.Tensor:

    """

    Compute the l1 norm of the control signal

    """

    return torch.abs(c).sum() * 1e-5


def to_numpy(x: torch.Tensor):

    return x.data.cpu().numpy()


def construct_experiment_model(state_dict: Union[None, dict] = None) -> OverfitControlPlane:

    """

    Construct a randomly initialized `OverfitControlPlane` instance, ready for training/overfitting

    """

    model = OverfitControlPlane(

        control_plane_dim=control_plane_dim,

        input_dim=window_size,

        state_matrix_dim=state_dim,

        n_samples=n_samples

    )

    model = model.to(device)


    if state_dict is not None:

        model.load_state_dict(state_dict)


    return model</code-block>


<a id="Examples"></a>
<h1>Examples</h1>
<p>Finally, some trained models to listen to!  Each example consists of the following:</p>
<ol>
<li>the original audio signal from the MusicNet dataset</li>
<li>the sparse control signal for the reconstruction</li>
<li>the reconstructed audio, produced using the sparse control signal and the learned state-space model</li>
<li>a novel, random audio signal produced using the learned state-space model and a random control signal</li>
</ol>
<p>Just for fun, we attempt to learn the fourth example from a speech signal from the 
<a href="https://keithito.com/LJ-Speech-Dataset/">LJ Speech Dataset</a></p>


<a id="Example 1"></a>
<h2>Example 1</h2>

<a id="Original Audio"></a>
<h3>Original Audio</h3>
<p>A random 11.89 seconds segment of audio drawn from the MusicNet dataset</p>

        <audio-view
            src="https://state-space-model-demo.s3.amazonaws.com/audio_dcdb7a5df7be5f48a1d958a1abee62f639315937"
            height="200"
            samples="512"
            scale="1"
            controls
        ></audio-view>

<a id="Reconstruction"></a>
<h3>Reconstruction</h3>
<p>Reconstruction of the original audio after overfitting the model for 1500 iterations</p>

        <audio-view
            src="https://state-space-model-demo.s3.amazonaws.com/audio_6acba920c2c10488c9dd983126a45d8505d916bd"
            height="200"
            samples="512"
            scale="1"
            controls
        ></audio-view>

<a id="Random Audio"></a>
<h3>Random Audio</h3>
<p>Signal produced by a random, sparse control signal after overfitting the model for 1500 iterations</p>

        <audio-view
            src="https://state-space-model-demo.s3.amazonaws.com/audio_ef5e535057bbc0eae1f0f310d42a892cb5c8a96e"
            height="200"
            samples="512"
            scale="1"
            controls
        ></audio-view>

<a id="Control Signal"></a>
<h3>Control Signal</h3>
<p>Sparse control signal for the original audio after overfitting the model for 1500 iterations</p>

        <img src="https://state-space-model-demo.s3.amazonaws.com/matrix_8b12fb659733e4998f13f2b921c65753e3e94060"></img>


<a id="Example 2"></a>
<h2>Example 2</h2>

<a id="Original Audio"></a>
<h3>Original Audio</h3>
<p>A random 11.89 seconds segment of audio drawn from the MusicNet dataset</p>

        <audio-view
            src="https://state-space-model-demo.s3.amazonaws.com/audio_f4d91dcbdd7f3ceb1a8d71ff78a1fb82ecdbae1d"
            height="200"
            samples="512"
            scale="1"
            controls
        ></audio-view>

<a id="Reconstruction"></a>
<h3>Reconstruction</h3>
<p>Reconstruction of the original audio after overfitting the model for 1500 iterations</p>

        <audio-view
            src="https://state-space-model-demo.s3.amazonaws.com/audio_f5aabdd2a9fbcb6278bff362b5efab88b8cf33e6"
            height="200"
            samples="512"
            scale="1"
            controls
        ></audio-view>

<a id="Random Audio"></a>
<h3>Random Audio</h3>
<p>Signal produced by a random, sparse control signal after overfitting the model for 1500 iterations</p>

        <audio-view
            src="https://state-space-model-demo.s3.amazonaws.com/audio_5616ee4501ecb4d6f5d5a36a1e37c96d94dc4289"
            height="200"
            samples="512"
            scale="1"
            controls
        ></audio-view>

<a id="Control Signal"></a>
<h3>Control Signal</h3>
<p>Sparse control signal for the original audio after overfitting the model for 1500 iterations</p>

        <img src="https://state-space-model-demo.s3.amazonaws.com/matrix_5ec6e9899427f9534cb14c5c8f32a16e3dc5680a"></img>


<a id="Example 3"></a>
<h2>Example 3</h2>

<a id="Original Audio"></a>
<h3>Original Audio</h3>
<p>A random 11.89 seconds segment of audio drawn from the MusicNet dataset</p>

        <audio-view
            src="https://state-space-model-demo.s3.amazonaws.com/audio_ca0ff2a269eabe178cb3cdd41fc0c7b1ba97bebb"
            height="200"
            samples="512"
            scale="1"
            controls
        ></audio-view>

<a id="Reconstruction"></a>
<h3>Reconstruction</h3>
<p>Reconstruction of the original audio after overfitting the model for 1500 iterations</p>

        <audio-view
            src="https://state-space-model-demo.s3.amazonaws.com/audio_9305d7914d9088d2ebbe4c17f5f386fc0414ae36"
            height="200"
            samples="512"
            scale="1"
            controls
        ></audio-view>

<a id="Random Audio"></a>
<h3>Random Audio</h3>
<p>Signal produced by a random, sparse control signal after overfitting the model for 1500 iterations</p>

        <audio-view
            src="https://state-space-model-demo.s3.amazonaws.com/audio_92fd95cfd311a931457876c103538791a1a2c42a"
            height="200"
            samples="512"
            scale="1"
            controls
        ></audio-view>

<a id="Control Signal"></a>
<h3>Control Signal</h3>
<p>Sparse control signal for the original audio after overfitting the model for 1500 iterations</p>

        <img src="https://state-space-model-demo.s3.amazonaws.com/matrix_4cc0f5a2c8882861c490a894c476451cf79aeb45"></img>


<a id="Example 4"></a>
<h2>Example 4</h2>

<a id="Original Audio"></a>
<h3>Original Audio</h3>
<p>A random 11.89 seconds segment of audio drawn from the MusicNet dataset</p>

        <audio-view
            src="https://state-space-model-demo.s3.amazonaws.com/audio_5cce08357354294bf1c96c188c5b54ebe1fc9204"
            height="200"
            samples="512"
            scale="1"
            controls
        ></audio-view>

<a id="Reconstruction"></a>
<h3>Reconstruction</h3>
<p>Reconstruction of the original audio after overfitting the model for 1500 iterations</p>

        <audio-view
            src="https://state-space-model-demo.s3.amazonaws.com/audio_7da87195190777f9080cea934bff2ecfce4e79ab"
            height="200"
            samples="512"
            scale="1"
            controls
        ></audio-view>

<a id="Random Audio"></a>
<h3>Random Audio</h3>
<p>Signal produced by a random, sparse control signal after overfitting the model for 1500 iterations</p>

        <audio-view
            src="https://state-space-model-demo.s3.amazonaws.com/audio_513da2a21c2f507c2722e2da5ddf53c610576e08"
            height="200"
            samples="512"
            scale="1"
            controls
        ></audio-view>

<a id="Control Signal"></a>
<h3>Control Signal</h3>
<p>Sparse control signal for the original audio after overfitting the model for 1500 iterations</p>

        <img src="https://state-space-model-demo.s3.amazonaws.com/matrix_de39724b888a2788242b0bdb28ba15f7dba9bde8"></img>


<a id="Code For Generating this Article"></a>
<h1>Code For Generating this Article</h1>
<p>What follows is the code used to train the model and produce the article you're reading.  It uses 
the <a href="https://github.com/JohnVinyard/conjure"><code>conjure</code></a> Python library, a tool I've been writing 
that helps to persist and display images, audio and other code artifacts that are interleaved throughout
this post.</p>

<code-block language="python">def demo_page_dict(n_iterations: int = 100) -> Dict[str, any]:

    print(f'Generating article, training models for {n_iterations} iterations')


    remote = S3Collection(

        remote_collection_name, is_public=True, cors_enabled=True)


    @numpy_conjure(remote)

    def fetch_audio(url: str, start_sample: int) -> np.ndarray:

        return get_audio_segment(

            url,

            target_samplerate=samplerate,

            start_sample=start_sample,

            duration_samples=n_samples)


    def train_model_for_segment(

            target: torch.Tensor,

            iterations: int):


        while True:

            model = construct_experiment_model()

            optim = Adam(model.parameters(), lr=1e-2)


            for iteration in range(iterations):

                optim.zero_grad()

                recon = model.forward()

                loss = reconstruction_loss(recon, target) + sparsity_loss(model.control_signal)

                non_zero = (model.control_signal > 0).sum()

                sparsity = (non_zero / model.control_signal.numel()).item()


                if torch.isnan(loss).any():

                    print(f'detected NaN at iteration {iteration}')

                    break


                loss.backward()

                clip_grad_value_(model.parameters(), 0.5)

                optim.step()

                print(iteration, loss.item(), sparsity)


            if iteration < n_iterations - 1:

                print('NaN detected, starting anew')

                continue


            return model.state_dict()


    def encode(arr: np.ndarray) -> bytes:

        return encode_audio(arr)


    def display_matrix(arr: Union[torch.Tensor, np.ndarray], cmap: str = 'gray') -> bytes:

        if arr.ndim > 2:

            raise ValueError('Only two-dimensional arrays are supported')


        if isinstance(arr, torch.Tensor):

            arr = arr.data.cpu().numpy()


        arr = arr * -1


        bio = BytesIO()

        plt.matshow(arr, cmap=cmap)

        plt.axis('off')

        plt.margins(0, 0)

        plt.savefig(bio, pad_inches=0, bbox_inches='tight')

        plt.clf()

        bio.seek(0)

        return bio.read()


    # define loggers

    audio_logger = logger(

        'audio', 'audio/wav', encode, remote)


    matrix_logger = logger(

        'matrix', 'image/png', display_matrix, remote)


    @pickle_conjure(remote)

    def train_model_for_segment_and_produce_artifacts(

            url: str,

            start_sample: int,

            n_iterations: int):


        print(f'Generating example for {url} with start sample {start_sample}')


        audio_array = fetch_audio(url, start_sample)

        audio_tensor = torch.from_numpy(audio_array).to(device).view(1, 1, n_samples)

        audio_tensor = max_norm(audio_tensor)

        state_dict = train_model_for_segment(audio_tensor, n_iterations)

        hydrated = construct_experiment_model(state_dict)


        with torch.no_grad():

            recon = hydrated.forward()

            random = hydrated.random()


        _, orig_audio = audio_logger.result_and_meta(audio_array)

        _, recon_audio = audio_logger.result_and_meta(recon)

        _, random_audio = audio_logger.result_and_meta(random)

        _, control_plane = matrix_logger.result_and_meta(hydrated.control_signal_display)


        result = dict(

            orig=orig_audio,

            recon=recon_audio,

            control_plane=control_plane,

            random=random_audio,

        )

        return result


    def train_model_and_produce_components(

            url: str,

            start_sample: int,

            n_iterations: int):


        """

        Produce artifacts/media for a single example section

        """


        result_dict = train_model_for_segment_and_produce_artifacts(

            url, start_sample, n_iterations)


        orig = AudioComponent(result_dict['orig'].public_uri, height=200, samples=512)

        recon = AudioComponent(result_dict['recon'].public_uri, height=200, samples=512)

        random = AudioComponent(result_dict['random'].public_uri, height=200, samples=512)

        control = ImageComponent(result_dict['control_plane'].public_uri, height=200)


        return dict(orig=orig, recon=recon,control=control, random=random)


    def train_model_and_produce_content_section(

            url: str,

            start_sample: int,

            n_iterations: int,

            number: int) -> CompositeComponent:


        """

        Produce a single "Examples" section for the post

        """


        component_dict = train_model_and_produce_components(url, start_sample, n_iterations)

        composite = CompositeComponent(

            header=f'## Example {number}',

            orig_header='### Original Audio',

            orig_text=f'A random {n_seconds:.2f} seconds segment of audio drawn from the MusicNet dataset',

            orig_component=component_dict['orig'],

            recon_header='### Reconstruction',

            recon_text=f'Reconstruction of the original audio after overfitting the model for {n_iterations} iterations',

            recon_component=component_dict['recon'],

            random_header='### Random Audio',

            random_text=f'Signal produced by a random, sparse control signal after overfitting the model for {n_iterations} iterations',

            random_component=component_dict['random'],

            control_header='### Control Signal',

            control_text=f'Sparse control signal for the original audio after overfitting the model for {n_iterations} iterations',

            control_component=component_dict['control']

        )

        return composite


    example_1 = train_model_and_produce_content_section(

        'https://music-net.s3.amazonaws.com/2358',

        start_sample=2 ** 16,

        n_iterations=n_iterations,

        number=1

    )


    example_2 = train_model_and_produce_content_section(

        'https://music-net.s3.amazonaws.com/2296',

        start_sample=2 ** 18,

        n_iterations=n_iterations,

        number=2

    )


    example_3 = train_model_and_produce_content_section(

        'https://music-net.s3.amazonaws.com/2391',

        start_sample=2 ** 18,

        n_iterations=n_iterations,

        number=3

    )


    example_4 = train_model_and_produce_content_section(

        'https://lj-speech.s3.amazonaws.com/LJ019-0120.wav',

        start_sample=0,

        n_iterations=n_iterations,

        number=4

    )


    citation = CitationComponent(

        tag='johnvinyardstatespacemodels',

        author='Vinyard, John',

        url='https://blog.cochlea.xyz/ssm.html',

        header='State Space Modelling for Sparse Decomposition of Audio',

        year='2024'

    )


    return dict(

        example_1=example_1,

        example_2=example_2,

        example_3=example_3,

        example_4=example_4,

        citation=citation,

    )


def generate_demo_page(iterations: int = 500):

    display = demo_page_dict(n_iterations=iterations)

    conjure_article(

        __file__,

        'html',

        title='Learning "Playable" State-Space Models from Audio',

        **display)</code-block>


<a id="Training Code"></a>
<h1>Training Code</h1>
<p>As I developed this model, I used the following code to pick a random audio segment, overfit a model, and monitor
its progress during training.</p>

<code-block language="python">def train_and_monitor():

    target = get_one_audio_segment(n_samples=n_samples, samplerate=samplerate)

    collection = LmdbCollection(path='ssm')


    recon_audio, orig_audio, random_audio = loggers(

        ['recon', 'orig', 'random'],

        'audio/wav',

        encode_audio,

        collection)


    envelopes, state_space = loggers(

        ['envelopes', 'statespace'],

        SupportedContentType.Spectrogram.value,

        to_numpy,

        collection,

        serializer=NumpySerializer(),

        deserializer=NumpyDeserializer())


    orig_audio(target)


    serve_conjure([

        orig_audio,

        recon_audio,

        envelopes,

        random_audio,

        state_space

    ], port=9999, n_workers=1)


    def train(target: torch.Tensor):

        model = construct_experiment_model()


        optim = Adam(model.parameters(), lr=1e-2)


        for iteration in count():

            optim.zero_grad()

            recon = model.forward()

            recon_audio(max_norm(recon))

            loss = reconstruction_loss(recon, target) + sparsity_loss(model.control_signal)


            non_zero = (model.control_signal > 0).sum()

            sparsity = (non_zero / model.control_signal.numel()).item()


            state_space(model.ssm.state_matrix)

            envelopes(model.control_signal.view(control_plane_dim, -1))

            loss.backward()


            clip_grad_value_(model.parameters(), 0.5)


            optim.step()

            print(iteration, loss.item(), sparsity)


            with torch.no_grad():

                rnd = model.random()

                random_audio(rnd)


    train(target)</code-block>


<a id="Conclusion"></a>
<h1>Conclusion</h1>
<p>Thanks for reading this far!  </p>
<p>I'm excited about the results of this experiment, but am not totally pleased
with the frame-based approach, which leads to very noticeable artifacts in the reconstructions.  It <strong>runs counter to
one of my guiding principles</strong> as I try to design a sparse, interpretable, and easy-to-manipulate audio codec, which
is that there is no place for arbitrary, fixed-size "frames".  Ideally, we represent audio as a sparse set of events
or sound sources that are sample-rate independent, i.e., more like a function or operator, and less like a rasterized
representation.</p>
<p>I'm just beginning to learn more about state-space models and was excited when I learned from Albert Gu in his 
excellent talk <a href="https://youtu.be/luCBXCErkCs?si=rRSQ7af3X6cZRivW&amp;t=1776">"Efficiently Modeling Long Sequences with Structured State Spaces"</a> 
that there are ways to transform state-space models, which strongly resemble 
<a href="https://en.wikipedia.org/wiki/Infinite_impulse_response">IIR filters</a>, into their 
<a href="https://en.wikipedia.org/wiki/Finite_impulse_response">FIR</a> counterpart, convolutions, which I've depended on 
heavily to model resonance in other recent work.</p>
<p>I'm looking forward to following this thread and beginning to find where the two different approaches converge! </p>

<a id="Future Work"></a>
<h2>Future Work</h2>
<ol>
<li>Instead of (or in addition to) a sparsity loss, could we build in more physics-informed losses, such as conservation
   of energy, i.e., overall energy can never <em>increase</em> unless it comes from the control signal?</li>
<li>Could we use 
   <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.StateSpace.html"><code>scipy.signal.StateSpace</code></a> to
   derive a continuous-time formulation of the model?</li>
<li>How would a model like this work as an event generator in my <a href="https://blog.cochlea.xyz/machine-learning/2024/02/29/siam.html">sparse, interpretable audio model from other 
   experiments?</a></li>
<li>Could we treat an entire, multi-instrument song as a single, large state-space model, learning a compressed 
   representation of the audio <em>and</em> a "playable" instrument, all at the same time?</li>
</ol>

<a id="Cite this Article"></a>
<h1>Cite this Article</h1>
<p>If you'd like to cite this article, you can use the following <a href="https://bibtex.org/">BibTeX block</a>.</p>

<citation-block
            tag="johnvinyardstatespacemodels"
            author="Vinyard, John"
            url="https://blog.cochlea.xyz/ssm.html"
            header="State Space Modelling for Sparse Decomposition of Audio"
            year="2024">
        </citation-block>

<code-block language="python">if __name__ == '__main__':

    parser = ArgumentParser()

    parser.add_argument('--mode', type=str, required=True)

    parser.add_argument('--iterations', type=int, default=250)

    parser.add_argument('--prefix', type=str, required=False, default='')

    args = parser.parse_args()


    if args.mode == 'train':

        train_and_monitor()

    elif args.mode == 'demo':

        generate_demo_page(args.iterations)

    elif args.mode == 'list':

        remote = S3Collection(

            remote_collection_name, is_public=True, cors_enabled=True)

        print('Listing stored keys')

        for key in remote.iter_prefix(start_key=args.prefix):

            print(key)

    elif args.mode == 'clear':

        remote = S3Collection(

            remote_collection_name, is_public=True, cors_enabled=True)

        remote.destroy(prefix=args.prefix)

    else:

        raise ValueError('Please provide one of train, demo, or clear')</code-block>


                <a href="#">
                    <div class="back-to-top">
                        Back to Top
                    </div>
                </a>

            </body>
            </html>