jlami · jlami · Jun 6, 2024 · Nov 16, 2023 · Dec 3, 2023 · Dec 8, 2023
diff --git a/.gitignore b/.gitignore
@@ -14,4 +14,4 @@ Session.vim
 /trash
 /misc
 /mdx
-.mypy_cache
+.mypy_cache
diff --git a/README.md b/README.md
@@ -1,21 +1,25 @@
 # Demucs Music Source Separation
 
-[![Support Ukraine](https://img.shields.io/badge/Support-Ukraine-FFD500?style=flat&labelColor=005BBB)](https://opensource.fb.com/support-ukraine)
 ![tests badge](https://github.com/facebookresearch/demucs/workflows/tests/badge.svg)
 ![linter badge](https://github.com/facebookresearch/demucs/workflows/linter/badge.svg)
 
 
+**This is the officially maintained Demucs** now that I (Alexandre Défossez) have left Meta to join [Kyutai](https://twitter.com/kyutai_labs).
+Note that I'm not actively working on Demucs anymore, so expect slow replies and no new feature for now.
+
+
+
 This is the 4th release of Demucs (v4), featuring Hybrid Transformer based source separation.
 **For the classic Hybrid Demucs (v3):** [Go this commit][demucs_v3].
-If you are experiencing issues and want the old Demucs back, please fill an issue, and then you can get back to the v3 with
+If you are experiencing issues and want the old Demucs back, please file an issue, and then you can get back to Demucs v3 with
 `git checkout v3`. You can also go [Demucs v2][demucs_v2].
 
 
 Demucs is a state-of-the-art music source separation model, currently capable of separating
 drums, bass, and vocals from the rest of the accompaniment.
 Demucs is based on a U-Net convolutional architecture inspired by [Wave-U-Net][waveunet].
 The v4 version features [Hybrid Transformer Demucs][htdemucs], a hybrid spectrogram/waveform separation model using Transformers.
-It is based on [Hybrid Demucs][hybrid_paper] (also provided in this repo) with the innermost layers are
+It is based on [Hybrid Demucs][hybrid_paper] (also provided in this repo), with the innermost layers
 replaced by a cross-domain Transformer Encoder. This Transformer uses self-attention within each domain,
 and cross-attention across domains.
 The model achieves a SDR of 9.00 dB on the MUSDB HQ test set. Moreover, when using sparse attention
@@ -123,7 +127,7 @@ python3 -m pip install -U git+https://github.com/facebookresearch/demucs#egg=dem
 
 Advanced OS support are provided on the following page, **you must read the page for your OS before posting an issues**:
 - **If you are using Windows:** [Windows support](docs/windows.md).
-- **If you are using MAC OS X:** [Mac OS X support](docs/mac.md).
+- **If you are using macOS:** [macOS support](docs/mac.md).
 - **If you are using Linux:** [Linux support](docs/linux.md).
 
 ### For machine learning scientists
@@ -139,7 +143,7 @@ pip install -e .
 
 This will create a `demucs` environment with all the dependencies installed.
 
-You will also need to install [soundstretch/soundtouch](https://www.surina.net/soundtouch/soundstretch.html): on Mac OSX you can do `brew install sound-touch`,
+You will also need to install [soundstretch/soundtouch](https://www.surina.net/soundtouch/soundstretch.html): on macOS you can do `brew install sound-touch`,
 and on Ubuntu `sudo apt-get install soundstretch`. This is used for the
 pitch/tempo augmentation.
 
@@ -194,16 +198,18 @@ demucs --two-stems=vocals myfile.mp3
 ```
 
 
-If you have a GPU, but you run out of memory, please use `--segment SEGMENT` to reduce length of each split. `SEGMENT` should be changed to a integer. Personally recommend not less than 10 (the bigger the number is, the more memory is required, but quality may increase). Create an environment variable `PYTORCH_NO_CUDA_MEMORY_CACHING=1` is also helpful. If this still cannot help, please add `-d cpu` to the command line. See the section hereafter for more details on the memory requirements for GPU acceleration.
+If you have a GPU, but you run out of memory, please use `--segment SEGMENT` to reduce length of each split. `SEGMENT` should be changed to a integer describing the length of each segment in seconds.
+A segment length of at least 10 is recommended (the bigger the number is, the more memory is required, but quality may increase). Note that the Hybrid Transformer models only support a maximum segment length of 7.8 seconds.
+Creating an environment variable `PYTORCH_NO_CUDA_MEMORY_CACHING=1` is also helpful. If this still does not help, please add `-d cpu` to the command line. See the section hereafter for more details on the memory requirements for GPU acceleration.
 
 Separated tracks are stored in the `separated/MODEL_NAME/TRACK_NAME` folder. There you will find four stereo wav files sampled at 44.1 kHz: `drums.wav`, `bass.wav`,
 `other.wav`, `vocals.wav` (or `.mp3` if you used the `--mp3` option).
 
-All audio formats supported by `torchaudio` can be processed (i.e. wav, mp3, flac, ogg/vorbis on Linux/Mac OS X etc.). On Windows, `torchaudio` has limited support, so we rely on `ffmpeg`, which should support pretty much anything.
+All audio formats supported by `torchaudio` can be processed (i.e. wav, mp3, flac, ogg/vorbis on Linux/macOS, etc.). On Windows, `torchaudio` has limited support, so we rely on `ffmpeg`, which should support pretty much anything.
 Audio is resampled on the fly if necessary.
-The output will be a wave file encoded as int16.
+The output will be a wav file encoded as int16.
 You can save as float32 wav files with `--float32`, or 24 bits integer wav with `--int24`.
-You can pass `--mp3` to save as mp3 instead, and set the bitrate with `--mp3-bitrate` (default is 320kbps).
+You can pass `--mp3` to save as mp3 instead, and set the bitrate (in kbps) with `--mp3-bitrate` (default is 320).
 
 It can happen that the output would need clipping, in particular due to some separation artifacts.
 Demucs will automatically rescale each output stem so as to avoid clipping. This can however break
@@ -226,8 +232,8 @@ The list of pre-trained models is:
     but quality can be slightly worse.
 - `SIG`: where `SIG` is a single model from the [model zoo](docs/training.md#model-zoo).
 
-The `--two-stems=vocals` option allows to separate vocals from the rest (e.g. karaoke mode).
-`vocals` can be changed into any source in the selected model.
+The `--two-stems=vocals` option allows separating vocals from the rest of the accompaniment (i.e., karaoke mode).
+`vocals` can be changed to any source in the selected model.
 This will mix the files after separating the mix fully, so this won't be faster or use less memory.
 
 The `--shifts=SHIFTS` performs multiple predictions with random shifts (a.k.a the *shift trick*) of the input and average them. This makes prediction `SHIFTS` times
@@ -248,7 +254,7 @@ If you do not have enough memory on your GPU, simply add `-d cpu` to the command
 
 ## Calling from another Python program
 
-The main function provides a `opt` parameter as a simple API. You can just pass the parsed command line as this parameter: 
+The main function provides an `opt` parameter as a simple API. You can just pass the parsed command line as this parameter: 
 ```python
 # Assume that your command is `demucs --mp3 --two-stems vocals -n mdx_extra "track with space.mp3"`
 # The following codes are same as the command above:

diff --git a/demucs/__init__.py b/demucs/__init__.py
@@ -4,4 +4,4 @@
 # This source code is licensed under the license found in the
 # LICENSE file in the root directory of this source tree.
 
-__version__ = "4.1.0a1"
+__version__ = "4.1.0a3"
diff --git a/demucs/api.py b/demucs/api.py
@@ -22,6 +22,7 @@
 
 import subprocess
 
+from . import audio_legacy
 import torch as th
 import torchaudio as ta
 
@@ -195,7 +196,7 @@ def update_parameter(
             self._jobs = jobs
         if not isinstance(progress, _NotProvided):
             self._progress = progress
-        if not isinstance(callback, _NotProvided) and (callback is None or callable(callback)):
+        if not isinstance(callback, _NotProvided):
             self._callback = callback
         if not isinstance(callback_arg, _NotProvided):
             self._callback_arg = callback_arg
@@ -266,7 +267,7 @@ def separate_tensor(
             wav = convert_audio(wav, sr, self._samplerate, self._audio_channels)
         ref = wav.mean(0)
         wav -= ref.mean()
-        wav /= ref.std()
+        wav /= ref.std() + 1e-8
         out = apply_model(
                 self._model,
                 wav[None],
@@ -284,9 +285,9 @@ def separate_tensor(
             )
         if out is None:
             raise KeyboardInterrupt
-        out *= ref.std()
+        out *= ref.std() + 1e-8
         out += ref.mean()
-        wav *= ref.std()
+        wav *= ref.std() + 1e-8
         wav += ref.mean()
         return (wav, dict(zip(self._model.sources, out[0])))
 

diff --git a/demucs/apply.py b/demucs/apply.py
@@ -51,7 +51,7 @@ def __init__(self, models: tp.List[Model],
             assert other.samplerate == first.samplerate
             assert other.audio_channels == first.audio_channels
             if segment is not None:
-                if not isinstance(other, HTDemucs) and segment > other.segment:
+                if not isinstance(other, HTDemucs) or segment <= other.segment:
                     other.segment = segment
 
         self.audio_channels = first.audio_channels
@@ -150,7 +150,7 @@ def apply_model(model: tp.Union[BagOfModels, Model],
                 num_workers: int = 0, segment: tp.Optional[float] = None,
                 pool=None, lock=None,
                 callback: tp.Optional[tp.Callable[[dict], None]] = None,
-                callback_arg: tp.Optional[dict] = None) -> tp.Optional[th.Tensor]:
+                callback_arg: tp.Optional[dict] = None) -> th.Tensor:
     """
     Apply model to a given mixture.
 
@@ -197,30 +197,23 @@ def apply_model(model: tp.Union[BagOfModels, Model],
         'lock': lock,
     }
     out: tp.Union[float, th.Tensor]
-    res: tp.Union[float, th.Tensor, None]
+    res: tp.Union[float, th.Tensor]
     if isinstance(model, BagOfModels):
         # Special treatment for bag of model.
         # We explicitely apply multiple times `apply_model` so that the random shifts
         # are different for each model.
         estimates: tp.Union[float, th.Tensor] = 0.
         totals = [0.] * len(model.sources)
         callback_arg["models"] = len(model.models)
-        kwargs["callback"] = (
-            (
-                lambda d, i=callback_arg["model_idx_in_bag"]: callback(
-                    _replace_dict(d, ("model_idx_in_bag", i))
-                )
-            )
-            if callable(callback)
-            else None
-        )
         for sub_model, model_weights in zip(model.models, model.weights):
+            kwargs["callback"] = ((
+                    lambda d, i=callback_arg["model_idx_in_bag"]: callback(
+                        _replace_dict(d, ("model_idx_in_bag", i))) if callback else None)
+            )
             original_model_device = next(iter(sub_model.parameters())).device
             sub_model.to(device)
 
             res = apply_model(sub_model, mix, **kwargs, callback_arg=callback_arg)
-            if res is None:
-                return res
             out = res
             sub_model.to(original_model_device)
             for k, inst_weight in enumerate(model_weights):
@@ -252,13 +245,10 @@ def apply_model(model: tp.Union[BagOfModels, Model],
             offset = random.randint(0, max_shift)
             shifted = TensorChunk(padded_mix, offset, length + max_shift - offset)
             kwargs["callback"] = (
-                    (lambda d, i=shift_idx: callback(_replace_dict(d, ("shift_idx", i))))
-                    if callable(callback)
-                    else None
+                    (lambda d, i=shift_idx: callback(_replace_dict(d, ("shift_idx", i)))
+                     if callback else None)
                 )
             res = apply_model(model, shifted, **kwargs, callback_arg=callback_arg)
-            if res is None:
-                return res
             shifted_out = res
             out += shifted_out[..., max_shift - offset:]
         out /= shifts
@@ -289,17 +279,18 @@ def apply_model(model: tp.Union[BagOfModels, Model],
             chunk = TensorChunk(mix, offset, segment_length)
             future = pool.submit(apply_model, model, chunk, **kwargs, callback_arg=callback_arg,
                                  callback=(lambda d, i=offset:
-                                           callback(_replace_dict(d, ("segment_offset", i))))
-                                 if callable(callback) else None)
+                                           callback(_replace_dict(d, ("segment_offset", i)))
+                                           if callback else None))
             futures.append((future, offset))
             offset += segment_length
         if progress:
             futures = tqdm.tqdm(futures, unit_scale=scale, ncols=120, unit='seconds')
         for future, offset in futures:
-            chunk_out = future.result()  # type: tp.Union[None, th.Tensor]
-            if chunk_out is None:
-                pool.shutdown(wait=False, cancel_futures=True)
-                return chunk_out
+            try:
+                chunk_out = future.result()  # type: th.Tensor
+            except Exception:
+                pool.shutdown(wait=True, cancel_futures=True)
+                raise
             chunk_length = chunk_out.shape[-1]
             out[..., offset:offset + segment_length] += (
                 weight[:chunk_length] * chunk_out).to(mix.device)
@@ -320,20 +311,12 @@ def apply_model(model: tp.Union[BagOfModels, Model],
         assert isinstance(mix, TensorChunk)
         padded_mix = mix.padded(valid_length).to(device)
         with lock:
-            try:
+            if callback is not None:
                 callback(_replace_dict(callback_arg, ("state", "start")))  # type: ignore
-            except KeyboardInterrupt:
-                raise
-            except Exception:
-                pass
         with th.no_grad():
             out = model(padded_mix)
         with lock:
-            try:
+            if callback is not None:
                 callback(_replace_dict(callback_arg, ("state", "end")))  # type: ignore
-            except KeyboardInterrupt:
-                raise
-            except Exception:
-                pass
         assert isinstance(out, th.Tensor)
         return center_trim(out, length)
diff --git a/demucs/audio.py b/demucs/audio.py
@@ -10,6 +10,7 @@
 import lameenc
 import julius
 import numpy as np
+from . import audio_legacy
 import torch
 import torchaudio as ta
 import typing as tp

diff --git a/demucs/audio_legacy.py b/demucs/audio_legacy.py
@@ -0,0 +1,17 @@
+# This file is to extend support for torchaudio 2.1
+
+import importlib
+import os
+import sys
+import warnings
+
+if not "torchaudio" in sys.modules:
+    os.environ["TORCHAUDIO_USE_BACKEND_DISPATCHER"] = "0"
+elif os.getenv("TORCHAUDIO_USE_BACKEND_DISPATCHER", default="1") == "1":
+    if sys.modules["torchaudio"].__version__ >= "2.1":
+        os.environ["TORCHAUDIO_USE_BACKEND_DISPATCHER"] = "0"
+        importlib.reload(sys.modules["torchaudio"])
+        warnings.warn(
+            "TORCHAUDIO_USE_BACKEND_DISPATCHER is set to 0 and torchaudio is reloaded.",
+            ImportWarning,
+        )
diff --git a/demucs/hdemucs.py b/demucs/hdemucs.py
@@ -776,16 +776,18 @@ def forward(self, mix):
         # demucs issue #435 ##432
         # NOTE: in this case z already is on cpu
         # TODO: remove this when mps supports complex numbers
-        x_is_mps = x.device.type == "mps"
-        if x_is_mps:
+        x_is_mps_xpu = x.device.type in ["mps", "xpu"]
+        x_device = x.device
+        if x_is_mps_xpu:
             x = x.cpu()
 
         zout = self._mask(z, x)
         x = self._ispec(zout, length)
 
         # back to mps device
-        if x_is_mps:
-            x = x.to('mps')
+        if x_is_mps_xpu:
+            x = x.to(x_device)
+
 
         if self.hybrid:
             xt = xt.view(B, S, -1, length)

diff --git a/demucs/htdemucs.py b/demucs/htdemucs.py
@@ -629,8 +629,9 @@ def forward(self, mix):
         # demucs issue #435 ##432
         # NOTE: in this case z already is on cpu
         # TODO: remove this when mps supports complex numbers
-        x_is_mps = x.device.type == "mps"
-        if x_is_mps:
+        x_is_mps_xpu = x.device.type in ["mps", "xpu"]
+        x_device = x.device
+        if x_is_mps_xpu:
             x = x.cpu()
 
         zout = self._mask(z, x)
@@ -643,8 +644,8 @@ def forward(self, mix):
             x = self._ispec(zout, length)
 
         # back to mps device
-        if x_is_mps:
-            x = x.to("mps")
+        if x_is_mps_xpu:
+            x = x.to(x_device)
 
         if self.use_train_segment:
             if self.training:

diff --git a/demucs/repitch.py b/demucs/repitch.py
@@ -9,6 +9,7 @@
 import subprocess as sp
 import tempfile
 
+from . import audio_legacy
 import torch
 import torchaudio as ta
 

diff --git a/demucs/separate.py b/demucs/separate.py
@@ -41,7 +41,13 @@ def get_parser():
                         'Default is "{track}/{stem}.{ext}".')
     parser.add_argument("-d",
                         "--device",
-                        default="cuda" if th.cuda.is_available() else "cpu",
+                        default=(
+                            "cuda"
+                            if th.cuda.is_available()
+                            else "mps"
+                            if th.backends.mps.is_available()
+                            else "cpu"
+                        ),
                         help="Device to use, default is cuda if available else cpu")
     parser.add_argument("--shifts",
                         default=1,

diff --git a/demucs/spec.py b/demucs/spec.py
@@ -11,8 +11,8 @@
 def spectro(x, n_fft=512, hop_length=None, pad=0):
     *other, length = x.shape
     x = x.reshape(-1, length)
-    is_mps = x.device.type == 'mps'
-    if is_mps:
+    is_mps_xpu = x.device.type in ['mps', 'xpu']
+    if is_mps_xpu:
         x = x.cpu()
     z = th.stft(x,
                 n_fft * (1 + pad),
@@ -32,8 +32,8 @@ def ispectro(z, hop_length=None, length=None, pad=0):
     n_fft = 2 * freqs - 2
     z = z.view(-1, freqs, frames)
     win_length = n_fft // (1 + pad)
-    is_mps = z.device.type == 'mps'
-    if is_mps:
+    is_mps_xpu = z.device.type in ['mps', 'xpu']
+    if is_mps_xpu:
         z = z.cpu()
     x = th.istft(z,
                  n_fft,

diff --git a/demucs/train.py b/demucs/train.py
@@ -15,6 +15,7 @@
 import hydra
 from hydra.core.global_hydra import GlobalHydra
 from omegaconf import OmegaConf
+from . import audio_legacy
 import torch
 from torch import nn
 import torchaudio
-Original file line number
+Diff line change
@@ Expand Up / @@ -14,4 +14,4 @@ Session.vim @@
     /trash
     /misc
     /mdx
-    .mypy_cache
+    .mypy_cache