Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge adefossez changes #1

Merged
merged 7 commits into from
Jun 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,4 @@ Session.vim
/trash
/misc
/mdx
.mypy_cache
.mypy_cache
30 changes: 18 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,25 @@
# Demucs Music Source Separation

[![Support Ukraine](https://img.shields.io/badge/Support-Ukraine-FFD500?style=flat&labelColor=005BBB)](https://opensource.fb.com/support-ukraine)
![tests badge](https://github.com/facebookresearch/demucs/workflows/tests/badge.svg)
![linter badge](https://github.com/facebookresearch/demucs/workflows/linter/badge.svg)


**This is the officially maintained Demucs** now that I (Alexandre Défossez) have left Meta to join [Kyutai](https://twitter.com/kyutai_labs).
Note that I'm not actively working on Demucs anymore, so expect slow replies and no new feature for now.



This is the 4th release of Demucs (v4), featuring Hybrid Transformer based source separation.
**For the classic Hybrid Demucs (v3):** [Go this commit][demucs_v3].
If you are experiencing issues and want the old Demucs back, please fill an issue, and then you can get back to the v3 with
If you are experiencing issues and want the old Demucs back, please file an issue, and then you can get back to Demucs v3 with
`git checkout v3`. You can also go [Demucs v2][demucs_v2].


Demucs is a state-of-the-art music source separation model, currently capable of separating
drums, bass, and vocals from the rest of the accompaniment.
Demucs is based on a U-Net convolutional architecture inspired by [Wave-U-Net][waveunet].
The v4 version features [Hybrid Transformer Demucs][htdemucs], a hybrid spectrogram/waveform separation model using Transformers.
It is based on [Hybrid Demucs][hybrid_paper] (also provided in this repo) with the innermost layers are
It is based on [Hybrid Demucs][hybrid_paper] (also provided in this repo), with the innermost layers
replaced by a cross-domain Transformer Encoder. This Transformer uses self-attention within each domain,
and cross-attention across domains.
The model achieves a SDR of 9.00 dB on the MUSDB HQ test set. Moreover, when using sparse attention
Expand Down Expand Up @@ -123,7 +127,7 @@ python3 -m pip install -U git+https://github.com/facebookresearch/demucs#egg=dem

Advanced OS support are provided on the following page, **you must read the page for your OS before posting an issues**:
- **If you are using Windows:** [Windows support](docs/windows.md).
- **If you are using MAC OS X:** [Mac OS X support](docs/mac.md).
- **If you are using macOS:** [macOS support](docs/mac.md).
- **If you are using Linux:** [Linux support](docs/linux.md).

### For machine learning scientists
Expand All @@ -139,7 +143,7 @@ pip install -e .

This will create a `demucs` environment with all the dependencies installed.

You will also need to install [soundstretch/soundtouch](https://www.surina.net/soundtouch/soundstretch.html): on Mac OSX you can do `brew install sound-touch`,
You will also need to install [soundstretch/soundtouch](https://www.surina.net/soundtouch/soundstretch.html): on macOS you can do `brew install sound-touch`,
and on Ubuntu `sudo apt-get install soundstretch`. This is used for the
pitch/tempo augmentation.

Expand Down Expand Up @@ -194,16 +198,18 @@ demucs --two-stems=vocals myfile.mp3
```


If you have a GPU, but you run out of memory, please use `--segment SEGMENT` to reduce length of each split. `SEGMENT` should be changed to a integer. Personally recommend not less than 10 (the bigger the number is, the more memory is required, but quality may increase). Create an environment variable `PYTORCH_NO_CUDA_MEMORY_CACHING=1` is also helpful. If this still cannot help, please add `-d cpu` to the command line. See the section hereafter for more details on the memory requirements for GPU acceleration.
If you have a GPU, but you run out of memory, please use `--segment SEGMENT` to reduce length of each split. `SEGMENT` should be changed to a integer describing the length of each segment in seconds.
A segment length of at least 10 is recommended (the bigger the number is, the more memory is required, but quality may increase). Note that the Hybrid Transformer models only support a maximum segment length of 7.8 seconds.
Creating an environment variable `PYTORCH_NO_CUDA_MEMORY_CACHING=1` is also helpful. If this still does not help, please add `-d cpu` to the command line. See the section hereafter for more details on the memory requirements for GPU acceleration.

Separated tracks are stored in the `separated/MODEL_NAME/TRACK_NAME` folder. There you will find four stereo wav files sampled at 44.1 kHz: `drums.wav`, `bass.wav`,
`other.wav`, `vocals.wav` (or `.mp3` if you used the `--mp3` option).

All audio formats supported by `torchaudio` can be processed (i.e. wav, mp3, flac, ogg/vorbis on Linux/Mac OS X etc.). On Windows, `torchaudio` has limited support, so we rely on `ffmpeg`, which should support pretty much anything.
All audio formats supported by `torchaudio` can be processed (i.e. wav, mp3, flac, ogg/vorbis on Linux/macOS, etc.). On Windows, `torchaudio` has limited support, so we rely on `ffmpeg`, which should support pretty much anything.
Audio is resampled on the fly if necessary.
The output will be a wave file encoded as int16.
The output will be a wav file encoded as int16.
You can save as float32 wav files with `--float32`, or 24 bits integer wav with `--int24`.
You can pass `--mp3` to save as mp3 instead, and set the bitrate with `--mp3-bitrate` (default is 320kbps).
You can pass `--mp3` to save as mp3 instead, and set the bitrate (in kbps) with `--mp3-bitrate` (default is 320).

It can happen that the output would need clipping, in particular due to some separation artifacts.
Demucs will automatically rescale each output stem so as to avoid clipping. This can however break
Expand All @@ -226,8 +232,8 @@ The list of pre-trained models is:
but quality can be slightly worse.
- `SIG`: where `SIG` is a single model from the [model zoo](docs/training.md#model-zoo).

The `--two-stems=vocals` option allows to separate vocals from the rest (e.g. karaoke mode).
`vocals` can be changed into any source in the selected model.
The `--two-stems=vocals` option allows separating vocals from the rest of the accompaniment (i.e., karaoke mode).
`vocals` can be changed to any source in the selected model.
This will mix the files after separating the mix fully, so this won't be faster or use less memory.

The `--shifts=SHIFTS` performs multiple predictions with random shifts (a.k.a the *shift trick*) of the input and average them. This makes prediction `SHIFTS` times
Expand All @@ -248,7 +254,7 @@ If you do not have enough memory on your GPU, simply add `-d cpu` to the command

## Calling from another Python program

The main function provides a `opt` parameter as a simple API. You can just pass the parsed command line as this parameter:
The main function provides an `opt` parameter as a simple API. You can just pass the parsed command line as this parameter:
```python
# Assume that your command is `demucs --mp3 --two-stems vocals -n mdx_extra "track with space.mp3"`
# The following codes are same as the command above:
Expand Down
2 changes: 1 addition & 1 deletion demucs/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

__version__ = "4.1.0a1"
__version__ = "4.1.0a3"
9 changes: 5 additions & 4 deletions demucs/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@

import subprocess

from . import audio_legacy
import torch as th
import torchaudio as ta

Expand Down Expand Up @@ -195,7 +196,7 @@ def update_parameter(
self._jobs = jobs
if not isinstance(progress, _NotProvided):
self._progress = progress
if not isinstance(callback, _NotProvided) and (callback is None or callable(callback)):
if not isinstance(callback, _NotProvided):
self._callback = callback
if not isinstance(callback_arg, _NotProvided):
self._callback_arg = callback_arg
Expand Down Expand Up @@ -266,7 +267,7 @@ def separate_tensor(
wav = convert_audio(wav, sr, self._samplerate, self._audio_channels)
ref = wav.mean(0)
wav -= ref.mean()
wav /= ref.std()
wav /= ref.std() + 1e-8
out = apply_model(
self._model,
wav[None],
Expand All @@ -284,9 +285,9 @@ def separate_tensor(
)
if out is None:
raise KeyboardInterrupt
out *= ref.std()
out *= ref.std() + 1e-8
out += ref.mean()
wav *= ref.std()
wav *= ref.std() + 1e-8
wav += ref.mean()
return (wav, dict(zip(self._model.sources, out[0])))

Expand Down
53 changes: 18 additions & 35 deletions demucs/apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ def __init__(self, models: tp.List[Model],
assert other.samplerate == first.samplerate
assert other.audio_channels == first.audio_channels
if segment is not None:
if not isinstance(other, HTDemucs) and segment > other.segment:
if not isinstance(other, HTDemucs) or segment <= other.segment:
other.segment = segment

self.audio_channels = first.audio_channels
Expand Down Expand Up @@ -150,7 +150,7 @@ def apply_model(model: tp.Union[BagOfModels, Model],
num_workers: int = 0, segment: tp.Optional[float] = None,
pool=None, lock=None,
callback: tp.Optional[tp.Callable[[dict], None]] = None,
callback_arg: tp.Optional[dict] = None) -> tp.Optional[th.Tensor]:
callback_arg: tp.Optional[dict] = None) -> th.Tensor:
"""
Apply model to a given mixture.

Expand Down Expand Up @@ -197,30 +197,23 @@ def apply_model(model: tp.Union[BagOfModels, Model],
'lock': lock,
}
out: tp.Union[float, th.Tensor]
res: tp.Union[float, th.Tensor, None]
res: tp.Union[float, th.Tensor]
if isinstance(model, BagOfModels):
# Special treatment for bag of model.
# We explicitely apply multiple times `apply_model` so that the random shifts
# are different for each model.
estimates: tp.Union[float, th.Tensor] = 0.
totals = [0.] * len(model.sources)
callback_arg["models"] = len(model.models)
kwargs["callback"] = (
(
lambda d, i=callback_arg["model_idx_in_bag"]: callback(
_replace_dict(d, ("model_idx_in_bag", i))
)
)
if callable(callback)
else None
)
for sub_model, model_weights in zip(model.models, model.weights):
kwargs["callback"] = ((
lambda d, i=callback_arg["model_idx_in_bag"]: callback(
_replace_dict(d, ("model_idx_in_bag", i))) if callback else None)
)
original_model_device = next(iter(sub_model.parameters())).device
sub_model.to(device)

res = apply_model(sub_model, mix, **kwargs, callback_arg=callback_arg)
if res is None:
return res
out = res
sub_model.to(original_model_device)
for k, inst_weight in enumerate(model_weights):
Expand Down Expand Up @@ -252,13 +245,10 @@ def apply_model(model: tp.Union[BagOfModels, Model],
offset = random.randint(0, max_shift)
shifted = TensorChunk(padded_mix, offset, length + max_shift - offset)
kwargs["callback"] = (
(lambda d, i=shift_idx: callback(_replace_dict(d, ("shift_idx", i))))
if callable(callback)
else None
(lambda d, i=shift_idx: callback(_replace_dict(d, ("shift_idx", i)))
if callback else None)
)
res = apply_model(model, shifted, **kwargs, callback_arg=callback_arg)
if res is None:
return res
shifted_out = res
out += shifted_out[..., max_shift - offset:]
out /= shifts
Expand Down Expand Up @@ -289,17 +279,18 @@ def apply_model(model: tp.Union[BagOfModels, Model],
chunk = TensorChunk(mix, offset, segment_length)
future = pool.submit(apply_model, model, chunk, **kwargs, callback_arg=callback_arg,
callback=(lambda d, i=offset:
callback(_replace_dict(d, ("segment_offset", i))))
if callable(callback) else None)
callback(_replace_dict(d, ("segment_offset", i)))
if callback else None))
futures.append((future, offset))
offset += segment_length
if progress:
futures = tqdm.tqdm(futures, unit_scale=scale, ncols=120, unit='seconds')
for future, offset in futures:
chunk_out = future.result() # type: tp.Union[None, th.Tensor]
if chunk_out is None:
pool.shutdown(wait=False, cancel_futures=True)
return chunk_out
try:
chunk_out = future.result() # type: th.Tensor
except Exception:
pool.shutdown(wait=True, cancel_futures=True)
raise
chunk_length = chunk_out.shape[-1]
out[..., offset:offset + segment_length] += (
weight[:chunk_length] * chunk_out).to(mix.device)
Expand All @@ -320,20 +311,12 @@ def apply_model(model: tp.Union[BagOfModels, Model],
assert isinstance(mix, TensorChunk)
padded_mix = mix.padded(valid_length).to(device)
with lock:
try:
if callback is not None:
callback(_replace_dict(callback_arg, ("state", "start"))) # type: ignore
except KeyboardInterrupt:
raise
except Exception:
pass
with th.no_grad():
out = model(padded_mix)
with lock:
try:
if callback is not None:
callback(_replace_dict(callback_arg, ("state", "end"))) # type: ignore
except KeyboardInterrupt:
raise
except Exception:
pass
assert isinstance(out, th.Tensor)
return center_trim(out, length)
1 change: 1 addition & 0 deletions demucs/audio.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import lameenc
import julius
import numpy as np
from . import audio_legacy
import torch
import torchaudio as ta
import typing as tp
Expand Down
17 changes: 17 additions & 0 deletions demucs/audio_legacy.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# This file is to extend support for torchaudio 2.1

import importlib
import os
import sys
import warnings

if not "torchaudio" in sys.modules:
os.environ["TORCHAUDIO_USE_BACKEND_DISPATCHER"] = "0"
elif os.getenv("TORCHAUDIO_USE_BACKEND_DISPATCHER", default="1") == "1":
if sys.modules["torchaudio"].__version__ >= "2.1":
os.environ["TORCHAUDIO_USE_BACKEND_DISPATCHER"] = "0"
importlib.reload(sys.modules["torchaudio"])
warnings.warn(
"TORCHAUDIO_USE_BACKEND_DISPATCHER is set to 0 and torchaudio is reloaded.",
ImportWarning,
)
10 changes: 6 additions & 4 deletions demucs/hdemucs.py
Original file line number Diff line number Diff line change
Expand Up @@ -776,16 +776,18 @@ def forward(self, mix):
# demucs issue #435 ##432
# NOTE: in this case z already is on cpu
# TODO: remove this when mps supports complex numbers
x_is_mps = x.device.type == "mps"
if x_is_mps:
x_is_mps_xpu = x.device.type in ["mps", "xpu"]
x_device = x.device
if x_is_mps_xpu:
x = x.cpu()

zout = self._mask(z, x)
x = self._ispec(zout, length)

# back to mps device
if x_is_mps:
x = x.to('mps')
if x_is_mps_xpu:
x = x.to(x_device)


if self.hybrid:
xt = xt.view(B, S, -1, length)
Expand Down
9 changes: 5 additions & 4 deletions demucs/htdemucs.py
Original file line number Diff line number Diff line change
Expand Up @@ -629,8 +629,9 @@ def forward(self, mix):
# demucs issue #435 ##432
# NOTE: in this case z already is on cpu
# TODO: remove this when mps supports complex numbers
x_is_mps = x.device.type == "mps"
if x_is_mps:
x_is_mps_xpu = x.device.type in ["mps", "xpu"]
x_device = x.device
if x_is_mps_xpu:
x = x.cpu()

zout = self._mask(z, x)
Expand All @@ -643,8 +644,8 @@ def forward(self, mix):
x = self._ispec(zout, length)

# back to mps device
if x_is_mps:
x = x.to("mps")
if x_is_mps_xpu:
x = x.to(x_device)

if self.use_train_segment:
if self.training:
Expand Down
1 change: 1 addition & 0 deletions demucs/repitch.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
import subprocess as sp
import tempfile

from . import audio_legacy
import torch
import torchaudio as ta

Expand Down
8 changes: 7 additions & 1 deletion demucs/separate.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,13 @@ def get_parser():
'Default is "{track}/{stem}.{ext}".')
parser.add_argument("-d",
"--device",
default="cuda" if th.cuda.is_available() else "cpu",
default=(
"cuda"
if th.cuda.is_available()
else "mps"
if th.backends.mps.is_available()
else "cpu"
),
help="Device to use, default is cuda if available else cpu")
parser.add_argument("--shifts",
default=1,
Expand Down
8 changes: 4 additions & 4 deletions demucs/spec.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@
def spectro(x, n_fft=512, hop_length=None, pad=0):
*other, length = x.shape
x = x.reshape(-1, length)
is_mps = x.device.type == 'mps'
if is_mps:
is_mps_xpu = x.device.type in ['mps', 'xpu']
if is_mps_xpu:
x = x.cpu()
z = th.stft(x,
n_fft * (1 + pad),
Expand All @@ -32,8 +32,8 @@ def ispectro(z, hop_length=None, length=None, pad=0):
n_fft = 2 * freqs - 2
z = z.view(-1, freqs, frames)
win_length = n_fft // (1 + pad)
is_mps = z.device.type == 'mps'
if is_mps:
is_mps_xpu = z.device.type in ['mps', 'xpu']
if is_mps_xpu:
z = z.cpu()
x = th.istft(z,
n_fft,
Expand Down
1 change: 1 addition & 0 deletions demucs/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
import hydra
from hydra.core.global_hydra import GlobalHydra
from omegaconf import OmegaConf
from . import audio_legacy
import torch
from torch import nn
import torchaudio
Expand Down
Loading