Skip to content

M1 is a research project exploring large-scale music generation using diffusion transformers. This repository contains the implementation of our proposed architecture combining recent advances in diffusion models, transformer architectures, and music processing.

License

Notifications You must be signed in to change notification settings

Agora-Lab-AI/m1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Multi-Modality

M1: Music Generation via Diffusion Transformers πŸŽ΅πŸ”¬

Join our Discord Subscribe on YouTube Connect on LinkedIn Follow on X.com

M1 is a research project exploring large-scale music generation using diffusion transformers. This repository contains the implementation of our proposed architecture combining recent advances in diffusion models, transformer architectures, and music processing.

πŸ”¬ Research Overview

We propose a novel approach to music generation that combines:

  • Diffusion-based generative modeling
  • Multi-query attention mechanisms
  • Hierarchical audio encoding
  • Text-conditional generation
  • Scalable training methodology

Key Hypotheses

  1. Diffusion transformers can capture long-range musical structure better than traditional autoregressive models
  2. Multi-query attention mechanisms can improve training efficiency without sacrificing quality
  3. Hierarchical audio encoding preserves both local and global musical features
  4. Text conditioning enables semantic control over generation

πŸ—οΈ Architecture

                                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                              β”‚  Time Encoding  β”‚
                                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                      β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                              β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Audio Input  β”œβ”€β”€β–Ί mel spectrogram ──────────►               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                              β”‚   Diffusion    β”‚
                                              β”‚  Transformer   β”‚ ──► Generated Audio
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚     Block      β”‚
β”‚ Text Input   β”œβ”€β”€β–Ί β”‚ T5 Encoder  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Ί               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Implementation Details

# Key architectural dimensions
MODEL_CONFIG = {
    'dim': 512,          # Base dimension
    'depth': 12,         # Number of transformer layers
    'heads': 8,          # Attention heads
    'dim_head': 64,      # Dimension per head
    'mlp_dim': 2048,     # FFN dimension
    'dropout': 0.1       # Dropout rate
}

# Audio processing parameters
AUDIO_CONFIG = {
    'sample_rate': 16000,
    'n_mels': 80,
    'n_fft': 1024,
    'hop_length': 256
}

πŸ“Š Proposed Experiments

Phase 1: Architecture Validation

  • Baseline model training on synthetic data
  • Ablation studies on attention mechanisms
  • Time embedding comparison study
  • Audio encoding architecture experiments

Phase 2: Dataset Construction

We plan to build a research dataset from multiple sources:

  1. Initial Development Dataset

    • 10k Creative Commons music samples
    • Focused on single-instrument recordings
    • Clear genre categorization
  2. Scaled Dataset (Future Work)

    • Spotify API integration
    • SoundCloud API integration
    • Public domain music archives

Phase 3: Training & Evaluation

Planned training configurations:

initial_training:
  batch_size: 32
  gradient_accumulation: 4
  learning_rate: 1e-4
  warmup_steps: 1000
  max_steps: 100000
  
evaluation_metrics:
  - spectral_convergence
  - magnitude_error
  - musical_consistency
  - genre_accuracy

πŸ› οΈ Development Setup

# Clone repository
git clone https://github.com/Agora-Lab-AI/m1.git
cd m1-music

# Create environment
conda create -n m1 python=3.10
conda activate m1

# Install dependencies
pip install -r requirements.txt

# Run tests
pytest tests/

Example

import torch
from m1.model import ModelConfig, AudioConfig, MusicDiffusionTransformer, DiffusionScheduler, train_step, generate_audio
from loguru import logger

# Example usage
def main():
    logger.info("Setting up model configurations")
    
    # Configure logging
    logger.add("music_diffusion.log", rotation="500 MB")
    
    # Set device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    logger.info(f"Using device: {device}")
    
    # Initialize configurations
    model_config = ModelConfig(
        dim=512,
        depth=12,
        heads=8,
        dim_head=64,
        mlp_dim=2048,
        dropout=0.1
    )
    
    audio_config = AudioConfig(
        sample_rate=16000,
        n_mels=80,
        audio_length=1024,
        hop_length=256,
        win_length=1024,
        n_fft=1024
    )
    
    # Initialize model and scheduler
    model = MusicDiffusionTransformer(model_config, audio_config).to(device)
    scheduler = DiffusionScheduler(num_inference_steps=1000)
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
    
    # Example forward pass
    logger.info("Preparing example forward pass")
    batch_size = 4
    example_audio = torch.randn(batch_size, audio_config.audio_length).to(device)
    example_text = {
        'input_ids': torch.randint(0, 1000, (batch_size, 50)).to(device),
        'attention_mask': torch.ones(batch_size, 50).bool().to(device)
    }
    
    # Training step
    logger.info("Executing training step")
    loss = train_step(
        model,
        scheduler,
        optimizer,
        example_audio,
        example_text,
        device
    )
    logger.info(f"Training loss: {loss:.4f}")
    generation_text = {
        'input_ids': torch.randint(0, 1000, (1, 50)).to(device),
        'attention_mask': torch.ones(1, 50).bool().to(device)
    }
    
    # Generation example
    logger.info("Generating example audio")
    generated_audio = generate_audio(
        model,
        scheduler,
        generation_text,
        device,
        audio_config.audio_length
    )
    logger.info(f"Generated audio shape: {generated_audio.shape}")

if __name__ == "__main__":
    main()

πŸ“ Project Structure

m1/
β”œβ”€β”€ configs/               # Training configurations
β”œβ”€β”€ m1/
β”‚   β”œβ”€β”€ models/           # Model architectures
β”‚   β”œβ”€β”€ diffusion/        # Diffusion scheduling
β”‚   β”œβ”€β”€ data/             # Data loading/processing
β”‚   └── training/         # Training loops
β”œβ”€β”€ notebooks/            # Research notebooks
β”œβ”€β”€ scripts/              # Training scripts
└── tests/                # Unit tests

πŸ§ͺ Current Status

This is an active research project in early stages. Current focus:

  • Implementing and testing base architecture
  • Setting up data processing pipeline
  • Designing initial experiments
  • Building evaluation framework

πŸ“š References

Key papers informing this work:

  • "Diffusion Models Beat GANs on Image Synthesis" (Dhariwal & Nichol, 2021)
  • "Structured Denoising Diffusion Models" (Sohl-Dickstein et al., 2015)
  • "High-Resolution Image Synthesis with Latent Diffusion Models" (Rombach et al., 2022)

🀝 Contributing

We welcome research collaborations! Areas where we're looking for contributions:

  • Novel architectural improvements
  • Efficient training methodologies
  • Evaluation metrics
  • Dataset curation tools

πŸ“¬ Contact

For research collaboration inquiries:

βš–οΈ License

This research code is released under the MIT License.

πŸ” Citation

If you use this code in your research, please cite:

@misc{m1music2024,
  title={M1: Experimental Music Generation via Diffusion Transformers},
  author={M1 Research Team},
  year={2024},
  publisher={GitHub},
  journal={GitHub repository},
  howpublished={\url{https://github.com/Agora-Lab-AI/m1}}
}

🚧 Disclaimer

This is experimental research code:

  • Architecture and training procedures may change significantly
  • Not yet optimized for production use
  • Results and capabilities are being actively researched
  • Breaking changes should be expected

We're sharing this code to foster collaboration and advance the field of AI music generation research.

About

M1 is a research project exploring large-scale music generation using diffusion transformers. This repository contains the implementation of our proposed architecture combining recent advances in diffusion models, transformer architectures, and music processing.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published