Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Keywords:

The official repo for Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Community Contribution:

Outlines

💥 News 💥
👀 Overall Structure
📊 BEWO-1M Dataset
🏆 Usage
📝 Evaluation
📜 License
🤝 Contributors

💥 News 💥

[2025.01.24] Our preview version of BEWO-1M is released in with instructions.

[2025.01.23] Our paper is accepted by ICLR 2025! See you in Singapore.

[2024.10.14] Our initial paper is now accessible at .

Overall Structure

Dataset: Data instruction for BEWO
Inference: Inference code for BEWO (Coming soon.)
ITD Evaluation: Evaluation code for BEWO (Coming soon.)

BEWO-1M Dataset

To better facilitate the advancement of multimodal guided spatial audio generation models, we have developed a dual-channel audio dataset named Both Ears Wide Open 1M (BEWO-1M) through rigorous simulations and GPT-assisted caption transformation.

Totally, we constructed 2.8k hours of training audio with more than 1M audio-text pairs and approximately 17 hours of validation data with 6.2k pairs.

The full dataset of BEWO-1M can be find in here.

Requirements

Requires PyTorch 2.0 or later for Flash Attention support

Development for the repo is done in Python 3.8.10

This code base is adapted from stable-audio-tools. Sincere thanks to the engineers for their great work.

Model Gallery

Coming Soon...

Usage

Simple generation:

To generate audio from a text prompt using our pretrained model:

Download the pretrained model and config files from [MODEL_LINK]
Place the model checkpoint at /path/to/final.ckpt
Place the model config at /path/to/model_config_sim.json
Run the following command:

python simple_generation.py --prompt "A dog is barking on the left." --device cuda:0

Coarse-to-fine generation:

To generate audio from a text prompt using our pretrained model:

Download the pretrained model and config files from [MODEL_LINK]
Place the model checkpoint at /path/to/final_c2f.ckpt
Place the model config at /path/to/model_config_sim_c2f.json
Run the following command:

The GPT induction is used to generate the spatial attributes. We offer two models for you to choose. GPT-4o and DeepSeekv3. Since the DeepSeek model is much cheaper and opensourced, using it can be considered as a cost-effective solution.

Using GPT induction:

python gpt_induction.py --prompt "A dog is barking on the left." --device cuda:0
python gpt_induction.py  --prompt "a dog is barking and running from left to right." --device cuda:0

We also provide a manual setting for you to manually set the initial and final direction and moving state. The direction is from 1 (left) to 5 (right). The moving state is from 0 (no moving) to 3 (fast moving).

Using manual setting:

python gpt_induction.py --prompt "a dog is barking." --device cuda:0 --manual True --init_direction 1 --final_direction 1 --moving 0
python gpt_induction.py --prompt "a dog is barking." --device cuda:0 --manual True --init_direction 1 --final_direction 5 --moving 1

Reference

If you find this repo useful, please cite our papers:

@article{sun2024both,
  title={Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation},
  author={Sun, Peiwen and Cheng, Sitong and Li, Xiangtai and Ye, Zhen and Liu, Huadai and Zhang, Honggang and Xue, Wei and Guo, Yike},
  journal={arXiv preprint arXiv:2410.10676},
  year={2024}
}

Please also cite stable-audio-tools paper if you use the code in this repo. Thanks again for their great work.

@article{evans2024stable,
  title={Stable audio open},
  author={Evans, Zach and Parker, Julian D and Carr, CJ and Zukowski, Zack and Taylor, Josiah and Pons, Jordi},
  journal={arXiv preprint arXiv:2407.14358},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.assets		.assets
datasets		datasets
evaluations		evaluations
models		models
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Outlines

💥 News 💥

Overall Structure

BEWO-1M Dataset

Requirements

Model Gallery

Usage

Simple generation:

Coarse-to-fine generation:

Reference

About

Releases

Packages

Contributors 2

PeiwenSun2000/Both-Ears-Wide-Open

Folders and files

Latest commit

History

Repository files navigation

Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Outlines

💥 News 💥

Overall Structure

BEWO-1M Dataset

Requirements

Model Gallery

Usage

Simple generation:

Coarse-to-fine generation:

Reference

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages