Nvidia Ai Introduces Audio-sds: A Unified Diffusion-based Framework For Prompt-guided Audio Synthesis And Source Separation Without Specialized Datasets

Trending 4 hours ago
ARTICLE AD BOX

Audio diffusion models person achieved high-quality speech, music, and Foley sound synthesis, yet they predominantly excel astatine sample procreation alternatively than parameter optimization. Tasks for illustration physically informed effect sound procreation aliases prompt-driven root separation require models that tin set explicit, interpretable parameters nether structural constraints. Score Distillation Sampling (SDS)—which has powered text-to-3D and image editing by backpropagating done pretrained diffusion priors—has not yet been applied to audio. Adapting SDS to audio diffusion allows optimizing parametric audio representations without assembling ample task-specific datasets, bridging modern generative models pinch parameterized synthesis workflows.

Classic audio techniques—such arsenic wave modulation (FM) synthesis, which uses operator-modulated oscillators to trade rich | timbres, and physically grounded impact-sound simulators—provide compact, interpretable parameter spaces. Similarly, root separation has evolved from matrix factorization to neural and text-guided methods for isolating components for illustration vocals aliases instruments. By integrating SDS updates pinch pretrained audio diffusion models, 1 tin leverage learned generative priors to guideline nan optimization of FM parameters, impact-sound simulators, aliases separation masks straight from high-level prompts, uniting signal-processing interpretability pinch nan elasticity of modern diffusion-based generation. 

Researchers from NVIDIA and MIT present Audio-SDS, an hold of SDS for text-conditioned audio diffusion models. Audio-SDS leverages a azygous pretrained exemplary to execute various audio tasks without requiring specialized datasets. Distilling generative priors into parametric audio representations facilitates tasks for illustration effect sound simulation, FM synthesis parameter calibration, and root separation. The model combines data-driven priors pinch definitive parameter control, producing perceptually convincing results. Key improvements see a unchangeable decoder-based SDS, multistep denoising, and a multiscale spectrogram attack for amended high-frequency item and realism. 

The study discusses applying SDS to audio diffusion models. Inspired by DreamFusion, SDS generates stereo audio done a rendering function, improving capacity by bypassing encoder gradients and focusing alternatively connected nan decoded audio. The methodology is enhanced by 3 modifications: avoiding encoder instability, emphasizing spectrogram features to item high-frequency details, and utilizing multi-step denoising for amended stability. Applications of Audio-SDS see FM synthesizers, effect sound synthesis, and root separation. These tasks show really SDS adapts to different audio domains without retraining, ensuring that synthesized audio aligns pinch textual prompts while maintaining precocious fidelity. 

The capacity of nan Audio-SDS model is demonstrated crossed 3 tasks: FM synthesis, effect synthesis, and root separation. The experiments are designed to trial nan framework’s effectiveness utilizing some subjective (listening tests) and nonsubjective metrics specified arsenic nan CLAP score, region to crushed truth, and Signal-to-Distortion Ratio (SDR). Pretrained models, specified arsenic nan Stable Audio Open checkpoint, are utilized for these tasks. The results show important audio synthesis and separation improvements, pinch clear alignment to matter prompts. 

In conclusion, nan study introduces Audio-SDS, a method that extends SDS to text-conditioned audio diffusion models. Using a azygous pretrained model, Audio-SDS enables a assortment of tasks, specified arsenic simulating physically informed effect sounds, adjusting FM synthesis parameters, and performing root separation based connected prompts. The attack unifies data-driven priors pinch user-defined representations, eliminating nan request for large, domain-specific datasets. While location are challenges successful exemplary coverage, latent encoding artifacts, and optimization sensitivity, Audio-SDS demonstrates nan imaginable of distillation-based methods for multimodal research, peculiarly successful audio-related tasks. 


Check out the Paper and Project Page. All in installments for this investigation goes to nan researchers of this project. Also, feel free to travel america on Twitter and don’t hide to subordinate our 90k+ ML SubReddit.

Here’s a little overview of what we’re building astatine Marktechpost:

  • ML News Community – r/machinelearningnews (92k+ members)
  • Newsletter– airesearchinsights.com/(30k+ subscribers)
  • miniCON AI Events – minicon.marktechpost.com
  • AI Reports & Magazines – magazine.marktechpost.com
  • AI Dev & Research News – marktechpost.com (1M+ monthly readers)
  • Partner pinch us

Sana Hassan, a consulting intern astatine Marktechpost and dual-degree student astatine IIT Madras, is passionate astir applying exertion and AI to reside real-world challenges. With a keen liking successful solving applicable problems, he brings a caller position to nan intersection of AI and real-life solutions.

More