Highlights
- Pro
Stars
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
Vector (and Scalar) Quantization, in Pytorch
Official Jax Implementation of MaskGIT
[ICLR2025] Halton Scheduler for Masked Generative Image Transformer
An AI-Powered Speech Processing Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Enhancement, Separation, and Target Speaker Extraction, etc.
A PyTorch library for implementing flow matching algorithms, featuring continuous and discrete flow matching implementations. It includes practical examples for both text and image modalities.
A family of state-of-the-art Transformer-based audio codecs for low-bitrate high-quality audio coding.
Official PyTorch implementation of "Paralinguistics-Aware Speech-Empowered LLMs for Natural Conversation" (NeurIPS 2024)
Automatically Update Text-to-speech (TTS) Papers Daily using Github Actions (Update Every 12th hours)
open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
Official Code Implementation for 'A Simple Early Exiting Framework for Accelerated Sampling in Diffusion Models'
[ICLR 2025] SOTA discrete acoustic codec models with 40/75 tokens per second for audio language modeling
Evaluation Protocol for Large-Scale Zero-Shot TTS Literature
AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.
Inference and training library for high-quality TTS models.
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Official Implementation for "Consistency Flow Matching: Defining Straight Flows with Velocity Consistency"
NU-Wave 2: A General Neural Audio Upsampling Model for Various Sampling Rates @ INTERSPEECH 2022
A playbook for systematically maximizing the performance of deep learning models.
Official repository of DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech, ICASSP 2023
Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
A Non-Autoregressive End-to-End Text-to-Speech (text-to-wav), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ulti…
A Non-Autoregressive Transformer based Text-to-Speech, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, …