Talk To AI with FastRTC

A real-time voice conversation application powered by FastRTC that enables interactive audio communication with both Local and Cloud AI models. Inspired by Talk To Claude, this project transforms text-based AI interactions into natural voice conversations.

Overview

This application creates a seamless voice interface to interact with AI models. It provides:

Real-time speech-to-text conversion using various STT models
Enables text generation using local or cloud-based language models through an OpenAI-compatible API
High-quality text-to-speech synthesis
Interactive web interface with audio visualization
Flexibility to use either Local or Cloud APIs with simple configuration changes

System Architecture

The application follows a modular architecture with the following components:

┌────────────────┐      ┌────────────────┐      ┌────────────────┐
│   Web Browser  │<────>│  FastRTC API   │<────>│  OpenAI API    │
│  (WebRTC+UI)   │      │  (Python App)  │      │(Local or Cloud)│
└────────────────┘      └────────────────┘      └────────────────┘
                                                       ▲
                               │───────────────────────│───────────────────────│
                               ▼                       ▼                       ▼
                        ┌────────────────┐      ┌────────────────┐     ┌────────────────┐
                        │   STT Server   │      │   LLM Server   │     │   TTS Server   │
                        │(Local or Cloud)│      │(Local or Cloud)│     │(Local or Cloud)│
                        └────────────────┘      └────────────────┘     └────────────────┘

API Compatibility

The application has been tested with the following API combinations:

1. Local APIs

STT: LocalAI with whisper.cpp backend, FastWhisperAPI
LLM: LocalAI with llama.cpp backend, MLC LLM
TTS: LocalAI with Piper backend, FastKoko

2. Cloud APIs

STT: Groq
LLM: Groq
TTS: Microsoft Edge TTS with openai-edge-tts

Features

API Flexibility: Switch between local and cloud APIs with simple .env file changes
Real-time Voice Interaction: Speak naturally and receive AI responses as audio
WebRTC Integration: Low-latency audio streaming with network traversal capability
Progressive TTS Playback: Audio responses begin playing as soon as sentences are completed
Responsive Audio Visualization: Visual feedback of audio input/output
Configurable AI Models: Easily switch between different AI models
Customizable Voice Settings: Configure voice, language and audio format
Multiple Deployment Options: UI, API, or phone integration

Prerequisites

Python 3.8+
Local AI instance or Cloud API credentials
FastRTC-compatible environment
Modern web browser with WebRTC support

Installation

Clone the repository:

git clone https://github.com/limcheekin/talk-to-ai.git
cd talk-to-ai

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Configure environment variables:

cp .env.example .env
# Edit .env with your settings

Configuration

Edit the .env file to configure the following settings. You can easily switch between local and cloud providers by updating these settings:

LLM Configuration

LLM_BASE_URL: URL of your AI instance (e.g., "http://192.168.1.111:8880/v1" for LocalAI or "https://api.groq.com/openai/v1" for Groq)
LLM_MODEL: Name of the language model to use (e.g., "llama-3.2-3b-instruct-q4_k_m" for LocalAI or "llama3-8b-8192" for Groq)
LLM_API_KEY: API key for your AI service (required for cloud APIs, dummy_api_key for local)

Speech-to-Text Configuration

STT_BASE_URL: URL of your STT service (can be LocalAI or cloud service like Groq)
STT_API_KEY: API key for STT service (required for cloud APIs)
STT_MODEL: Model to use for speech recognition (e.g., "whisper-base" for LocalAI or "whisper-large-v3" for Groq)
STT_RESPONSE_FORMAT: Format for STT responses (e.g., "verbose_json")
LANGUAGE: Language code for speech recognition (e.g., "en")

Text-to-Speech Configuration

TTS_BASE_URL: URL of your TTS service (LocalAI, FastKoko, or cloud services like Edge TTS)
TTS_API_KEY: API key for TTS service (required for cloud APIs)
TTS_MODEL: TTS model to use (e.g., "kokoro" or "tts-1-hd" or "tts-1")
TTS_VOICE: Voice ID to use (e.g., "af_heart" for Kokoro or "en-US-AriaNeural" for Edge TTS)
TTS_BACKEND: TTS backend identifier (e.g., "kokoro" or "edge-tts")
TTS_AUDIO_FORMAT: Output audio format (e.g., "pcm")

Application Mode

MODE: Deployment mode ("UI", "PHONE", or "API")

Running the Application

Start the application using the provided shell script:

chmod +x run.sh
./run.sh

Or run it directly with Python:

python app.py

The application will be available at:

UI mode: http://localhost:7860
API mode: http://localhost:7860/

Usage

Open the web interface in your browser
Click the microphone icon or "Click to Access Microphone", allow microphone access when prompted
Click the "Record" button to initialize the WebRTC connection
Speak naturally after the connection is established
The application will convert your speech to text, process it with the AI model, and provide an audio response
The conversation history will be displayed in the chat window

Technical Details

Components

app.py: Main application server handling WebRTC connections and API endpoints
speech.py: Client for speech-to-text and text-to-speech services
settings.py: Configuration management using Pydantic
index.html: Web interface with WebRTC client implementation
requirements.txt: Python dependencies
run.sh: Convenience script to run the application

Key Technologies

FastRTC: Handles WebRTC connections and audio streaming
OpenAI API Client: Used for compatible interfaces with local APIs and cloud services
Gradio: Provides UI components and server functionality
Pydantic: Configuration and settings management
WebRTC: Browser-based real-time communication

Customization

Switching Between Local and Cloud APIs

Simply update the .env file with appropriate URLs and API keys:

For Local API Setup:

LLM_BASE_URL="http://192.168.1.111:8880/v1"
LLM_MODEL="llama-3.2-3b-instruct-q4_k_m"
LLM_API_KEY="sk-1" # dummy api key required by openai package

STT_BASE_URL="http://192.168.1.111:8880/v1" # or your FastWhisperAPI instance
STT_MODEL="whisper-base" # or "small.en"

TTS_BASE_URL="http://192.168.1.111:8880/v1"  # or your FastKoko instance
TTS_MODEL="en-us-ryan-high.onnx" # or "kokoro"
TTS_VOICE="en-us-ryan-high.onnx" # or "af_heart"
TTS_BACKEND="piper" # or "kokoro"

For Cloud API Setup:

LLM_BASE_URL="https://api.groq.com/openai/v1"
LLM_MODEL="llama3-8b-8192"
LLM_API_KEY="your-groq-api-key"

STT_BASE_URL="https://api.groq.com/openai/v1"
STT_MODEL="whisper-large-v3"
STT_API_KEY="your-groq-api-key"

TTS_BASE_URL="https://your-edge-tts-server/v1"
TTS_MODEL="tts-1-hd"
TTS_VOICE="en-US-AriaNeural"

Voice Customization

Modify the TTS settings in .env to change voice characteristics:

TTS_VOICE="different_voice"  # Voice ID depends on your TTS provider

UI Customization

The web interface can be customized by editing the index.html file. The interface uses standard HTML, CSS, and JavaScript.

Troubleshooting

Connection Issues

Ensure your AI services (local or cloud) are accessible
Check if the provided URLs in .env are correct
Verify API keys are valid for cloud services
Verify that WebRTC is supported in your browser
If behind a firewall, ensure WebRTC traffic is allowed

Audio Problems

Check microphone permissions in your browser
Ensure audio output is enabled and volume is up
Try a different browser if issues persist
For local APIs, verify the models are properly loaded

API-Specific Issues

Local APIs: Ensure sufficient system resources for running models
Cloud APIs: Check API quotas and rate limits
Verify API endpoint formatting is correct for your chosen provider

Performance Considerations

STT and TTS processing can be resource-intensive for local setups
Smaller models may provide faster responses at the cost of quality
Consider adjusting the concurrent user limit based on your server capacity
Cloud APIs typically offer better performance but at a cost

Contributing

Contributions are welcome! To contribute:

Fork the repository.
Create a new branch:
```
git checkout -b feature-name
```
Commit your changes:
```
git commit -m "Add feature description"
```
Push to the branch:
```
git push origin feature-name
```
Open a pull request.

License

This project is open source and available under the MIT License.

Acknowledgements

This project builds upon and integrates numerous open-source projects and commercial APIs:

Core Technology

FastRTC by Hugging Face - For real-time communication capabilities
WebRTC - For browser-based real-time communication protocol
Gradio - For UI components and server functionality
Pydantic - For configuration and validation

Inspiration

Talk To Claude - The original project that inspired this adaptation

Local AI Solutions

LocalAI - For local deployment of AI models
- whisper.cpp - Backend for speech-to-text
- llama.cpp - Backend for text generation
- Piper - Backend for text-to-speech
FastWhisperAPI - For efficient speech recognition
MLC LLM - For local deployment of language models
FastKoko - For local text-to-speech synthesis
Kokoro - The underlying TTS model for FastKoko

Cloud Services

Groq - For cloud LLM and STT services
- Groq's Speech-to-Text API
- Groq's Text Chat API
Microsoft Edge TTS - For cloud text-to-speech
openai-edge-tts - For OpenAI-compatible interface to Edge TTS

Additional Libraries

Please refer to the requirements.txt

We extend our sincere appreciation to all the developers and organizations that have made their work available for integration into this project.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
images		images
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
index.html		index.html
requirements.txt		requirements.txt
run.sh		run.sh
settings.py		settings.py
speech.py		speech.py

License

limcheekin/talk-to-ai

Folders and files

Latest commit

History

Repository files navigation

Talk To AI with FastRTC

Overview

System Architecture

API Compatibility

1. Local APIs

2. Cloud APIs

Features

Prerequisites

Installation

Configuration

LLM Configuration

Speech-to-Text Configuration

Text-to-Speech Configuration

Application Mode

Running the Application

Usage

Technical Details

Components

Key Technologies

Customization

Switching Between Local and Cloud APIs

For Local API Setup:

For Cloud API Setup:

Voice Customization

UI Customization

Troubleshooting

Connection Issues

Audio Problems

API-Specific Issues

Performance Considerations

Contributing

License

Acknowledgements

Core Technology

Inspiration

Local AI Solutions

Cloud Services

Additional Libraries

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages