Skip to main content
Bolna Voice AI is built on a modular architecture that combines multiple AI components into seamless conversational agents. Understanding these core concepts will help you build more effective voice AI applications. Bolna helps you create AI Agents which can be instructed to perform tasks using a modular pipeline:

Input medium

The channel through which users interact with your agent:
  • Voice conversations: Microphone or phone call
  • Text conversations: Keyboard input via chat interfaces
  • Visual conversations: Image inputs (Coming soon)

ASR (Automatic Speech Recognition)

The transcriber component converts spoken input into text format that the LLM can understand. Bolna supports multiple ASR providers including Deepgram, Azure, AssemblyAI, and Sarvam.

LLM (Large Language Model)

The LLM component processes the transcribed input and generates appropriate responses. It’s the “brain” of your agent that understands context and makes decisions. Bolna integrates with OpenAI, Azure OpenAI, Anthropic, and other providers.

TTS (Text-to-Speech) / Synthesizer

The voice synthesizer converts the LLM’s text response into natural-sounding speech. Choose from providers like ElevenLabs, Azure, Cartesia, and more.

Output component

Delivers the agent’s response back to the user through the appropriate medium (voice, text, or visual).

What tasks can agents perform?

Bolna provides functionality to instruct your agent to execute tasks during and after conversations:

Real-time tasks

Post-conversation tasks

Next steps

Ready to build your first agent? Start with the agent setup guide or explore provider integrations to configure your components.
I