Jetson Generative AI – LlamaSpeak

Jetson Generative AI – LlamaSpeak

Vision wasn’t the only modality transformed by Transformers—**LlamaSpeak** brings the power of large language models to spoken conversations, streaming speech-to-text (ASR), intelligent response generation, and text-to-speech (TTS) back out in real time on your Jetson device.

In this article you’ll learn how to run NanoLLM’s WebChat agent (nicknamed LlamaSpeak) on Jetson using NVIDIA TensorRT/MLC and Riva Speech Skills.

Requirements

Hardware / Software
Notes
Jetson AI Kit / Dev Kit
Orin AGX / Orin NX recommended for best latency
JetPack 6 (L4T r36.x)
Needed for latest pre-built containers
USB microphone & speakers / headset
Confirm with `arecord -l`
Riva Speech Skills 2.15+
Provides low-latency ASR engine
Hugging Face token
Needed for gated Meta-Llama weights

Obtaining Your Hugging Face Token

To download the gated Llama checkpoints you’ll need a personal access token (PAT) from Hugging Face:

1. Create / sign into your account at https://huggingface.co
2. Click your avatar ▸ SettingsAccess Tokens
3. Press New Token, select _”Read”_ scope, name it (e.g., `jetson-llamaspeak`), and click Generate
4. Copy the token string that starts with `hf_` — you’ll export it in Step 7 below:
Copy to Clipboard

Step-by-Step Setup

1.  Clone the repository

Copy to Clipboard

2.  Enter the repo

Copy to Clipboard

3.  Update APT & install pip3

Copy to Clipboard

4. Install helper Python packages

Copy to Clipboard

5.  Verify audio devices

Copy to Clipboard

If your microphone doesn’t appear, check USB connections and reboot.

6.  Start the Riva server

Copy to Clipboard

Accept the license prompt and wait until the log shows State = READY.

7.  Launch LlamaSpeak

Copy to Clipboard

The first run downloads the model (~ 9 GB for 8-bit, ~4 GB for 4-bit) and container layers.

8.  Open the Web UI

When the console prints:
WebChat serving at https://0.0.0.0:8050
On-device: open a browser on the Jetson and visit `https://localhost:8050`
Remote: replace `<jetson-ip>` with your board’s address:
https://<jetson-ip>:8050

LLM Chat Interface

9.  Talk to your Jetson

Grant microphone access when prompted. Try interrupting mid-reply; LlamaSpeak will pause TTS and listen.

10.  (Advanced) Enable multimodality

Launch it with a VLM and drag an image to the chat:

Copy to Clipboard

 

See the NanoVLM docs for more supported checkpoints.

Multimodal Chat Interface

LlamaSpeak in Action

Below are demonstration videos showing LlamaSpeak’s capabilities in both text-only and multimodal scenarios.

Text-Only LLM Chat

Experience natural voice conversations with LlamaSpeak using the Meta-Llama-3-8B-Instruct model. The system provides real-time speech recognition, intelligent responses, and natural text-to-speech output.

LLM Demo Video:

Multimodal Vision Chat

See how LlamaSpeak analyzes images while maintaining voice conversation using the VILA-7b vision-language model. Upload images by dragging them into the chat interface.

Multimodal Demo Video:

Troubleshooting

How to resolve the NGC Registry access 401 Unauthorized error?

After signing up and logging in to the NGC Catalog website, an API key must be generated from the “Setup” section. Then, run the command docker login nvcr.io in the terminal. In the login prompt, enter $oauthtoken as the username and the generated API key as the password. If the message “Login Succeeded” appears, the process has been successfully completed.

“model-repo” Error When Running Riva Manually Without riva_start.sh

The tutorial only invokes riva_start.sh, which handles model repository setup internally. However, when running Riva manually using docker run, we encountered a "model-repo" error due to missing model repository configuration.

To resolve this, we explicitly defined the model path using both an environment variable and a bind mount:

Copy to Clipboard

By setting the RIVA_MODEL_REPO environment variable and bind-mounting the local models directory to the container path /data/models, the error was resolved successfully.

How to Prevent Port Conflicts (“port is already in use” Error)

The default port (e.g., 8050) may already be occupied by another service, leading to a “port is already in use” error. To prevent this, specify custom ports when launching Riva by using flags like --web-port 8443 and --ws-port 9443. This ensures Riva runs without interfering with other applications.

How to Resolve the Microphone Not Detected Issue

When connecting via HTTPS through the browser, sometimes the browser may not automatically prompt for microphone or speaker access permissions. In such cases, the user manually bypasses the security warning by selecting “Advanced” > “Proceed.” After the page loads, microphone and speaker permissions are granted manually by clicking the padlock icon in the address bar.

How to Fix the Riva Logs Getting Stuck in an Infinite “Waiting” Loop

If the Riva logs show an endless “waiting” loop, follow these steps:

  • Wait until the status changes to READY before proceeding.

  • If the loop persists, it may indicate insufficient VRAM. In that case, try selecting a smaller model to reduce memory requirements.

This approach helps resolve the “waiting” hang and ensures Riva initializes properly.

 

Issue
Fix
Mic not detected
Use a powered USB sound card; reconfirm with `arecord -l`
Stuck on Waiting for Riva
Ensure the Riva container is running and ports are exposed
Out-of-memory errors
Use a 4-bit quantized checkpoint or a 7-B model (e.g., Mistral-7B)

For more information about NanoLLM and advanced configurations, visit the NanoLLM GitHub repository.