Jetson Generative AI – Live LLaVA

Vision-Language Models reach new heights when applied to live video streamsLive LLaVA demonstrates real-time multimodal AI that can see, understand, and describe what’s happening in your camera feed continuously on your Jetson device.
In this article you’ll learn how to run Live LLaVA with optimized vision-language models like LLaVA and VILA, featuring hardware-accelerated video processing and real-time inference capabilities.

 

Requirements

 

Hardware / Software
Notes
Jetson AGX Orin (64GB)
Recommended for best performance
Jetson AGX Orin (32GB)
Good performance for most use cases
Jetson Orin NX (16GB)
Solid performance
Jetson Orin Nano (8GB)
Minimum requirement – use smaller models
JetPack 6 (L4T r36.x)
Required for latest optimizations
USB camera or CSI camera
For live video input
NVMe SSD highly recommended
For storage speed and space
22GB for nano_llm container
Container image storage
>10GB for models
Vision-language model storage
Note: Follow the NanoVLM tutorial first to familiarize yourself with vision/language models, and see Agent Studio for an interactive pipeline editor.

Supported Models

The following vision-language models are optimized for Live LLaVA:

LLaVA Models:
`liuhaotian/llava-v1.5-7b`,
`liuhaotian/llava-v1.5-13b`
`liuhaotian/llava-v1.6-vicuna-7b`
`liuhaotian/llava-v1.6-vicuna-13b`
VILA Models:
`Efficient-Large-Model/VILA-2.7b`
`Efficient-Large-Model/VILA-7b`
`Efficient-Large-Model/VILA-13b`
`Efficient-Large-Model/VILA1.5-3b`
`Efficient-Large-Model/Llama-3-VILA1.5-8B`
`Efficient-Large-Model/VILA1.5-13b`
Jetson Orin Nano Compatible Models:
VILA-2.7b
VILA1.5-3b
VILA-7b
Llava-7b
Obsidian-3B

Step-by-Step Setup

1. Verify Camera Connection

Check that your camera is properly connected and detected:

Copy to Clipboard

2. Launch Live LLaVA

Start the VideoQuery agent with your camera:

Copy to Clipboard

3. Access the Web Interface

Navigate your browser to:
https://<jetson-ip>:8050
⚠️ Chrome Recommended: For best WebRTC performance, use Chrome with `chrome://flags#enable-webrtc-hide-local-ips-with-mdns` disabled.

4. Configure Prompts

In the web interface, you can:

Set custom prompts for continuous analysis
Adjust inference frequency for real-time performance
Monitor live video feed with AI descriptions

Live LLaVA Face Detection

Real-time Object Detection

Live LLaVA can continuously analyze your video feed, detecting and describing objects, people, and activities in real-time:

Live LLaVA Object Detection

Custom Prompting

You can customize the analysis with specific prompts:

Copy to Clipboard

Pre-recorded Video Analysis

Process existing video files instead of live camera feeds:
Copy to Clipboard

Supported Formats

Input Formats:
MP4, MKV, AVI, FLV (with H.264/H.265 encoding)
Live network streams (RTP, RTSP, WebRTC)
USB/CSI cameras
Output Formats:
Video files (MP4, AVI, etc.)
Network streams (WebRTC, RTSP)
Display output

NanoDB Integration

Enable reverse-image search and database tagging by integrating with NanoDB:

Copy to Clipboard

This enables:

Reverse-image search against your database
One-shot recognition tasks via web UI
Automatic tagging of incoming images

Video VILA – Multi-frame Analysis

VILA-1.5 models can analyze multiple frames simultaneously for temporal understanding:
Copy to Clipboard

Troubleshooting

How to fix freezing issues while loading the model?

The documentation uses the old awq4; instead, use the --quantization q4f16_1 parameter.
The 13B model eventually freezes on the Jetson AGX Orin 32GB due to running out of tokens; if speed is needed, we recommend using VILA-7B or VILA-2.7B instead.

How to fix the issue of the camera not being detected

To make a USB camera accessible inside the container, add the parameter --device /dev/video0 when running the container. This maps the host’s camera device into the container, allowing applications inside to access the video stream as if it were running natively on the host system.

How to Avoid Color Distortion Issues with Logitech C505e Using MJPEG Codec ?

To prevent color distortion problems on the Logitech C505e camera, we recommend using the --video-input-codec mjpeg parameter. This forces the camera to use the MJPEG codec, which is better supported and helps maintain accurate color reproduction.

Resolution Limitation

For stable FPS performance, use the parameters --video-input-width 1280 and --video-input-height 720. These settings limit the video resolution to 1280×720, helping maintain smoother and more consistent frame rates.

 

Issue
Fix
Camera not detected
Check USB connection, verify with `ls /dev/video*`
WebRTC not working
Use Chrome, disable WebRTC local IP hiding flag
Out of memory errors
Use smaller model (VILA1.5-3b), reduce context length
Low frame rate
Reduce max-new-tokens, use smaller model, check camera resolution
Video codec errors
Verify input format is H.264/H.265, check jetson_utils installation

For more information about Live LLaVA and advanced configurations, visit the NanoLLM GitHub repository.