Latest Posts ⚡

How to Run Llama.cpp Server on Jetson AGX Thor?

Llama.cpp Server on Jetson AGX Thor: Unlocking Edge AI with Large Language Models

Llama.cpp Server is a lightweight, high-performance runtime for large language models (LLMs), designed to run efficiently on both CPU and GPU. Built in C++, it eliminates unnecessary overhead and delivers deep hardware-level optimizations. By supporting the GGUF model format, it allows for quantization, drastically reducing memory requirements while maintaining accuracy. Through its REST API, Llama.cpp Server can be seamlessly integrated into applications, enabling developers to bring advanced LLM capabilities directly to devices—without relying on the cloud.

When deployed on NVIDIA Jetson AGX Thor, the advantages become even more compelling:

  • GPU acceleration with CUDA ensures that the Thor’s compute power is fully utilized, bringing real-time inference to the edge.
  • Optimized for edge AI use cases such as robotics, autonomous systems, and industrial automation, it provides ultra-low latency decision-making.
  • Resource efficiency via quantization makes it possible to run models from 7B up to 13B parameters within the limited memory budgets typical of embedded devices.

By combining Llama.cpp Server with Jetson AGX Thor, organizations gain a powerful platform for on-device AI that is private, fast, and cost-effective. No data needs to leave the device, latency is minimized, and the system remains fully adaptable to both prototyping and production scenarios. Supported by an open-source ecosystem, this pairing represents a breakthrough for deploying large language models securely and efficiently at the edge.

Featured Product
NVIDIA Jetson AGX Thor Developer Kit
A next-generation developer kit with Blackwell architecture for robotics and edge AI.

Requirements

  • JetPack 7 (Learn more about JetPack)
  • CUDA 13
  • At least 10 GB of free disk space (Only for the Llama Server image, not for the models.)
  • A stable and fast internet connection

How to use Llama.cpp Server ?

Firstly download the image ;

Copy to Clipboard

Then, download the model from Hugging Face. If the model requires access, log in with your token by running:

Copy to Clipboard

Then, install the required Python dependencies with the following command:

Copy to Clipboard

This command set downloads the NVIDIA NVPL local repository package, installs it, adds the signing key to the system, and then installs the NVPL library via apt-get.

Copy to Clipboard

This command takes the Qwen2.5-VL-3B-Instruct model downloaded from Hugging Face (inside the snapshot folder identified by ), and uses the convert_hf_to_gguf.py tool to convert the Hugging Face weights (safetensors/PyTorch) into GGUF format, saving the output as /data/models/Qwen3-4B-Instruct-2507-f16.gguf.

Copy to Clipboard

This command takes the full-precision GGUF model (Qwen3-4B-Instruct-2507-f16.gguf) and runs it through llama-quantize to produce a quantized version (Qwen3-4B-Instruct-2507-q4_k_m.gguf) using the q4_k_m quantization method.

  • Input file: /data/models/Qwen3-4B-Instruct-2507-f16.gguf (the FP16 model converted from Hugging Face).
  • Output file: /data/models/Qwen3-4B-Instruct-2507-q4_k_m.gguf (smaller, quantized model).
  • Quantization type: q4_k_m → a 4-bit quantization scheme optimized for speed and memory efficiency.
Copy to Clipboard

This command launches the llama.cpp server so the quantized model can be served via an HTTP API.

Copy to Clipboard

And thats it ! You can start chatting .

Latest Posts ⚡