Latest Posts ⚡

How to Run MLC LLM on Jetson AGX Thor?

What is MLC LLM ?

MLC LLM (Machine Learning Compilation for Large Language Models) is an open-source project designed to make large language models (LLMs) run efficiently across different hardware platforms. Its main goal is to optimize performance and reduce energy consumption, enabling AI applications to run not only in the cloud but also on edge devices.

NVIDIA’s next-generation Jetson AGX Thor platform delivers powerful computing capabilities for robotics, autonomous systems, and AI-driven applications. By leveraging MLC LLM on Jetson AGX Thor, large language models can be optimized to run in real time, supporting tasks such as natural language processing, decision-making, and human-like interaction with higher efficiency.

Featured Product
NVIDIA Jetson AGX Thor Developer Kit
A next-generation developer kit with Blackwell architecture for robotics and edge AI.

In short, MLC LLM on Jetson AGX Thor acts as a bridge that brings high-performance large language model capabilities to edge devices.

Requirements

  • JetPack 7 (Learn more about JetPack)
  • CUDA 13
  • At least 25 GB of free disk space (Only for the MLC LLM image, not for the models.)
  • A stable and fast internet connection

How to use MLC LLM ?

First, install the Docker image on your computer:

Copy to Clipboard

If you’d like to explore the available images or replace them with newer ones, you can visit the GitHub Container Registry.

Once inside the container, find the model you want to download from Hugging Face.
Use the hf download command inside the container to download the model.

For example:

Copy to Clipboard

In the next step, provide the folder where you downloaded the model and run the command below.
This command converts the model’s original Hugging Face weights (in safetensor format) into the optimized MLC LLM format. During conversion, the weights are quantized (e.g., to q4bf16_1), which reduces memory usage and improves runtime efficiency on GPU without heavily sacrificing accuracy.

In short, mlc_llm convert_weight takes the raw model checkpoint and transforms it into a format that can be directly executed by the MLC runtime on your target device (e.g., Jetson AGX Thor with CUDA).

⚠️ Warning: In the command, replace in snapshots// with the actual folder name you see inside the snapshots directory (e.g., aeb13307a71acd8fe81861d94ad54ab689df…). This folder contains the real model files such as config.json, tokenizer.json, and model.safetensors, which are required for the mlc_llm convert_weight command to work.

Copy to Clipboard

In the next step , gen_config generates the configuration files needed to run the converted model in MLC. It defines the conversation template (e.g., Qwen format), context length, batch size, and other runtime parameters. In short, it makes the weight-converted model fully executable in the MLC runtime.

Copy to Clipboard

⚠️ Note: The “Not found” messages for files like tokenizer.model or added_tokens.json are not errors. These files are optional and not required by all models. As long as tokenizer.json, vocab.json, and merges.txt are found and copied, the model configuration is complete and ready to run.

Now that the configuration is ready, we can move on to the compilation step. In this stage, the model is compiled into a CUDA-optimized shared library (.so file), which enables fast execution on the GPU.

Copy to Clipboard

With the compilation complete, the final step is to serve the model so it can handle inference requests. The mlc_llm serve command launches an HTTP server that exposes the model as an API endpoint, making it accessible for testing or integration into applications.

Copy to Clipboard

If you see this output, it means the model has been successfully compiled and serving .

You can test it with this curl request ;

Copy to Clipboard

Which Jetson should I choose for my LLM model?

Below, you can find the RAM requirements of the most popular LLM models along with Jetson recommendations that meet the minimum specifications to run them. You can choose the one that best fits your needs.

Model Parameters Quantization Required RAM (GB) Recommended Minimum Jetson
deepseek-ai Deepseek-R1 Base 684B Dynamic-1.58-bit 162.11 Not supported (≥128 GB and above)
deepseek-ai Deepseek-R1 Distill-Qwen-1.5B 1.5B Q4_K_M 0.90 Jetson Orin Nano 4 GB, Jetson Nano 4 GB
deepseek-ai Deepseek-R1 Distill-Qwen-7B 7B Q5_K_M 5.25 Jetson Orin Nano 8 GB, Jetson Orin NX 8 GB, Jetson Xavier NX 8 GB
mistralai Mixtral 8x22B-Instruct-v0.1 22B Q4_K_M 13.20 Jetson Orin NX 16 GB, Jetson AGX Orin 32 GB, Jetson AGX Xavier 32 GB
mistralai Mathstral 7B-v0.1 7B Q5_K_M 5.25 Jetson Orin Nano 8 GB, Jetson Orin NX 8 GB, Jetson Xavier NX 8 GB
google gemma-3 12b-it 12B Q4_K_M 7.20 Jetson Orin NX 8 GB, Jetson Orin Nano 8 GB, Jetson Xavier NX 8 GB
meta-llama Llama-3.1 70B-Instruct 70B Q5_K_M 52.50 Jetson AGX Orin 64 GB, Jetson AGX Xavier 64 GB, Jetson AGX Thor (T5000) 128 GB

Latest Posts ⚡