llama.cpp Multi-Model Server Architecture: ASUS Zenbook UM3504DA

variety4me@lemmy.zip · 12 天前

I have multiple models configured, but only one model is loaded at any given time

llama-server --port 8080 --host 0.0.0.0 \
  --models-preset /mnt/data/ai/models.ini \
  --models-max 1 \
  --tools all

look at models-max, since my hardware is potato, i dont want to load more than a model at a time, but if your hardware allows, you can load more than one.

from llama-server web interface, i can switch models based on what i need to get done (no need for llama swap).

hope this helps!

variety4me@lemmy.zip · 12 天前

Nice write-up. Thanks for also including all the numbers. If I might ask: What is the thermal/throttling behaviour you mention? Is it still within the laptop’s thermal budget? Or does it reach throttling territory when doing inference on a long context window?

Thanks! Nowhere near throttling territory. The thermal ceiling is 95°C, and my sustained load is 72-78°C. There’s a 17-23°C safety margin.

The laptop never throttles. But the fans also don’t spin up properly during GPU-heavy inference unless I force it.

Why? ASUS’s embedded controller monitors CPU temperature and CPU package power for fan curves. It largely ignores GPU temperature.

variety4me@lemmy.zip · 12 天前

llama.cpp Multi-Model Server Architecture: ASUS Zenbook UM3504DA

variety4me@lemmy.zip · 26 天前

Memory:
  System RAM: total: 32 GiB available: 31.19 GiB used: 8.76 GiB (28.1%)
  Array-1: capacity: 64 GiB slots: 4 modules: 2 EC: None
  Device-1: ChannelA-DIMM0 type: DDR4 size: 16 GiB speed: spec: 2667 MT/s
    actual: 2666 MT/s
  Device-2: ChannelA-DIMM1 type: no module installed
  Device-3: ChannelB-DIMM0 type: DDR4 size: 16 GiB speed: spec: 2667 MT/s
    actual: 2666 MT/s
  Device-4: ChannelB-DIMM1 type: no module installed

variety4me@lemmy.zip · 26 天前

would you laugh at me if I ran gemma-4-26b on a 4 core Xeon, with 32GB RAM, no GPU?

variety4me@lemmy.zip · 1 个月前

I have a cpu only build of llama.cpp on a 32GB LPDDR5, 6400 MT/s. The laptop has an AMD Radeon 680M, but that is used for wayland, browser and GPU accelerated terminals.

Running llama-server, this is the performance: gemma-4 26B - 10.53 t/s

here is my llama-server command:

llama-server --port 8080 --host 0.0.0.0 --models-preset /mnt/data/ai/models.ini --ctx-size 8192 -ngl 0 --mlock --no-mmap

and here is the model.ini file

version = 1

[*]
; Global defaults - CPU-optimized
seed = -1
top-p = 0.95
top-k = 20
min-p = 0.05
presence-penalty = 0.0
repeat-penalty = 1.1
models-max = 2
jinja = true
batch-size = 256
ubatch-size = 128
threads = 8
threads-batch = 4
cpu-range = 0-7
cpu-strict = 1
kv-offload = false
poll = 25
poll-batch = 50
cpu-moe = true

[gemma-4-26b]
model = /mnt/data/models/daily/google_gemma-4-26B-A4B-it-Q4_K_M.gguf
temperature = 0.65
reasoning-budget = 384
repeat-penalty = 1.05