

Nice write-up. Thanks for also including all the numbers. If I might ask: What is the thermal/throttling behaviour you mention? Is it still within the laptop’s thermal budget? Or does it reach throttling territory when doing inference on a long context window?
Thanks! Nowhere near throttling territory. The thermal ceiling is 95°C, and my sustained load is 72-78°C. There’s a 17-23°C safety margin.
The laptop never throttles. But the fans also don’t spin up properly during GPU-heavy inference unless I force it.
Why? ASUS’s embedded controller monitors CPU temperature and CPU package power for fan curves. It largely ignores GPU temperature.





I have multiple models configured, but only one model is loaded at any given time
llama-server --port 8080 --host 0.0.0.0 \ --models-preset /mnt/data/ai/models.ini \ --models-max 1 \ --tools alllook at models-max, since my hardware is potato, i dont want to load more than a model at a time, but if your hardware allows, you can load more than one.
from llama-server web interface, i can switch models based on what i need to get done (no need for llama swap).
hope this helps!