• 6 Posts
  • 17 Comments
Joined 1 年前
cake
Cake day: 2025年6月6日

help-circle

  • Nice write-up. Thanks for also including all the numbers. If I might ask: What is the thermal/throttling behaviour you mention? Is it still within the laptop’s thermal budget? Or does it reach throttling territory when doing inference on a long context window?

    Thanks! Nowhere near throttling territory. The thermal ceiling is 95°C, and my sustained load is 72-78°C. There’s a 17-23°C safety margin.

    The laptop never throttles. But the fans also don’t spin up properly during GPU-heavy inference unless I force it.

    Why? ASUS’s embedded controller monitors CPU temperature and CPU package power for fan curves. It largely ignores GPU temperature.





  • I have a cpu only build of llama.cpp on a 32GB LPDDR5, 6400 MT/s. The laptop has an AMD Radeon 680M, but that is used for wayland, browser and GPU accelerated terminals.

    Running llama-server, this is the performance: gemma-4 26B - 10.53 t/s

    here is my llama-server command:

    llama-server --port 8080 --host 0.0.0.0 --models-preset /mnt/data/ai/models.ini --ctx-size 8192 -ngl 0 --mlock --no-mmap

    and here is the model.ini file

    version = 1
    
    [*]
    ; Global defaults - CPU-optimized
    seed = -1
    top-p = 0.95
    top-k = 20
    min-p = 0.05
    presence-penalty = 0.0
    repeat-penalty = 1.1
    models-max = 2
    jinja = true
    batch-size = 256
    ubatch-size = 128
    threads = 8
    threads-batch = 4
    cpu-range = 0-7
    cpu-strict = 1
    kv-offload = false
    poll = 25
    poll-batch = 50
    cpu-moe = true
    
    [gemma-4-26b]
    model = /mnt/data/models/daily/google_gemma-4-26B-A4B-it-Q4_K_M.gguf
    temperature = 0.65
    reasoning-budget = 384
    repeat-penalty = 1.05