Sandboxing local models on macOS
The recently released Qwen3.5 family advances the capabilities of open-weights multi-modal models: rumors are theyβre exceptionally good for their size. The 35B-A3B and 27B variants are small enough to run on 64GB of unified memory (especially if quantized). So, I decided to give them a try.
# Sandboxing
Being wary of prompt injection and the probabilistic nature of LLMs, I usually
confine AI agents to ephemeral VMs, where they can wreak havoc in YOLO mode
(--dangerously-skip-permissions). With hosted models, this is easy: the VM
just needs internet access and remains isolated from the host. To run local
models on the VM instead, weβd need to let it access the GPU and install
compatible drivers. My only device with 64GB of RAM is a MacBook, and I run
Linux VMs on it, so Iβd need to find Linux drivers compatible with Apple
hardware (if they exist).
A pragmatic approach is to:
- Sandbox model inference (since parsing GGUF models has led to code execution vulnerabilities);
- either sandbox the client as well, for quick explorations; or
- keep the agents in a VM and forward their API access to the local model.
The go-to sandbox on macOS is sandbox-exec. Apple has marked it as
deprecated years ago, but it continues to work because it powers App
Sandbox, the sandboxing mechanism used by the Apple Store. Itβs a weird
beast, configurable through arcane profiles written in a Scheme dialect (which
got a lot easier to write with LLMs). Now that the big AI companies are using
it to sandbox their tools1, it is seeing a renaissance. I have
been using it for a while to sandbox lazyvim when running outside a VM, so
that it cannot reach the internet or make changes outside ~/work.
Here, weβll take a similar approach to sandbox model inference down to its
minimal capabilities (essentially, GPU and driver access) and then sandbox the
agents to run offline and within the well-defined boundaries of a workspace.
Importantly, all sandbox profiles have default deny, so we can allow exactly
what we need and deny the rest.
# Sandboxing profiles
If you want to jump to the code, youβll find everything at the
sandboxed-ai GitHub repository.
# Server
Weβll use llama-server to do model inference, which also provides
OpenAI-compatible REST APIs for the clients. The
sandbox file I came up
with through trial and error only allows access to the executable, the model
weights, the GPU and its drivers, and some cache directories. In addition,
llama-server can bind to port 8080 to serve its APIs, but cannot reach the
network (outbound).
The result is a simple
sandbox.sh script (no external dependencies)
that calls sandbox-exec and spins up the server. All additional arguments are
forwarded to llama-server. Because the sandbox prevents network access, the
script does some special handling of the --model parameter: if a model
doesnβt exist locally, it will fetch it (outside the sandbox) through curl.
./sandbox.sh llama-server --model unsloth/Qwen3.5-9B-GGUF:Qwen3.5-9B-Q8_0.gguf
We can now run Qwen3.5. Here, I am using the unsloth quantized model and including their recommended sampling parameters:
$ ./sandbox.sh llama-server \
--model unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL.gguf \
--ctx-size 16384 \
--temp 0.7 \
--top-p 0.8 \
--top-k 20 \
--min-p 0.00 \
--chat-template-kwargs '{"enable_thinking":false}'
Starting sandboxed llama-server:
binary: /nix/store/l4xdm13zilm71n1jad95rpzk49h57is5-llama-cpp-metalkit-0.0.0/bin/llama-server
model: /Users/aldur/Work/local-opencode/.opencode/models/unsloth/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf
alias: Qwen3.5-35B-A3B-GGUF
port: 8080
extra: --ctx-size 16384 --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.00 --chat-template-kwargs {"enable_thinking":false}
...
# Local clients
With the model ready, we can now chat with it. To do that, I have also prepared sandbox profiles for:
simonw/llm, which I use to run quick queries;opencode, which I use interactively.
$ ./sandbox.sh llm
Hello there!
^D
Hello! How can I help you today?
The sandbox.sh script automatically takes care of the configuration files
that both tools require to interface with llama-server (youβll find an
opencode.json in the script directory).
The sandbox profiles are quite hardened: they donβt allow network outbound,
restrict all writes to a single workspace directory (-w), and whitelist only
selected executables (restricting allowed tools). Cache files are written
alongside the script. Because of the sandbox restrictions, a few things will
break: opencode web, for instance, because it requires remote access to load
the frontend assets. On the other hand, the sandbox guarantees both integrity
and confidentiality on bare macOS, guaranteeing that the computation remains
local and preventing opencode from leaking your prompts2.
# In a QEMU VM
When I need to do anything more than quick checks/chats, I just spin up an
ephemeral QEMU VM. My
qemu-vm script uses SLiRP user network3,
so llama-server from the host is available at 10.0.2.2:8080:
qemu-vm -d $(pwd) -- -ephemeral
Starting VM...
...
qemu-nixos login: aldur (automatic login)
[I] aldur@qemu-nixos ~> curl http://10.0.2.2:8080/health
{"status":"ok"}
[I] aldur@qemu-nixos ~> cat opencode.json
{
"$schema": "https://opencode.ai/config.json",
"model": "llama/Qwen3.5-35B-A3B-GGUF",
"provider": {
"llama": {
"npm": "@ai-sdk/openai-compatible",
"name": "llama.cpp (local)",
"options": {
"baseURL": "http://10.0.2.2:8080/v1",
"apiKey": "dummy"
},
"models": {
"Qwen3.5-35B-A3B-GGUF": {
"name": "Qwen3.5-35B-A3B-GGUF",
"tool_call": true
}
}
}
},
"autoupdate": false
}
[I] aldur@qemu-nixos ~> nix run github:nixos/nixpkgs#opencode
β
ββββ ββββ ββββ ββββ ββββ ββββ ββββ ββββ
β β β β ββββ β β β β β β β ββββ
ββββ ββββ ββββ ββββ ββββ ββββ ββββ ββββ
Session Greeting and quick check-in
Hello Qwen! How are you doing today?
I'm doing well, thanks for asking! How can I help you today?
# Test drives
On a MacBook Pro with an M3 Max CPU and 64GB of RAM, Qwen3.5-35B-A3B-Q8_0 is
quick enough that any prompt feels snappy and interactive, even with thinking
enabled. When using opencode, it churns for a few seconds through the initial
big prompt (on a cold cache) and is then able to use tools and read files
quickly enough. Although the quality of results is lower than frontier hosted
models, itβs a big step forward: this relatively small model can make small,
interactive changes. Plus, the OCR capabilities of the whole family are
impressive.
I also gave Qwen3.5-27B-Q8_0.gguf a shot, which trades inference speed for
better accuracy. To test drive it, I stuck to the default parameters, fed it a
draft of this blog post, then asked for edits4.
Hereβs the transcript of the session, the server logs and the model
info. Producing the output required more than 10 minutes of computation and
used about 15% of battery charge, with fans spinning and GPU at 100%
utilization. The task was relatively easy, and the results are satisfying: I
integrated almost all of its suggestions.
Thank you for reading and βtil next time! π
# Footnotes
-
With the important caveat that often the agent can disable the sandbox at runtime.Β ↩
-
In the default configuration, the
small_modelconfiguration parameter uploads prompts to OpenCodeβs servers to generate session titles.Β ↩ -
SLiRP user network isnβt ideal for isolation because it doesnβt firewall the VM, which appears to macOS as a process that can access
localhostsockets. The proper solution is to use TAP bridges, but weβll leave that to another post.Β ↩ -
As a reminder, all writing on this blog (typos and weird sentences included) is mine. Writing is thinking and I donβt see the point in having the LLM think for me.Β ↩