Building Colony: Why Gleam + OTP?
Why we chose Gleam and the OTP framework over Go, Rust, or Node.js for Colony's orchestration layer.
We’re building Colony on Gleam. Most developers have never heard of it. And when they do, they ask: why not Go? Why not Rust? Why not just use Node like everyone else?
Here’s the thing: we didn’t pick Gleam because it’s exotic. We picked it because it’s built on the Erlang VM, and the Erlang VM solves the exact problem we have.
The Problem: Concurrent, Crash-Prone Workloads
Colony manages hundreds of isolated development environments. Each colony is a complete workspace: network namespace, file system, running services, AI agents, preview windows. These colonies need to start, stop, crash, and recover without affecting each other.
This is a concurrency problem. And not the easy kind where you spin up a few goroutines and call it a day. This is the kind where:
- One colony’s crash shouldn’t take down the entire system
- Agents will absolutely crash (trust me on this)
- Services need to communicate over WebSockets and Unix sockets
- Everything needs to scale horizontally without rewriting the core
Sound familiar? It’s the telecom problem. The same problem Erlang solved 30 years ago when it was built to run phone switches that can’t go down.
Why Go Would’ve Hurt
Go is pragmatic. Great standard library. Goroutines are lightweight. We almost picked it.
But then you start writing the error handling. Goroutines crash. How do you recover them? You write your own supervision logic. How do you prevent crash loops? More custom code. How do you distribute work across multiple machines later? Even more infrastructure.
Go makes concurrency easy. It doesn’t make fault tolerance automatic.
Why Rust Was Overkill
Rust gives you fearless concurrency and zero-cost abstractions. We’re actually using Rust for Stem (our TUI client). But for the orchestration layer? Fighting the borrow checker to model actors would’ve been painful.
Rust’s async story is still maturing. The ecosystem is fragmented (Tokio vs async-std). And honestly, we’d spend more time satisfying the compiler than building features.
We don’t need zero-cost abstractions here. We need proven fault tolerance.
Enter OTP: 30 Years of Not Crashing
OTP is Erlang’s framework for building distributed, fault-tolerant systems. It runs WhatsApp’s messaging infrastructure. It runs telecom switches. It runs systems that can’t go down.
What you get:
Supervision trees. When an actor crashes, its supervisor restarts it. Automatically. No manual error handling. No crash loops. It just works.
Actor model concurrency. Every colony is an OTP process. They communicate via messages. They share nothing. One colony’s corrupted state can’t poison another.
Hot code reloading. Deploy updates without stopping running colonies. When users have long-running AI tasks, downtime isn’t an option.
Observability baked in. Process registry, telemetry hooks, error logging — all first-class primitives. Not third-party libraries you have to integrate.
This isn’t theory. This is production-proven infrastructure.
So Why Gleam Instead of Elixir?
If OTP is the answer, why not Elixir? Bigger ecosystem. Better docs. Phoenix framework.
Because Gleam has strong static typing. And that matters.
Look at this state machine from Colony’s codebase:
pub type ColonyState {
Provisioning(id: String, namespace: String)
Starting(id: String, namespace: String, services: List(Service))
Running(id: String, namespace: String, services: List(Service))
Stopping(id: String)
Failed(id: String, error: Error)
}
pub fn handle_start(state: ColonyState) -> Result(ColonyState, Error) {
case state {
Provisioning(id, ns) -> {
// Type system guarantees we can't start a colony that's already running
Ok(Starting(id, ns, []))
}
_ -> Error("Cannot start colony in current state")
}
}
This state machine is enforced by the compiler. You cannot transition from Running to Provisioning. The code won’t compile. In Elixir, you’d use atoms and pattern matching, but nothing stops you from making invalid transitions.
When you’re debugging why a colony won’t start at 2 AM, you want compile-time guarantees.
Smaller runtime overhead. Gleam compiles to Erlang bytecode, but without macros or runtime metaprogramming. Predictable performance. Smaller memory footprint.
Better error messages. Gleam’s compiler errors are clear. When you screw up, it tells you exactly what’s wrong. Not a cryptic pattern match failure at runtime.
The Actor-Per-Colony Architecture
Every colony is a GenServer. Here’s the pattern:
pub fn init(id: String) -> Result(State, Error) {
// Each colony gets its own isolated state
Ok(State(
id: id,
namespace: None,
services: [],
subscribers: [],
))
}
pub fn handle_call(message: Message, state: State) -> Reply {
case message {
Start -> {
// Spawn services, update state, notify subscribers
case start_services(state) {
Ok(new_state) -> {
notify_subscribers(new_state.subscribers, ColonyStarted)
Reply(Ok(Nil), new_state)
}
Error(e) -> Reply(Error(e), state)
}
}
Stop -> // ...
}
}
What this gives us:
- Isolation. One colony crashes. Others keep running.
- Parallelism. Thousands of colonies processing messages simultaneously.
- Backpressure. GenServer mailboxes naturally queue when overloaded.
No manual thread pools. No goroutine leaks. It just works.
Real Performance Numbers
On an M2 MacBook Pro:
- Cold start: 562 Gleam tests in ~3 seconds
- Colony spawn:
<50msto create namespace and initialize actor - Memory per colony: ~2-5 MB (mostly the network namespace)
- Concurrent colonies: 100+ running simultaneously without degradation
The Erlang VM scheduler is incredibly efficient. It distributes work across cores automatically. We don’t manage thread pools.
The Tradeoffs
Gleam isn’t perfect.
Small ecosystem. We’ve written FFI bindings to Erlang libraries. Our Protobuf integration uses gpb via Erlang FFI because there’s no pure Gleam library yet.
Learning curve. Functional programming and OTP are unfamiliar to most developers. Onboarding takes time.
Tooling. Editor support is improving but not as polished as Go or Rust. We’ve built custom scripts for codegen.
But here’s the thing: we’re not optimizing for ecosystem size or familiarity. We’re optimizing for not waking up at 3 AM because the orchestration layer crashed.
Would We Do It Again?
Yes.
The supervision model alone has saved us countless hours. When a colony fails to start (because an AI agent corrupted its config, or the network namespace hit a kernel limit), the supervisor catches it, logs it, keeps the system running.
And when we scale horizontally? OTP’s distribution primitives (clustering, distributed process registry) will let us build multi-datacenter deployments without rewriting anything.
If you’re building a system that needs to manage hundreds of concurrent, stateful, crash-prone workloads, take a hard look at the Erlang VM. And if you want type safety on top of that proven foundation, Gleam is the answer.
Want to see the actor-per-colony architecture in action? Join the waitlist for early access to Colony.