Colony Lifecycle Management
Full lifecycle from spawn to stop — states, transitions, supervision trees, and service orchestration in Colony
Every colony follows a well-defined lifecycle. Understanding these states helps you monitor progress and debug issues.
Lifecycle States
spawning → provisioning → building → running → stopping → stopped
↓ ↓ ↓ ↓
[error] [error] [error] [manual]
| State | Description | Duration |
|---|---|---|
| spawning | Creating network namespace and Jujutsu workspace | <100ms |
| provisioning | Running environment setup, installing deps | 5-30s |
| building | Running build commands (npm build, cargo build) | 10-60s |
| running | Services are live, agent is working | Minutes-hours |
| stopping | Graceful shutdown, cleaning up resources | <5s |
| stopped | Colony is inactive, resources freed | — |
| error | Unrecoverable failure at any stage | — |
State Transitions
Transitions happen automatically based on success or failure:
// Successful path
Spawning -> Provisioning -> Building -> Running
// Error paths
Spawning [error] -> Error
Provisioning [error] -> Error (with rollback)
Building [error] -> Error
Running -> Stopping -> Stopped (manual or on completion)
State changes broadcast via WebSocket to Bloom. The UI updates instantly when a colony transitions.
Spawning Stage
Spawning creates the isolated environment.
What Happens
- Generate colony ID — unique identifier (e.g.,
colony-alpha-7f3d) - Create network namespace —
ip netns add ns-{colony_id} - Set up veth pair — virtual ethernet connecting namespace to host
- Configure networking — assign IP, set up routing, enable NAT
- Create Jujutsu workspace —
jj workspace add {colony_id} - Initialize database — SQLite instance at
db/{colony_id}.db - Create log buffer — ETS ring buffer for log storage
Error Handling
If spawning fails (namespace already exists, for example), we roll back everything:
# Rollback actions
ip netns del ns-{colony_id}
rm -rf workspaces/{colony_id}
rm -f db/{colony_id}.db
Provisioning Stage
Provisioning sets up the environment based on colony.toml.
Configuration Parsing
[environment]
node_version = "20"
packages = ["git", "curl", "jq"]
[services]
# Services defined here
Actions
- Parse colony.toml — load configuration from workspace
- Install language runtime — nvm, rustup, pyenv based on config
- Install system packages — apt/dnf based on distro
- Run setup scripts — custom provisioning commands
- Install dependencies — npm install, cargo fetch, pip install
Provisioning runs inside the namespace via ip netns exec. All installed tools are namespace-local.
Rollback on Failure
If provisioning fails (npm install error, for example), we restore the previous state:
- Restore workspace to pre-provision commit
- Clear installed packages (if possible)
- Transition to error state with logs attached
Building Stage
Building compiles or bundles your application.
Build Commands
[build]
command = "npm run build"
timeout = 300 # 5 minutes
We execute the build command and stream output to the log buffer. Exit code 0? Transition to running. Non-zero? Transition to error.
Parallel Builds
Multiple colonies can build at the same time. They’re isolated, so there are no conflicts over temp files or ports.
Running Stage
This is where agents do their work and services are live.
Service Spawning
For each service in colony.toml:
[[services]]
name = "web"
command = "npm start"
port = 4001
We:
- Spawn dedicated owner process — separate from colony actor
- Open Erlang port —
open_port({spawn, "npm start"}, [...]) - Capture OS PID — via
erlang:port_info(Port, os_pid) - Stream output — stdout/stderr to log buffer
- Register Caddy route — non-blocking HTTP call to Caddy API
Why Dedicated Owner Process?
The colony actor needs to stay responsive to handle API requests (get state, stream logs, etc.). If we did synchronous HTTP calls or blocking I/O inside the actor, it’d become unresponsive.
Solution: spawn (not spawn_link) a separate process that owns the port. If the service crashes, only that process dies. The colony actor survives and reports the error.
// Simplified version
pub fn spawn_service(service: Service) -> Result(Pid, Error) {
// Spawn separate process to own the port
let owner_pid = process.spawn(fn() {
let port = open_port(service.command)
let os_pid = get_os_pid(port)
// Register route (non-blocking, separate process)
task.async(fn() { register_caddy_route(service) })
// Stream output forever
stream_output_loop(port)
})
Ok(owner_pid)
}
Service Health Monitoring
The owner process monitors the port for exit signals:
receive
{Port, {exit_status, 0}} ->
% Service exited gracefully
notify_actor(service_stopped);
{Port, {exit_status, Code}} ->
% Service crashed
notify_actor({service_crashed, Code})
end
Crashed services don’t crash the colony. The actor gets the error and can restart the service or transition to error state.
Stopping Stage
Stopping performs graceful shutdown.
Actions
- Kill services —
os:cmd("kill {os_pid}")for each service - Wait for port exit — timeout 5s, force kill if needed
- Deregister Caddy routes — remove reverse proxy rules
- Close log buffer — flush logs to disk (optional)
- Keep namespace alive — don’t destroy (allows restart)
OS PID Cleanup
Erlang port_close(Port) doesn’t kill the underlying OS process. We explicitly kill by PID:
% Get OS PID from port
{os_pid, OsPid} = erlang:port_info(Port, os_pid),
% Kill the process
os:cmd(io_lib:format("kill ~p", [OsPid])),
% Close the port
port_close(Port).
This ensures services are fully terminated, not orphaned.
OTP Supervision Tree
The colony manager uses an OTP supervisor with one_for_one strategy:
ColonyManager (Supervisor)
├─ ColonyActor[colony-alpha] (GenServer)
│ └─ ServiceOwner[web] (Pid)
│ └─ ServiceOwner[api] (Pid)
├─ ColonyActor[colony-beta] (GenServer)
└─ ColonyActor[colony-gamma] (GenServer)
If a colony actor crashes (unhandled exception), the supervisor restarts only that actor. Other colonies keep running.
Fault Isolation
// If this crashes...
pub fn handle_call(msg: GetState, state: State) {
case msg {
GetState -> {
let invalid_state = crash_here() // Oops!
Response(invalid_state)
}
}
}
// ...only this colony actor restarts.
// Other colonies are unaffected.
This is OTP’s superpower: automatic fault recovery with surgical precision.
State Diagram (ASCII Art)
┌─────────────┐
│ spawning │
└──────┬──────┘
│
success
│
┌──────▼──────────┐
│ provisioning │──── error ───┐
└──────┬──────────┘ │
│ │
success │
│ │
┌──────▼──────┐ │
│ building │──── error ───────┤
└──────┬──────┘ │
│ │
success │
│ │
┌──────▼──────┐ │
│ running │ │
└──────┬──────┘ │
│ │
stop command │
│ │
┌──────▼──────┐ │
│ stopping │ │
└──────┬──────┘ │
│ │
┌──────▼──────┐ ┌──────▼──────┐
│ stopped │ │ error │
└─────────────┘ └─────────────┘
Caddy Route Registration
When a service starts, we register a reverse proxy rule with Caddy:
{
"handle": [{
"handler": "reverse_proxy",
"upstreams": [{
"dial": "10.200.1.2:4001"
}]
}]
}
Now the service is accessible at http://web-4001.colony.local/ from the host and Bloom.
Deregistration on Stop
When the colony stops, we delete the Caddy route:
DELETE /config/apps/http/servers/colony/routes/route-colony-alpha-web
No stale routes pointing to dead services.
Next Steps
- Parallel Agents — How multiple colonies coexist
- Live Preview — Watch the lifecycle in real-time
- Architecture Overview — Technical deep dive