Colony Lifecycle Management

Every colony follows a well-defined lifecycle. Understanding these states helps you monitor progress and debug issues.

Lifecycle States

spawning → provisioning → building → running → stopping → stopped
    ↓           ↓            ↓          ↓
  [error]    [error]      [error]    [manual]

State	Description	Duration
spawning	Creating network namespace and Jujutsu workspace	`<100ms`
provisioning	Running environment setup, installing deps	5-30s
building	Running build commands (npm build, cargo build)	10-60s
running	Services are live, agent is working	Minutes-hours
stopping	Graceful shutdown, cleaning up resources	`<5s`
stopped	Colony is inactive, resources freed	—
error	Unrecoverable failure at any stage	—

State Transitions

Transitions happen automatically based on success or failure:

// Successful path
Spawning -> Provisioning -> Building -> Running

// Error paths
Spawning [error] -> Error
Provisioning [error] -> Error (with rollback)
Building [error] -> Error
Running -> Stopping -> Stopped (manual or on completion)

State changes broadcast via WebSocket to Bloom. The UI updates instantly when a colony transitions.

Spawning Stage

Spawning creates the isolated environment.

What Happens

Generate colony ID — unique identifier (e.g., colony-alpha-7f3d)
Create network namespace — ip netns add ns-{colony_id}
Set up veth pair — virtual ethernet connecting namespace to host
Configure networking — assign IP, set up routing, enable NAT
Create Jujutsu workspace — jj workspace add {colony_id}
Initialize database — SQLite instance at db/{colony_id}.db
Create log buffer — ETS ring buffer for log storage

Error Handling

If spawning fails (namespace already exists, for example), we roll back everything:

# Rollback actions
ip netns del ns-{colony_id}
rm -rf workspaces/{colony_id}
rm -f db/{colony_id}.db

Provisioning Stage

Provisioning sets up the environment based on colony.toml.

Configuration Parsing

[environment]
node_version = "20"
packages = ["git", "curl", "jq"]

[services]
# Services defined here

Actions

Parse colony.toml — load configuration from workspace
Install language runtime — nvm, rustup, pyenv based on config
Install system packages — apt/dnf based on distro
Run setup scripts — custom provisioning commands
Install dependencies — npm install, cargo fetch, pip install

Provisioning runs inside the namespace via ip netns exec. All installed tools are namespace-local.

Rollback on Failure

If provisioning fails (npm install error, for example), we restore the previous state:

Restore workspace to pre-provision commit
Clear installed packages (if possible)
Transition to error state with logs attached

Building Stage

Building compiles or bundles your application.

Build Commands

[build]
command = "npm run build"
timeout = 300  # 5 minutes

We execute the build command and stream output to the log buffer. Exit code 0? Transition to running. Non-zero? Transition to error.

Parallel Builds

Multiple colonies can build at the same time. They’re isolated, so there are no conflicts over temp files or ports.

Running Stage

This is where agents do their work and services are live.

Service Spawning

For each service in colony.toml:

[[services]]
name = "web"
command = "npm start"
port = 4001

We:

Spawn dedicated owner process — separate from colony actor
Open Erlang port — open_port({spawn, "npm start"}, [...])
Capture OS PID — via erlang:port_info(Port, os_pid)
Stream output — stdout/stderr to log buffer
Register Caddy route — non-blocking HTTP call to Caddy API

Why Dedicated Owner Process?

The colony actor needs to stay responsive to handle API requests (get state, stream logs, etc.). If we did synchronous HTTP calls or blocking I/O inside the actor, it’d become unresponsive.

Solution: spawn (not spawn_link) a separate process that owns the port. If the service crashes, only that process dies. The colony actor survives and reports the error.

// Simplified version
pub fn spawn_service(service: Service) -> Result(Pid, Error) {
  // Spawn separate process to own the port
  let owner_pid = process.spawn(fn() {
    let port = open_port(service.command)
    let os_pid = get_os_pid(port)

    // Register route (non-blocking, separate process)
    task.async(fn() { register_caddy_route(service) })

    // Stream output forever
    stream_output_loop(port)
  })

  Ok(owner_pid)
}

Service Health Monitoring

The owner process monitors the port for exit signals:

receive
  {Port, {exit_status, 0}} ->
    % Service exited gracefully
    notify_actor(service_stopped);
  {Port, {exit_status, Code}} ->
    % Service crashed
    notify_actor({service_crashed, Code})
end

Crashed services don’t crash the colony. The actor gets the error and can restart the service or transition to error state.

Stopping Stage

Stopping performs graceful shutdown.

Actions

Kill services — os:cmd("kill {os_pid}") for each service
Wait for port exit — timeout 5s, force kill if needed
Deregister Caddy routes — remove reverse proxy rules
Close log buffer — flush logs to disk (optional)
Keep namespace alive — don’t destroy (allows restart)

OS PID Cleanup

Erlang port_close(Port) doesn’t kill the underlying OS process. We explicitly kill by PID:

% Get OS PID from port
{os_pid, OsPid} = erlang:port_info(Port, os_pid),

% Kill the process
os:cmd(io_lib:format("kill ~p", [OsPid])),

% Close the port
port_close(Port).

This ensures services are fully terminated, not orphaned.

OTP Supervision Tree

The colony manager uses an OTP supervisor with one_for_one strategy:

ColonyManager (Supervisor)
    ├─ ColonyActor[colony-alpha] (GenServer)
    │   └─ ServiceOwner[web] (Pid)
    │   └─ ServiceOwner[api] (Pid)
    ├─ ColonyActor[colony-beta] (GenServer)
    └─ ColonyActor[colony-gamma] (GenServer)

If a colony actor crashes (unhandled exception), the supervisor restarts only that actor. Other colonies keep running.

Fault Isolation

// If this crashes...
pub fn handle_call(msg: GetState, state: State) {
  case msg {
    GetState -> {
      let invalid_state = crash_here()  // Oops!
      Response(invalid_state)
    }
  }
}

// ...only this colony actor restarts.
// Other colonies are unaffected.

This is OTP’s superpower: automatic fault recovery with surgical precision.

State Diagram (ASCII Art)

              ┌─────────────┐
              │   spawning  │
              └──────┬──────┘
                     │
                success
                     │
              ┌──────▼──────────┐
              │  provisioning   │──── error ───┐
              └──────┬──────────┘              │
                     │                         │
                success                        │
                     │                         │
              ┌──────▼──────┐                 │
              │  building   │──── error ───────┤
              └──────┬──────┘                  │
                     │                         │
                success                        │
                     │                         │
              ┌──────▼──────┐                 │
              │   running   │                 │
              └──────┬──────┘                 │
                     │                         │
               stop command                    │
                     │                         │
              ┌──────▼──────┐                 │
              │  stopping   │                 │
              └──────┬──────┘                 │
                     │                         │
              ┌──────▼──────┐          ┌──────▼──────┐
              │   stopped   │          │    error    │
              └─────────────┘          └─────────────┘

Caddy Route Registration

When a service starts, we register a reverse proxy rule with Caddy:

{
  "handle": [{
    "handler": "reverse_proxy",
    "upstreams": [{
      "dial": "10.200.1.2:4001"
    }]
  }]
}

Now the service is accessible at http://web-4001.colony.local/ from the host and Bloom.

Deregistration on Stop

When the colony stops, we delete the Caddy route:

DELETE /config/apps/http/servers/colony/routes/route-colony-alpha-web

No stale routes pointing to dead services.

Next Steps

Parallel Agents — How multiple colonies coexist
Live Preview — Watch the lifecycle in real-time
Architecture Overview — Technical deep dive