Chapter 9: Monitoring and Maintenance
TL;DR: Anvil exposes a
/readyendpoint for health checks and is designed to be monitored with standard tools like Prometheus. Background workers handle tasks like garbage collection and shard repair.
Running a distributed system in production requires robust monitoring and a clear understanding of its maintenance processes. Anvil is designed to be observable and resilient.
9.1. Health Checks and Readiness Probes
Each Anvil node exposes two HTTP endpoints for health monitoring:
-
/(Health Check): This is a simple endpoint that returns200 OKas long as the Anvil server process is running. It can be used for basic liveness probes. -
/ready(Readiness Check): This is a more comprehensive check that should be used to determine if a node is ready to accept traffic. It returns200 OKonly if:- The node can successfully connect to its databases (global and regional).
- The node is part of a cluster and has discovered at least one peer (itself included).
In an orchestrator like Kubernetes, you should use the /ready endpoint for your readiness probes to ensure traffic is only routed to fully initialized nodes.
9.2. Key Metrics for Monitoring (Prometheus)
Anvil is designed to expose metrics in a Prometheus-compatible format. While the specific metrics will evolve, you should monitor the following key areas of the system:
- API Latency and Error Rates:
- Latency for S3 and gRPC API calls (
PutObject,GetObject). - Rate of
4xxand5xxerrors.
- Latency for S3 and gRPC API calls (
- Cluster Membership:
- Number of active peers in the cluster.
- Rate of peer churn (nodes joining or leaving).
- Storage Utilization:
- Total storage capacity and usage across the cluster.
- Storage usage per tenant and per bucket.
- Task Queue:
- Number of pending tasks in the queue.
- Rate of failed tasks.
- Shard Health:
- Number of missing or corrupted shards.
- Rate of shard repair and rebalancing operations.
9.3. Backup and Recovery Strategy
Anvil's durability model is designed to withstand node failures, but a comprehensive backup strategy must also account for database failure.
- Database Backup: Your primary backup responsibility is the PostgreSQL databases. You should use standard Postgres tools like
pg_dumpor continuous archiving (PITR) to back up both the global and all regional databases. - Data Recovery: In the event of a catastrophic failure where a sufficient number of nodes are lost to prevent erasure code reconstruction, you would:
- Restore the PostgreSQL databases from your backup.
- Restore the object data itself from your off-site backups (if you have them).
- Launch a new Anvil cluster connected to the restored databases.
9.4. The Task Queue and Background Workers
Anvil uses a task queue within the global database to manage asynchronous, long-running, or deferrable operations. This ensures that the main API remains fast and responsive.
Each Anvil node runs a background worker that polls this queue for pending tasks.
Key Tasks Handled by the Worker:
- Garbage Collection: When a user deletes an object or a bucket, it is initially "soft-deleted" (marked as deleted in the database). A
DeleteObjectorDeleteBuckettask is enqueued. The background worker picks up this task and performs the actual physical deletion of the object shards from the storage nodes. - Shard Repair: The worker will eventually be responsible for periodically scanning for missing or corrupted shards and enqueuing tasks to reconstruct them from the remaining erasure-coded data.
Monitoring the health and depth of the task queue is a critical part of operating Anvil at scale.