Infrastructure Best Practice Guide
Created by Validatrium.com
- 🖥️ Infrastructure
- 🔐 Key Management
- 📊 Monitoring & Metrics
- 🔄 Node Updates
- 🆘 Failover & Disaster Recovery
- 🔒 Security
- 💡 Additional Recommendations
| Component | Specification |
|---|---|
| CPU | 8+ cores |
| RAM | 16–32 GB |
| Disk | SSD NVMe, 100+ GB |
- Holds signing keys
- Participates in consensus
- Not exposed to the public network
- Exposed to the internet
- Handles external RPC/API connections
- Connects to the validator node
- Does not sign blocks
- Fully synchronized with the network
- Stores inactive keys (to prevent double-signing)
- Can be promoted to validator in case of failover
- Validator node is isolated in a private network
- Only sentry nodes can connect to the validator
- Firewall configuration (Open only essential ports (e.g., RPC, gRPC, Prometheus)
- Store keys in
keystore/and manage viawallet.yaml - Store key backups offline only (e.g., YubiKey, HSM, or encrypted backup)
- Minimize the number of key copies
Maintain at least 3 encrypted backups:
| Location | Method |
|---|---|
| Local | encrypted disk or partition |
| Remote | e.g., S3, GCS |
| Offline | e.g., USB drive stored in a safe |
Encrypt backups using:
ansible-vaultgpgage
- Always back up the slashing protection DB along with keys
- Essential when migrating validator keys to another node, to make sure no double-signing event during recovery or migration
- Enable the Prometheus metrics endpoint on port
9153/metrics - Collect and visualize metrics via Grafana
- Set up alerts for:
- Missed blocks
- Peer count
- Disk usage
- High CPU/RAM
- Time drift
- Always stay up to date with the latest stable binary version
- Before updating:
- Backup
config.yaml,wallet.yaml, andkeystore/
- Backup
- After updating:
- Run
genlayernode doctor - Verify node health via
/health
- Run
- Immediately stop the old validator to prevent double-signing
- On the backup node:
- Restore validator keys and slashing protection DB
- Start the validator node
- Validate health using
genlayernode doctor
- Restore the most recent backup of
genlayer.db - If no backup exists — resync from genesis
Use Ansible and/or Terraform for:
- Rapid node deployment
- Infrastructure as Code (IaC) recovery
- Never store private keys on public servers
- Run validator under a dedicated system user
- Centralize logs via:
- Loki, ELK, or other logging stack
- Access only via VPN or SSH Bastion
- Regularly audit:
- File permissions
- Network exposure
- Binary integrity
- Use watchdog tools (e.g., systemd, supervisord) to auto-restart the node if it crashes
- Monitor for:
- Block production delays
- Peer disconnections
- Time synchronization issues
- Maintain a changelog of all updates and configuration changes