Safety Verification Guide
This document verifies that all NightOps operations are safe, reversible, and will restore full functionality when resources are turned back on.
Safety Principles
NightOps follows these principles for all operations:
- No Data Loss - All data is preserved during stop/start cycles
- No Configuration Loss - All settings, network configs, and security rules remain intact
- Idempotent Operations - Operations can be safely retried
- State Preservation - Previous state is stored for accurate restoration
- No Deletion - Resources are never deleted, only stopped/scaled down
AWS Services
EC2 Instances ✅ SAFE
| Aspect | Status | Details |
|---|---|---|
| Operation | StopInstances / StartInstances | |
| Data Preserved | ✅ Yes | EBS volumes retain all data |
| Config Preserved | ✅ Yes | Security groups, IAM role, network interfaces |
| Network | ⚠️ Partial | Private IP preserved; Public IP changes unless Elastic IP |
| Billing Stops | ✅ Yes | Compute charges stop; EBS storage still charged |
| Restore Time | ~30-60 seconds |
What changes on restart:
- Public IP address (unless Elastic IP assigned)
- Instance store volumes are wiped (use EBS for persistence)
What stays the same:
- All EBS data
- Private IP address
- Security groups
- IAM instance profile
- Tags and metadata
RDS Databases ✅ SAFE
| Aspect | Status | Details |
|---|---|---|
| Operation | StopDBInstance / StartDBInstance | |
| Data Preserved | ✅ Yes | All database data retained |
| Config Preserved | ✅ Yes | Parameter groups, security groups, options |
| Network | ✅ Yes | Endpoint DNS name unchanged |
| Billing Stops | ✅ Yes | Compute stops; Storage still charged |
| Restore Time | ~5-10 minutes |
Important: AWS auto-restarts stopped RDS instances after 7 days. NightOps will re-stop them if outside scheduled hours.
What stays the same:
- All data
- Endpoint DNS name
- Connection strings
- Automated backup configuration
- Parameter groups
ECS Services ✅ SAFE
| Aspect | Status | Details |
|---|---|---|
| Operation | UpdateService (desiredCount: 0 / restore) | |
| Data Preserved | N/A | Stateless containers; use EFS/EBS for state |
| Config Preserved | ✅ Yes | Task definition, service config, ALB attachment |
| Network | ⚠️ New IPs | New tasks get new IPs on scale-up |
| Billing Stops | ✅ Yes | No running tasks = no compute charges |
| Restore Time | ~1-3 minutes |
State preservation: NightOps stores the original desiredCount before scaling to zero.
What stays the same:
- Task definition
- Service configuration
- Load balancer target group attachments
- Service discovery registration
- Auto-scaling policies (paused)
EKS Node Groups ✅ SAFE
| Aspect | Status | Details |
|---|---|---|
| Operation | UpdateNodegroupConfig (desiredSize: 0 / restore) | |
| Data Preserved | ⚠️ Requires PVC | Use PersistentVolumeClaims for stateful data |
| Config Preserved | ✅ Yes | Node group config, labels, taints |
| Network | ⚠️ New IPs | New nodes get new IPs |
| Billing Stops | ⚠️ Partial | Node compute stops; Control plane (~$72/mo) continues |
| Restore Time | ~3-5 minutes |
State preservation: NightOps stores the original desiredSize and minSize.
What stays the same:
- Cluster configuration
- Node group settings
- Launch template
- Kubernetes labels and taints
Redshift Clusters ✅ SAFE
| Aspect | Status | Details |
|---|---|---|
| Operation | PauseCluster / ResumeCluster | |
| Data Preserved | ✅ Yes | All data retained |
| Config Preserved | ✅ Yes | Cluster configuration unchanged |
| Network | ✅ Yes | Endpoint DNS name unchanged |
| Billing Stops | ✅ Yes | Compute stops; Snapshot storage charged |
| Restore Time | ~30-60 seconds |
What stays the same:
- All data
- Endpoint DNS name
- Connection strings
- Parameter groups
- IAM roles
Auto Scaling Groups ✅ SAFE
| Aspect | Status | Details |
|---|---|---|
| Operation | UpdateAutoScalingGroup (capacity: 0 / restore) | |
| Data Preserved | N/A | Use EBS or external storage for state |
| Config Preserved | ✅ Yes | Launch template, scaling policies |
| Network | ⚠️ New IPs | New instances get new IPs |
| Billing Stops | ✅ Yes | No instances = no compute charges |
| Restore Time | ~2-5 minutes |
State preservation: NightOps stores desiredCapacity, minSize, and maxSize.
What stays the same:
- Launch template
- Scaling policies
- Target group attachments
- Health check configuration
GCP Services
Compute Engine ✅ SAFE
| Aspect | Status | Details |
|---|---|---|
| Operation | instances.stop / instances.start | |
| Data Preserved | ✅ Yes | Persistent disks retain all data |
| Config Preserved | ✅ Yes | Network, service account, metadata |
| Network | ⚠️ Partial | Internal IP preserved; External IP changes unless static |
| Billing Stops | ✅ Yes | Compute stops; Disk storage still charged |
| Restore Time | ~30-60 seconds |
What changes on restart:
- Ephemeral external IP (use static IP if needed)
- Local SSD data is lost (use Persistent Disk)
What stays the same:
- All persistent disk data
- Internal IP address
- Network configuration
- Service account
- Labels and metadata
Cloud SQL ✅ SAFE
| Aspect | Status | Details |
|---|---|---|
| Operation | Set activationPolicy: NEVER / ALWAYS | |
| Data Preserved | ✅ Yes | All database data retained |
| Config Preserved | ✅ Yes | All settings unchanged |
| Network | ✅ Yes | Connection name unchanged |
| Billing Stops | ✅ Yes | Compute stops; Storage still charged |
| Restore Time | ~2-5 minutes |
Note: This uses activation policy, not deletion. The instance is deactivated but exists.
What stays the same:
- All data
- Connection name
- Private IP
- Authorized networks
- Backup configuration
Cloud Run ✅ SAFE
| Aspect | Status | Details |
|---|---|---|
| Operation | Scale min/max instances to 0 / restore | |
| Data Preserved | N/A | Stateless; use Cloud Storage/Firestore for state |
| Config Preserved | ✅ Yes | Container config, env vars, secrets |
| Network | ✅ Yes | Service URL unchanged |
| Billing Stops | ✅ Yes | No instances = no charges |
| Restore Time | Instant (cold start on first request) |
State preservation: NightOps stores original minInstances and maxInstances.
What stays the same:
- Service URL
- Container configuration
- Environment variables
- Secrets
- VPC connector
GKE Node Pools ✅ SAFE
| Aspect | Status | Details |
|---|---|---|
| Operation | setNodePoolSize(0) / restore | |
| Data Preserved | ⚠️ Requires PVC | Use PersistentVolumeClaims for stateful data |
| Config Preserved | ✅ Yes | Node pool config preserved |
| Network | ⚠️ New IPs | New nodes get new IPs |
| Billing Stops | ⚠️ Partial | Node compute stops; Control plane continues |
| Restore Time | ~2-5 minutes |
State preservation: NightOps stores original node count.
What stays the same:
- Node pool configuration
- Machine type
- Node labels and taints
- Autoscaling configuration
Azure Services
Virtual Machines ✅ SAFE
| Aspect | Status | Details |
|---|---|---|
| Operation | Deallocate / Start | |
| Data Preserved | ✅ Yes | Managed disks retain all data |
| Config Preserved | ✅ Yes | Network, identity, extensions |
| Network | ⚠️ Partial | Static private IP preserved; Dynamic public IP released |
| Billing Stops | ✅ Yes | Compute stops; Disk storage still charged |
| Restore Time | ~1-2 minutes |
Important: Use Deallocate, NOT Stop. Stop keeps the VM allocated and still charges.
What changes on restart:
- Dynamic public IP (use static if needed)
- Temp disk (D: drive) data is lost
What stays the same:
- All managed disk data
- Static private IP
- Network interface
- Managed identity
- Extensions
Azure SQL Database ✅ SAFE (Serverless Only)
| Aspect | Status | Details |
|---|---|---|
| Operation | Pause / Resume | |
| Data Preserved | ✅ Yes | All data retained |
| Config Preserved | ✅ Yes | All settings unchanged |
| Network | ✅ Yes | Connection string unchanged |
| Billing Stops | ✅ Yes | Compute stops; Storage still charged |
| Restore Time | ~1-2 minutes |
Note: Only Serverless tier supports pause/resume. Provisioned tiers cannot be paused.
What stays the same:
- All data
- Connection strings
- Firewall rules
- Auditing configuration
AKS Node Pools ✅ SAFE (User Pools Only)
| Aspect | Status | Details |
|---|---|---|
| Operation | Set count: 0 / restore | |
| Data Preserved | ⚠️ Requires PVC | Use PersistentVolumeClaims for stateful data |
| Config Preserved | ✅ Yes | Node pool config preserved |
| Network | ⚠️ New IPs | New nodes get new IPs |
| Billing Stops | ⚠️ Partial | Node compute stops; System pool continues |
| Restore Time | ~3-5 minutes |
Note: System node pools cannot scale to 0. Only user node pools can be fully scaled down.
State preservation: NightOps stores original count and minCount.
VM Scale Sets ✅ SAFE
| Aspect | Status | Details |
|---|---|---|
| Operation | Set capacity: 0 / restore | |
| Data Preserved | N/A | Use managed disks or external storage |
| Config Preserved | ✅ Yes | Scale set config preserved |
| Network | ⚠️ New IPs | New instances get new IPs |
| Billing Stops | ✅ Yes | No instances = no charges |
| Restore Time | ~2-5 minutes |
State preservation: NightOps stores original capacity.
Services NOT Safe for Automation
These services should NOT be managed automatically:
| Service | Provider | Reason |
|---|---|---|
| ElastiCache | AWS | No stop; delete loses endpoint |
| OpenSearch | AWS | No stop; requires snapshot restore |
| NAT Gateway | AWS | Delete changes IP; breaks routes |
| Load Balancers | AWS | Delete changes DNS |
| Memorystore | GCP | No stop; delete loses data |
| Cloud Spanner | GCP | Can't scale to 0 |
| Azure Cache for Redis | Azure | No pause; delete loses data |
| Cosmos DB | Azure | No pause; minimum throughput cost |
Pre-Flight Checklist
Before enabling NightOps for a staging environment:
- All stateful services use persistent storage (EBS, Persistent Disk, Managed Disks)
- Kubernetes workloads use PersistentVolumeClaims for data
- Static IPs assigned if external IP stability required
- Applications handle cold starts gracefully
- Health checks configured with appropriate timeouts
- CI/CD pipelines don't run during scheduled downtime
- Monitoring alerts adjusted for expected downtime
Recovery Procedures
If a resource fails to restore properly:
General Steps
- Check NightOps logs for error details
- Verify resource state in cloud console
- Check for quota/capacity issues
- Manually start/scale if needed
Service-Specific
RDS won't start:
- Check if it auto-started (7-day limit)
- Verify storage capacity
- Check for pending maintenance
EKS pods not scheduling:
- Verify node pool scaled up
- Check pod resource requests vs node capacity
- Review pending PVCs
Cloud Run cold start issues:
- Increase min instances temporarily
- Check container startup time
- Review memory allocation