Skip to main content

Safety Verification Guide

This document verifies that all NightOps operations are safe, reversible, and will restore full functionality when resources are turned back on.

Safety Principles

NightOps follows these principles for all operations:

  1. No Data Loss - All data is preserved during stop/start cycles
  2. No Configuration Loss - All settings, network configs, and security rules remain intact
  3. Idempotent Operations - Operations can be safely retried
  4. State Preservation - Previous state is stored for accurate restoration
  5. No Deletion - Resources are never deleted, only stopped/scaled down

AWS Services

EC2 Instances ✅ SAFE

AspectStatusDetails
OperationStopInstances / StartInstances
Data Preserved✅ YesEBS volumes retain all data
Config Preserved✅ YesSecurity groups, IAM role, network interfaces
Network⚠️ PartialPrivate IP preserved; Public IP changes unless Elastic IP
Billing Stops✅ YesCompute charges stop; EBS storage still charged
Restore Time~30-60 seconds

What changes on restart:

  • Public IP address (unless Elastic IP assigned)
  • Instance store volumes are wiped (use EBS for persistence)

What stays the same:

  • All EBS data
  • Private IP address
  • Security groups
  • IAM instance profile
  • Tags and metadata

RDS Databases ✅ SAFE

AspectStatusDetails
OperationStopDBInstance / StartDBInstance
Data Preserved✅ YesAll database data retained
Config Preserved✅ YesParameter groups, security groups, options
Network✅ YesEndpoint DNS name unchanged
Billing Stops✅ YesCompute stops; Storage still charged
Restore Time~5-10 minutes

Important: AWS auto-restarts stopped RDS instances after 7 days. NightOps will re-stop them if outside scheduled hours.

What stays the same:

  • All data
  • Endpoint DNS name
  • Connection strings
  • Automated backup configuration
  • Parameter groups

ECS Services ✅ SAFE

AspectStatusDetails
OperationUpdateService (desiredCount: 0 / restore)
Data PreservedN/AStateless containers; use EFS/EBS for state
Config Preserved✅ YesTask definition, service config, ALB attachment
Network⚠️ New IPsNew tasks get new IPs on scale-up
Billing Stops✅ YesNo running tasks = no compute charges
Restore Time~1-3 minutes

State preservation: NightOps stores the original desiredCount before scaling to zero.

What stays the same:

  • Task definition
  • Service configuration
  • Load balancer target group attachments
  • Service discovery registration
  • Auto-scaling policies (paused)

EKS Node Groups ✅ SAFE

AspectStatusDetails
OperationUpdateNodegroupConfig (desiredSize: 0 / restore)
Data Preserved⚠️ Requires PVCUse PersistentVolumeClaims for stateful data
Config Preserved✅ YesNode group config, labels, taints
Network⚠️ New IPsNew nodes get new IPs
Billing Stops⚠️ PartialNode compute stops; Control plane (~$72/mo) continues
Restore Time~3-5 minutes

State preservation: NightOps stores the original desiredSize and minSize.

What stays the same:

  • Cluster configuration
  • Node group settings
  • Launch template
  • Kubernetes labels and taints

Redshift Clusters ✅ SAFE

AspectStatusDetails
OperationPauseCluster / ResumeCluster
Data Preserved✅ YesAll data retained
Config Preserved✅ YesCluster configuration unchanged
Network✅ YesEndpoint DNS name unchanged
Billing Stops✅ YesCompute stops; Snapshot storage charged
Restore Time~30-60 seconds

What stays the same:

  • All data
  • Endpoint DNS name
  • Connection strings
  • Parameter groups
  • IAM roles

Auto Scaling Groups ✅ SAFE

AspectStatusDetails
OperationUpdateAutoScalingGroup (capacity: 0 / restore)
Data PreservedN/AUse EBS or external storage for state
Config Preserved✅ YesLaunch template, scaling policies
Network⚠️ New IPsNew instances get new IPs
Billing Stops✅ YesNo instances = no compute charges
Restore Time~2-5 minutes

State preservation: NightOps stores desiredCapacity, minSize, and maxSize.

What stays the same:

  • Launch template
  • Scaling policies
  • Target group attachments
  • Health check configuration

GCP Services

Compute Engine ✅ SAFE

AspectStatusDetails
Operationinstances.stop / instances.start
Data Preserved✅ YesPersistent disks retain all data
Config Preserved✅ YesNetwork, service account, metadata
Network⚠️ PartialInternal IP preserved; External IP changes unless static
Billing Stops✅ YesCompute stops; Disk storage still charged
Restore Time~30-60 seconds

What changes on restart:

  • Ephemeral external IP (use static IP if needed)
  • Local SSD data is lost (use Persistent Disk)

What stays the same:

  • All persistent disk data
  • Internal IP address
  • Network configuration
  • Service account
  • Labels and metadata

Cloud SQL ✅ SAFE

AspectStatusDetails
OperationSet activationPolicy: NEVER / ALWAYS
Data Preserved✅ YesAll database data retained
Config Preserved✅ YesAll settings unchanged
Network✅ YesConnection name unchanged
Billing Stops✅ YesCompute stops; Storage still charged
Restore Time~2-5 minutes

Note: This uses activation policy, not deletion. The instance is deactivated but exists.

What stays the same:

  • All data
  • Connection name
  • Private IP
  • Authorized networks
  • Backup configuration

Cloud Run ✅ SAFE

AspectStatusDetails
OperationScale min/max instances to 0 / restore
Data PreservedN/AStateless; use Cloud Storage/Firestore for state
Config Preserved✅ YesContainer config, env vars, secrets
Network✅ YesService URL unchanged
Billing Stops✅ YesNo instances = no charges
Restore TimeInstant (cold start on first request)

State preservation: NightOps stores original minInstances and maxInstances.

What stays the same:

  • Service URL
  • Container configuration
  • Environment variables
  • Secrets
  • VPC connector

GKE Node Pools ✅ SAFE

AspectStatusDetails
OperationsetNodePoolSize(0) / restore
Data Preserved⚠️ Requires PVCUse PersistentVolumeClaims for stateful data
Config Preserved✅ YesNode pool config preserved
Network⚠️ New IPsNew nodes get new IPs
Billing Stops⚠️ PartialNode compute stops; Control plane continues
Restore Time~2-5 minutes

State preservation: NightOps stores original node count.

What stays the same:

  • Node pool configuration
  • Machine type
  • Node labels and taints
  • Autoscaling configuration

Azure Services

Virtual Machines ✅ SAFE

AspectStatusDetails
OperationDeallocate / Start
Data Preserved✅ YesManaged disks retain all data
Config Preserved✅ YesNetwork, identity, extensions
Network⚠️ PartialStatic private IP preserved; Dynamic public IP released
Billing Stops✅ YesCompute stops; Disk storage still charged
Restore Time~1-2 minutes

Important: Use Deallocate, NOT Stop. Stop keeps the VM allocated and still charges.

What changes on restart:

  • Dynamic public IP (use static if needed)
  • Temp disk (D: drive) data is lost

What stays the same:

  • All managed disk data
  • Static private IP
  • Network interface
  • Managed identity
  • Extensions

Azure SQL Database ✅ SAFE (Serverless Only)

AspectStatusDetails
OperationPause / Resume
Data Preserved✅ YesAll data retained
Config Preserved✅ YesAll settings unchanged
Network✅ YesConnection string unchanged
Billing Stops✅ YesCompute stops; Storage still charged
Restore Time~1-2 minutes

Note: Only Serverless tier supports pause/resume. Provisioned tiers cannot be paused.

What stays the same:

  • All data
  • Connection strings
  • Firewall rules
  • Auditing configuration

AKS Node Pools ✅ SAFE (User Pools Only)

AspectStatusDetails
OperationSet count: 0 / restore
Data Preserved⚠️ Requires PVCUse PersistentVolumeClaims for stateful data
Config Preserved✅ YesNode pool config preserved
Network⚠️ New IPsNew nodes get new IPs
Billing Stops⚠️ PartialNode compute stops; System pool continues
Restore Time~3-5 minutes

Note: System node pools cannot scale to 0. Only user node pools can be fully scaled down.

State preservation: NightOps stores original count and minCount.


VM Scale Sets ✅ SAFE

AspectStatusDetails
OperationSet capacity: 0 / restore
Data PreservedN/AUse managed disks or external storage
Config Preserved✅ YesScale set config preserved
Network⚠️ New IPsNew instances get new IPs
Billing Stops✅ YesNo instances = no charges
Restore Time~2-5 minutes

State preservation: NightOps stores original capacity.


Services NOT Safe for Automation

These services should NOT be managed automatically:

ServiceProviderReason
ElastiCacheAWSNo stop; delete loses endpoint
OpenSearchAWSNo stop; requires snapshot restore
NAT GatewayAWSDelete changes IP; breaks routes
Load BalancersAWSDelete changes DNS
MemorystoreGCPNo stop; delete loses data
Cloud SpannerGCPCan't scale to 0
Azure Cache for RedisAzureNo pause; delete loses data
Cosmos DBAzureNo pause; minimum throughput cost

Pre-Flight Checklist

Before enabling NightOps for a staging environment:

  • All stateful services use persistent storage (EBS, Persistent Disk, Managed Disks)
  • Kubernetes workloads use PersistentVolumeClaims for data
  • Static IPs assigned if external IP stability required
  • Applications handle cold starts gracefully
  • Health checks configured with appropriate timeouts
  • CI/CD pipelines don't run during scheduled downtime
  • Monitoring alerts adjusted for expected downtime

Recovery Procedures

If a resource fails to restore properly:

General Steps

  1. Check NightOps logs for error details
  2. Verify resource state in cloud console
  3. Check for quota/capacity issues
  4. Manually start/scale if needed

Service-Specific

RDS won't start:

  • Check if it auto-started (7-day limit)
  • Verify storage capacity
  • Check for pending maintenance

EKS pods not scheduling:

  • Verify node pool scaled up
  • Check pod resource requests vs node capacity
  • Review pending PVCs

Cloud Run cold start issues:

  • Increase min instances temporarily
  • Check container startup time
  • Review memory allocation