Safety Verification Guide

This document verifies that all NightOps operations are safe, reversible, and will restore full functionality when resources are turned back on.

Safety Principles

NightOps follows these principles for all operations:

No Data Loss - All data is preserved during stop/start cycles
No Configuration Loss - All settings, network configs, and security rules remain intact
Idempotent Operations - Operations can be safely retried
State Preservation - Previous state is stored for accurate restoration
No Deletion - Resources are never deleted, only stopped/scaled down

AWS Services

EC2 Instances ✅ SAFE

Aspect	Status	Details
Operation	`StopInstances` / `StartInstances`
Data Preserved	✅ Yes	EBS volumes retain all data
Config Preserved	✅ Yes	Security groups, IAM role, network interfaces
Network	⚠️ Partial	Private IP preserved; Public IP changes unless Elastic IP
Billing Stops	✅ Yes	Compute charges stop; EBS storage still charged
Restore Time	~30-60 seconds

What changes on restart:

Public IP address (unless Elastic IP assigned)
Instance store volumes are wiped (use EBS for persistence)

What stays the same:

All EBS data
Private IP address
Security groups
IAM instance profile
Tags and metadata

RDS Databases ✅ SAFE

Aspect	Status	Details
Operation	`StopDBInstance` / `StartDBInstance`
Data Preserved	✅ Yes	All database data retained
Config Preserved	✅ Yes	Parameter groups, security groups, options
Network	✅ Yes	Endpoint DNS name unchanged
Billing Stops	✅ Yes	Compute stops; Storage still charged
Restore Time	~5-10 minutes

Important: AWS auto-restarts stopped RDS instances after 7 days. NightOps will re-stop them if outside scheduled hours.

What stays the same:

All data
Endpoint DNS name
Connection strings
Automated backup configuration
Parameter groups

ECS Services ✅ SAFE

Aspect	Status	Details
Operation	`UpdateService` (desiredCount: 0 / restore)
Data Preserved	N/A	Stateless containers; use EFS/EBS for state
Config Preserved	✅ Yes	Task definition, service config, ALB attachment
Network	⚠️ New IPs	New tasks get new IPs on scale-up
Billing Stops	✅ Yes	No running tasks = no compute charges
Restore Time	~1-3 minutes

State preservation: NightOps stores the original desiredCount before scaling to zero.

What stays the same:

Task definition
Service configuration
Load balancer target group attachments
Service discovery registration
Auto-scaling policies (paused)

EKS Node Groups ✅ SAFE

Aspect	Status	Details
Operation	`UpdateNodegroupConfig` (desiredSize: 0 / restore)
Data Preserved	⚠️ Requires PVC	Use PersistentVolumeClaims for stateful data
Config Preserved	✅ Yes	Node group config, labels, taints
Network	⚠️ New IPs	New nodes get new IPs
Billing Stops	⚠️ Partial	Node compute stops; Control plane (~$72/mo) continues
Restore Time	~3-5 minutes

State preservation: NightOps stores the original desiredSize and minSize.

What stays the same:

Cluster configuration
Node group settings
Launch template
Kubernetes labels and taints

Redshift Clusters ✅ SAFE

Aspect	Status	Details
Operation	`PauseCluster` / `ResumeCluster`
Data Preserved	✅ Yes	All data retained
Config Preserved	✅ Yes	Cluster configuration unchanged
Network	✅ Yes	Endpoint DNS name unchanged
Billing Stops	✅ Yes	Compute stops; Snapshot storage charged
Restore Time	~30-60 seconds

What stays the same:

All data
Endpoint DNS name
Connection strings
Parameter groups
IAM roles

Auto Scaling Groups ✅ SAFE

Aspect	Status	Details
Operation	`UpdateAutoScalingGroup` (capacity: 0 / restore)
Data Preserved	N/A	Use EBS or external storage for state
Config Preserved	✅ Yes	Launch template, scaling policies
Network	⚠️ New IPs	New instances get new IPs
Billing Stops	✅ Yes	No instances = no compute charges
Restore Time	~2-5 minutes

State preservation: NightOps stores desiredCapacity, minSize, and maxSize.

What stays the same:

Launch template
Scaling policies
Target group attachments
Health check configuration

GCP Services

Compute Engine ✅ SAFE

Aspect	Status	Details
Operation	`instances.stop` / `instances.start`
Data Preserved	✅ Yes	Persistent disks retain all data
Config Preserved	✅ Yes	Network, service account, metadata
Network	⚠️ Partial	Internal IP preserved; External IP changes unless static
Billing Stops	✅ Yes	Compute stops; Disk storage still charged
Restore Time	~30-60 seconds

What changes on restart:

Ephemeral external IP (use static IP if needed)
Local SSD data is lost (use Persistent Disk)

What stays the same:

All persistent disk data
Internal IP address
Network configuration
Service account
Labels and metadata

Cloud SQL ✅ SAFE

Aspect	Status	Details
Operation	Set `activationPolicy: NEVER` / `ALWAYS`
Data Preserved	✅ Yes	All database data retained
Config Preserved	✅ Yes	All settings unchanged
Network	✅ Yes	Connection name unchanged
Billing Stops	✅ Yes	Compute stops; Storage still charged
Restore Time	~2-5 minutes

Note: This uses activation policy, not deletion. The instance is deactivated but exists.

What stays the same:

All data
Connection name
Private IP
Authorized networks
Backup configuration

Cloud Run ✅ SAFE

Aspect	Status	Details
Operation	Scale min/max instances to 0 / restore
Data Preserved	N/A	Stateless; use Cloud Storage/Firestore for state
Config Preserved	✅ Yes	Container config, env vars, secrets
Network	✅ Yes	Service URL unchanged
Billing Stops	✅ Yes	No instances = no charges
Restore Time	Instant (cold start on first request)

State preservation: NightOps stores original minInstances and maxInstances.

What stays the same:

Service URL
Container configuration
Environment variables
Secrets
VPC connector

GKE Node Pools ✅ SAFE

Aspect	Status	Details
Operation	`setNodePoolSize(0)` / restore
Data Preserved	⚠️ Requires PVC	Use PersistentVolumeClaims for stateful data
Config Preserved	✅ Yes	Node pool config preserved
Network	⚠️ New IPs	New nodes get new IPs
Billing Stops	⚠️ Partial	Node compute stops; Control plane continues
Restore Time	~2-5 minutes

State preservation: NightOps stores original node count.

What stays the same:

Node pool configuration
Machine type
Node labels and taints
Autoscaling configuration

Azure Services

Virtual Machines ✅ SAFE

Aspect	Status	Details
Operation	`Deallocate` / `Start`
Data Preserved	✅ Yes	Managed disks retain all data
Config Preserved	✅ Yes	Network, identity, extensions
Network	⚠️ Partial	Static private IP preserved; Dynamic public IP released
Billing Stops	✅ Yes	Compute stops; Disk storage still charged
Restore Time	~1-2 minutes

Important: Use Deallocate, NOT Stop. Stop keeps the VM allocated and still charges.

What changes on restart:

Dynamic public IP (use static if needed)
Temp disk (D: drive) data is lost

What stays the same:

All managed disk data
Static private IP
Network interface
Managed identity
Extensions

Azure SQL Database ✅ SAFE (Serverless Only)

Aspect	Status	Details
Operation	`Pause` / `Resume`
Data Preserved	✅ Yes	All data retained
Config Preserved	✅ Yes	All settings unchanged
Network	✅ Yes	Connection string unchanged
Billing Stops	✅ Yes	Compute stops; Storage still charged
Restore Time	~1-2 minutes

Note: Only Serverless tier supports pause/resume. Provisioned tiers cannot be paused.

What stays the same:

All data
Connection strings
Firewall rules
Auditing configuration

AKS Node Pools ✅ SAFE (User Pools Only)

Aspect	Status	Details
Operation	Set `count: 0` / restore
Data Preserved	⚠️ Requires PVC	Use PersistentVolumeClaims for stateful data
Config Preserved	✅ Yes	Node pool config preserved
Network	⚠️ New IPs	New nodes get new IPs
Billing Stops	⚠️ Partial	Node compute stops; System pool continues
Restore Time	~3-5 minutes

Note: System node pools cannot scale to 0. Only user node pools can be fully scaled down.

State preservation: NightOps stores original count and minCount.

VM Scale Sets ✅ SAFE

Aspect	Status	Details
Operation	Set `capacity: 0` / restore
Data Preserved	N/A	Use managed disks or external storage
Config Preserved	✅ Yes	Scale set config preserved
Network	⚠️ New IPs	New instances get new IPs
Billing Stops	✅ Yes	No instances = no charges
Restore Time	~2-5 minutes

State preservation: NightOps stores original capacity.

Services NOT Safe for Automation

These services should NOT be managed automatically:

Service	Provider	Reason
ElastiCache	AWS	No stop; delete loses endpoint
OpenSearch	AWS	No stop; requires snapshot restore
NAT Gateway	AWS	Delete changes IP; breaks routes
Load Balancers	AWS	Delete changes DNS
Memorystore	GCP	No stop; delete loses data
Cloud Spanner	GCP	Can't scale to 0
Azure Cache for Redis	Azure	No pause; delete loses data
Cosmos DB	Azure	No pause; minimum throughput cost

Pre-Flight Checklist

Before enabling NightOps for a staging environment:

All stateful services use persistent storage (EBS, Persistent Disk, Managed Disks)
Kubernetes workloads use PersistentVolumeClaims for data
Static IPs assigned if external IP stability required
Applications handle cold starts gracefully
Health checks configured with appropriate timeouts
CI/CD pipelines don't run during scheduled downtime
Monitoring alerts adjusted for expected downtime

Recovery Procedures

If a resource fails to restore properly:

General Steps

Check NightOps logs for error details
Verify resource state in cloud console
Check for quota/capacity issues
Manually start/scale if needed

Service-Specific

RDS won't start:

Check if it auto-started (7-day limit)
Verify storage capacity
Check for pending maintenance

EKS pods not scheduling:

Verify node pool scaled up
Check pod resource requests vs node capacity
Review pending PVCs

Cloud Run cold start issues:

Increase min instances temporarily
Check container startup time
Review memory allocation

Safety Principles​

AWS Services​

EC2 Instances ✅ SAFE​

RDS Databases ✅ SAFE​

ECS Services ✅ SAFE​

EKS Node Groups ✅ SAFE​

Redshift Clusters ✅ SAFE​

Auto Scaling Groups ✅ SAFE​

GCP Services​

Compute Engine ✅ SAFE​

Cloud SQL ✅ SAFE​

Cloud Run ✅ SAFE​

GKE Node Pools ✅ SAFE​

Azure Services​

Virtual Machines ✅ SAFE​

Azure SQL Database ✅ SAFE (Serverless Only)​

AKS Node Pools ✅ SAFE (User Pools Only)​

VM Scale Sets ✅ SAFE​

Services NOT Safe for Automation​

Pre-Flight Checklist​

Recovery Procedures​

General Steps​

Service-Specific​

Safety Principles

AWS Services

EC2 Instances ✅ SAFE

RDS Databases ✅ SAFE

ECS Services ✅ SAFE

EKS Node Groups ✅ SAFE

Redshift Clusters ✅ SAFE

Auto Scaling Groups ✅ SAFE

GCP Services

Compute Engine ✅ SAFE

Cloud SQL ✅ SAFE

Cloud Run ✅ SAFE

GKE Node Pools ✅ SAFE

Azure Services

Virtual Machines ✅ SAFE

Azure SQL Database ✅ SAFE (Serverless Only)

AKS Node Pools ✅ SAFE (User Pools Only)

VM Scale Sets ✅ SAFE

Services NOT Safe for Automation

Pre-Flight Checklist

Recovery Procedures

General Steps

Service-Specific