Skip to main content

Troubleshooting

This guide covers common issues, debugging techniques, and reference information for diagnosing problems with the Aerospike CE Kubernetes Operator.

Symptom-Based Diagnosis

SymptomCheck CommandLikely CauseResolution
Phase = Errorkubectl get asc <name> -o jsonpath='{.status.lastReconcileError}'Invalid config, image pull failure, insufficient resourcesFix based on error message and re-apply
Phase = WaitingForMigrationkubectl exec <pod> -- asinfo -v 'statistics' | grep migrateData migration in progressWait for completion (automatic)
Stuck at InProgresskubectl get pvc -n <ns> -l aerospike.io/cr-name=<name>PVC Pending, ImagePull failure, scheduling failureCheck StorageClass, image, and resources
CircuitBreakerActive eventkubectl get asc <name> -o jsonpath='{.status.failedReconcileCount}'10+ consecutive failuresCheck lastReconcileError, fix root cause
Pod CrashLoopBackOffkubectl logs <pod> -c aerospike-server --previousConfig parsing error, out of memoryCheck server logs and fix config
Webhook rejects CRCheck kubectl apply error messageCE constraint violationSee Validation Error Patterns
dynamicConfigStatus=Failedkubectl get asc <name> -o jsonpath='{.status.pods}' | jq '.[].dynamicConfigStatus'Non-dynamic parameter changeSet enableDynamicConfigUpdate: false to trigger rolling restart
ReadinessGateBlockingkubectl get pod <pod> -o jsonpath='{.status.conditions}' | jq '.[]'Readiness gate not satisfiedCheck pod Aerospike status and migration state

Common Issues and Solutions

Cluster Stuck in InProgress

The cluster phase stays at InProgress and does not transition to Completed.

Possible causes:

  1. PVC not binding — The StorageClass does not exist or has no available capacity.
  2. Image pull failure — The container image is not accessible from the cluster.
  3. Scheduling failure — No nodes match the pod's scheduling constraints.
  4. Circuit breaker active — The operator is backing off after repeated failures.

Steps to diagnose:

# Check cluster phase and reason
kubectl -n aerospike get asc <name> -o jsonpath='{.status.phase}{"\t"}{.status.phaseReason}'

# Check for pending PVCs
kubectl -n aerospike get pvc -l aerospike.io/cr-name=<name>

# Check pod events for scheduling or pull errors
kubectl -n aerospike describe pod <pod-name>

# Check operator logs
kubectl -n aerospike-operator logs -l control-plane=controller-manager | tail -50

# Check if circuit breaker is active
kubectl -n aerospike get asc <name> \
-o jsonpath='{.status.failedReconcileCount}{"\t"}{.status.lastReconcileError}'

Pods Not Ready (CrashLoopBackOff)

Aerospike pods start but crash repeatedly.

Possible causes:

  1. Invalid aerospikeConfig — Namespace names, storage paths, or parameters are incorrect.
  2. Insufficient memory — The container memory limit is too low for the configured namespaces.
  3. Storage path mismatch — Configured file paths do not match mounted volumes.

Steps to diagnose:

# Check current crash logs
kubectl -n aerospike logs <pod-name> -c aerospike-server

# Check previous crash logs
kubectl -n aerospike logs <pod-name> -c aerospike-server --previous

# Check pod status details
kubectl -n aerospike describe pod <pod-name>

Scaling Failures

Scale-up or scale-down does not complete.

Scale-up issues:

  • Verify spec.size does not exceed 8 (CE limit).
  • Check that replication-factor does not exceed the new cluster size.
  • Ensure sufficient cluster resources (CPU, memory, storage).

Scale-down issues:

  • The operator waits for data migrations to complete before removing pods.
  • Check for ScaleDownDeferred events indicating migration is blocking scale-down.
# Check for scale-down deferral events
kubectl get events --field-selector reason=ScaleDownDeferred -n aerospike

# Check migration status
kubectl -n aerospike exec <pod-name> -c aerospike-server -- asinfo -v 'statistics' | grep migrate_partitions_remaining

ACL Sync Failures

ACL roles or users fail to synchronize with the Aerospike cluster.

Possible causes:

  1. Missing Secret — The referenced Kubernetes Secret does not exist.
  2. Missing password key — The Secret does not contain a password key.
  3. No admin user — No user has both sys-admin and user-admin roles.

Steps to diagnose:

# Check ACL sync events
kubectl get events --field-selector reason=ACLSyncError -n aerospike

# Verify the Secret exists and contains the password key
kubectl -n aerospike get secret <secret-name> -o jsonpath='{.data.password}' | base64 -d

# Check operator logs for ACL errors
kubectl -n aerospike-operator logs -l control-plane=controller-manager | grep -i acl

Dynamic Config Failures

Configuration changes trigger a restart instead of applying dynamically.

Possible causes:

  1. enableDynamicConfigUpdate not set — Dynamic updates are off by default.
  2. Static parameter changed — Parameters like replication-factor, storage-engine type, and name always require a restart.
  3. Invalid characters — Parameter values containing ; or : are rejected by pre-flight validation.
  4. Partial failure with rollback — If one change in a batch fails, the operator rolls back all applied changes across all pods and falls back to a cold restart.
  5. ConfigDegraded state — If rollback itself fails, the cluster enters ConfigDegraded phase. The operator will attempt cold restart recovery.

Steps to diagnose:

# Check per-pod dynamic config status
kubectl -n aerospike get asc <name> -o jsonpath='{.status.pods}' | \
jq '.[] | {name: .podName, dynamicConfig: .dynamicConfigStatus}'

# Check per-change details for a specific pod
kubectl -n aerospike get asc <name> -o jsonpath='{.status.pods.<pod-name>.dynamicConfigChanges}'

# Check for dynamic config events
kubectl get events --field-selector reason=DynamicConfigApplied -n aerospike
kubectl get events --field-selector reason=DynamicConfigDegraded -n aerospike
kubectl get events --field-selector reason=DynamicConfigRollback -n aerospike

# Check operator logs for 2PC and rollback activity
kubectl -n aerospike-operator logs -l control-plane=controller-manager | grep -i "rollback\|dynamic config\|2PC\|ConfigDegraded"
StatusMeaning
AppliedDynamic config applied successfully at runtime
FailedDynamic update failed — rolling restart will be triggered
PendingWaiting for the operator to apply the change
(empty)No dynamic config change was attempted

If the cluster is in ConfigDegraded phase, check the DynamicConfigDegraded condition for details about which pods have inconsistent configuration.

Circuit Breaker and Recovery

The operator includes a built-in circuit breaker to prevent excessive retries on persistently failing clusters.

How It Works

After 10 consecutive reconciliation failures, the operator enters a backoff state with exponential delays:

Consecutive FailuresBackoff Delay
12 seconds
24 seconds
38 seconds
532 seconds
8+~4.3 minutes (capped at 256 seconds)

While the circuit breaker is active, a CircuitBreakerActive warning event is emitted with the failure count and last error.

info

Permanent validation errors (e.g., invalid Aerospike config structure, missing ACL secrets, invalid privilege codes) immediately activate the circuit breaker by setting failedReconcileCount to the maximum threshold. A PermanentError event is emitted, and the ReconcileHealthy status condition is set to False with reason PermanentError. These errors will never self-heal — fix the spec to recover.

Checking Circuit Breaker Status

# Check for active circuit breaker events
kubectl get events --field-selector reason=CircuitBreakerActive -n aerospike

# Check failure count and last error
kubectl -n aerospike get asc <name> \
-o jsonpath='{.status.failedReconcileCount}{"\t"}{.status.lastReconcileError}'

Resetting the Circuit Breaker

The circuit breaker resets automatically after a successful reconciliation. To trigger recovery:

  1. Fix the root cause — Check lastReconcileError and resolve the underlying issue.
  2. Re-apply the corrected speckubectl apply -f <fixed-cr.yaml>
  3. Verify reset — Look for a CircuitBreakerReset event.
kubectl get events --field-selector reason=CircuitBreakerReset -n aerospike

Manual reset via annotation:

If you need to force an immediate reconcile (e.g., after fixing an external dependency), annotate the CR:

kubectl -n aerospike annotate asc <name> acko.io/force-reconcile=true --overwrite

The operator will clear the annotation after performing the reconciliation. This works even when the circuit breaker is active.

Reset via Aerospike Cluster Manager UI:

When the integrated UI is enabled (default; requires ui.web.enabled=true), the Reconciliation Health dashboard shows a Reset Circuit Breaker button that triggers the annotation-based reset with a single click.

Debugging Commands

Cluster Status

# List all clusters with phase
kubectl get asc -n <ns>

# Check specific cluster phase and reason
kubectl get asc <name> -o jsonpath='{.status.phase}'
kubectl get asc <name> -o jsonpath='{.status.phaseReason}'

# Check conditions
kubectl get asc <name> -o jsonpath='{.status.conditions}' | jq .

# Check circuit breaker state
kubectl get asc <name> -o jsonpath='{.status.failedReconcileCount}'
kubectl get asc <name> -o jsonpath='{.status.lastReconcileError}'

Pod Status

# All pod statuses for a cluster
kubectl get asc <name> -o jsonpath='{.status.pods}' | jq .

# Ready pod count
kubectl get asc <name> -o jsonpath='{.status.size}'

# Pods pending restart
kubectl get asc <name> -o jsonpath='{.status.pendingRestartPods}'

# Template sync status
kubectl get asc <name> -o jsonpath='{.status.templateSnapshot.synced}'

Events

# Events for a specific cluster (sorted by time)
kubectl get events -n <ns> --field-selector involvedObject.name=<name> --sort-by='.lastTimestamp'

# Watch events in real time
kubectl get events -n <ns> -w

# Filter by specific event reason
kubectl get events --field-selector reason=CircuitBreakerActive -n <ns>
kubectl get events --field-selector reason=ACLSyncError -n <ns>
kubectl get events --field-selector reason=RestartFailed -n <ns>

Logs

# Operator logs
kubectl -n aerospike-operator logs -l control-plane=controller-manager -f

# Aerospike server logs (current)
kubectl -n <ns> logs <pod> -c aerospike-server -f

# Aerospike server logs (previous crash)
kubectl -n <ns> logs <pod> -c aerospike-server --previous

Validation Error Patterns

The webhook validates CE constraints when creating or updating an AerospikeCluster. Below are common validation errors and how to fix them.

Size and Image Errors

Error MessageCauseFix
spec.size N exceeds CE maximum of 8Cluster size exceeds CE limitSet spec.size to 8 or fewer
spec.image must not be emptyNo image specified and no templateRefSet spec.image to a valid CE image
spec.image "..." is an Enterprise Edition imageUsing an EE image tagUse a Community Edition image (e.g., aerospike:ce-8.1.1.1)

Aerospike Config Errors

Error MessageCauseFix
must not contain 'xdr' sectionXDR is Enterprise-onlyRemove the xdr section from aerospikeConfig
must not contain 'tls' sectionTLS is Enterprise-onlyRemove the tls section from aerospikeConfig
namespaces count N exceeds CE maximum of 2More than 2 namespacesReduce to 2 or fewer namespaces
heartbeat.mode must be 'mesh'Non-mesh heartbeat modeSet network.heartbeat.mode to mesh

Enterprise-Only Namespace Keys

The following keys are not allowed in CE namespace configuration:

compression, compression-level, durable-delete, fast-restart, index-type, sindex-type, rack-id, strong-consistency, tomb-raider-eligible-age, tomb-raider-period

Error format: namespace[N] "name": 'key' is not allowed (reason)

ACL Validation Errors

Error MessageCauseFix
must have at least one user with both 'sys-admin' and 'user-admin' rolesNo admin user definedAssign both roles to at least one user
user "name" must have a secretName for passwordMissing password Secret referenceAdd secretName to the user spec
duplicate user name "name"Duplicate user namesUse unique names for each user
user "name" references undefined role "role"Custom role not declaredAdd the role to aerospikeAccessControl.roles or use a built-in role

Valid privilege codes: read, write, read-write, read-write-udf, sys-admin, user-admin, data-admin, truncate

Privilege format: "<code>" / "<code>.<namespace>" / "<code>.<namespace>.<set>"

Rack Config Validation Errors

Error MessageCauseFix
rack ID must be > 0Rack ID is 0 or negativeUse rack IDs starting from 1
duplicate rack ID NSame ID used in multiple racksUse unique rack IDs
duplicate rackLabel "label"Same label on multiple racksUse unique rack labels
rackConfig rack IDs cannot be changedAttempting to change rack IDs on updateRack IDs are immutable after creation

Storage Validation Errors

Error MessageCauseFix
duplicate volume name "name"Same volume name used twiceUse unique volume names
exactly one volume source must be specifiedZero or multiple sources for a volumeSpecify exactly one source (persistentVolume, emptyDir, etc.)
persistentVolume.size must not be emptyMissing PV sizeSet a valid size (e.g., 10Gi)
aerospike.path must be an absolute pathRelative path in volume mountUse an absolute path (e.g., /opt/aerospike/data)
subPath and subPathExpr are mutually exclusiveBoth set on the same mountUse only one of subPath or subPathExpr

Namespace Validation Errors

Error MessageCauseFix
replication-factor must be between 1 and 4RF out of rangeSet to a value between 1 and 4
replication-factor N exceeds cluster size MRF larger than node countLower RF or increase spec.size

PVC Not Binding

PersistentVolumeClaims remain in Pending state.

# Check PVC status
kubectl -n aerospike get pvc -l aerospike.io/cr-name=<name>

# Check PVC events for details
kubectl -n aerospike describe pvc <pvc-name>

# Verify StorageClass exists
kubectl get sc

Common causes:

  • StorageClass does not exist or is misconfigured.
  • No PersistentVolumes available (for static provisioning).
  • Insufficient storage capacity in the provisioner.
  • Volume topology constraints prevent binding on the scheduled node.

Cascade Delete Behavior

When cascadeDelete: true is set on a volume (or via global volume policy), PVCs are automatically deleted when:

  • The AerospikeCluster CR is deleted.
  • Pods are scaled down (after the pods have fully terminated).

PVC cleanup during scale-down:

  • The operator waits for all scaled-down pods to terminate before deleting PVCs.
  • If pods are stuck in Terminating, PVC cleanup is deferred to the next reconciliation.
  • Check PVCCleanedUp and PVCCleanupFailed events for status.
# Check for stuck terminating pods
kubectl -n aerospike get pods | grep Terminating

# Check PVC cleanup events
kubectl get events --field-selector reason=PVCCleanedUp -n aerospike
kubectl get events --field-selector reason=PVCCleanupFailed -n aerospike
warning

PVCs without cascadeDelete: true are always preserved, even after CR deletion. You must delete them manually if no longer needed.

Local Storage Issues

When using local storage classes with deleteLocalStorageOnRestart: true:

  • PVCs backed by local storage are deleted before pod deletion during cold restarts.
  • This forces re-provisioning on the new node.
  • If deleteLocalStorageOnRestart is not set, local PVCs persist and may block scheduling if the pod moves to a different node.
# Check for local PVC delete failures
kubectl get events --field-selector reason=LocalPVCDeleteFailed -n aerospike

Pod Connectivity

If pods cannot connect to each other or clients cannot reach the cluster:

# Check pod IPs and readiness
kubectl -n aerospike get pods -o wide

# Verify the headless service
kubectl -n aerospike get svc

# Check Aerospike cluster mesh status
kubectl -n aerospike exec <pod-name> -c aerospike-server -- asinfo -v 'statistics' | grep cluster_size

# Check network endpoints in cluster status
kubectl -n aerospike get asc <name> -o jsonpath='{.status.pods}' | \
jq '.[] | {pod: .podName, ip: .podIP, endpoints: .accessEndpoints}'

Mesh Heartbeat Issues

The CE operator requires heartbeat.mode to be mesh. If nodes cannot form a cluster:

  1. Verify mesh mode — Ensure aerospikeConfig.network.heartbeat.mode is set to mesh.
  2. Check mesh addresses — The operator auto-configures mesh seed addresses via the headless service.
  3. DNS resolution — Verify that pods can resolve the headless service DNS name.
# Check DNS resolution from within a pod
kubectl -n aerospike exec <pod-name> -c aerospike-server -- nslookup <headless-svc-name>

# Check Aerospike network info
kubectl -n aerospike exec <pod-name> -c aerospike-server -- asinfo -v 'mesh'

Host Network Issues

When using hostNetwork: true:

  • multiPodPerHost should be false to avoid port conflicts.
  • dnsPolicy should be ClusterFirstWithHostNet for proper DNS resolution.
  • The operator sets these defaults automatically, but mismatches trigger validation warnings.

Event Reference

The operator emits Kubernetes Events for significant lifecycle transitions. Use these events to monitor cluster activity.

Rolling Restart Events

ReasonTypeDescription
RollingRestartStartedNormalRolling restart loop began
RollingRestartCompletedNormalAll targeted pods restarted
PodWarmRestartedNormalSIGUSR1 sent for config reload
PodColdRestartedNormalPod deleted and recreated
RestartFailedWarningPod restart failed

Configuration Events

ReasonTypeDescription
ConfigMapCreatedNormalRack ConfigMap created
ConfigMapUpdatedNormalConfigMap content updated
DynamicConfigAppliedNormalRuntime config change applied
DynamicConfigStatusFailedWarningDynamic config change failed
DynamicConfigDegradedWarningCluster entered ConfigDegraded phase
DynamicConfigRollbackNormal/WarningRollback result after failed batch apply

StatefulSet and Rack Events

ReasonTypeDescription
StatefulSetCreatedNormalRack StatefulSet created
StatefulSetUpdatedNormalStatefulSet spec updated
RackScaledNormalRack pod count changed
ScaleDownDeferredWarningScale-down blocked by data migration

ACL Events

ReasonTypeDescription
ACLSyncStartedNormalACL synchronization began
ACLSyncCompletedNormalACL sync completed successfully
ACLSyncErrorWarningACL sync encountered an error

Storage Events

ReasonTypeDescription
PVCCleanedUpNormalOrphaned PVCs deleted after scale-down
PVCCleanupFailedWarningFailed to delete orphaned PVCs
LocalPVCDeleteFailedWarningLocal PVC deletion failed before cold restart

Template Events

ReasonTypeDescription
TemplateAppliedNormalClusterTemplate spec applied
TemplateDriftedWarningCluster spec drifted from template
TemplateResolutionErrorWarningFailed to resolve a ClusterTemplate

Infrastructure Events

ReasonTypeDescription
PDBCreatedNormalPodDisruptionBudget created
PDBUpdatedNormalPodDisruptionBudget updated
ServiceCreatedNormalHeadless service created
ServiceUpdatedNormalHeadless service updated

Lifecycle Events

ReasonTypeDescription
ClusterDeletionStartedNormalCluster teardown began
FinalizerRemovedNormalFinalizer removed, object will be deleted
ReadinessGateSatisfiedNormalPod readiness gate satisfied
ReadinessGateBlockingWarningRolling restart blocked by readiness gate

Circuit Breaker Events

ReasonTypeDescription
CircuitBreakerActiveWarningReconciliation backed off after consecutive failures
CircuitBreakerResetNormalCircuit breaker reset after success
PermanentErrorWarningPermanent validation error detected, automatic retries halted

Other Events

ReasonTypeDescription
ValidationWarningWarningNon-blocking validation warning
ReconcileErrorWarningReconciliation encountered an error
OperationNormalOn-demand operation processed

Quiesce Events

ReasonTypeDescription
NodeQuiesceStartedNormalNode quiesce started
NodeQuiescedNormalNode quiesce completed
NodeQuiesceFailedWarningNode quiesce failed

Operator Resource Sizing

The operator itself is lightweight, but incorrect sizing can cause slow reconciliation or OOM kills.

Clusters ManagedCPU RequestCPU LimitMemory RequestMemory Limit
1-5100m500m128Mi256Mi
5-20250m1000m256Mi512Mi
20+500m2000m512Mi1Gi

Aerospike Pod Resource Guidelines

Memory limits must accommodate the sum of all namespace data-size values plus ~30% overhead for primary index, buffers, and internal structures:

Minimum Memory = Σ(namespace data-size) × 1.3

For example, a cluster with two 2GB memory namespaces needs at least 2 × 2GB × 1.3 = 5.2GB memory limit.

tip

The Aerospike Cluster Manager UI automatically calculates and adjusts memory limits based on your namespace configuration during cluster creation.

Kubernetes Compatibility

Operator VersionKubernetes VersionsController-RuntimeCRD API Version
v0.1.xv1.26 - v1.35v0.23.xacko.io/v1alpha1
note

The operator uses standard Kubernetes APIs and should work with any CNCF-certified distribution (EKS, GKE, AKS, OpenShift, k3s, etc.).