Troubleshooting Guide¶
Comprehensive troubleshooting guide for common issues in the RCIIS DevOps platform.
General Troubleshooting Approach¶
Diagnostic Methodology¶
- Identify symptoms: Gather error messages and logs
- Isolate the problem: Narrow down the scope
- Check recent changes: Review recent deployments or configurations
- Verify dependencies: Ensure all required services are running
- Apply fixes: Implement solutions systematically
- Verify resolution: Confirm the issue is resolved
Essential Commands¶
# Check cluster status
kubectl get nodes
kubectl get pods --all-namespaces
kubectl top nodes
kubectl top pods --all-namespaces
# Check specific namespace
kubectl get all -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --tail=100
# Check events
kubectl get events --sort-by=.metadata.creationTimestamp -n <namespace>
kubectl get events --all-namespaces --sort-by=.metadata.creationTimestamp
Application Issues¶
Pod Not Starting¶
Symptoms:
- Pod stuck in Pending, CrashLoopBackOff, or ImagePullBackOff state
- Application not responding to health checks
Diagnosis:
# Check pod status and events
kubectl describe pod <pod-name> -n <namespace>
# Check logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
# Check resource availability
kubectl describe node <node-name>
kubectl top nodes
Common Causes and Solutions:
-
Insufficient Resources
-
Image Pull Issues
# Check image name and registry access kubectl describe pod <pod-name> -n <namespace> # Solution: Verify image exists and credentials are correct kubectl create secret docker-registry harbor-registry \ --docker-server=harbor.devops.africa \ --docker-username=<username> \ --docker-password=<password> \ -n <namespace> -
Configuration Issues
Service Connection Issues¶
Symptoms: - Services unable to communicate - DNS resolution failures - Connection timeouts
Diagnosis:
# Check service endpoints
kubectl get endpoints <service-name> -n <namespace>
# Test DNS resolution
kubectl run debug --image=busybox -i --tty --rm -- /bin/sh
# Inside pod:
nslookup <service-name>.<namespace>.svc.cluster.local
wget -qO- http://<service-name>.<namespace>:8080/health
# Check network policies
kubectl get networkpolicy -n <namespace>
Common Solutions:
-
Service Selector Mismatch
-
Network Policy Blocking
-
Port Configuration
Infrastructure Issues¶
ArgoCD Sync Failures¶
Symptoms:
- Applications stuck in OutOfSync state
- Sync operations failing
- Resource conflicts
Diagnosis:
# Check application status
argocd app list
argocd app get <app-name>
# Check sync history
argocd app history <app-name>
# Check resource differences
argocd app diff <app-name>
Common Solutions:
-
Resource Conflicts
-
RBAC Issues
-
Repository Access
Certificate Issues¶
Symptoms: - SSL/TLS connection failures - Certificate not found errors - Expired certificate warnings
Diagnosis:
# Check certificate status
kubectl get certificate -A
kubectl describe certificate <cert-name> -n <namespace>
# Check cert-manager logs
kubectl logs -n cert-manager deployment/cert-manager
# Check certificate details
kubectl get secret <cert-secret> -o yaml -n <namespace>
Common Solutions:
-
Certificate Not Issued
-
ClusterIssuer Issues
Storage Issues¶
Symptoms:
- Pods stuck in Pending with volume mount errors
- Database connection failures
- File system errors
Diagnosis:
# Check persistent volumes and claims
kubectl get pv,pvc -A
# Check storage class
kubectl get storageclass
# Check volume mount issues
kubectl describe pod <pod-name> -n <namespace>
Common Solutions:
-
Volume Not Available
-
Permission Issues
# Fix volume permissions kubectl exec <pod-name> -n <namespace> -- chown -R 1001:1001 /data # Use init container for permission fix kubectl patch deployment <deployment> -p '{"spec":{"template":{"spec":{"initContainers":[{"name":"fix-permissions","image":"busybox","command":["chown","-R","1001:1001","/data"],"volumeMounts":[{"name":"data","mountPath":"/data"}]}]}}}}'
Database Issues¶
SQL Server Connection Problems¶
Symptoms: - Application unable to connect to database - Login failures - Connection timeout errors
Diagnosis:
# Check SQL Server pod status
kubectl get pods -l app=mssql -n database
# Check SQL Server logs
kubectl logs <mssql-pod> -n database
# Test connection
kubectl exec -it <mssql-pod> -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password>
Common Solutions:
-
Connection String Issues
-
Authentication Issues
# Reset SA password kubectl exec <mssql-pod> -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <old-password> -Q "ALTER LOGIN sa WITH PASSWORD='<new-password>'" # Update secret with new password kubectl patch secret <db-secret> -p '{"data":{"password":"<base64-encoded-password>"}}' -n <namespace> -
Database Not Ready
Message Queue Issues¶
Kafka Connection Problems¶
Symptoms: - Producers unable to send messages - Consumers not receiving messages - Broker connection failures
Diagnosis:
# Check Kafka cluster status
kubectl get kafka kafka-cluster -n kafka
# Check Kafka pods
kubectl get pods -l strimzi.io/cluster=kafka-cluster -n kafka
# Check topic status
kubectl get kafkatopic -n kafka
# Test producer/consumer
kubectl exec kafka-cluster-kafka-0 -n kafka -- bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic test-topic
Common Solutions:
-
Broker Not Ready
-
Topic Issues
# List topics kubectl exec kafka-cluster-kafka-0 -n kafka -- bin/kafka-topics.sh --bootstrap-server localhost:9092 --list # Create missing topic kubectl apply -f - <<EOF apiVersion: kafka.strimzi.io/v1beta2 kind: KafkaTopic metadata: name: <topic-name> namespace: kafka spec: partitions: 3 replicas: 2 EOF -
Consumer Group Issues
# Check consumer group status kubectl exec kafka-cluster-kafka-0 -n kafka -- bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id> # Reset consumer group offset kubectl exec kafka-cluster-kafka-0 -n kafka -- bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group <group-id> --reset-offsets --to-earliest --topic <topic-name> --execute
Security Issues¶
SOPS Decryption Failures¶
Symptoms: - Secrets not decrypted in pods - KSOPS plugin errors - Age key issues
Diagnosis:
# Test SOPS decryption manually
sops --decrypt apps/rciis/secrets/staging/nucleus/appsettings.yaml
# Check Age key
echo $SOPS_AGE_KEY_FILE
cat $SOPS_AGE_KEY_FILE
# Check KSOPS plugin
kustomize build --enable-alpha-plugins apps/rciis/nucleus/staging/
Common Solutions:
-
Missing Age Key
-
KSOPS Plugin Issues
RBAC Permission Issues¶
Symptoms: - Access denied errors - ServiceAccount permission failures - Unauthorized API calls
Diagnosis:
# Check current permissions
kubectl auth can-i <verb> <resource> --as=system:serviceaccount:<namespace>:<serviceaccount>
# Check role bindings
kubectl get rolebinding,clusterrolebinding -A | grep <serviceaccount>
# Check role definitions
kubectl describe role <role-name> -n <namespace>
Common Solutions:
-
Missing Permissions
-
ClusterRole Issues
Performance Issues¶
High Resource Usage¶
Symptoms: - Pods being OOMKilled - High CPU usage - Slow response times
Diagnosis:
# Check resource usage
kubectl top pods -A --sort-by=memory
kubectl top pods -A --sort-by=cpu
# Check resource limits
kubectl describe pod <pod-name> -n <namespace>
# Check node resources
kubectl describe node <node-name>
Common Solutions:
-
Memory Issues
# Increase memory limits kubectl patch deployment <deployment> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"limits":{"memory":"2Gi"}}}]}}}}' # Enable horizontal pod autoscaling kubectl autoscale deployment <deployment> --cpu-percent=70 --min=2 --max=10 -n <namespace> -
CPU Issues
Network Performance¶
Symptoms: - Slow network communications - High latency between services - Packet loss
Diagnosis:
# Test network connectivity
kubectl exec <pod-a> -n <namespace> -- ping <pod-b-ip>
kubectl exec <pod-a> -n <namespace> -- iperf3 -c <service-name> -p 5201
# Check Cilium status
cilium status
cilium connectivity test
# Monitor network traffic
kubectl exec <pod-name> -n <namespace> -- tcpdump -i eth0
Solutions:
# Restart Cilium agents
kubectl delete pods -l k8s-app=cilium -n kube-system
# Check CNI configuration
kubectl describe node <node-name>
# Optimize network policies
kubectl get networkpolicy -A
Emergency Procedures¶
Complete Service Outage¶
-
Immediate Response
-
Rollback Procedures
-
Communication
Data Recovery¶
- Database Recovery
# Stop application kubectl scale deployment <app-deployment> --replicas=0 -n <namespace> # Restore from backup kubectl exec <mssql-pod> -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password> -Q "RESTORE DATABASE [DB] FROM DISK = '/backup/latest.bak' WITH REPLACE" # Verify restoration kubectl exec <mssql-pod> -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password> -Q "SELECT COUNT(*) FROM [Table]" # Restart application kubectl scale deployment <app-deployment> --replicas=2 -n <namespace>
Monitoring and Alerting¶
Setting Up Alerts¶
# Critical alert rules
groups:
- name: critical.rules
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
- alert: HighMemoryUsage
expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.pod }}"
Log Analysis¶
# Aggregate error logs
kubectl logs -l app=nucleus -n nucleus --tail=1000 | grep ERROR
# Export logs for analysis
kubectl logs <pod-name> -n <namespace> --since=1h > /tmp/pod-logs.txt
# Search for specific patterns
kubectl logs -l app=nucleus -n nucleus | grep -E "(Exception|Error|Failed)"
For specific component troubleshooting, refer to the individual service documentation.