Operations Troubleshooting¶
Operational troubleshooting procedures for the RCIIS DevOps platform, focusing on deployment, infrastructure, and service issues.
Overview¶
This guide provides systematic troubleshooting approaches for operational issues, including deployment failures, service disruptions, and infrastructure problems.
Troubleshooting Methodology¶
Standard Operating Procedure¶
- Assess Impact: Determine severity and affected systems
- Gather Information: Collect logs, metrics, and status information
- Identify Root Cause: Systematic elimination of potential causes
- Implement Fix: Apply corrective measures
- Verify Resolution: Confirm issue is resolved
- Document: Record findings and preventive measures
Escalation Process¶
- Level 1: Automated alerts and monitoring
- Level 2: On-call engineer response
- Level 3: Team lead and subject matter experts
- Level 4: Management and external vendors
Deployment Troubleshooting¶
ArgoCD Sync Failures¶
Symptoms: - Applications stuck in "OutOfSync" state - Sync operations failing or timing out - Resource conflicts preventing deployment
Diagnosis:
# Check application status
argocd app list | grep -v Synced
argocd app get <app-name>
# View sync history
argocd app history <app-name>
# Check resource differences
argocd app diff <app-name>
# Check ArgoCD logs
kubectl logs -n argocd deployment/argocd-application-controller --tail=100
Common Solutions:
-
Resource Conflicts:
-
Permission Issues:
-
Repository Access:
Helm Deployment Issues¶
Chart Installation Failures:
# Check Helm release status
helm list -A --all
helm status <release-name> -n <namespace>
# View release history
helm history <release-name> -n <namespace>
# Check for conflicts
helm template <release-name> <chart> --values values.yaml --dry-run
# Debug template rendering
helm template <release-name> <chart> --values values.yaml --debug
Resolution Steps:
# Rollback failed release
helm rollback <release-name> <revision> -n <namespace>
# Uninstall and reinstall
helm uninstall <release-name> -n <namespace>
helm install <release-name> <chart> -n <namespace> --values values.yaml
# Force upgrade
helm upgrade <release-name> <chart> -n <namespace> --values values.yaml --force
Kustomize Build Failures¶
SOPS/KSOPS Issues:
# Test Kustomize build
kustomize build --enable-alpha-plugins --enable-exec <path>
# Check KSOPS plugin
which ksops
echo $XDG_CONFIG_HOME/kustomize/plugin/viaduct.ai/v1/ksops/ksops
# Verify Age key
echo $SOPS_AGE_KEY_FILE
sops --decrypt <secret-file>
# Test decryption manually
export SOPS_AGE_KEY_FILE=~/.age/key.txt
sops --decrypt apps/rciis/secrets/staging/nucleus/appsettings.yaml
Service Troubleshooting¶
Pod Startup Issues¶
CrashLoopBackOff:
# Check pod status and restarts
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
# Check logs from previous container
kubectl logs <pod-name> -n <namespace> --previous
# Check resource limits
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Limits
# Check liveness/readiness probes
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Liveness
Resolution Strategies:
# Increase resource limits
kubectl patch deployment <deployment> -n <namespace> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"limits":{"memory":"1Gi","cpu":"500m"}}}]}}}}'
# Adjust probe timings
kubectl patch deployment <deployment> -n <namespace> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","livenessProbe":{"initialDelaySeconds":60}}]}}}}'
# Debug with different image
kubectl set image deployment/<deployment> <container>=busybox -n <namespace>
kubectl exec -it deployment/<deployment> -n <namespace> -- /bin/sh
Database Connection Issues¶
SQL Server Connectivity:
# Check SQL Server pod status
kubectl get pods -l app=mssql -n database
# Test connectivity from application pod
kubectl exec deployment/nucleus -n nucleus -- telnet mssql-service 1433
# Check connection string
kubectl get secret nucleus-database -o yaml -n nucleus | grep connection-string | base64 -d
# Test database query
kubectl exec -it mssql-0 -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password> -Q "SELECT 1"
Common Database Issues:
# Database not ready
kubectl logs mssql-0 -n database | grep "SQL Server is now ready"
# Connection pool exhaustion
kubectl exec deployment/nucleus -n nucleus -- netstat -an | grep 1433 | wc -l
# Lock issues
kubectl exec -it mssql-0 -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password> -Q "SELECT * FROM sys.dm_tran_locks"
Message Queue Issues¶
Kafka Connectivity Problems:
# Check Kafka cluster status
kubectl get kafka kafka-cluster -n kafka
# Check broker pods
kubectl get pods -l strimzi.io/cluster=kafka-cluster -n kafka
# Test producer
kubectl exec kafka-cluster-kafka-0 -n kafka -- bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic test-topic
# Check consumer lag
kubectl exec kafka-cluster-kafka-0 -n kafka -- bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id>
Kafka Troubleshooting:
# Check topic status
kubectl get kafkatopic -n kafka
kubectl describe kafkatopic <topic-name> -n kafka
# Check user permissions
kubectl get kafkauser -n kafka
kubectl describe kafkauser <user-name> -n kafka
# View broker logs
kubectl logs kafka-cluster-kafka-0 -n kafka | tail -100
Network Troubleshooting¶
Ingress and Load Balancer Issues¶
Service Unavailable (503) Errors:
# Check ingress controller status
kubectl get pods -n ingress-nginx
kubectl logs deployment/ingress-nginx-controller -n ingress-nginx --tail=100
# Check service endpoints
kubectl get endpoints <service-name> -n <namespace>
# Check backend pod health
kubectl get pods -l app=<app-label> -n <namespace>
kubectl exec <pod-name> -n <namespace> -- curl localhost:8080/health
DNS Resolution Issues:
# Test DNS from pod
kubectl exec -it <pod-name> -n <namespace> -- nslookup <service-name>.<namespace>.svc.cluster.local
# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs deployment/coredns -n kube-system
# Test external DNS
kubectl exec -it <pod-name> -n <namespace> -- nslookup google.com
Network Policy Debugging¶
Connection Blocked by Policy:
# Check network policies
kubectl get networkpolicy -A
kubectl describe networkpolicy <policy-name> -n <namespace>
# Test connectivity
kubectl exec <source-pod> -n <source-namespace> -- nc -zv <target-service> <port>
# Monitor Cilium policy drops (if using Cilium)
cilium monitor --type policy-verdict
# Temporarily disable network policies
kubectl delete networkpolicy --all -n <namespace>
Storage Troubleshooting¶
Persistent Volume Issues¶
PVC Pending State:
# Check PVC status
kubectl get pvc -A
kubectl describe pvc <pvc-name> -n <namespace>
# Check storage class
kubectl get storageclass
kubectl describe storageclass <storage-class>
# Check available PVs
kubectl get pv
Volume Mount Failures:
# Check pod events
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events
# Check volume permissions
kubectl exec <pod-name> -n <namespace> -- ls -la /mount/path
# Check node disk space
kubectl describe node <node-name> | grep -A 5 Capacity
Security Troubleshooting¶
Certificate Issues¶
TLS Certificate Problems:
# Check certificate status
kubectl get certificates -A
kubectl describe certificate <cert-name> -n <namespace>
# Check cert-manager logs
kubectl logs deployment/cert-manager -n cert-manager --tail=100
# Test certificate
openssl s_client -connect <domain>:443 -servername <domain> < /dev/null
RBAC Permission Denied:
# Check current permissions
kubectl auth can-i <verb> <resource> --as=<user> -n <namespace>
# Check role bindings
kubectl get rolebinding,clusterrolebinding -A | grep <user-or-group>
# Describe role
kubectl describe role <role-name> -n <namespace>
kubectl describe clusterrole <clusterrole-name>
Performance Troubleshooting¶
High Resource Usage¶
Memory Issues:
# Check memory usage
kubectl top pods -A --sort-by=memory
kubectl top nodes
# Check memory limits
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Limits
# Check for memory leaks
kubectl exec <pod-name> -n <namespace> -- ps aux --sort=-%mem
CPU Issues:
# Check CPU usage
kubectl top pods -A --sort-by=cpu
# Check for CPU throttling
kubectl exec <pod-name> -n <namespace> -- cat /sys/fs/cgroup/cpu/cpu.stat
# Monitor process CPU usage
kubectl exec <pod-name> -n <namespace> -- top -p <pid>
Slow Response Times¶
Application Performance:
# Check application metrics
curl http://<pod-ip>:8080/metrics | grep http_request_duration
# Monitor database queries
kubectl exec -it mssql-0 -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password> -Q "SELECT * FROM sys.dm_exec_query_stats"
# Check network latency
kubectl exec <pod-name> -n <namespace> -- ping <target-service>
Emergency Procedures¶
Service Outage Response¶
Immediate Actions:
# Check overall cluster health
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running
# Scale up critical services
kubectl scale deployment <critical-service> --replicas=5 -n <namespace>
# Check ingress controller
kubectl get pods -n ingress-nginx
kubectl logs deployment/ingress-nginx-controller -n ingress-nginx --tail=50
Communication:
# Post status update
curl -X POST $SLACK_WEBHOOK -d '{"text":"🚨 SERVICE OUTAGE: Investigating connectivity issues"}'
# Update status page
# (Update external status page if available)
# Notify stakeholders
# Send email/SMS to key stakeholders
Data Recovery¶
Database Recovery:
# Stop application to prevent data corruption
kubectl scale deployment nucleus --replicas=0 -n nucleus
# Check backup status
kubectl get cronjob -n database
kubectl get job -l app=database-backup -n database
# Restore from latest backup
kubectl exec -it mssql-0 -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password> -Q "RESTORE DATABASE [NucleusDB] FROM DISK = '/backup/latest.bak' WITH REPLACE"
# Restart application
kubectl scale deployment nucleus --replicas=2 -n nucleus
Monitoring and Alerting¶
Alert Triage¶
Critical Alert Response:
# Check alert details
kubectl get prometheusrule -A
kubectl describe prometheusrule <rule-name> -n <namespace>
# Check AlertManager
kubectl get pods -n monitoring | grep alertmanager
kubectl logs alertmanager-0 -n monitoring
# Silence alerts temporarily
amtool silence add alertname="<alert-name>" --duration=1h --comment="Investigating issue"
Log Analysis¶
Centralized Logging:
# Search application logs
kubectl logs -l app=nucleus -n nucleus --tail=1000 | grep ERROR
# Export logs for analysis
kubectl logs deployment/nucleus -n nucleus --since=1h > /tmp/nucleus-logs.txt
# Search across namespaces
kubectl logs --all-containers=true --selector app=nucleus -A
Preventive Measures¶
Health Monitoring¶
Proactive Monitoring:
# Regular health checks
curl -f https://nucleus-staging.devops.africa/health
kubectl get componentstatuses
# Resource monitoring
kubectl top nodes
kubectl top pods -A
# Certificate expiry monitoring
kubectl get certificates -A -o custom-columns=NAME:.metadata.name,READY:.status.conditions[0].status,EXPIRY:.status.notAfter
Maintenance Windows¶
Planned Maintenance:
# Drain nodes for maintenance
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# Update system components
helm upgrade <release> <chart> --values values.yaml
# Uncordon nodes
kubectl uncordon <node-name>
# Verify cluster health
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running
Documentation and Runbooks¶
Incident Documentation¶
Post-Incident Review: 1. Timeline: Detailed timeline of events 2. Root Cause: Identified root cause analysis 3. Impact: Assessment of impact and affected services 4. Resolution: Steps taken to resolve the issue 5. Prevention: Measures to prevent recurrence
Runbook Maintenance¶
Regular Updates: - Update procedures based on lessons learned - Test runbooks during maintenance windows - Keep contact information current - Review and update escalation procedures
For specific component troubleshooting, refer to the detailed troubleshooting guide and individual service documentation.