Troubleshooting Guide¶

Comprehensive troubleshooting guide for common issues in the RCIIS DevOps platform.

General Troubleshooting Approach¶

Diagnostic Methodology¶

Identify symptoms: Gather error messages and logs
Isolate the problem: Narrow down the scope
Check recent changes: Review recent deployments or configurations
Verify dependencies: Ensure all required services are running
Apply fixes: Implement solutions systematically
Verify resolution: Confirm the issue is resolved

Essential Commands¶

# Check cluster status
kubectl get nodes
kubectl get pods --all-namespaces
kubectl top nodes
kubectl top pods --all-namespaces

# Check specific namespace
kubectl get all -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --tail=100

# Check events
kubectl get events --sort-by=.metadata.creationTimestamp -n <namespace>
kubectl get events --all-namespaces --sort-by=.metadata.creationTimestamp

Application Issues¶

Pod Not Starting¶

Symptoms: - Pod stuck in Pending, CrashLoopBackOff, or ImagePullBackOff state - Application not responding to health checks

Diagnosis:

# Check pod status and events
kubectl describe pod <pod-name> -n <namespace>

# Check logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous

# Check resource availability
kubectl describe node <node-name>
kubectl top nodes

Common Causes and Solutions:

Insufficient Resources

# Check resource requests vs available
kubectl describe node <node-name>

# Solution: Reduce resource requests or add nodes
kubectl patch deployment <deployment> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"requests":{"memory":"256Mi","cpu":"100m"}}}]}}}}'

Image Pull Issues

# Check image name and registry access
kubectl describe pod <pod-name> -n <namespace>

# Solution: Verify image exists and credentials are correct
kubectl create secret docker-registry harbor-registry \
  --docker-server=harbor.devops.africa \
  --docker-username=<username> \
  --docker-password=<password> \
  -n <namespace>

Configuration Issues

# Check configmaps and secrets
kubectl get configmap -n <namespace>
kubectl get secret -n <namespace>

# Solution: Verify configuration exists and is correctly mounted
kubectl describe configmap <configmap-name> -n <namespace>

Service Connection Issues¶

Symptoms: - Services unable to communicate - DNS resolution failures - Connection timeouts

Diagnosis:

# Check service endpoints
kubectl get endpoints <service-name> -n <namespace>

# Test DNS resolution
kubectl run debug --image=busybox -i --tty --rm -- /bin/sh
# Inside pod:
nslookup <service-name>.<namespace>.svc.cluster.local
wget -qO- http://<service-name>.<namespace>:8080/health

# Check network policies
kubectl get networkpolicy -n <namespace>

Common Solutions:

Service Selector Mismatch

# Check service selector matches pod labels
kubectl get service <service-name> -o yaml
kubectl get pods -l <label-selector> -n <namespace>

Network Policy Blocking

# Temporarily disable network policies for testing
kubectl delete networkpolicy --all -n <namespace>

# Re-apply with correct rules
kubectl apply -f correct-network-policy.yaml

Port Configuration

# Verify service ports match container ports
kubectl describe service <service-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace>

Infrastructure Issues¶

ArgoCD Sync Failures¶

Symptoms: - Applications stuck in OutOfSync state - Sync operations failing - Resource conflicts

Diagnosis:

# Check application status
argocd app list
argocd app get <app-name>

# Check sync history
argocd app history <app-name>

# Check resource differences
argocd app diff <app-name>

Common Solutions:

Resource Conflicts

# Force sync with pruning
argocd app sync <app-name> --force --prune

# Delete conflicting resources manually
kubectl delete <resource-type> <resource-name> -n <namespace>

RBAC Issues

# Check ArgoCD service account permissions
kubectl describe clusterrolebinding argocd-application-controller

# Verify project permissions
argocd proj get <project-name>

Repository Access

# Check repository connection
argocd repo list
argocd repo get <repo-url>

# Update repository credentials
argocd repo add <repo-url> --ssh-private-key-path ~/.ssh/id_rsa

Certificate Issues¶

Symptoms: - SSL/TLS connection failures - Certificate not found errors - Expired certificate warnings

Diagnosis:

# Check certificate status
kubectl get certificate -A
kubectl describe certificate <cert-name> -n <namespace>

# Check cert-manager logs
kubectl logs -n cert-manager deployment/cert-manager

# Check certificate details
kubectl get secret <cert-secret> -o yaml -n <namespace>

Common Solutions:

Certificate Not Issued

# Check certificate challenges
kubectl get challenge -A
kubectl describe challenge <challenge-name> -n <namespace>

# Check DNS resolution
nslookup <domain-name>

# Force certificate renewal
kubectl delete certificate <cert-name> -n <namespace>

ClusterIssuer Issues

# Check cluster issuer status
kubectl describe clusterissuer <issuer-name>

# Verify ACME configuration
kubectl get secret <issuer-secret> -o yaml -n cert-manager

Storage Issues¶

Symptoms: - Pods stuck in Pending with volume mount errors - Database connection failures - File system errors

Diagnosis:

# Check persistent volumes and claims
kubectl get pv,pvc -A

# Check storage class
kubectl get storageclass

# Check volume mount issues
kubectl describe pod <pod-name> -n <namespace>

Common Solutions:

Volume Not Available

# Check PVC status
kubectl describe pvc <pvc-name> -n <namespace>

# Verify storage class exists
kubectl get storageclass <storage-class>

# Create missing storage class
kubectl apply -f storage-class.yaml

Permission Issues

# Fix volume permissions
kubectl exec <pod-name> -n <namespace> -- chown -R 1001:1001 /data

# Use init container for permission fix
kubectl patch deployment <deployment> -p '{"spec":{"template":{"spec":{"initContainers":[{"name":"fix-permissions","image":"busybox","command":["chown","-R","1001:1001","/data"],"volumeMounts":[{"name":"data","mountPath":"/data"}]}]}}}}'

Database Issues¶

SQL Server Connection Problems¶

Symptoms: - Application unable to connect to database - Login failures - Connection timeout errors

Diagnosis:

# Check SQL Server pod status
kubectl get pods -l app=mssql -n database

# Check SQL Server logs
kubectl logs <mssql-pod> -n database

# Test connection
kubectl exec -it <mssql-pod> -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password>

Common Solutions:

Connection String Issues

# Verify connection string secret
kubectl get secret <db-secret> -o yaml -n <namespace>

# Test connectivity from application pod
kubectl exec <app-pod> -n <namespace> -- telnet <mssql-service> 1433

Authentication Issues

# Reset SA password
kubectl exec <mssql-pod> -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <old-password> -Q "ALTER LOGIN sa WITH PASSWORD='<new-password>'"

# Update secret with new password
kubectl patch secret <db-secret> -p '{"data":{"password":"<base64-encoded-password>"}}' -n <namespace>

Database Not Ready

# Check database initialization
kubectl exec <mssql-pod> -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password> -Q "SELECT name FROM sys.databases"

# Run database migrations
kubectl exec <app-pod> -n <namespace> -- dotnet ef database update

Message Queue Issues¶

Kafka Connection Problems¶

Symptoms: - Producers unable to send messages - Consumers not receiving messages - Broker connection failures

Diagnosis:

# Check Kafka cluster status
kubectl get kafka kafka-cluster -n kafka

# Check Kafka pods
kubectl get pods -l strimzi.io/cluster=kafka-cluster -n kafka

# Check topic status
kubectl get kafkatopic -n kafka

# Test producer/consumer
kubectl exec kafka-cluster-kafka-0 -n kafka -- bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic test-topic

Common Solutions:

Broker Not Ready

# Check broker logs
kubectl logs kafka-cluster-kafka-0 -n kafka

# Restart brokers if needed
kubectl delete pod kafka-cluster-kafka-0 -n kafka

Topic Issues

# List topics
kubectl exec kafka-cluster-kafka-0 -n kafka -- bin/kafka-topics.sh --bootstrap-server localhost:9092 --list

# Create missing topic
kubectl apply -f - <<EOF
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: <topic-name>
  namespace: kafka
spec:
  partitions: 3
  replicas: 2
EOF

Consumer Group Issues

# Check consumer group status
kubectl exec kafka-cluster-kafka-0 -n kafka -- bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id>

# Reset consumer group offset
kubectl exec kafka-cluster-kafka-0 -n kafka -- bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group <group-id> --reset-offsets --to-earliest --topic <topic-name> --execute

Security Issues¶

SOPS Decryption Failures¶

Symptoms: - Secrets not decrypted in pods - KSOPS plugin errors - Age key issues

Diagnosis:

# Test SOPS decryption manually
sops --decrypt apps/rciis/secrets/staging/nucleus/appsettings.yaml

# Check Age key
echo $SOPS_AGE_KEY_FILE
cat $SOPS_AGE_KEY_FILE

# Check KSOPS plugin
kustomize build --enable-alpha-plugins apps/rciis/nucleus/staging/

Common Solutions:

Missing Age Key

# Generate new Age key
age-keygen -o ~/.age/key.txt

# Update SOPS configuration
export SOPS_AGE_KEY_FILE=~/.age/key.txt

# Re-encrypt secrets with new key
sops updatekeys apps/rciis/secrets/staging/nucleus/appsettings.yaml

KSOPS Plugin Issues

# Install KSOPS plugin
curl -Lo ksops https://github.com/viaduct-ai/kustomize-sops/releases/latest/download/ksops_linux_amd64
chmod +x ksops
sudo mv ksops /usr/local/bin/

# Verify plugin
kustomize plugin list

RBAC Permission Issues¶

Symptoms: - Access denied errors - ServiceAccount permission failures - Unauthorized API calls

Diagnosis:

# Check current permissions
kubectl auth can-i <verb> <resource> --as=system:serviceaccount:<namespace>:<serviceaccount>

# Check role bindings
kubectl get rolebinding,clusterrolebinding -A | grep <serviceaccount>

# Check role definitions
kubectl describe role <role-name> -n <namespace>

Common Solutions:

Missing Permissions

# Create role with required permissions
kubectl create role <role-name> --verb=get,list,watch --resource=pods,services -n <namespace>

# Bind role to service account
kubectl create rolebinding <binding-name> --role=<role-name> --serviceaccount=<namespace>:<serviceaccount> -n <namespace>

ClusterRole Issues

# Check cluster role
kubectl describe clusterrole <clusterrole-name>

# Update cluster role
kubectl patch clusterrole <clusterrole-name> --type='json' -p='[{"op":"add","path":"/rules/-","value":{"apiGroups":[""],"resources":["secrets"],"verbs":["get","list"]}}]'

Performance Issues¶

High Resource Usage¶

Symptoms: - Pods being OOMKilled - High CPU usage - Slow response times

Diagnosis:

# Check resource usage
kubectl top pods -A --sort-by=memory
kubectl top pods -A --sort-by=cpu

# Check resource limits
kubectl describe pod <pod-name> -n <namespace>

# Check node resources
kubectl describe node <node-name>

Common Solutions:

Memory Issues

# Increase memory limits
kubectl patch deployment <deployment> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"limits":{"memory":"2Gi"}}}]}}}}'

# Enable horizontal pod autoscaling
kubectl autoscale deployment <deployment> --cpu-percent=70 --min=2 --max=10 -n <namespace>

CPU Issues

# Increase CPU limits
kubectl patch deployment <deployment> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"limits":{"cpu":"1000m"}}}]}}}}'

# Check for CPU throttling
kubectl exec <pod-name> -n <namespace> -- cat /sys/fs/cgroup/cpu/cpu.stat

Network Performance¶

Symptoms: - Slow network communications - High latency between services - Packet loss

Diagnosis:

# Test network connectivity
kubectl exec <pod-a> -n <namespace> -- ping <pod-b-ip>
kubectl exec <pod-a> -n <namespace> -- iperf3 -c <service-name> -p 5201

# Check Cilium status
cilium status
cilium connectivity test

# Monitor network traffic
kubectl exec <pod-name> -n <namespace> -- tcpdump -i eth0

Solutions:

# Restart Cilium agents
kubectl delete pods -l k8s-app=cilium -n kube-system

# Check CNI configuration
kubectl describe node <node-name>

# Optimize network policies
kubectl get networkpolicy -A

Emergency Procedures¶

Complete Service Outage¶

Immediate Response

# Check cluster status
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running

# Check critical services
kubectl get pods -n argocd
kubectl get pods -n ingress-nginx
kubectl get pods -n cert-manager

Rollback Procedures

# Rollback ArgoCD application
argocd app rollback <app-name> <revision-id>

# Rollback Kubernetes deployment
kubectl rollout undo deployment/<deployment-name> -n <namespace>

# Scale down problematic deployment
kubectl scale deployment <deployment-name> --replicas=0 -n <namespace>

Communication

# Post incident status
curl -X POST $SLACK_WEBHOOK -d '{"text":"🚨 INCIDENT: <description> - Investigating"}'

# Update status page
# (Update external status page if available)

Data Recovery¶

Database Recovery

# Stop application
kubectl scale deployment <app-deployment> --replicas=0 -n <namespace>

# Restore from backup
kubectl exec <mssql-pod> -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password> -Q "RESTORE DATABASE [DB] FROM DISK = '/backup/latest.bak' WITH REPLACE"

# Verify restoration
kubectl exec <mssql-pod> -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password> -Q "SELECT COUNT(*) FROM [Table]"

# Restart application
kubectl scale deployment <app-deployment> --replicas=2 -n <namespace>

Monitoring and Alerting¶

Setting Up Alerts¶

# Critical alert rules
groups:
- name: critical.rules
  rules:
  - alert: PodCrashLooping
    expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Pod {{ $labels.pod }} is crash looping"

  - alert: HighMemoryUsage
    expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage on {{ $labels.pod }}"

Log Analysis¶

# Aggregate error logs
kubectl logs -l app=nucleus -n nucleus --tail=1000 | grep ERROR

# Export logs for analysis
kubectl logs <pod-name> -n <namespace> --since=1h > /tmp/pod-logs.txt

# Search for specific patterns
kubectl logs -l app=nucleus -n nucleus | grep -E "(Exception|Error|Failed)"

For specific component troubleshooting, refer to the individual service documentation.