Kubernetes Clusters¶

Kubernetes cluster configurations and management for the RCIIS DevOps platform.

Cluster Types¶

Local Development Clusters¶

Technology: Kind (Kubernetes in Docker)
Purpose: Developer workstations and local testing
Configuration: Single or multi-node clusters
Networking: Cilium or Calico CNI options

Testing Clusters (Proxmox/Talos)¶

Technology: Talos Linux on Proxmox VE
Purpose: Automated testing and CI/CD integration
Configuration: 3 control plane + 3 worker nodes
Networking: Cilium CNI with kube-proxy replacement
Provisioning: OpenTofu/Terraform IaC
Documentation: See Proxmox Talos Guide

Staging Clusters¶

Purpose: Pre-production testing and validation
Configuration: Production-like setup and scale
Data: Sanitized production data copies
Access: Controlled access for testing teams

Production Clusters (Planned)¶

Purpose: Live production workloads
Configuration: High availability and resilience
Security: Enhanced security controls and monitoring
Compliance: Regulatory and audit requirements

Cluster Configuration¶

Kind Cluster Setup¶

Basic Configuration:

# kind-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: rciis-local
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-labels: "ingress-ready=true"
  extraPortMappings:
  - containerPort: 80
    hostPort: 80
    protocol: TCP
  - containerPort: 443
    hostPort: 443
    protocol: TCP
- role: worker
- role: worker

Multi-Node Setup:

# Create multi-node cluster
kind create cluster --config kind-config.yaml --name rciis-local

# Verify cluster
kubectl cluster-info --context kind-rciis-local
kubectl get nodes

CNI Configuration¶

Cilium Installation:

# Install Cilium
helm repo add cilium https://helm.cilium.io/
helm install cilium cilium/cilium \
  --namespace kube-system \
  --set nodeinit.enabled=true \
  --set kubeProxyReplacement=partial \
  --set hostServices.enabled=false \
  --set externalIPs.enabled=true \
  --set nodePort.enabled=true \
  --set hostPort.enabled=true \
  --set image.pullPolicy=IfNotPresent \
  --set ipam.mode=kubernetes

Calico Alternative:

# Install Calico
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml

# Verify installation
kubectl get pods -n kube-system -l k8s-app=calico-node

Cluster Management¶

Node Management¶

Adding Nodes:

# Scale worker nodes in Kind
kind get clusters
docker ps | grep rciis-local

# For cloud clusters, use cluster autoscaler
kubectl apply -f cluster-autoscaler.yaml

Node Maintenance:

# Drain node for maintenance
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Uncordon node after maintenance
kubectl uncordon <node-name>

# Remove node from cluster
kubectl delete node <node-name>

Resource Management¶

Resource Quotas:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: development
spec:
  hard:
    requests.cpu: "4"
    requests.memory: 8Gi
    limits.cpu: "8"
    limits.memory: 16Gi
    persistentvolumeclaims: "10"

Limit Ranges:

apiVersion: v1
kind: LimitRange
metadata:
  name: pod-limit-range
  namespace: development
spec:
  limits:
  - default:
      cpu: 500m
      memory: 512Mi
    defaultRequest:
      cpu: 100m
      memory: 128Mi
    type: Container

Storage Configuration¶

Storage Classes¶

Local Development:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-storage
provisioner: rancher.io/local-path
parameters:
  nodePath: /opt/local-path-provisioner
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete

Production Storage:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: pd.csi.storage.gke.io
parameters:
  type: pd-ssd
  replication-type: regional-pd
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Retain

Persistent Volumes¶

Static Provisioning:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: database-pv
spec:
  capacity:
    storage: 20Gi
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: fast-ssd
  hostPath:
    path: /mnt/data/database

Network Configuration¶

Service Mesh Preparation¶

Istio Integration (Future):

# Install Istio operator
istioctl install --set values.defaultRevision=default

# Enable sidecar injection
kubectl label namespace nucleus istio-injection=enabled

Linkerd Integration (Alternative):

# Install Linkerd
linkerd install | kubectl apply -f -

# Inject proxy
kubectl get deployment nucleus -o yaml | linkerd inject - | kubectl apply -f -

Network Policies¶

Default Deny:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: nucleus
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Security Configuration¶

Pod Security Standards¶

Restricted Policy:

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Security Contexts¶

Container Security:

kindmet  spec

href="#__codelineno-15-1">apiVersion: apps/v1 class="p">: Deployment adata: name: secure-app class="p">: template: spec: securityContext: runAsNonRoot: true runAsUser: 1001 runAsGroup: 1001 fsGroup: 1001 containers: - name: app securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true capabilities: drop: - ALL

Monitoring and Maintenance¶

Cluster Health Monitoring¶

Health Checks:

# Check cluster components
kubectl get componentstatuses

# Check node health
kubectl describe nodes

# Check system pods
kubectl get pods -n kube-system

# Check cluster events
kubectl get events --sort-by=.metadata.creationTimestamp

Maintenance Procedures¶

Cluster Updates:

# Update Kind cluster
kind delete cluster --name rciis-local
kind create cluster --config kind-config.yaml --name rciis-local

# Update Kubernetes version
kubeadm upgrade plan
kubeadm upgrade apply v1.28.0

Backup Procedures:

# Backup cluster state
kubectl get all --all-namespaces -o yaml > cluster-backup.yaml

# Backup etcd (for managed clusters)
kubectl exec -n kube-system etcd-master -- etcdctl snapshot save /tmp/etcd-backup.db

Troubleshooting¶

Common Issues¶

Node Not Ready:

# Check node status
kubectl describe node <node-name>

# Check kubelet logs
journalctl -u kubelet

# Restart kubelet
systemctl restart kubelet

Pod Scheduling Issues:

# Check scheduler logs
kubectl logs -n kube-system deployment/kube-scheduler

# Check resource availability
kubectl describe node <node-name>

# Check taints and tolerations
kubectl describe node <node-name> | grep Taints

Diagnostic Commands¶

# Cluster information
kubectl cluster-info
kubectl version

# Resource usage
kubectl top nodes
kubectl top pods --all-namespaces

# Network connectivity
kubectl exec -it <pod-name> -- ping <target-ip>
kubectl exec -it <pod-name> -- nslookup kubernetes.default

Best Practices¶

Cluster Design¶

Separation of Concerns: Separate clusters for different environments
Resource Planning: Adequate capacity planning and monitoring
Security First: Apply security controls from the start
Backup Strategy: Regular backups and disaster recovery testing

Operational Excellence¶

Monitoring: Comprehensive monitoring and alerting
Automation: Automated deployment and scaling
Documentation: Maintain current operational documentation
Training: Regular team training and knowledge sharing

For specific cluster configurations, refer to the setup scripts in the /scripts directory.