fix(aula-08): prevenir volume stalling com CSI tolerations e PDB

- Adicionar hcloud-csi-values.yaml com tolerations para node failures
- Configurar 2 replicas do CSI controller para HA
- Criar statefulset-pdb.yaml para proteger StatefulSets durante drain
- Documentar troubleshooting de volumes stuck no README
This commit is contained in:
ArgoCD Setup
2026-01-23 18:45:00 -03:00
parent 9f96e97205
commit 2480c82944
4 changed files with 74 additions and 3 deletions

View File

@@ -202,7 +202,47 @@ aula-08/
├── install-nginx-ingress.sh # Instala NGINX Ingress com LB
├── install-metrics-server.sh # Instala Metrics Server (kubectl top, HPA)
├── nginx-ingress-values.yaml # Configuracao do NGINX Ingress
── talos-patches/ # Patches de configuracao Talos
├── control-plane.yaml
└── worker.yaml
── talos-patches/ # Patches de configuracao Talos
├── control-plane.yaml
└── worker.yaml
├── hcloud-csi-values.yaml # Configuracao do CSI Driver
└── statefulset-pdb.yaml # PDB para proteger StatefulSets
```
## Troubleshooting: Volume Stuck
Se um pod ficar `Pending` aguardando volume:
### 1. Verificar VolumeAttachment
```bash
kubectl get volumeattachments
kubectl describe volumeattachment <name>
```
### 2. Se o node de origem nao existe mais
```bash
# Deletar o VolumeAttachment orfao (seguro pois node nao existe)
kubectl delete volumeattachment <name>
```
### 3. Se o node existe mas pod morreu
```bash
# Aguardar - Kubernetes vai liberar automaticamente
# Timeout padrao: 6 minutos
```
### 4. Verificar no Hetzner
```bash
hcloud volume list
# Se volume mostra attached a server que nao existe, abrir ticket
```
### Limitacoes do Block Storage
- Volumes Hetzner sao **RWO** (ReadWriteOnce) - single-attach por design
- Podem ficar stuck por ate 6 min (timeout do Kubernetes)
- Se node morrer abruptamente, recuperacao pode ser manual (deletar VolumeAttachment)