aula-12: Victoria Metrics + Grafana via GitOps

Stack completo de observabilidade:
- Victoria Metrics (vmsingle, vmagent, vmalert)
- Grafana com dashboards built-in
- Alertas customizados (PVC, pods, nodes, deployments)
- pvc-autoresizer para expansão automática de volumes
- Queries PromQL documentadas

Instalação via ArgoCD seguindo padrão GitOps da aula-11.
This commit is contained in:
ArgoCD Setup
2026-01-08 17:11:28 -03:00
parent e75b245c3b
commit 4b92838ac3
9 changed files with 1939 additions and 0 deletions

View File

@@ -0,0 +1,323 @@
# Queries PromQL Úteis
Queries prontas para uso no Grafana ou diretamente na API do Victoria Metrics.
## Como usar
### Via Grafana
1. Acesse Grafana → Explore
2. Selecione datasource "VictoriaMetrics"
3. Cole a query no editor
### Via API
```bash
# Port-forward
kubectl port-forward -n monitoring svc/vmsingle-vm-victoria-metrics-k8s-stack 8429:8429
# Query
curl "http://localhost:8429/api/v1/query?query=up"
```
---
## Storage / PVC
### Uso de PVC em porcentagem
```promql
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100
```
### PVCs acima de 80%
```promql
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.8
```
### Espaço disponível por PVC (bytes)
```promql
kubelet_volume_stats_available_bytes
```
### Espaço disponível por PVC (GB)
```promql
kubelet_volume_stats_available_bytes / 1024 / 1024 / 1024
```
### Inodes disponíveis
```promql
kubelet_volume_stats_inodes_free / kubelet_volume_stats_inodes * 100
```
### PVCs que vão encher em 24h (previsão)
```promql
predict_linear(kubelet_volume_stats_available_bytes[6h], 24 * 3600) < 0
```
---
## CPU
### CPU por pod (cores)
```promql
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace)
```
### CPU por namespace (cores)
```promql
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (namespace)
```
### CPU por node (%)
```promql
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
```
### Top 10 pods por CPU
```promql
topk(10, sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace))
```
### Uso de CPU vs Request
```promql
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace)
/
sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod, namespace)
```
---
## Memória
### Memória por pod (bytes)
```promql
sum(container_memory_working_set_bytes{container!=""}) by (pod, namespace)
```
### Memória por namespace (GB)
```promql
sum(container_memory_working_set_bytes{container!=""}) by (namespace) / 1024 / 1024 / 1024
```
### Memória disponível por node (%)
```promql
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
```
### Top 10 pods por memória
```promql
topk(10, sum(container_memory_working_set_bytes{container!=""}) by (pod, namespace))
```
### Uso de memória vs Limit
```promql
sum(container_memory_working_set_bytes{container!=""}) by (pod, namespace)
/
sum(kube_pod_container_resource_limits{resource="memory"}) by (pod, namespace)
```
---
## Pods e Containers
### Pods restartando na última hora
```promql
sum(increase(kube_pod_container_status_restarts_total[1h])) by (pod, namespace) > 0
```
### Pods não Ready
```promql
kube_pod_status_ready{condition="false"}
```
### Pods em CrashLoopBackOff
```promql
kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}
```
### Pods pendentes
```promql
kube_pod_status_phase{phase="Pending"}
```
### Containers OOMKilled
```promql
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}
```
### Total de pods por namespace
```promql
sum(kube_pod_info) by (namespace)
```
### Pods por node
```promql
sum(kube_pod_info) by (node)
```
---
## Deployments
### Deployments com réplicas indisponíveis
```promql
kube_deployment_status_replicas_unavailable > 0
```
### Deployments não atualizados
```promql
kube_deployment_status_observed_generation != kube_deployment_metadata_generation
```
### Proporção de réplicas disponíveis
```promql
kube_deployment_status_replicas_available / kube_deployment_spec_replicas
```
---
## Network
### Bytes recebidos por pod (rate)
```promql
sum(rate(container_network_receive_bytes_total[5m])) by (pod, namespace)
```
### Bytes enviados por pod (rate)
```promql
sum(rate(container_network_transmit_bytes_total[5m])) by (pod, namespace)
```
### Erros de rede por interface
```promql
sum(rate(node_network_receive_errs_total[5m])) by (instance, device)
```
### Conexões TCP por estado
```promql
node_netstat_Tcp_CurrEstab
```
---
## Nodes
### Nodes não Ready
```promql
kube_node_status_condition{condition="Ready",status="true"} == 0
```
### Pressão de memória
```promql
kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
```
### Pressão de disco
```promql
kube_node_status_condition{condition="DiskPressure",status="true"} == 1
```
### Disco disponível por node (%)
```promql
(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) * 100
```
### Load average (1 min)
```promql
node_load1
```
---
## Cluster Overview
### Total de pods Running
```promql
count(kube_pod_status_phase{phase="Running"})
```
### Total de namespaces
```promql
count(kube_namespace_created)
```
### Total de deployments
```promql
count(kube_deployment_created)
```
### Total de PVCs
```promql
count(kube_persistentvolumeclaim_info)
```
### Idade do cluster (dias)
```promql
(time() - min(kube_namespace_created{namespace="kube-system"})) / 86400
```
---
## Victoria Metrics
### Métricas sendo coletadas (por job)
```promql
count by (job) ({__name__!=""})
```
### Taxa de ingestão
```promql
sum(rate(vm_rows_inserted_total[5m]))
```
### Uso de disco do VM
```promql
vm_data_size_bytes
```
### Queries por segundo
```promql
sum(rate(vm_http_requests_total{path="/api/v1/query"}[5m]))
```
---
## Dicas
### Filtrar por namespace
```promql
# Adicione {namespace="meu-namespace"} a qualquer query
sum(container_memory_working_set_bytes{namespace="gitlab"}) by (pod)
```
### Excluir namespaces de sistema
```promql
{namespace!~"kube-system|argocd|monitoring|gitlab"}
```
### Agregar por label
```promql
sum by (label_app) (kube_pod_info)
```
### Ordenar resultados
```promql
sort_desc(sum(container_memory_working_set_bytes) by (namespace))
```
### Top N
```promql
topk(5, sum(rate(container_cpu_usage_seconds_total[5m])) by (pod))
```
### Valor no tempo (offset)
```promql
# Valor de 1 hora atrás
container_memory_working_set_bytes offset 1h
```
---
## Referências
- [PromQL Cheat Sheet](https://promlabs.com/promql-cheat-sheet/)
- [Victoria Metrics MetricsQL](https://docs.victoriametrics.com/metricsql/)
- [Grafana Dashboards](https://grafana.com/grafana/dashboards/)