Stack completo de observabilidade: - Victoria Metrics (vmsingle, vmagent, vmalert) - Grafana com dashboards built-in - Alertas customizados (PVC, pods, nodes, deployments) - pvc-autoresizer para expansão automática de volumes - Queries PromQL documentadas Instalação via ArgoCD seguindo padrão GitOps da aula-11.
324 lines
6.2 KiB
Markdown
324 lines
6.2 KiB
Markdown
# Queries PromQL Úteis
|
|
|
|
Queries prontas para uso no Grafana ou diretamente na API do Victoria Metrics.
|
|
|
|
## Como usar
|
|
|
|
### Via Grafana
|
|
1. Acesse Grafana → Explore
|
|
2. Selecione datasource "VictoriaMetrics"
|
|
3. Cole a query no editor
|
|
|
|
### Via API
|
|
```bash
|
|
# Port-forward
|
|
kubectl port-forward -n monitoring svc/vmsingle-vm-victoria-metrics-k8s-stack 8429:8429
|
|
|
|
# Query
|
|
curl "http://localhost:8429/api/v1/query?query=up"
|
|
```
|
|
|
|
---
|
|
|
|
## Storage / PVC
|
|
|
|
### Uso de PVC em porcentagem
|
|
```promql
|
|
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100
|
|
```
|
|
|
|
### PVCs acima de 80%
|
|
```promql
|
|
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.8
|
|
```
|
|
|
|
### Espaço disponível por PVC (bytes)
|
|
```promql
|
|
kubelet_volume_stats_available_bytes
|
|
```
|
|
|
|
### Espaço disponível por PVC (GB)
|
|
```promql
|
|
kubelet_volume_stats_available_bytes / 1024 / 1024 / 1024
|
|
```
|
|
|
|
### Inodes disponíveis
|
|
```promql
|
|
kubelet_volume_stats_inodes_free / kubelet_volume_stats_inodes * 100
|
|
```
|
|
|
|
### PVCs que vão encher em 24h (previsão)
|
|
```promql
|
|
predict_linear(kubelet_volume_stats_available_bytes[6h], 24 * 3600) < 0
|
|
```
|
|
|
|
---
|
|
|
|
## CPU
|
|
|
|
### CPU por pod (cores)
|
|
```promql
|
|
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace)
|
|
```
|
|
|
|
### CPU por namespace (cores)
|
|
```promql
|
|
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (namespace)
|
|
```
|
|
|
|
### CPU por node (%)
|
|
```promql
|
|
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
|
```
|
|
|
|
### Top 10 pods por CPU
|
|
```promql
|
|
topk(10, sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace))
|
|
```
|
|
|
|
### Uso de CPU vs Request
|
|
```promql
|
|
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace)
|
|
/
|
|
sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod, namespace)
|
|
```
|
|
|
|
---
|
|
|
|
## Memória
|
|
|
|
### Memória por pod (bytes)
|
|
```promql
|
|
sum(container_memory_working_set_bytes{container!=""}) by (pod, namespace)
|
|
```
|
|
|
|
### Memória por namespace (GB)
|
|
```promql
|
|
sum(container_memory_working_set_bytes{container!=""}) by (namespace) / 1024 / 1024 / 1024
|
|
```
|
|
|
|
### Memória disponível por node (%)
|
|
```promql
|
|
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
|
|
```
|
|
|
|
### Top 10 pods por memória
|
|
```promql
|
|
topk(10, sum(container_memory_working_set_bytes{container!=""}) by (pod, namespace))
|
|
```
|
|
|
|
### Uso de memória vs Limit
|
|
```promql
|
|
sum(container_memory_working_set_bytes{container!=""}) by (pod, namespace)
|
|
/
|
|
sum(kube_pod_container_resource_limits{resource="memory"}) by (pod, namespace)
|
|
```
|
|
|
|
---
|
|
|
|
## Pods e Containers
|
|
|
|
### Pods restartando na última hora
|
|
```promql
|
|
sum(increase(kube_pod_container_status_restarts_total[1h])) by (pod, namespace) > 0
|
|
```
|
|
|
|
### Pods não Ready
|
|
```promql
|
|
kube_pod_status_ready{condition="false"}
|
|
```
|
|
|
|
### Pods em CrashLoopBackOff
|
|
```promql
|
|
kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}
|
|
```
|
|
|
|
### Pods pendentes
|
|
```promql
|
|
kube_pod_status_phase{phase="Pending"}
|
|
```
|
|
|
|
### Containers OOMKilled
|
|
```promql
|
|
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}
|
|
```
|
|
|
|
### Total de pods por namespace
|
|
```promql
|
|
sum(kube_pod_info) by (namespace)
|
|
```
|
|
|
|
### Pods por node
|
|
```promql
|
|
sum(kube_pod_info) by (node)
|
|
```
|
|
|
|
---
|
|
|
|
## Deployments
|
|
|
|
### Deployments com réplicas indisponíveis
|
|
```promql
|
|
kube_deployment_status_replicas_unavailable > 0
|
|
```
|
|
|
|
### Deployments não atualizados
|
|
```promql
|
|
kube_deployment_status_observed_generation != kube_deployment_metadata_generation
|
|
```
|
|
|
|
### Proporção de réplicas disponíveis
|
|
```promql
|
|
kube_deployment_status_replicas_available / kube_deployment_spec_replicas
|
|
```
|
|
|
|
---
|
|
|
|
## Network
|
|
|
|
### Bytes recebidos por pod (rate)
|
|
```promql
|
|
sum(rate(container_network_receive_bytes_total[5m])) by (pod, namespace)
|
|
```
|
|
|
|
### Bytes enviados por pod (rate)
|
|
```promql
|
|
sum(rate(container_network_transmit_bytes_total[5m])) by (pod, namespace)
|
|
```
|
|
|
|
### Erros de rede por interface
|
|
```promql
|
|
sum(rate(node_network_receive_errs_total[5m])) by (instance, device)
|
|
```
|
|
|
|
### Conexões TCP por estado
|
|
```promql
|
|
node_netstat_Tcp_CurrEstab
|
|
```
|
|
|
|
---
|
|
|
|
## Nodes
|
|
|
|
### Nodes não Ready
|
|
```promql
|
|
kube_node_status_condition{condition="Ready",status="true"} == 0
|
|
```
|
|
|
|
### Pressão de memória
|
|
```promql
|
|
kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
|
|
```
|
|
|
|
### Pressão de disco
|
|
```promql
|
|
kube_node_status_condition{condition="DiskPressure",status="true"} == 1
|
|
```
|
|
|
|
### Disco disponível por node (%)
|
|
```promql
|
|
(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) * 100
|
|
```
|
|
|
|
### Load average (1 min)
|
|
```promql
|
|
node_load1
|
|
```
|
|
|
|
---
|
|
|
|
## Cluster Overview
|
|
|
|
### Total de pods Running
|
|
```promql
|
|
count(kube_pod_status_phase{phase="Running"})
|
|
```
|
|
|
|
### Total de namespaces
|
|
```promql
|
|
count(kube_namespace_created)
|
|
```
|
|
|
|
### Total de deployments
|
|
```promql
|
|
count(kube_deployment_created)
|
|
```
|
|
|
|
### Total de PVCs
|
|
```promql
|
|
count(kube_persistentvolumeclaim_info)
|
|
```
|
|
|
|
### Idade do cluster (dias)
|
|
```promql
|
|
(time() - min(kube_namespace_created{namespace="kube-system"})) / 86400
|
|
```
|
|
|
|
---
|
|
|
|
## Victoria Metrics
|
|
|
|
### Métricas sendo coletadas (por job)
|
|
```promql
|
|
count by (job) ({__name__!=""})
|
|
```
|
|
|
|
### Taxa de ingestão
|
|
```promql
|
|
sum(rate(vm_rows_inserted_total[5m]))
|
|
```
|
|
|
|
### Uso de disco do VM
|
|
```promql
|
|
vm_data_size_bytes
|
|
```
|
|
|
|
### Queries por segundo
|
|
```promql
|
|
sum(rate(vm_http_requests_total{path="/api/v1/query"}[5m]))
|
|
```
|
|
|
|
---
|
|
|
|
## Dicas
|
|
|
|
### Filtrar por namespace
|
|
```promql
|
|
# Adicione {namespace="meu-namespace"} a qualquer query
|
|
sum(container_memory_working_set_bytes{namespace="gitlab"}) by (pod)
|
|
```
|
|
|
|
### Excluir namespaces de sistema
|
|
```promql
|
|
{namespace!~"kube-system|argocd|monitoring|gitlab"}
|
|
```
|
|
|
|
### Agregar por label
|
|
```promql
|
|
sum by (label_app) (kube_pod_info)
|
|
```
|
|
|
|
### Ordenar resultados
|
|
```promql
|
|
sort_desc(sum(container_memory_working_set_bytes) by (namespace))
|
|
```
|
|
|
|
### Top N
|
|
```promql
|
|
topk(5, sum(rate(container_cpu_usage_seconds_total[5m])) by (pod))
|
|
```
|
|
|
|
### Valor no tempo (offset)
|
|
```promql
|
|
# Valor de 1 hora atrás
|
|
container_memory_working_set_bytes offset 1h
|
|
```
|
|
|
|
---
|
|
|
|
## Referências
|
|
|
|
- [PromQL Cheat Sheet](https://promlabs.com/promql-cheat-sheet/)
|
|
- [Victoria Metrics MetricsQL](https://docs.victoriametrics.com/metricsql/)
|
|
- [Grafana Dashboards](https://grafana.com/grafana/dashboards/)
|