Prom Examples
The Book of Knowledge
- The Book of Knowledge
- Prom.examples.md
- Prometheus Query Examples
- Infra Nodes
- Number of containers by cluster and namespace without CPU limits
- Count of pods per cluster and namespace
- CPU USage per namespace
- CPU Usage per selected pod by Namespace
- CPU overcommit
- Kafka Disk Space Available
- Network by workload: pod, namespace, interface
- CPU Usage in OpenShift
- Find Pods killed by OOM
- Find Highest PVC Utilization
Prom.examples.md
Prometheus Query Examples
Infra Nodes
sort_desc(sum by (cpu,id,pod_name,container_name) (rate(container_cpu_usage_seconds_total{type="infra"}[5m])))
Number of containers by cluster and namespace without CPU limits
count by (namespace)(sum by (namespace,pod,container)(kube_pod_container_info{container!=""}) unless sum by (namespace,pod,container)(kube_pod_container_resource_limits{resource="cpu"})
Count of pods per cluster and namespace
sum by (namespace) (kube_pod_info)))
CPU USage per namespace
sort_desc(sum by (namespace) (rate(container_cpu_usage_seconds_total[5m])))
CPU Usage per selected pod by Namespace
sum by (pod) (rate (container_cpu_usage_seconds_total{container!="",pod=~"service-label-generator.+",namespace=~"ecs-am-ramp-webapps-prd"}[1m]))
CPU overcommit
CPU limits over the capacity of the cluster is a scenario you need to avoid. Otherwise, you’ll end up with CPU throttling issues. You can detect CPU overcommit with the following query.
sum(kube_pod_container_resource_limits{resource="cpu"}) - sum(kube_node_status_capacity_cpu_cores)
Kafka Disk Space Available
# data-kafka-copy-kafka-*
kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data-kafka-copy-k.*"}
# data-kafka-copy-zookeeper-*
kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data-kafka-copy-z.*"}
# data-kafka-kafka-*
kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data-kafka-k.*"}
# data-kafka-zookeeper-*
kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data-kafka-z.*"}
Network by workload: pod, namespace, interface
(sum(irate(container_network_receive_bytes_total{namespace='ecs-am-ramp-webapps-prd',pod=~'api-label.+'}[1m]))
by (pod, namespace, interface)) +
on(namespace,pod,interface) group_left(network_name) ( pod_network_name_info )
CPU Usage in OpenShift
The container_cpu_usage_seconds_total is a counter, e.g. it increases
over time. This isn’t very informative for determining CPU usage at
a particular time. Prometheus provides rate() function, which returns
the average per-second increase rate over counters. For example, the
following query returns the average per-second increase of per-container
container_cpu_usage_seconds_total metrics over the last 5 minutes (see
5m lookbehind window in square brackets):
rate(container_cpu_usage_seconds_total[5m])
This is basically the average number of CPU cores used during the last
5 minutes. Just multiply it by 100 in order to get CPU usage in %. Note
that the resulting value may exceed 100% if the container uses more than
a single CPU core during the last 5 minutes.
The rate(container_cpu_usage_seconds_total[5m]) usually returns
a TON of time series with many long labels in production Kubernetes,
so it is better to use the following queries:
The average number of CPU cores used during the last 5 minutes per each pod:
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod)
The average number of CPU cores used during the last 5 minutes per each node:
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (node)
The average number of CPU cores used during the last 5 minutes per each namespace:
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (namespace)
The container!="" filter removes superfluous metrics related to cgroups hierarchy - see this answer for more details. Source
Find Pods killed by OOM
"kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} >=1"
Find Highest PVC Utilization
The entire query identifies the top 20 PVCs with the highest utilization percentages in namespaces that contain “ecs-am.”
topk(20,
(sum by (namespace, persistentvolumeclaim)
(kubelet_volume_stats_used_bytes{namespace=~".*ecs-am.*"}) /
sum by (namespace, persistentvolumeclaim)
(kubelet_volume_stats_capacity_bytes{namespace=~".*ecs-am.*"}))
*100)
Explanation of Each Component:
topk(20, ...): This function is used to retrieve the top 20 entries based on the value calculated in the inner expression. It sorts the results in descending order and selects the highest values.- Inner Expression:
This part of the query calculates the utilization percentage
of Persistent Volume Claims (PVCs) in namespaces matching the
regex .*ecs-am*.
sum by (namespace, persistentvolumeclaim) (kubelet_volume_stats_used_bytes{namespace=~".\*ecs-am.\*"}): This metric(kubelet_volume_stats_used_bytes)represents the amount of storage currently used by the PVCs. The sum by(namespace, persistentvolumeclaim)aggregates the used bytes for each PVC, grouped by both the namespace and the PVC name. The namespace=~".*ecs-am*." filter selects only those PVCs that are in namespaces containing “*ecs-am.”sum by (namespace, persistentvolumeclaim) (kubelet_volume_stats_capacity_bytes{namespace=~".*ecs-am.*"}): This metric(kubelet_volume_stats_capacity_bytes)represents the total capacity allocated for each PVC. Similar to the previous sum, this aggregates the total capacity for each PVC, also grouped by namespace and PVC name, with the same namespace filter. - Division, Multiplication by 100: The division of the two sums calculates the ratio of used storage to total storage (capacity) for each PVC. This gives a value between 0 and 1, representing how much of the allocated storage is currently being used. The result of the division is multiplied by 100 to convert the ratio into a percentage. This represents the utilization percentage of each PVC.