目錄:
十多年的印江網(wǎng)站建設(shè)經(jīng)驗,針對設(shè)計、前端、開發(fā)、售后、文案、推廣等六對一服務(wù),響應(yīng)快,48小時及時工作處理。全網(wǎng)營銷推廣的優(yōu)勢是能夠根據(jù)用戶設(shè)備顯示端的尺寸不同,自動調(diào)整印江建站的顯示方式,使網(wǎng)站能夠適用不同顯示終端,在瀏覽器中調(diào)整網(wǎng)站的寬度,無論在任何一種瀏覽器上瀏覽網(wǎng)站,都能展現(xiàn)優(yōu)雅布局與設(shè)計,從而大程度地提升瀏覽體驗。成都創(chuàng)新互聯(lián)公司從事“印江網(wǎng)站設(shè)計”,“印江網(wǎng)站推廣”以來,每個客戶項目都認(rèn)真落實執(zhí)行。
說在前面的話,現(xiàn)在監(jiān)控首選的話,肯定是Prometheus+Grafana,也就是很多大型公司也都在用,像RBM,360,網(wǎng)易,基本都是使用這一套監(jiān)控系統(tǒng)。
一、Prometheus 是什么?
Prometheus(普羅米修斯)是一個最初在SoundCloud上構(gòu)建的監(jiān)控系統(tǒng)。SoundCloud是搞云計算的一家國外的公司,也是由一個谷歌的一位工程師來到這家公司之后開發(fā)的這個系統(tǒng),自2012年成為社區(qū)開源項目,擁有非?;钴S的開發(fā)人員和用戶社區(qū)。為強調(diào)開源及獨立維護,Prometheus于2016年加入云原生云計算基金會
(CNCF),成為繼Kubernetes之后的第二個托管項目,這個項目發(fā)展的還是比較快的,隨著k8s的發(fā)展,它也起來了。
https://prometheus.io 官方網(wǎng)站
https://github.com/prometheus GitHub地址
Prometheus組成及架構(gòu)
接下來看一下它這個官方給出的架構(gòu)圖,我們來研究一下
最左邊這塊就是采集的,采集誰監(jiān)控誰,一般是一些短周期的任務(wù),比如cronjob這樣的任務(wù),也可以是一些持久性的任務(wù),其實主要就是一些持久性的任務(wù),比如web服務(wù),也就是持續(xù)運行的,暴露一些指標(biāo),像短期任務(wù)呢,處理一下就關(guān)了,分為這兩個類型,短期任務(wù)會用到Pushgateway,專門收集這些短期任務(wù)的。
中間這塊就是Prometheus它本身,內(nèi)部是有一個TSDB的數(shù)據(jù)庫的,從內(nèi)部的采集和展示Prometheus它都可以完成,展示這塊自己的這塊UI比較lou,所以借助于這個開源的Grafana來展示,所有的被監(jiān)控端暴露完指標(biāo)之后,Prometheus會主動的抓取這些指標(biāo),存儲到自己TSDB數(shù)據(jù)庫里面,提供給Web UI,或者Grafana,或者API clients通過PromQL來調(diào)用這些數(shù)據(jù),PromQL相當(dāng)于MySQL的SQL,主要是查詢這些數(shù)據(jù)的。
中間上面這塊是做服務(wù)發(fā)現(xiàn)的,也就是你有很多的被監(jiān)控端時,手動的去寫這些被監(jiān)控端是不現(xiàn)實的,所以需要自動的去發(fā)現(xiàn)新加入的節(jié)點,或者以批量的節(jié)點,加入到這個監(jiān)控中,像k8s它內(nèi)置了k8s服務(wù)發(fā)現(xiàn)的機制,也就是它會連接k8s的API,去發(fā)現(xiàn)你部署的哪些應(yīng)用,哪些pod,通通的都給你暴露出去,監(jiān)控出來,也就是為什么K8S對prometheus特別友好的地方,也就是它內(nèi)置了做這種相關(guān)的支持了。
右上角是Prometheus的告警,它告警實現(xiàn)是有一個組件的,Alertmanager,這個組件是接收prometheus發(fā)來的告警就是觸發(fā)了一些預(yù)值,會通知Alertmanager,而Alertmanager來處理告警相關(guān)的處理,然后發(fā)送給接收人,可以是email,也可以是企業(yè)微信,或者釘釘,也就是它整個的這個框架,分為這5塊。
小結(jié):
? Prometheus Server:收集指標(biāo)和存儲時間序列數(shù)據(jù),并提供查詢接口
? ClientLibrary:客戶端庫,這些可以集成一些很多的語言中,比如使用JAVA開發(fā)的一個Web網(wǎng)站,那么可以集成JAVA的客戶端,去暴露相關(guān)的指標(biāo),暴露自身的指標(biāo),但很多的業(yè)務(wù)指標(biāo)需要開發(fā)去寫的,
? Push Gateway:短期存儲指標(biāo)數(shù)據(jù)。主要用于臨時性的任務(wù)
? Exporters:采集已有的第三方服務(wù)監(jiān)控指標(biāo)并暴露metrics,相當(dāng)于一個采集端的agent,
? Alertmanager:告警
? Web UI:簡單的Web控制臺
數(shù)據(jù)模型
Prometheus將所有數(shù)據(jù)存儲為時間序列;具有相同度量名稱以及標(biāo)簽屬于同一個指標(biāo)。
每個時間序列都由度量標(biāo)準(zhǔn)名稱和一組鍵值對(也成為標(biāo)簽)唯一標(biāo)識。 也就是查詢時
也會依據(jù)這些標(biāo)簽來查詢和過濾,就是寫PromQL時
時間序列格式:
示例:api_http_requests_total{method="POST", handler="/messages"}
( 名稱 )(里面包含的POST請求,GET請求,請求里面還包含了請求的資源,比如messages或者API)里面可以還有很多的指標(biāo),比如請求的協(xié)議,或者攜帶了其他HTTP頭的字段,都可以進行標(biāo)記出來,就是想監(jiān)控的都可以通過這種方式監(jiān)控出來。
作業(yè)和實例
實例:可以抓取的目標(biāo)稱為實例(Instances),用過zabbix的都知道被監(jiān)控端是稱為什么,一般就是稱為主機,被監(jiān)控端,而在prometheus稱為一個實例。
作業(yè):具有相同目標(biāo)的實例集合稱為作業(yè)(Job),也就是將你的被監(jiān)控端作為你個集合,比如做一個分組,web 服務(wù)有幾臺,比如有3臺,寫一個job下,這個job下就是3臺,就是做一個邏輯上的分組,
二、K8S監(jiān)控指標(biāo)
Kubernetes本身監(jiān)控
? Node資源利用率 :一般生產(chǎn)環(huán)境幾十個node,幾百個node去監(jiān)控
? Node數(shù)量 :一般能監(jiān)控到node,就能監(jiān)控到它的數(shù)量了,因為它是一個實例,一個node能跑多少個項目,也是需要去評估的,整體資源率在一個什么樣的狀態(tài),什么樣的值,所以需要根據(jù)項目,跑的資源利用率,還有值做一個評估的,比如再跑一個項目,需要多少資源。
? Pods數(shù)量(Node):其實也是一樣的,每個node上都跑多少pod,不過默認(rèn)一個node上能跑110個pod,但大多數(shù)情況下不可能跑這么多,比如一個128G的內(nèi)存,32核cpu,一個java的項目,一個分配2G,也就是能跑50-60個,一般機器,pod也就跑幾十個,很少很少超過100個。
? 資源對象狀態(tài) :比如pod,service,deployment,job這些資源狀態(tài),做一個統(tǒng)計。
Pod監(jiān)控
? Pod數(shù)量(項目):你的項目跑了多少個pod的數(shù)量,大概的利益率是多少,好評估一下這個項目跑了多少個資源占有多少資源,每個pod占了多少資源。
? 容器資源利用率 :每個容器消耗了多少資源,用了多少CPU,用了多少內(nèi)存
? 應(yīng)用程序:這個就是偏應(yīng)用程序本身的指標(biāo)了,這個一般在我們運維很難拿到的,所以在監(jiān)控之前呢,需要開發(fā)去給你暴露出來,這里有很多客戶端的集成,客戶端庫就是支持很多語言的,需要讓開發(fā)做一些開發(fā)量將它集成進去,暴露這個應(yīng)用程序的想知道的指標(biāo),然后納入監(jiān)控,如果開發(fā)部配合,基本運維很難做到這一塊,除非自己寫一個客戶端程序,通過shell/python能不能從外部獲取內(nèi)部的工作情況,如果這個程序提供API的話,這個很容易做到。
Prometheus監(jiān)控K8S架構(gòu)
如果想監(jiān)控node的資源,就可以放一個node_exporter,這是監(jiān)控node資源的,node_exporter是Linux上的采集器,你放上去你就能采集到當(dāng)前節(jié)點的CPU、內(nèi)存、網(wǎng)絡(luò)IO,等待都可以采集的。
如果想監(jiān)控容器,k8s內(nèi)部提供cAdvisor采集器,pod呀,容器都可以采集到這些指標(biāo),都是內(nèi)置的,不需要單獨部署,只知道怎么去訪問這個Cadvisor就可以了。
如果想監(jiān)控k8s資源對象,會部署一個kube-state-metrics這個服務(wù),它會定時的API中獲取到這些指標(biāo),幫你存取到Prometheus里,要是告警的話,通過Alertmanager發(fā)送給一些接收方,通過Grafana可視化展示。
服務(wù)發(fā)現(xiàn):
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config
三、在K8S中部署Prometheus+Grafana
文檔有的yaml格式可能不對部署可能會出現(xiàn)問題
建議拉取我代碼倉庫的地址,拉取的時候請把你的公鑰給我,不然拉取不下來git clone git@gitee.com:zhaocheng172/prometheus.git
[root@k8s-master prometheus-k8s]# ls
alertmanager-configmap.yaml OWNERS
alertmanager-deployment.yaml prometheus-configmap.yaml
alertmanager-pvc.yaml prometheus-rbac.yaml
alertmanager-service.yaml prometheus-rules.yaml
grafana.yaml prometheus-service.yaml
kube-state-metrics-deployment.yaml prometheus-statefulset-static-pv.yaml
kube-state-metrics-rbac.yaml prometheus-statefulset.yaml
kube-state-metrics-service.yaml README.md
node_exporter.sh
現(xiàn)在先來創(chuàng)建rbac,因為部署它的主服務(wù)主進程要引用這幾個服務(wù)
因為prometheus來連接你的API,從API中獲取很多的指標(biāo)
并且設(shè)置了綁定集群角色的權(quán)限,只能查看,不能修改
[root@k8s-master prometheus-k8s]# cat prometheus-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: prometheus
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
rules:
- apiGroups:
- ""
resources:
- nodes
- nodes/metrics
- services
- endpoints
- pods
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- nonResourceURLs:
- "/metrics"
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: prometheus
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: kube-system
[root@k8s-master prometheus-k8s]# kubectl create -f prometheus-rbac.yaml
現(xiàn)在創(chuàng)建一下configmap,
rule_files:
- /etc/config/rules/*.rules
這是寫入告警規(guī)則的目錄,也就是這個configmap會掛載到普羅米修斯里面,讓主進程讀取這些配置
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
下面這些都是來配置監(jiān)控端的,job_name是分組,這是是監(jiān)控它本身,下面還有監(jiān)控node,我們會在node上起一個nodeport,這里修改要監(jiān)控node節(jié)點
scrape_interval: 30s:這里采集的時間,每多少秒采集一次數(shù)據(jù)
這里還有一個alerting的服務(wù)的名字
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:80"]
[root@k8s-master prometheus-k8s]# kubectl create -f prometheus-configmap.yaml
[root@k8s-master prometheus-k8s]# cat prometheus-configmap.yaml
# Prometheus configuration format https://prometheus.io/docs/prometheus/latest/configuration/configuration/
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: EnsureExists
data:
prometheus.yml: |
rule_files:
- /etc/config/rules/*.rules
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
- job_name: kubernetes-nodes
scrape_interval: 30s
static_configs:
- targets:
- 192.168.30.22:9100
- 192.168.30.23:9100
- job_name: kubernetes-apiservers
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: default;kubernetes;https
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_service_name
- __meta_kubernetes_endpoint_port_name
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
- job_name: kubernetes-nodes-kubelet
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
- job_name: kubernetes-nodes-cadvisor
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __metrics_path__
replacement: /metrics/cadvisor
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
- job_name: kubernetes-service-endpoints
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_service_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
- job_name: kubernetes-services
kubernetes_sd_configs:
- role: service
metrics_path: /probe
params:
module:
- http_2xx
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_probe
- source_labels:
- __address__
target_label: __param_target
- replacement: blackbox
target_label: __address__
- source_labels:
- __param_target
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_pod_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: kubernetes_pod_name
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:80"]
再配置這個角色,這個就是配置告警規(guī)則的,這里分為兩塊告警規(guī)則,一個是通用的告警規(guī)則,適用所有的實例,如果實例要是掛了,然后發(fā)送告警,實例我們被監(jiān)控端的agent,還有一個node角色,這個監(jiān)控每個node的CPU、內(nèi)存、磁盤利用率,在prometheus寫告警值是通過promQL去寫的,來查詢一個數(shù)據(jù)來比對,如果符合這個比對的表達式,就是為真的情況下,去觸發(fā)當(dāng)前這條告警,比如就是下面這條,然后會將這條告警推送給alertmanager,它來處理這個信息的告警。expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80
[root@k8s-master prometheus-k8s]# kubectl create -f prometheus-rules.yaml
[root@k8s-master prometheus-k8s]# cat prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: kube-system
data:
general.rules: |
groups:
- name: general.rules
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: error
annotations:
summary: "Instance {{ $labels.instance }} 停止工作"
description: "{{ $labels.instance }} job {{ $labels.job }} 已經(jīng)停止5分鐘以上."
node.rules: |
groups:
- name: node.rules
rules:
- alert: NodeFilesystemUsage
expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分區(qū)使用率過高"
description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分區(qū)使用大于80% (當(dāng)前值: {{ $value }})"
- alert: NodeMemoryUsage
expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} 內(nèi)存使用率過高"
description: "{{ $labels.instance }}內(nèi)存使用大于80% (當(dāng)前值: {{ $value }})"
- alert: NodeCPUUsage
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} CPU使用率過高"
description: "{{ $labels.instance }}CPU使用大于60% (當(dāng)前值: {{ $value }})"
然后再部署一下statefulset
[root@k8s-master prometheus-k8s]# cat prometheus-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus
namespace: kube-system
labels:
k8s-app: prometheus
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
version: v2.2.1
spec:
serviceName: "prometheus"
replicas: 1
podManagementPolicy: "Parallel"
updateStrategy:
type: "RollingUpdate"
selector:
matchLabels:
k8s-app: prometheus
template:
metadata:
labels:
k8s-app: prometheus
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ''
spec:
priorityClassName: system-cluster-critical
serviceAccountName: prometheus
initContainers:
- name: "init-chown-data"
image: "busybox:latest"
imagePullPolicy: "IfNotPresent"
command: ["chown", "-R", "65534:65534", "/data"]
volumeMounts:
- name: prometheus-data
mountPath: /data
subPath: ""
containers:
- name: prometheus-server-configmap-reload
image: "jimmidyson/configmap-reload:v0.1"
imagePullPolicy: "IfNotPresent"
args:
- --volume-dir=/etc/config
- --webhook-url=http://localhost:9090/-/reload
volumeMounts:
- name: config-volume
mountPath: /etc/config
readOnly: true
resources:
limits:
cpu: 10m
memory: 10Mi
requests:
cpu: 10m
memory: 10Mi
- name: prometheus-server
image: "prom/prometheus:v2.2.1"
imagePullPolicy: "IfNotPresent"
args:
- --config.file=/etc/config/prometheus.yml
- --storage.tsdb.path=/data
- --web.console.libraries=/etc/prometheus/console_libraries
- --web.console.templates=/etc/prometheus/consoles
- --web.enable-lifecycle
ports:
- containerPort: 9090
readinessProbe:
httpGet:
path: /-/ready
port: 9090
initialDelaySeconds: 30
timeoutSeconds: 30
livenessProbe:
httpGet:
path: /-/healthy
port: 9090
initialDelaySeconds: 30
timeoutSeconds: 30
# based on 10 running nodes with 30 pods each
resources:
limits:
cpu: 200m
memory: 1000Mi
requests:
cpu: 200m
memory: 1000Mi
volumeMounts:
- name: config-volume
mountPath: /etc/config
- name: prometheus-data
mountPath: /data
subPath: ""
- name: prometheus-rules
mountPath: /etc/config/rules
terminationGracePeriodSeconds: 300
volumes:
- name: config-volume
configMap:
name: prometheus-config
- name: prometheus-rules
configMap:
name: prometheus-rules
volumeClaimTemplates:
- metadata:
name: prometheus-data
spec:
storageClassName: managed-nfs-storage
accessModes:
- ReadWriteOnce
resources:
requests:
storage: "16Gi"
這里呢因為我之前就把nfs動態(tài)創(chuàng)建pvc的搭建好了,使用的nfs做的網(wǎng)絡(luò)存儲,所以這里沒有演示,可以看我之前的博客,然后這里已經(jīng)創(chuàng)建好了
[root@k8s-master prometheus-k8s]# kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
coreDNS-bccdc95cf-kqxwv 1/1 Running 3 2d4h
coredns-bccdc95cf-nwkbp 1/1 Running 3 2d4h
etcd-k8s-master 1/1 Running 2 2d4h
kube-apiserver-k8s-master 1/1 Running 2 2d4h
kube-controller-manager-k8s-master 1/1 Running 5 2d4h
kube-flannel-ds-amd64-dc5z9 1/1 Running 1 2d4h
kube-flannel-ds-amd64-jm2jz 1/1 Running 1 2d4h
kube-flannel-ds-amd64-z6tt2 1/1 Running 1 2d4h
kube-proxy-9ltx7 1/1 Running 2 2d4h
kube-proxy-lnzrj 1/1 Running 1 2d4h
kube-proxy-v7dqm 1/1 Running 1 2d4h
kube-scheduler-k8s-master 1/1 Running 5 2d4h
prometheus-0 2/2 Running 0 3m3s
然后看一下service,我們使用Nodeport類型,端口使用9090。當(dāng)然也可以使用ingress暴露出去
[root@k8s-master prometheus-k8s]# cat prometheus-service.yaml
kind: Service
apiVersion: v1
metadata:
name: prometheus
namespace: kube-system
labels:
kubernetes.io/name: "Prometheus"
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
spec:
type: NodePort
ports:
- name: http
port: 9090
protocol: TCP
targetPort: 9090
selector:
k8s-app: prometheus
現(xiàn)在可以去訪問一下了,訪問隨機端口32276,我們的prometheus已經(jīng)部署成功
[root@k8s-master prometheus-k8s]# kubectl create -f prometheus-service.yaml
[root@k8s-master prometheus-k8s]# kubectl get svc -n kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.1.0.10 53/UDP,53/TCP,9153/TCP 2d4h
prometheus NodePort 10.1.58.1 9090:32276/TCP 22s
一個非常簡潔的UI頁面,沒有什么好的功能,很難滿足企業(yè)UI的要求的,不過只在這里做一個調(diào)試,上面主要寫promQL的表達式的,怎么去查這個數(shù)據(jù),就好比mysql的SQL,去查詢出你的數(shù)據(jù),可以在status里面去進行調(diào)試,而里面的config配置文件我們增加了告警預(yù)值,增加了對nodeport的支持還有指定了alertmanager的地址,然后rules,我們也是規(guī)劃了兩塊,一個是通用規(guī)則,一個是node節(jié)點規(guī)則,主要監(jiān)控三大塊,內(nèi)存、磁盤、CPU
現(xiàn)在查看CPU的利用率,一般都是使用Grafana去展示
五、在K8S平臺部署Grafana
這里也是用statefulset去做的,也是自動創(chuàng)建pv,定義的端口是30007
[root@k8s-master prometheus-k8s]# cat grafana.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: grafana
namespace: kube-system
spec:
serviceName: "grafana"
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana
ports:
- containerPort: 3000
protocol: TCP
resources:
limits:
cpu: 100m
memory: 256Mi
requests:
cpu: 100m
memory: 256Mi
volumeMounts:
- name: grafana-data
mountPath: /var/lib/grafana
subPath: grafana
securityContext:
fsGroup: 472
runAsUser: 472
volumeClaimTemplates:
- metadata:
name: grafana-data
spec:
storageClassName: managed-nfs-storage
accessModes:
- ReadWriteOnce
resources:
requests:
storage: "1Gi"
---
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: kube-system
spec:
type: NodePort
ports:
- port : 80
targetPort: 3000
nodePort: 30007
selector:
app: grafana
默認(rèn)賬號密碼都是admin
首先我們將prometheus做為數(shù)據(jù)源,添加一個數(shù)據(jù)源并選擇prometheus
添加一個URL地址,可以寫你訪問UI頁面的地址也可以寫service的地址
[root@k8s-master prometheus-k8s]# kubectl get svc -n kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
grafana NodePort 10.1.246.143 80:30007/TCP 11m
kube-dns ClusterIP 10.1.0.10 53/UDP,53/TCP,9153/TCP 2d5h
prometheus NodePort 10.1.58.1 9090:32276/TCP 40m
查看數(shù)據(jù)源已經(jīng)有一個了
六、監(jiān)控K8S集群中Pod、Node、資源對象
? Pod
kubelet的節(jié)點使用cAdvisor提供的metrics接口獲取該節(jié)點所有Pod和容器相關(guān)的性能指標(biāo)數(shù)據(jù)。
也就是kubelet會暴露兩個接口地址:
https://NodeIP:10255/metrics/cadvisor 只讀
https://NodeIP:10250/metrics/cadvisor kubelet的API,授權(quán)沒問題的話可以做任何操作
可以在node節(jié)點去看一下,這個端口主要用作于訪問kubelet的一些API鑒權(quán),和提供一些cAdvisor指標(biāo)用的,咱們部署prometheus的時候,就已經(jīng)開始收集cAdvisor數(shù)據(jù)了,為什么會采集,因為prometheus配置文件就已經(jīng)去定義怎么去采集數(shù)據(jù)了
[root@k8s-node1 ~]# netstat -antp |grep 10250
tcp6 0 0 :::10250 :::* LISTEN 107557/kubelet
tcp6 0 0 192.168.30.22:10250 192.168.30.23:58692 ESTABLISHED 107557/kubelet
tcp6 0 0 192.168.30.22:10250 192.168.30.23:46555 ESTABLISHED 107557/kubelet
? Node
使用node_exporter收集器采集節(jié)點資源利用率。
https://github.com/prometheus/node_exporter
使用文檔:https://prometheus.io/docs/guides/node-exporter/
? 資源對象
kube-state-metrics采集了k8s中各種資源對象的狀態(tài)信息,
https://github.com/kubernetes/kube-state-metrics
現(xiàn)在導(dǎo)入一個能夠查看pod數(shù)據(jù)的模版,也就是通過模版更能直觀去展示這些數(shù)據(jù)
七、使用Grafana可視化展示Prometheus監(jiān)控數(shù)據(jù)
推薦模板: 也就是在grafana共享中心里面的,也就是別人寫的模版上傳到這里庫里面的,自己也可以寫,寫完上傳上去,別人也可以訪問到,下面是模版的id,只要獲取這個ID,就能使用這個模版了,只要這個模版,后端提供執(zhí)行promeQL,只要有數(shù)據(jù)就能幫你展示出來
Grafana.com
? 集群資源監(jiān)控:3119
? 資源狀態(tài)監(jiān)控 :6417
? Node監(jiān)控 :9276
現(xiàn)在使用這個3319模版,來展示我們的集群的資源,打開添加模版,選擇dashboard
選擇導(dǎo)入模版
寫入3119,它能自動幫你識別這個模版的名字
因為這些都有數(shù)據(jù)了,所以就直接能查看到所有集群的資源
下面這個是網(wǎng)絡(luò)IO的圖表,一個是接收,一個是發(fā)送
下面這個是集群內(nèi)存的使用情況
這里是4G,只識別了3.84G,使用2.26G,CPU是雙核,使用了0.11,右邊這個是集群文件系統(tǒng),但是沒有顯示出來,我們可以看一下它PromQL怎么寫的,把這個寫promQL拿到promQL Ui上測試一下有沒有數(shù)據(jù),一般是沒有匹配到數(shù)據(jù)導(dǎo)致的
來看一下這個怎么解決
拿這個數(shù)據(jù)去比對,找到數(shù)據(jù),一點一點去刪除,現(xiàn)在我們找到數(shù)據(jù)了,這里是匹配的你節(jié)點的名稱,根據(jù)這個我們?nèi)フ遥驗檫@個模版是別人上傳的,我們自己用肯定根據(jù)自己的內(nèi)容去匹配,這里可以去匹配相關(guān)的promQL,然后改一下我們grafana的promQL,現(xiàn)在是獲取到數(shù)據(jù)了
另外我們可能還做一些其他的模版的監(jiān)控,可以在它Grafana的官方去找一些模版,但是有的可能不能用,自己需要去修改,比如輸入k8s,這里是監(jiān)控etcd集群的
Node
使用node_exporter收集器采集節(jié)點資源利用率。
https://github.com/prometheus/node_exporter
使用文檔:https://prometheus.io/docs/guides/node-exporter/
這個目前沒有使用pod去部署,因為沒有展示到一個磁盤的使用率,官方給出了一個statfulset的方式,無法展示磁盤,不過也可以以一個守護進程的方式部署在node 節(jié)點上,這個部署也比較簡單,以二進制的方式去部署,在宿主機上啟動一個就可以了
看一下這個腳本,是以systemd去過濾服務(wù)啟動監(jiān)控的狀態(tài),如果守護進程掛了話,也會被Prometheus采集到也就是下面這個參數(shù)--collector.systemd --collector.systemd.unit-whitelist=(docker|kubelet|kube-proxy|flanneld).service
[root@k8s-node1 ~]# bash node_exporter.sh
#!/bin/bash
wget https://github.com/prometheus/node_exporter/releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gz
tar zxf node_exporter-0.17.0.linux-amd64.tar.gz
mv node_exporter-0.17.0.linux-amd64 /usr/local/node_exporter
cat </usr/lib/systemd/system/node_exporter.service
[Unit]
Description=https://prometheus.io
[Service]
Restart=on-failure
ExecStart=/usr/local/node_exporter/node_exporter --collector.systemd --collector.systemd.unit-whitelist=(docker|kubelet|kube-proxy|flanneld).service
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable node_exporter
systemctl restart node_exporter
prometheus是主動的去采集資源的指標(biāo),而不是被動的被監(jiān)控端推送這些數(shù)據(jù)
然后使用的是9276這個模版,我們可以先讓這個模版導(dǎo)入進來
[root@k8s-node1 ~]# ps -ef |grep node_ex
root 5275 1 0 21:59 ? 00:00:03 /usr/local/node_exporter/node_exporter --collector.systemd --collector.systemd.unit-whitelist=(docker|kubelet|kube-proxy|flanneld).service
root 7393 81364 0 22:15 pts/1 00:00:00 grep --color=auto node_ex
選擇nodes ,這里可以看到兩個節(jié)點的資源狀態(tài)
獲取網(wǎng)絡(luò)帶寬失敗,然后我們可以去測這個promeQL,一般這個情況就是查看網(wǎng)卡的接口名稱,有的是eth0,有的是ens32,ens33,這個根據(jù)自己的去寫
點擊這個保存
現(xiàn)在就有了
K8s資源對象的監(jiān)控
具體實現(xiàn) kube-state-metrics ,這種類型pod/deployment/service
這個組件是官方開發(fā)的,通過API去獲取k8s資源的狀態(tài),通過metrics來完成數(shù)據(jù)的采集。比如副本數(shù)是多少,當(dāng)前是什么狀態(tài)了,是獲取這些的
當(dāng)然github上都有這些,只需要把國外的源換成國外的就可以了,或者換成我的,我已經(jīng)把鏡像上傳到docker hub上了。
https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/prometheus
創(chuàng)建rbac授權(quán)規(guī)則
[root@k8s-master prometheus-k8s]# cat kube-state-metrics-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-state-metrics
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kube-state-metrics
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
rules:
- apiGroups: [""]
resources:
- configmaps
- secrets
- nodes
- pods
- services
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs: ["list", "watch"]
- apiGroups: ["extensions"]
resources:
- daemonsets
- deployments
- replicasets
verbs: ["list", "watch"]
- apiGroups: ["apps"]
resources:
- statefulsets
verbs: ["list", "watch"]
- apiGroups: ["batch"]
resources:
- cronjobs
- jobs
verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
resources:
- horizontalpodautoscalers
verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: kube-state-metrics-resizer
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
rules:
- apiGroups: [""]
resources:
- pods
verbs: ["get"]
- apiGroups: ["extensions"]
resources:
- deployments
resourceNames: ["kube-state-metrics"]
verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kube-state-metrics
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: kube-state-metrics
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: kube-state-metrics-resizer
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: kube-system
創(chuàng)建deployment
[root@k8s-master prometheus-k8s]# cat kube-state-metrics-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
namespace: kube-system
labels:
k8s-app: kube-state-metrics
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
version: v1.3.0
spec:
selector:
matchLabels:
k8s-app: kube-state-metrics
version: v1.3.0
replicas: 1
template:
metadata:
labels:
k8s-app: kube-state-metrics
version: v1.3.0
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ''
spec:
priorityClassName: system-cluster-critical
serviceAccountName: kube-state-metrics
containers:
- name: kube-state-metrics
image: zhaocheng172/kube-state-metrics:v1.3.0
ports:
- name: http-metrics
containerPort: 8080
- name: telemetry
containerPort: 8081
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 5
- name: addon-resizer
image: zhaocheng172/addon-resizer:1.8.3
resources:
limits:
cpu: 100m
memory: 30Mi
requests:
cpu: 100m
memory: 30Mi
env:
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MY_POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
volumeMounts:
- name: config-volume
mountPath: /etc/config
command:
- /pod_nanny
- --config-dir=/etc/config
- --container=kube-state-metrics
- --cpu=100m
- --extra-cpu=1m
- --memory=100Mi
- --extra-memory=2Mi
- --threshold=5
- --deployment=kube-state-metrics
volumes:
- name: config-volume
configMap:
name: kube-state-metrics-config
---
# Config map for resource configuration.
apiVersion: v1
kind: ConfigMap
metadata:
name: kube-state-metrics-config
namespace: kube-system
labels:
k8s-app: kube-state-metrics
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
data:
NannyConfiguration: |-
apiVersion: nannyconfig/v1alpha1
kind: NannyConfiguration
創(chuàng)建暴露的端口,這里使用的是service
[root@k8s-master prometheus-k8s]# cat kube-state-metrics-service.yaml
apiVersion: v1
kind: Service
metadata:
name: kube-state-metrics
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/name: "kube-state-metrics"
annotations:
prometheus.io/scrape: 'true'
spec:
ports:
- name: http-metrics
port: 8080
targetPort: http-metrics
protocol: TCP
- name: telemetry
port: 8081
targetPort: telemetry
protocol: TCP
selector:
k8s-app: kube-state-metrics
部署成功之后,導(dǎo)入模版就能監(jiān)控到我們的數(shù)據(jù)
[root@k8s-master prometheus-k8s]# kubectl get pod,svc -n kube-system
NAME READY STATUS RESTARTS AGE
pod/coredns-bccdc95cf-kqxwv 1/1 Running 3 2d9h
pod/coredns-bccdc95cf-nwkbp 1/1 Running 3 2d9h
pod/etcd-k8s-master 1/1 Running 2 2d9h
pod/grafana-0 1/1 Running 0 4h60m
pod/kube-apiserver-k8s-master 1/1 Running 2 2d9h
pod/kube-controller-manager-k8s-master 1/1 Running 5 2d9h
pod/kube-flannel-ds-amd64-dc5z9 1/1 Running 1 2d9h
pod/kube-flannel-ds-amd64-jm2jz 1/1 Running 1 2d9h
pod/kube-flannel-ds-amd64-z6tt2 1/1 Running 1 2d9h
pod/kube-proxy-9ltx7 1/1 Running 2 2d9h
pod/kube-proxy-lnzrj 1/1 Running 1 2d9h
pod/kube-proxy-v7dqm 1/1 Running 1 2d9h
pod/kube-scheduler-k8s-master 1/1 Running 5 2d9h
pod/kube-state-metrics-6474469878-6kpxv 1/2 Running 0 4s
pod/kube-state-metrics-854b85d88-zl777 2/2 Running 0 35s
pod/prometheus-0 2/2 Running 0 5h40m
還是剛才步驟一樣,導(dǎo)入一個6417的模版
數(shù)據(jù)現(xiàn)在已經(jīng)展示出來了,它會從target里面獲取到這些數(shù)據(jù),也就是這個來提供的,由prometheus自動的發(fā)現(xiàn)了。它這個發(fā)現(xiàn)是根據(jù)里面的一個注解來獲取的,也就是在service里面
annotations:
prometheus.io/scrape: 'true'
也就是聲明了部署了哪些應(yīng)用,可以被prometheus去自動的發(fā)現(xiàn),如果加這條規(guī)則,prometheus會自動把這些帶注解的監(jiān)控到,也就是自己部署的應(yīng)用,并提供相應(yīng)的指標(biāo),也能自動發(fā)現(xiàn)這些狀態(tài)。
磁盤這里需要更改一個因為這里更新了,添加bytes
下面這里是pod的容量,最大可以創(chuàng)建的數(shù)量,也就是kubelet去限制的,總共一個節(jié)點可以創(chuàng)建330個pod,已經(jīng)分配24個。
小結(jié):
所以有了這些監(jiān)控,基本上就能了解k8s的基本資源的使用狀態(tài)了
八、告警規(guī)則與告警通知
在K8S中部署Alertmanager
說在前面的話,在k8s使用告警使用的是Alertmanager,先定義監(jiān)控預(yù)值的規(guī)則,比如node的內(nèi)存到達60%,才能告警,先定義好這些規(guī)則,如果prometheus采集的指標(biāo),匹配到這個規(guī)則,就是為真的話,它會發(fā)送告警,會將這個個告警信息推送給
Alertmanager,Alertmanager經(jīng)過一系列的處理,最終發(fā)送到告警人手上,可以是webhook,email,釘釘,企業(yè)微信,目前我們拿email來做以下實例,企業(yè)微信需要注冊企業(yè)的一些相關(guān)信息營業(yè)執(zhí)照等,而webhook需要對接第三方的系統(tǒng)調(diào)一個接口去傳值,email默認(rèn)都支持,prometheus原生是不支持釘釘?shù)?如果想支持的話,需要找第三方,做這個數(shù)據(jù)轉(zhuǎn)換的組件。因為promethes傳入的數(shù)據(jù),它與釘釘傳入的數(shù)據(jù)是不匹配的,所有有中間的程序數(shù)據(jù)之間進行轉(zhuǎn)換,現(xiàn)在也有開源的可以去實現(xiàn)。
基本流程就行這樣的,我們定義的規(guī)則都是在prometheus中
在K8S中部署Alertmanager
這里是定義誰發(fā)送這個告警信息的,誰接收這個郵件
[root@k8s-master prometheus-k8s]# vim alertmanager-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: EnsureExists
data:
alertmanager.yml: |
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25'
smtp_from: 'baojingtongzhi@163.com'
smtp_auth_username: 'baojingtongzhi@163.com'
smtp_auth_password: 'liang123'
receivers:
- name: default-receiver
email_configs:
- to: "17733661341@163.com"
route:
group_interval: 1m
group_wait: 10s
receiver: default-receiver
repeat_interval: 1m
[root@k8s-master prometheus-k8s]# cat alertmanager-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
namespace: kube-system
labels:
k8s-app: alertmanager
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
version: v0.14.0
spec:
replicas: 1
selector:
matchLabels:
k8s-app: alertmanager
version: v0.14.0
template:
metadata:
labels:
k8s-app: alertmanager
version: v0.14.0
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ''
spec:
priorityClassName: system-cluster-critical
containers:
- name: prometheus-alertmanager
image: "prom/alertmanager:v0.14.0"
imagePullPolicy: "IfNotPresent"
args:
- --config.file=/etc/config/alertmanager.yml
- --storage.path=/data
- --web.external-url=/
ports:
- containerPort: 9093
readinessProbe:
httpGet:
path: /#/status
port: 9093
initialDelaySeconds: 30
timeoutSeconds: 30
volumeMounts:
- name: config-volume
mountPath: /etc/config
- name: storage-volume
mountPath: "/data"
subPath: ""
resources:
limits:
cpu: 10m
memory: 50Mi
requests:
cpu: 10m
memory: 50Mi
- name: prometheus-alertmanager-configmap-reload
image: "jimmidyson/configmap-reload:v0.1"
imagePullPolicy: "IfNotPresent"
args:
- --volume-dir=/etc/config
- --webhook-url=http://localhost:9093/-/reload
volumeMounts:
- name: config-volume
mountPath: /etc/config
readOnly: true
resources:
limits:
cpu: 10m
memory: 10Mi
requests:
cpu: 10m
memory: 10Mi
volumes:
- name: config-volume
configMap:
name: alertmanager-config
- name: storage-volume
persistentVolumeClaim:
claimName: alertmanager
查看我們的pvc這里也是使用的我們的自動供給managed-nfs-storage
[root@k8s-master prometheus-k8s]# cat alertmanager-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: alertmanager
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: EnsureExists
spec:
storageClassName: managed-nfs-storage
accessModes:
- ReadWriteOnce
resources:
requests:
storage: "2Gi"
這里使用的是類型為cluster IP
[root@k8s-master prometheus-k8s]# cat alertmanager-service.yaml
apiVersion: v1
kind: Service
metadata:
name: alertmanager
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/name: "Alertmanager"
spec:
ports:
- name: http
port: 80
protocol: TCP
targetPort: 9093
selector:
k8s-app: alertmanager
type: "ClusterIP"
然后把我們的資源都創(chuàng)建好
[root@k8s-master prometheus-k8s]# kubectl create -f alertmanager-configmap.yaml
[root@k8s-master prometheus-k8s]# kubectl create -f alertmanager-deployment.yaml
[root@k8s-master prometheus-k8s]# kubectl create -f alertmanager-pvc.yaml
[root@k8s-master prometheus-k8s]# kubectl create -f alertmanager-service.yaml
[root@k8s-master prometheus-k8s]# kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
alertmanager-5d75d5688f-xw2qg 2/2 Running 0 66s
coredns-bccdc95cf-kqxwv 1/1 Running 2 6d
coredns-bccdc95cf-nwkbp 1/1 Running 2 6d
etcd-k8s-master 1/1 Running 1 6d
grafana-0 1/1 Running 0 14h
kube-apiserver-k8s-master 1/1 Running 1 6d
kube-controller-manager-k8s-master 1/1 Running 2 6d
kube-flannel-ds-amd64-dc5z9 1/1 Running 1 5d23h
kube-flannel-ds-amd64-jm2jz 1/1 Running 1 5d23h
kube-flannel-ds-amd64-z6tt2 1/1 Running 1 6d
kube-proxy-9ltx7 1/1 Running 2 6d
kube-proxy-lnzrj 1/1 Running 1 5d23h
kube-proxy-v7dqm 1/1 Running 1 5d23h
kube-scheduler-k8s-master 1/1 Running 2 6d
kube-state-metrics-6474469878-lkphv 2/2 Running 0 98m
prometheus-0 2/2 Running 0 15h
然后也可以在我們的prometheus上看到我們設(shè)置的告警規(guī)則
然后我們測試一下我們的告警,修改一下我們的prometheus的rules
把node磁盤資源設(shè)置為>20 就報警
[root@k8s-master prometheus-k8s]# vim prometheus-rules.yaml
- alert: NodeFilesystemUsage
expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 20
重建一下pod,這里會自動啟動,查看prometheus,已經(jīng)生效,另外上產(chǎn)環(huán)境都是去調(diào)用api,發(fā)送一個信號給rules,這里我是重建的,也可以找一些網(wǎng)上的其他文章
[root@k8s-master prometheus-k8s]# kubectl delete pod prometheus-0 -n kube-system
查看Alerts,這里會變顏色,等會會變成紅色,也就是alertmanager它是有一個處理的邏輯的,還是比較復(fù)雜的,它會設(shè)計到一個靜默,就是告警收斂這一塊,還有一個分組,還有一個再次等待的的確認(rèn),所有不是一觸發(fā)就發(fā)送
粉紅色其實已經(jīng)將告警推送給Alertmanager了,也就是這個狀態(tài)下才去發(fā)送這個告警信息