公司網(wǎng)絡(luò)更改重啟服務(wù)器后,發(fā)現(xiàn)Prometheus監(jiān)控中node節(jié)點(diǎn)三個(gè)掛掉了,實(shí)際上節(jié)點(diǎn)服務(wù)器是正常的,但是監(jiān)控的node_exporter請(qǐng)求http://IP:9100/metrics超過10秒沒有獲取返回?cái)?shù)據(jù)則認(rèn)為服務(wù)掛掉。
成都創(chuàng)新互聯(lián)是一家網(wǎng)站設(shè)計(jì)公司,集創(chuàng)意、互聯(lián)網(wǎng)應(yīng)用、軟件技術(shù)為一體的創(chuàng)意網(wǎng)站建設(shè)服務(wù)商,主營(yíng)產(chǎn)品:成都響應(yīng)式網(wǎng)站建設(shè)、品牌網(wǎng)站制作、營(yíng)銷型網(wǎng)站建設(shè)。我們專注企業(yè)品牌在網(wǎng)站中的整體樹立,網(wǎng)絡(luò)互動(dòng)的體驗(yàn),以及在手機(jī)等移動(dòng)端的優(yōu)質(zhì)呈現(xiàn)。網(wǎng)站制作、成都網(wǎng)站制作、移動(dòng)互聯(lián)產(chǎn)品、網(wǎng)絡(luò)運(yùn)營(yíng)、VI設(shè)計(jì)、云產(chǎn)品.運(yùn)維為核心業(yè)務(wù)。為用戶提供一站式解決方案,我們深知市場(chǎng)的競(jìng)爭(zhēng)激烈,認(rèn)真對(duì)待每位客戶,為客戶提供賞析悅目的作品,網(wǎng)站的價(jià)值服務(wù)。到各個(gè)節(jié)點(diǎn)服務(wù)器用curl命令檢測(cè)多久返回?cái)?shù)據(jù)
curl -o /dev/null -s -w '%{time_connect}:%{time_starttransfer}:%{time_total}\n' 'http://NodeIP:9100/metrics'
time_connect :連接時(shí)間,從開始到TCP三次握手完成時(shí)間,這里面包括DNS解析的時(shí)候,如果想求連接時(shí)間,需要減去上面的解析時(shí)間;
time_starttransfer :開始傳輸時(shí)間,從發(fā)起請(qǐng)求開始,到服務(wù)器返回第一個(gè)字段的時(shí)間;
time_total :總時(shí)間;
可以看到time_connect時(shí)間是很快的,排除網(wǎng)絡(luò)問題。除了167返回時(shí)間正常其他幾個(gè)節(jié)點(diǎn)都是有問題的。
查看pod日志
ts=2023-01-11T06:11:08.189Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp 163:9100" msg="->167:40743: write: broken pipe"
ts=2023-01-11T06:11:08.189Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp 163:9100" msg="->.167:40743: write: broken pipe"
可以看到有大量的這種日志
參考資料: https://asktug.com/t/topic/153284/36 和 https://blog.csdn.net/lyf0327/article/details/99971590
有些說可以調(diào)大scrape_timeout這個(gè)時(shí)間,但是這個(gè)不是解決問題的根本
也有說調(diào)大node_exporter的yaml中的limit內(nèi)存和CPU,但是我這邊的node_exporter是沒有配置request和limit資源的,也不是這個(gè)問題,最后參考別人的配置文件
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "5"
prometheus.io/scrape: "true"
generation: 5
labels:
app: node-exporter
name: node-exporter
namespace: monitoring
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: node-exporter
template:
metadata:
creationTimestamp: null
labels:
app: node-exporter
name: node-exporter
spec:
containers:
- args:
- --web.listen-address=$(HOSTIP):9100
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --path.rootfs=/host # 修改了這個(gè)原來是/host/root
- --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/)
- --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
- --log.level=debug
env:
- name: HOSTIP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
image: prom/node-exporter:latest
imagePullPolicy: IfNotPresent
name: node-exporter
ports:
- containerPort: 9100
hostPort: 9100
protocol: TCP
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/proc
name: proc
- mountPath: /host/sys
name: sys
- mountPath: /host # 修改了這個(gè)原來是/rootfs
name: rootfs
dnsPolicy: ClusterFirst
hostIPC: true
hostNetwork: true
hostPID: true
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
volumes:
- hostPath:
path: /proc
name: proc
- hostPath:
path: /sys
name: sys
- hostPath:
path: /
name: rootfs
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
觀察了一天
根據(jù)上述修改后仍有問題,其中一個(gè)報(bào)權(quán)限不足
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "5"
prometheus.io/scrape: "true"
generation: 5
labels:
app: node-exporter
name: node-exporter
namespace: monitoring
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: node-exporter
template:
metadata:
creationTimestamp: null
labels:
app: node-exporter
name: node-exporter
spec:
containers:
- args:
- --web.listen-address=$(HOSTIP):9100
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --path.rootfs=/host
- --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/)
- --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
- --log.level=debug
env:
- name: HOSTIP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
image: prom/node-exporter:latest
imagePullPolicy: IfNotPresent
name: node-exporter
ports:
- containerPort: 9100
hostPort: 9100
protocol: TCP
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/proc
name: proc
- mountPath: /host/sys
name: sys
- mountPath: /host
name: rootfs
dnsPolicy: ClusterFirst
hostIPC: true
hostNetwork: true
hostPID: true
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
runAsUser: 0 # 添加該選項(xiàng) 默認(rèn)是nobody用戶 65534的uid
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
volumes:
- hostPath:
path: /proc
name: proc
- hostPath:
path: /sys
name: sys
- hostPath:
path: /
name: rootfs
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
再觀察一天最新的問題有重新出現(xiàn)
報(bào)大量的error encoding and sending metric family: write tcp 10.37.0.163:9100" msg="->10.37.0.167:50266: write: broken pip
查了很多資料最后一個(gè)資料比較有用
地址:https://github.com/prometheus/node_exporter/issues/2500
我這邊的節(jié)點(diǎn)3個(gè)內(nèi)核是5.4.0-132,三個(gè)都是有問題的,而更高版本的內(nèi)核5.15.0是沒問題的
最終結(jié)論是內(nèi)核版本問題;
臨時(shí)解決方案是在node_exporter中添加環(huán)境變量GOMAXPROCS=1
永久解決是升級(jí)內(nèi)核
你是否還在尋找穩(wěn)定的海外服務(wù)器提供商?創(chuàng)新互聯(lián)www.cdcxhl.cn海外機(jī)房具備T級(jí)流量清洗系統(tǒng)配攻擊溯源,準(zhǔn)確流量調(diào)度確保服務(wù)器高可用性,企業(yè)級(jí)服務(wù)器適合批量采購,新人活動(dòng)首月15元起,快前往官網(wǎng)查看詳情吧