關(guān)于k8s中的node-創(chuàng)新互聯(lián)

公司網(wǎng)絡(luò)更改重啟服務(wù)器后，發(fā)現(xiàn)Prometheus監(jiān)控中node節(jié)點(diǎn)三個(gè)掛掉了，實(shí)際上節(jié)點(diǎn)服務(wù)器是正常的，但是監(jiān)控的node_exporter請(qǐng)求http://IP:9100/metrics超過10秒沒有獲取返回?cái)?shù)據(jù)則認(rèn)為服務(wù)掛掉。

成都創(chuàng)新互聯(lián)是一家網(wǎng)站設(shè)計(jì)公司，集創(chuàng)意、互聯(lián)網(wǎng)應(yīng)用、軟件技術(shù)為一體的創(chuàng)意網(wǎng)站建設(shè)服務(wù)商，主營(yíng)產(chǎn)品：成都響應(yīng)式網(wǎng)站建設(shè)、品牌網(wǎng)站制作、營(yíng)銷型網(wǎng)站建設(shè)。我們專注企業(yè)品牌在網(wǎng)站中的整體樹立，網(wǎng)絡(luò)互動(dòng)的體驗(yàn)，以及在手機(jī)等移動(dòng)端的優(yōu)質(zhì)呈現(xiàn)。網(wǎng)站制作、成都網(wǎng)站制作、移動(dòng)互聯(lián)產(chǎn)品、網(wǎng)絡(luò)運(yùn)營(yíng)、VI設(shè)計(jì)、云產(chǎn)品.運(yùn)維為核心業(yè)務(wù)。為用戶提供一站式解決方案，我們深知市場(chǎng)的競(jìng)爭(zhēng)激烈，認(rèn)真對(duì)待每位客戶，為客戶提供賞析悅目的作品，網(wǎng)站的價(jià)值服務(wù)。

到各個(gè)節(jié)點(diǎn)服務(wù)器用curl命令檢測(cè)多久返回?cái)?shù)據(jù)

curl -o /dev/null -s -w '%{time_connect}:%{time_starttransfer}:%{time_total}\n' 'http://NodeIP:9100/metrics'

time_connect ：連接時(shí)間，從開始到TCP三次握手完成時(shí)間，這里面包括DNS解析的時(shí)候，如果想求連接時(shí)間，需要減去上面的解析時(shí)間；
time_starttransfer ：開始傳輸時(shí)間，從發(fā)起請(qǐng)求開始，到服務(wù)器返回第一個(gè)字段的時(shí)間；
time_total ：總時(shí)間；

可以看到time_connect時(shí)間是很快的，排除網(wǎng)絡(luò)問題。除了167返回時(shí)間正常其他幾個(gè)節(jié)點(diǎn)都是有問題的。

查看pod日志

ts=2023-01-11T06:11:08.189Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp 163:9100" msg="->167:40743: write: broken pipe"
ts=2023-01-11T06:11:08.189Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp 163:9100" msg="->.167:40743: write: broken pipe"

可以看到有大量的這種日志

參考資料： https://asktug.com/t/topic/153284/36 和 https://blog.csdn.net/lyf0327/article/details/99971590

有些說可以調(diào)大scrape_timeout這個(gè)時(shí)間，但是這個(gè)不是解決問題的根本

也有說調(diào)大node_exporter的yaml中的limit內(nèi)存和CPU,但是我這邊的node_exporter是沒有配置request和limit資源的，也不是這個(gè)問題，最后參考別人的配置文件

apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "5"
    prometheus.io/scrape: "true"
  generation: 5
  labels:
    app: node-exporter
  name: node-exporter
  namespace: monitoring
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: node-exporter
      name: node-exporter
    spec:
      containers:
      - args:
        - --web.listen-address=$(HOSTIP):9100
        - --path.procfs=/host/proc
        - --path.sysfs=/host/sys
        - --path.rootfs=/host   # 修改了這個(gè)原來是/host/root
        - --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/)
        - --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
        - --log.level=debug
        env:
        - name: HOSTIP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        image: prom/node-exporter:latest
        imagePullPolicy: IfNotPresent
        name: node-exporter
        ports:
        - containerPort: 9100
          hostPort: 9100
          protocol: TCP
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/proc
          name: proc
        - mountPath: /host/sys
          name: sys
        - mountPath: /host     # 修改了這個(gè)原來是/rootfs
          name: rootfs
      dnsPolicy: ClusterFirst
      hostIPC: true
      hostNetwork: true
      hostPID: true
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
      volumes:
      - hostPath:
          path: /proc
        name: proc
      - hostPath:
          path: /sys
        name: sys
      - hostPath:
          path: /    
        name: rootfs
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate

觀察了一天

根據(jù)上述修改后仍有問題，其中一個(gè)報(bào)權(quán)限不足

apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "5"
    prometheus.io/scrape: "true"
  generation: 5
  labels:
    app: node-exporter
  name: node-exporter
  namespace: monitoring
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: node-exporter
      name: node-exporter
    spec:
      containers:
      - args:
        - --web.listen-address=$(HOSTIP):9100
        - --path.procfs=/host/proc
        - --path.sysfs=/host/sys
        - --path.rootfs=/host
        - --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/)
        - --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
        - --log.level=debug
        env:
        - name: HOSTIP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        image: prom/node-exporter:latest
        imagePullPolicy: IfNotPresent
        name: node-exporter
        ports:
        - containerPort: 9100
          hostPort: 9100
          protocol: TCP
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/proc
          name: proc
        - mountPath: /host/sys
          name: sys
        - mountPath: /host
          name: rootfs
      dnsPolicy: ClusterFirst
      hostIPC: true
      hostNetwork: true
      hostPID: true
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: 
        runAsUser: 0   # 添加該選項(xiàng) 默認(rèn)是nobody用戶 65534的uid
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
      volumes:
      - hostPath:
          path: /proc
        name: proc
      - hostPath:
          path: /sys
        name: sys
      - hostPath:
          path: /
        name: rootfs
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate

再觀察一天最新的問題有重新出現(xiàn)

報(bào)大量的error encoding and sending metric family: write tcp 10.37.0.163:9100" msg="->10.37.0.167:50266: write: broken pip

查了很多資料最后一個(gè)資料比較有用

地址：https://github.com/prometheus/node_exporter/issues/2500

我這邊的節(jié)點(diǎn)3個(gè)內(nèi)核是5.4.0-132，三個(gè)都是有問題的，而更高版本的內(nèi)核5.15.0是沒問題的
最終結(jié)論是內(nèi)核版本問題；
臨時(shí)解決方案是在node_exporter中添加環(huán)境變量GOMAXPROCS=1
永久解決是升級(jí)內(nèi)核

你是否還在尋找穩(wěn)定的海外服務(wù)器提供商？創(chuàng)新互聯(lián)www.cdcxhl.cn海外機(jī)房具備T級(jí)流量清洗系統(tǒng)配攻擊溯源，準(zhǔn)確流量調(diào)度確保服務(wù)器高可用性，企業(yè)級(jí)服務(wù)器適合批量采購，新人活動(dòng)首月15元起，快前往官網(wǎng)查看詳情吧

網(wǎng)站標(biāo)題：關(guān)于k8s中的node-創(chuàng)新互聯(lián)
分享鏈接：http://weahome.cn/article/gdpdg.html

真实的国产乱ⅩXXX66竹夫人,五月香六月婷婷激情综合,亚洲日本VA一区二区三区,亚洲精品一区二区三区麻豆

關(guān)于k8s中的node-創(chuàng)新互聯(lián)

其他資訊

網(wǎng)站制作

企業(yè)服務(wù)

網(wǎng)站建設(shè)

服務(wù)器托管