【夜鶯監(jiān)控】管理Kubernetes組件指標(biāo)

作者：?jiǎn)炭?2023-05-11 07:08:07

云計(jì)算

云原生這篇文章只討論 Kubernetes 本身的監(jiān)控，而且只討論如何在夜鶯體系中來(lái)監(jiān)控它們。

開(kāi)始之前

Kubernetes 是一個(gè)簡(jiǎn)單且復(fù)雜的系統(tǒng)，簡(jiǎn)單之處在于其整體架構(gòu)比較簡(jiǎn)單清晰，是一個(gè)標(biāo)準(zhǔn)的 Master-Slave 模式，如下：

但是，它又是一個(gè)復(fù)雜的系統(tǒng)，不論是 Master 還是 Slave，都有多個(gè)組件組合而成，如上圖所示：

Master 組件

apiserver：API 入口，負(fù)責(zé)認(rèn)證、授權(quán)、訪問(wèn)控制、API 注冊(cè)與發(fā)現(xiàn)等。
scheduler：負(fù)責(zé)資源調(diào)度。
controller-manager：維護(hù)集群狀態(tài)。

Slave 組件。
kubelet：維護(hù)容器生命周期、CSI 管理以及 CNI 管理。
kube-proxy：負(fù)責(zé)服務(wù)發(fā)現(xiàn)和負(fù)載均衡。
container runtime（docker、containerd 等）：鏡像管理、容器運(yùn)行、CRI 管理等。
數(shù)據(jù)庫(kù)組件。
Etcd：保存集群狀態(tài)，與 apiserver 保持通信。

對(duì)于如此復(fù)雜的簡(jiǎn)單系統(tǒng)，要時(shí)刻掌握里內(nèi)部的運(yùn)行狀態(tài)，是一件挺難的事情，因?yàn)樗母采w面非常的廣，主要涉及：

操作系統(tǒng)層面：Kubernetes 是部署在操作系統(tǒng)之上的，操作系統(tǒng)層面的監(jiān)控非常重要。
Kubernetes 本身：Kubernetes 涉及相當(dāng)多的組件，這些組件的運(yùn)行狀態(tài)關(guān)乎整個(gè)集群的穩(wěn)定性。
Kubernetes 之上的應(yīng)用：Kubernetes 是為應(yīng)用提供運(yùn)行環(huán)境的，企業(yè)的應(yīng)用系統(tǒng)都是部署在集群中，這些應(yīng)用的穩(wěn)定關(guān)乎企業(yè)的發(fā)展。
還有其他的比如網(wǎng)絡(luò)、機(jī)房、機(jī)柜等等底層支柱。

要監(jiān)控的非常多，SLI 也非常多。不過(guò)，這篇文章只討論 Kubernetes 本身的監(jiān)控，而且只討論如何在夜鶯體系中來(lái)監(jiān)控它們。

對(duì)于 Kubernetes 本身，主要是監(jiān)控其系統(tǒng)組件，如下：

!! Ps：這里不在介紹夜鶯監(jiān)控是怎么安裝的，如果不清楚的可以看《【夜鶯監(jiān)控】初識(shí)夜鶯》這篇文章，本次實(shí)驗(yàn)也是使用是這篇文章中的安裝方式。

KubeApiServer

ApiServer 是 Kubernetes 架構(gòu)中的核心，是所有 API 是入口，它串聯(lián)所有的系統(tǒng)組件。

為了方便監(jiān)控管理 ApiServer，設(shè)計(jì)者們?yōu)樗┞读艘幌盗械闹笜?biāo)數(shù)據(jù)。當(dāng)你部署完集群，默認(rèn)會(huì)在default名稱空間下創(chuàng)建一個(gè)名叫kubernetes的 service，它就是 ApiServer 的地址。

# kubectl get svc
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.96.0.1            443/TCP   309d

你可以通過(guò)curl -s -k -H "Authorization: Bearer $token" https://10.96.0.1:6443/metrics命令查看指標(biāo)。其中$token是通過(guò)在集群中創(chuàng)建 ServerAccount 以及授予相應(yīng)的權(quán)限得到。

所以，要監(jiān)控 ApiServer，采集到對(duì)應(yīng)的指標(biāo)，就需要先授權(quán)。為此，我們先準(zhǔn)備認(rèn)證信息。

創(chuàng)建 namespace

kubectl create ns flashcat

創(chuàng)建認(rèn)證授權(quán)信息

創(chuàng)建0-apiserver-auth.yaml文件，內(nèi)容如下：

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: categraf
rules:
  - apiGroups: [""]
    resources:
      - nodes
      - nodes/metrics
      - nodes/stats
      - nodes/proxy
      - services
      - endpoints
      - pods
    verbs: ["get", "list", "watch"]
  - apiGroups:
      - extensions
      - networking.k8s.io
    resources:
      - ingresses
    verbs: ["get", "list", "watch"]
  - nonResourceURLs: ["/metrics", "/metrics/cadvisor"]
    verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: categraf
  namespace: flashcat
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: categraf
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: categraf
subjects:
  - kind: ServiceAccount
    name: categraf
    namespace: flashcat

上面的內(nèi)容主要是為categraf授予查詢相關(guān)資源的權(quán)限，這樣就可以獲取到這些組件的指標(biāo)數(shù)據(jù)了。

指標(biāo)采集

指標(biāo)采集的方式有很多種，建議通過(guò)自動(dòng)發(fā)現(xiàn)的方式進(jìn)行采集，這樣是不論是伸縮、修改組件都無(wú)需再次來(lái)調(diào)整監(jiān)控方式了。

夜鶯支持Prometheus Agent的方式獲取指標(biāo)，而且 Prometheus 在服務(wù)發(fā)現(xiàn)方面做的非常好，所以這里將使用Prometheus Agent方式來(lái)采集 ApiServer 的指標(biāo)。

（1）創(chuàng)建 Prometheus 配置

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-agent-conf
  labels:
    name: prometheus-agent-conf
  namespace: flashcat
data:
  prometheus.yml: |-
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    scrape_configs:
      - job_name: 'apiserver'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https
    remote_write:
    - url: 'http://192.168.205.143:17000/prometheus/v1/write'

上面的內(nèi)容主要是通過(guò)endpoints的方式主動(dòng)發(fā)現(xiàn)在default名稱空間下名字為kubernetes且端口為https的服務(wù)，然后將獲取到的監(jiān)控指標(biāo)傳輸給夜鶯服務(wù)端http://192.168.205.143:17000/prometheus/v1/write（這個(gè)地址根據(jù)實(shí)際情況做調(diào)整）。

（2）部署 Prometheus Agent

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-agent
  namespace: flashcat
  labels:
    app: prometheus-agent
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-agent
  template:
    metadata:
      labels:
        app: prometheus-agent
    spec:
      serviceAccountName: categraf
      containers:
        - name: prometheus
          image: prom/prometheus
          args:
            - "--config.file=/etc/prometheus/prometheus.yml"
            - "--web.enable-lifecycle"
            - "--enable-feature=agent"
          ports:
            - containerPort: 9090
          resources:
            requests:
              cpu: 500m
              memory: 500M
            limits:
              cpu: 1
              memory: 1Gi
          volumeMounts:
            - name: prometheus-config-volume
              mountPath: /etc/prometheus/
            - name: prometheus-storage-volume
              mountPath: /prometheus/
      volumes:
        - name: prometheus-config-volume
          configMap:
            defaultMode: 420
            name: prometheus-agent-conf
        - name: prometheus-storage-volume
          emptyDir: {}

其中--enable-feature=agent表示啟動(dòng)的是 agent 模式。

然后將上面的所有 YAML 文件部署到 Kubernetes 中，然后查看 Prometheus Agent 是否正常。

# kubectl get po -n flashcat
NAME                                READY   STATUS    RESTARTS   AGE
prometheus-agent-78c8ccc4f5-g25st   1/1     Running   0          92s

然后可以到夜鶯UI查看對(duì)應(yīng)的指標(biāo)。

獲取到了指標(biāo)數(shù)據(jù)，后面就是合理利用指標(biāo)做其他動(dòng)作，比如構(gòu)建面板、告警處理等。

比如夜鶯Categraf提供了 ApiServer 的儀表盤(pán)（https://github.com/flashcatcloud/categraf/blob/main/k8s/apiserver-dash.json），導(dǎo)入后如下：

但是，不論是做面板也好，還是做告警也罷，首先都要對(duì) ApiServer 的指標(biāo)有一個(gè)清晰的認(rèn)識(shí)。

下面做了一些簡(jiǎn)單的整理。

指標(biāo)簡(jiǎn)介

以下指標(biāo)來(lái)自阿里云 ACK 官方文檔，我覺(jué)得整理的比較全，比較細(xì)，就貼了一部分。想要了解更多的可以到官方網(wǎng)站去查看。

指標(biāo)清單

指標(biāo)	類(lèi)型	解釋
apiserver_request_duration_seconds_bucket	Histogram	該指標(biāo)用于統(tǒng)計(jì) APIServer 客戶端對(duì) APIServer 的訪問(wèn)時(shí)延。對(duì) APIServer 不同請(qǐng)求的時(shí)延分布。請(qǐng)求的維度包括 Verb、Group、Version、Resource、Subresource、Scope、Component 和 Client。
Histogram Bucket 的閾值為：{0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.25, 1.5, 1.75, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40, 50, 60}，單位：秒。
apiserver_request_total	Counter	對(duì) APIServer 不同請(qǐng)求的計(jì)數(shù)。請(qǐng)求的維度包括 Verb、Group、Version、Resource、Scope、Component、HTTP contentType、HTTP code 和 Client。
apiserver_request_no_resourceversion_list_total	Counter	對(duì) APIServer 的請(qǐng)求參數(shù)中未配置 ResourceVersion 的 LIST 請(qǐng)求的計(jì)數(shù)。請(qǐng)求的維度包括 Group、Version、Resource、Scope 和 Client。用來(lái)評(píng)估 quorum read 類(lèi)型 LIST 請(qǐng)求的情況，用于發(fā)現(xiàn)是否存在過(guò)多 quorum read 類(lèi)型 LIST 以及相應(yīng)的客戶端，以便優(yōu)化客戶端請(qǐng)求行為。
apiserver_current_inflight_requests	Gauge	APIServer 當(dāng)前處理的請(qǐng)求數(shù)。包括 ReadOnly 和 Mutating 兩種。
apiserver_dropped_requests_total	Counter	限流丟棄掉的請(qǐng)求數(shù)。HTTP 返回值是429 'Try again later'。
apiserver_admission_controller_admission_duration_seconds_bucket	Gauge	準(zhǔn)入控制器（Admission Controller）的處理延時(shí)。標(biāo)簽包括準(zhǔn)入控制器名字、操作（CREATE、UPDATE、CONNECT 等）、API 資源、操作類(lèi)型（validate 或 admit）和請(qǐng)求是否被拒絕（true 或 false）。
Bucket 的閾值為：{0.005, 0.025, 0.1, 0.5, 2.5}，單位：秒。
apiserver_admission_webhook_admission_duration_seconds_bucket	Gauge	準(zhǔn)入 Webhook（Admission Webhook）的處理延時(shí)。標(biāo)簽包括準(zhǔn)入控制器名字、操作（CREATE、UPDATE、CONNECT 等）、API 資源、操作類(lèi)型（validate 或 admit）和請(qǐng)求是否被拒絕（true 或 false）。
Bucket 的閾值為：{0.005, 0.025, 0.1, 0.5, 2.5}，單位：秒。
apiserver_admission_webhook_admission_duration_seconds_count	Counter	準(zhǔn)入 Webhook（Admission Webhook）的處理請(qǐng)求統(tǒng)計(jì)。標(biāo)簽包括準(zhǔn)入控制器名字、操作（CREATE、UPDATE、CONNECT 等）、API 資源、操作類(lèi)型（validate 或 admit）和請(qǐng)求是否被拒絕（true 或 false）。
cpu_utilization_core	Gauge	CPU 使用量，單位：核（Core）。
cpu_utilization_ratio	Gauge	CPU 使用率=CPU 使用量/內(nèi)存資源上限，百分比形式。
memory_utilization_byte	Gauge	內(nèi)存使用量，單位：字節(jié)（Byte）。
memory_utilization_ratio	Gauge	內(nèi)存使用率=內(nèi)存使用量/內(nèi)存資源上限，百分比形式。
up	Gauge	服務(wù)可用性。

1：表示服務(wù)可用。
0：表示服務(wù)不可用。

關(guān)鍵指標(biāo)

名稱	PromQL	說(shuō)明
API QPS	sum(irate(apiserver_request_total[$interval]))	APIServer 總 QPS。
讀請(qǐng)求成功率	sum(irate(apiserver_request_total{code=~"20.*",verb=~"GET\|LIST"}[interval]))	APIServer 讀請(qǐng)求成功率。
寫(xiě)請(qǐng)求成功率	sum(irate(apiserver_request_total{code=~"20.*",verb!~"GET\|LIST\|WATCH\|CONNECT"}[interval]))	APIServer 寫(xiě)請(qǐng)求成功率。
在處理讀請(qǐng)求數(shù)量	sum(apiserver_current_inflight_requests{requestKind="readOnly"})	APIServer 當(dāng)前在處理讀請(qǐng)求數(shù)量。
在處理寫(xiě)請(qǐng)求數(shù)量	sum(apiserver_current_inflight_requests{requestKind="mutating"})	APIServer 當(dāng)前在處理寫(xiě)請(qǐng)求數(shù)量。
請(qǐng)求限流速率	sum(irate(apiserver_dropped_requests_total[$interval]))	Dropped Request Rate。

資源指標(biāo)

名稱	PromQL	說(shuō)明
內(nèi)存使用量	memory_utilization_byte{cnotallow="kube-apiserver"}	APIServer 內(nèi)存使用量，單位：字節(jié)。
CPU 使用量	cpu_utilization_core{cnotallow="kube-apiserver"}*1000	CPU 使用量，單位：豪核。
內(nèi)存使用率	memory_utilization_ratio{cnotallow="kube-apiserver"}	APIServer 內(nèi)存使用率，百分比。
CPU 使用率	cpu_utilization_ratio{cnotallow="kube-apiserver"}	APIServer CPU 使用率，百分比。
資源對(duì)象數(shù)量

max by(resource)(apiserver_storage_objects)。
max by(resource)(etcd_object_counts) | Kubernetes 管理資源數(shù)量，不同版本名稱可能不同。 |

QPS 和時(shí)延

名稱	PromQL	說(shuō)明
按 Verb 維度分析 QPS	sum(irate(apiserver_request_total{verb=~"verb"}[interval]))by(verb)	按 Verb 維度，統(tǒng)計(jì)單位時(shí)間（1s）內(nèi)的請(qǐng)求 QPS。
按 Verb+Resource 維度分析 QPS	sum(irate(apiserver_request_total{verb=~"resource"}[$interval]))by(verb,resource)	按 Verb+Resource 維度，統(tǒng)計(jì)單位時(shí)間（1s）內(nèi)的請(qǐng)求 QPS。
按 Verb 維度分析請(qǐng)求時(shí)延	histogram_quantile(interval])) by (le,verb))	按 Verb 維度，分析請(qǐng)求時(shí)延。
按 Verb+Resource 維度分析請(qǐng)求時(shí)延	histogram_quantile(interval])) by (le,verb,resource))	按 Verb+Resource 維度，分析請(qǐng)求時(shí)延。
非 2xx 返回值的讀請(qǐng)求 QPS	sum(irate(apiserver_request_total{verb=~"GET\|LIST",resource=~"resource",code!~"2.*"}[interval])) by (verb,resource,code)	統(tǒng)計(jì)非 2xx 返回值的讀請(qǐng)求 QPS。
非 2xx 返回值的寫(xiě)請(qǐng)求 QPS	sum(irate(apiserver_request_total{verb!~"GET\|LIST\|WATCH",verb=~"resource",code!~"2.*"}[$interval])) by (verb,resource,code)	統(tǒng)計(jì)非 2xx 返回值的寫(xiě)請(qǐng)求 QPS。

KubeControllerManager

ControllerManager 也是 Kubernetes 的重要組件，它負(fù)責(zé)整個(gè)集群的資源控制管理，它有許多的控制器，比如 NodeController、JobController 等。

ControllerManager 的監(jiān)控思路和 ApiServer 一樣，都使用 Prometheus Agent 進(jìn)行采集。

指標(biāo)采集

ControllerManager 是通過(guò)10257的/metrics接口進(jìn)行指標(biāo)采集，要訪問(wèn)這個(gè)接口同樣需要相應(yīng)的權(quán)限，不過(guò)我們?cè)诓杉?ApiServer 的時(shí)候創(chuàng)建過(guò)相應(yīng)的權(quán)限，這里就不用創(chuàng)建了。

（1）添加 Prometheus 配置在原有的 Prometheus 采集配置中新增一個(gè) job 用于采集 ControllerManager，如下：

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-agent-conf
  labels:
    name: prometheus-agent-conf
  namespace: flashcat
data:
  prometheus.yml: |-
    global:
      scrape_interval: 15s
      evaluation_interval: 15s

    scrape_configs:
      - job_name: 'apiserver'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https

      - job_name: 'controller-manager'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;kube-controller-manager;https-metrics

    remote_write:
    - url: 'http://192.168.205.143:17000/prometheus/v1/write'

由于我的集群里沒(méi)有相應(yīng)的 endpoints，所以需要?jiǎng)?chuàng)建一個(gè)，如下：

apiVersion: v1
kind: Service
metadata:
  annotations:
  labels:
    k8s-app: kube-controller-manager
  name: kube-controller-manager
  namespace: kube-system
spec:
  clusterIP: None
  ports:
    - name: https-metrics
      port: 10257
      protocol: TCP
      targetPort: 10257
  selector:
    component: kube-controller-manager
  sessionAffinity: None
  type: ClusterIP

將 YAML 的資源更新到 Kubernetes 中，然后使用curl -X POST "http://:9090/-/reload"重載 Prometheus。

但是現(xiàn)在我們還無(wú)法獲取到 ControllerManager 的指標(biāo)數(shù)據(jù)，需要把 ControllerManager 的bind-address改成0.0.0.0。

然后就可以在夜鶯 UI 中查看指標(biāo)了。

然后可以導(dǎo)入https://github.com/flashcatcloud/categraf/blob/main/k8s/cm-dash.json的是數(shù)據(jù)大盤(pán)。

指標(biāo)簡(jiǎn)介

指標(biāo)清單

指標(biāo)	類(lèi)型	說(shuō)明
workqueue_adds_total	Counter	Workqueue 處理的 Adds 事件的數(shù)量。
workqueue_depth	Gauge	Workqueue 當(dāng)前隊(duì)列深度。
workqueue_queue_duration_seconds_bucket	Histogram	任務(wù)在 Workqueue 中存在的時(shí)長(zhǎng)。
memory_utilization_byte	Gauge	內(nèi)存使用量，單位：字節(jié)（Byte）。
memory_utilization_ratio	Gauge	內(nèi)存使用率=內(nèi)存使用量/內(nèi)存資源上限，百分比形式。
cpu_utilization_core	Gauge	CPU 使用量，單位：核（Core）。
cpu_utilization_ratio	Gauge	CPU 使用率=CPU 使用量/內(nèi)存資源上限，百分比形式。
rest_client_requests_total	Counter	從狀態(tài)值（Status Code）、方法（Method）和主機(jī)（Host）維度分析 HTTP 請(qǐng)求數(shù)。
rest_client_request_duration_seconds_bucket	Histogram	從方法（Verb）和 URL 維度分析 HTTP 請(qǐng)求時(shí)延。

Queue 指標(biāo)

名稱	PromQL	說(shuō)明
Workqueue 入隊(duì)速率	sum(rate(workqueue_adds_total{job="ack-kube-controller-manager"}[$interval])) by (name)	無(wú)
Workqueue 深度	sum(rate(workqueue_depth{job="ack-kube-controller-manager"}[$interval])) by (name)	無(wú)
Workqueue 處理時(shí)延	histogram_quantile($quantile, sum(rate(workqueue_queue_duration_seconds_bucket{job="ack-kube-controller-manager"}[5m])) by (name, le))	無(wú)

資源指標(biāo)

名稱	PromQL	說(shuō)明
內(nèi)存使用量	memory_utilization_byte{cnotallow="kube-controller-manager"}	內(nèi)存使用量，單位：字節(jié)。
CPU 使用量	cpu_utilization_core{cnotallow="kube-controller-manager"}*1000	CPU 使用量，單位：毫核。
內(nèi)存使用率	memory_utilization_ratio{cnotallow="kube-controller-manager"}	內(nèi)存使用率，百分比。
CPU 使用率	cpu_utilization_ratio{cnotallow="kube-controller-manager"}	CPU 使用率，百分比。

QPS 和時(shí)延

名稱	PromQL	說(shuō)明
Kube API 請(qǐng)求 QPS

sum(rate(rest_client_requests_total{job="ack-scheduler",code=~"2.."}[$interval])) by (method,code)。
sum(rate(rest_client_requests_total{job="ack-scheduler",code=~"3.."}[$interval])) by (method,code)。
sum(rate(rest_client_requests_total{job="ack-scheduler",code=~"4.."}[$interval])) by (method,code)。
sum(rate(rest_client_requests_totaljob="ack-scheduler",code=~"5.."}[$interval])) by (method,code)對(duì) kube-apiserver 發(fā)起的 HTTP 請(qǐng)求，從方法（Method）和返回值（Code) 維度分析。 | | Kube API 請(qǐng)求時(shí)延 | histogram_quantile($quantile, sum(rate(rest_client_request_duration_seconds_bucket{job="ack-kube-controller-manager"[$interval])) by (verb,url,le)) | 對(duì) kube-apiserver 發(fā)起的 HTTP 請(qǐng)求時(shí)延，從方法（Verb）和請(qǐng)求 URL 維度分析。 |

KubeScheduler

Scheduler 監(jiān)聽(tīng)在10259端口，依然通過(guò) Prometheus Agent 的方式采集指標(biāo)。

指標(biāo)采集

（1）編輯 Prometheus 配置文件

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-agent-conf
  labels:
    name: prometheus-agent-conf
  namespace: flashcat
data:
  prometheus.yml: |-
    global:
      scrape_interval: 15s
      evaluation_interval: 15s

    scrape_configs:
      - job_name: 'apiserver'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https

      - job_name: 'controller-manager'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;kube-controller-manager;https-metrics
      - job_name: 'scheduler'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;kube-scheduler;https

    remote_write:
    - url: 'http://192.168.205.143:17000/prometheus/v1/write'

然后配置 Scheduler 的 Service。

apiVersion: v1
kind: Service
metadata:
  labels:
    k8s-app: kube-scheduler
  name: kube-scheduler
  namespace: kube-system
spec:
  clusterIP: None
  ports:
    - name: https
      port: 10259
      protocol: TCP
      targetPort: 10259
  selector:
    component: kube-scheduler
  sessionAffinity: None
  type: ClusterIP

將 YAML 的資源更新到 Kubernetes 中，然后使用curl -X POST "http://:9090/-/reload"重載 Prometheus。

但是現(xiàn)在我們還無(wú)法獲取到 Scheduler 的指標(biāo)數(shù)據(jù)，需要把 Scheduler 的bind-address改成0.0.0.0。

修改完成過(guò)后就可以正常在夜鶯UI中查看指標(biāo)了。

導(dǎo)入監(jiān)控大盤(pán)（https://github.com/flashcatcloud/categraf/blob/main/k8s/scheduler-dash.json）。

指標(biāo)簡(jiǎn)介

指標(biāo)清單

指標(biāo)清單	類(lèi)型	說(shuō)明
scheduler_scheduler_cache_size	Gauge	調(diào)度器緩存中 Node、Pod 和 AssumedPod 的數(shù)量。
scheduler_pending_pods	Gauge	Pending Pod 的數(shù)量。隊(duì)列種類(lèi)如下：

unschedulable：表示不可調(diào)度的 Pod 數(shù)量。
backoff：表示 backoffQ 的 Pod 數(shù)量。
active：表示 activeQ 的 Pod 數(shù)量。 | | scheduler_pod_scheduling_attempts_bucket | Histogram | 調(diào)度器嘗試成功調(diào)度 Pod 的次數(shù)，Bucket 閾值為 1、2、4、8、16。 | | memory_utilization_byte | Gauge | 內(nèi)存使用量，單位：字節(jié)（Byte）。 | | memory_utilization_ratio | Gauge | 內(nèi)存使用率=內(nèi)存使用量/內(nèi)存資源上限，百分比形式。 | | cpu_utilization_core | Gauge | CPU 使用量，單位：核（Core）。 | | cpu_utilization_ratio | Gauge | CPU 使用率=CPU 使用量/內(nèi)存資源上限，百分比形式。 | | rest_client_requests_total | Counter | 從狀態(tài)值（Status Code）、方法（Method）和主機(jī)（Host）維度分析 HTTP 請(qǐng)求數(shù)。 | | rest_client_request_duration_seconds_bucket | Histogram | 從方法（Verb）和 URL 維度分析 HTTP 請(qǐng)求時(shí)延。 |

基本指標(biāo)

指標(biāo)清單	PromQL	說(shuō)明
Scheduler 集群統(tǒng)計(jì)數(shù)據(jù)

scheduler_scheduler_cache_size{job="ack-scheduler",type="nodes"}
scheduler_scheduler_cache_size{job="ack-scheduler",type="pods"}
scheduler_scheduler_cache_sizejob="ack-scheduler",type="assumed_pods"}調(diào)度器緩存中 Node、Pod 和 AssumedPod 的數(shù)量。 | | Scheduler Pending Pods | scheduler_pending_pods{job="ack-scheduler"| Pending Pod 的數(shù)量。隊(duì)列種類(lèi)如下：
unschedulable：表示不可調(diào)度的 Pod 數(shù)量。
backoff：表示 backoffQ 的 Pod 數(shù)量。
active：表示 activeQ 的 Pod 數(shù)量。 | | Scheduler 嘗試成功調(diào)度 Pod 次數(shù) | histogram_quantile(interval])) by (pod, le)) | 調(diào)度器嘗試調(diào)度 Pod 的次數(shù)，Bucket 閾值為 1、2、4、8、16。 |

資源指標(biāo)

指標(biāo)清單	PromQL	說(shuō)明
內(nèi)存使用量	memory_utilization_byte{cnotallow="kube-scheduler"}	內(nèi)存使用量，單位：字節(jié)。
CPU 使用量	cpu_utilization_core{cnotallow="kube-scheduler"}*1000	CPU 使用量，單位：毫核。
內(nèi)存使用率	memory_utilization_ratio{cnotallow="kube-scheduler"}	內(nèi)存使用率，百分比。
CPU 使用率	cpu_utilization_ratio{cnotallow="kube-scheduler"}	CPU 使用率，百分比。

QPS 和時(shí)延

指標(biāo)清單	PromQL	說(shuō)明
Kube API 請(qǐng)求 QPS

sum(rate(rest_client_requests_total{job="ack-scheduler",code=~"2.."}[$interval])) by (method,code)
sum(rate(rest_client_requests_total{job="ack-scheduler",code=~"3.."}[$interval])) by (method,code)
sum(rate(rest_client_requests_total{job="ack-scheduler",code=~"4.."}[$interval])) by (method,code)
sum(rate(rest_client_requests_totaljob="ack-scheduler",code=~"5.."}[$interval])) by (method,code)調(diào)度器對(duì) kube-apiserver 發(fā)起的 HTTP 請(qǐng)求，從方法（Method）和返回值（Code) 維度分析。 | | Kube API 請(qǐng)求時(shí)延 | histogram_quantile($quantile, sum(rate(rest_client_request_duration_seconds_bucket{job="ack-scheduler"[$interval])) by (verb,url,le)) | 調(diào)度器對(duì) kube-apiserver 發(fā)起的 HTTP 請(qǐng)求時(shí)延，從方法（Verb）和請(qǐng)求 URL 維度分析。 |

Etcd

Etcd 是 Kubernetes 的存儲(chǔ)中心，所有資源信息都是存在在其中，它通過(guò)2381端口對(duì)外提供監(jiān)控指標(biāo)。

指標(biāo)采集

由于我這里的 Etcd 是通過(guò)靜態(tài) Pod 的方式部署到 Kubernetes 集群中的，所以依然使用 Prometheus Agent 來(lái)采集指標(biāo)。

（1）配置 Prometheus 的采集配置

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-agent-conf
  labels:
    name: prometheus-agent-conf
  namespace: flashcat
data:
  prometheus.yml: |-
    global:
      scrape_interval: 15s
      evaluation_interval: 15s

    scrape_configs:
      - job_name: 'apiserver'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https

      - job_name: 'controller-manager'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;kube-controller-manager;https-metrics
      - job_name: 'scheduler'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;kube-scheduler;https
      - job_name: 'etcd'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: http
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;etcd;http
    remote_write:
    - url: 'http://192.168.205.143:17000/prometheus/v1/write'

然后增加 Etcd 的 Service 配置。

apiVersion: v1
kind: Service
metadata:
  namespace: kube-system
  name: etcd
  labels:
    k8s-app: etcd
spec:
  selector:
    component: etcd
  type: ClusterIP
  clusterIP: None
  ports:
    - name: http
      port: 2381
      targetPort: 2381
      protocol: TCP

部署 YAML 文件，并重啟 Prometheus。如果獲取不到指標(biāo)，需要修改 Etcd 的listen-metrics-urls配置為0.0.0.0。

導(dǎo)入監(jiān)控大盤(pán)（https://github.com/flashcatcloud/categraf/blob/main/k8s/etcd-dash.json）。

指標(biāo)簡(jiǎn)介

指標(biāo)清單

指標(biāo)	類(lèi)型	說(shuō)明
cpu_utilization_core	Gauge	CPU 使用量，單位：核（Core）。
cpu_utilization_ratio	Gauge	CPU 使用率=CPU 使用量/內(nèi)存資源上限，百分比形式。
etcd_server_has_leader	Gauge	etcd member 是否有 Leader。

1：表示有主節(jié)點(diǎn)。
0：表示沒(méi)有主節(jié)點(diǎn)。 | | etcd_server_is_leader | Gauge | etcd member 是否是 Leader。
1：表示是。
0：表示不是。 | | etcd_server_leader_changes_seen_total | Counter | etcd member 過(guò)去一段時(shí)間切主次數(shù)。 | | etcd_mvcc_db_total_size_in_bytes | Gauge | etcd member db 總大小。 | | etcd_mvcc_db_total_size_in_use_in_bytes | Gauge | etcd member db 實(shí)際使用大小。 | | etcd_disk_backend_commit_duration_seconds_bucket | Histogram | etcd backend commit 延時(shí)。 Bucket 列表為：**[0.001 0.002 0.004 0.008 0.016 0.032 0.064 0.128 0.256 0.512 1.024 2.048 4.096 8.192]**。 | | etcd_debugging_mvcc_keys_total | Gauge | etcd keys 總數(shù)。 | | etcd_server_proposals_committed_total | Gauge | raft proposals commit 提交總數(shù)。 | | etcd_server_proposals_applied_total | Gauge | raft proposals apply 總數(shù)。 | | etcd_server_proposals_pending | Gauge | raft proposals 排隊(duì)數(shù)量。 | | etcd_server_proposals_failed_total | Counter | raft proposals 失敗數(shù)量。 | | memory_utilization_byte | Gauge | 內(nèi)存使用量，單位：字節(jié)（Byte）。 | | memory_utilization_ratio | Gauge | 內(nèi)存使用率=內(nèi)存使用量/內(nèi)存資源上限，百分比形式。 |

基礎(chǔ)指標(biāo)

名稱	PromQL	說(shuō)明
etcd 存活狀態(tài)

etcd_server_has_leader
etcd_server_is_leader == 1 |
etcd member 是否存活，正常值為 3。
etcd member 是否是主節(jié)點(diǎn)，正常情況下，必須有一個(gè) Member 為主節(jié)點(diǎn)。 | | 過(guò)去一天切主次數(shù) | changes(etcd_server_leader_changes_seen_totaljob="etcd"}[1d])過(guò)去一天內(nèi) etcd 集群切主次數(shù)。 | | 內(nèi)存使用量 | memory_utilization_byte{cnotallow="etcd"| 內(nèi)存使用量，單位：字節(jié)。 | | CPU 使用量 | cpu_utilization_corecnotallow="etcd"}*1000CPU 使用量，單位：毫核。 | | 內(nèi)存使用率 | memory_utilization_ratio{cnotallow="etcd"| 內(nèi)存使用率，百分比。 | | CPU 使用率 | cpu_utilization_ratio{cnotallow="etcd"} | CPU 使用率，百分比。 | | 磁盤(pán)大小 |
etcd_mvcc_db_total_size_in_bytes
etcd_mvcc_db_total_size_in_use_in_bytes |
etcd backend db 總大小。
etcd backend db 實(shí)際使用大小。 | | kv 總數(shù) | etcd_debugging_mvcc_keys_total | etcd 集群 kv 對(duì)總數(shù)。 | | backend commit 延遲 | histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket{job="etcd"}[5m])) by (instance, le)) | db commit 時(shí)延。 | | raft proposal 情況 |
rate(etcd_server_proposals_failed_total{job="etcd"}[1m])
etcd_server_proposals_pending{job="etcd"}
etcd_server_proposals_committed_total{job="etcd"} - etcd_server_proposals_applied_total{job="etcd"} |
raft proposal failed 速率（分鐘）。
raft proposal pending 總數(shù)。
commit-apply 差值。 |

kubelet

kubelet 工作節(jié)點(diǎn)的主要組件，它監(jiān)聽(tīng)兩個(gè)端口：10248和10250。10248是監(jiān)控檢測(cè)端口，10250是系統(tǒng)默認(rèn)端口，通過(guò)它的/metrics接口暴露指標(biāo)。

指標(biāo)采集

這里依然通過(guò) Prometheus Agent 的方式采集 kubelet 的指標(biāo)。

（1）修改 Prometheus 的配置文件

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-agent-conf
  labels:
    name: prometheus-agent-conf
  namespace: flashcat
data:
  prometheus.yml: |-
    global:
      scrape_interval: 15s
      evaluation_interval: 15s

    scrape_configs:
      - job_name: 'apiserver'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https

      - job_name: 'controller-manager'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __m                    

                    新聞標(biāo)題：【夜鶯監(jiān)控】管理Kubernetes組件指標(biāo)                    

                    文章鏈接：http://uogjgqi.cn/article/dpsopgi.html


                
                    
                    
                        掃二維碼與項(xiàng)目經(jīng)理溝通
                        我們?cè)谖⑿派?4小時(shí)期待你的聲音
                        解答本文疑問(wèn)/技術(shù)咨詢/運(yùn)營(yíng)咨詢/技術(shù)建議/互聯(lián)網(wǎng)交流
                    
                
                
                    其他資訊
                    
                        Neondb-窮人的Aurora，你看明白了嗎？
3s是什么航空？（立陶宛服務(wù)器租用收費(fèi)價(jià)格都與哪些方面有關(guān)？）
如何選擇適合自己的VPS(vps的選擇)
網(wǎng)站制作例子
Redis實(shí)現(xiàn)強(qiáng)大的阻塞式鎖機(jī)制（redis阻塞式鎖）


        
            
                
                    行業(yè)動(dòng)態(tài)
                    企業(yè)網(wǎng)站建設(shè)的重要性！
                    現(xiàn)在雖然是移動(dòng)互聯(lián)網(wǎng)時(shí)代，但企業(yè)網(wǎng)站依然重要，包含PC站點(diǎn)，移動(dòng)站?？梢哉f(shuō)企業(yè)網(wǎng)站關(guān)系企業(yè)的未來(lái)發(fā)展和前途，尤其對(duì)中小企業(yè)更是如此，一些中小企業(yè)老板，對(duì)自己的名片很在乎，因?yàn)檫@是個(gè)門(mén)面。...
                
            
            
                服務(wù)項(xiàng)目
                
                    
                        
                            網(wǎng)站建設(shè)
                            
                            查看詳情
                        
                    
                    
                        
                            移動(dòng)端/APP
                            
                            查看詳情
                        
                    
                    
                        
                            微信/小程序
                            
                            查看詳情
                        
                    
                    
                        
                            技術(shù)支持
                            
                            查看詳情
                        
                    
                    
                        
                            其它服務(wù)
                            
                            查看詳情
                        
                    
                    
                        
                            更多服務(wù)項(xiàng)目
                             用我們的專(zhuān)業(yè)和誠(chéng)信贏得您的信賴，從PC到移動(dòng)互聯(lián)網(wǎng)均有您想要的服務(wù)！
                            獲取更多
                        
                    
                
            
            
                 聯(lián)系吧 在百度地圖上找到我們 
                電話：13518219792
                如遇占線或暫未接聽(tīng)請(qǐng)撥：136xxx98888
                 業(yè)務(wù)咨詢 技術(shù)咨詢 售后服務(wù)


    
        
            
                
                    網(wǎng)站設(shè)計(jì)
                    網(wǎng)站設(shè)計(jì)制作報(bào)價(jià)
企業(yè)網(wǎng)站設(shè)計(jì)
廣安網(wǎng)站設(shè)計(jì)
教育網(wǎng)站設(shè)計(jì)方案
                
                
                    網(wǎng)站制作
                    成都網(wǎng)站制作
手機(jī)網(wǎng)站制作設(shè)計(jì)
網(wǎng)站制作
網(wǎng)站制作
                
                
                    聯(lián)系我們
                    電話：13518219792
                    郵箱：[email protected]
                    地址：成都青羊區(qū)錦天國(guó)際1002號(hào)
                    網(wǎng)址：uogjgqi.cn
                
                
                    網(wǎng)站建設(shè)
                    手機(jī)網(wǎng)站建設(shè)套餐
南充網(wǎng)站建設(shè)
網(wǎng)站建設(shè)
網(wǎng)站建設(shè)
                    
                
                
                    
                    
                         
                            微信二維碼
                        
                    
                
            
        
        
            
                友情鏈接
                 成都網(wǎng)站制作
網(wǎng)站ssl證書(shū)
成都門(mén)頭招牌制作
瀘州發(fā)電機(jī)保養(yǎng)
辦公家具
成都畫(huà)冊(cè)設(shè)計(jì)公司
網(wǎng)站空間
天澤尚品
四川發(fā)電機(jī)組
公關(guān)活動(dòng)
            
        
    
    
        
             Copyright © 2002-2023 uogjgqi.cn 快上網(wǎng)建站品牌 QQ：244261566 版權(quán)所有 備案號(hào)：蜀ICP備19037934號(hào)
            
             
        
    
    
        
            在線咨詢
            
            13518219792
             
                 
                    微信二維碼
                
            
             
                 
                    移動(dòng)版官網(wǎng)
                
            
            
        
    
    



感谢您访问我们的网站，您可能还对以下资源感兴趣：
av激情亚洲男人的天堂国语











<span id="u8qd2"><b id="u8qd2"></b></span><span id="u8qd2"><b id="u8qd2"></b></span>
<small id="u8qd2"></small>
<span id="u8qd2"><b id="u8qd2"></b></span>
<var id="u8qd2"></var>

av激情亚洲男人的天堂国语,日韩欧美精品一中文字幕,无码av一区二区三区无码,国产又色又爽又刺激的a片,国产又色又爽又刺激的a片

【夜鶯監(jiān)控】管理Kubernetes組件指標(biāo)

【夜鶯監(jiān)控】管理Kubernetes組件指標(biāo)

開(kāi)始之前

KubeApiServer

創(chuàng)建 namespace

創(chuàng)建認(rèn)證授權(quán)信息

指標(biāo)采集

（1）創(chuàng)建 Prometheus 配置

（2）部署 Prometheus Agent

指標(biāo)簡(jiǎn)介

指標(biāo)清單

關(guān)鍵指標(biāo)

資源指標(biāo)

QPS 和時(shí)延

KubeControllerManager

指標(biāo)采集

指標(biāo)簡(jiǎn)介

指標(biāo)清單

Queue 指標(biāo)

資源指標(biāo)

QPS 和時(shí)延

KubeScheduler

指標(biāo)采集

（1）編輯 Prometheus 配置文件

指標(biāo)簡(jiǎn)介

指標(biāo)清單

基本指標(biāo)

資源指標(biāo)

QPS 和時(shí)延

Etcd

指標(biāo)采集

（1）配置 Prometheus 的采集配置

指標(biāo)簡(jiǎn)介

指標(biāo)清單

基礎(chǔ)指標(biāo)

kubelet

指標(biāo)采集

（1）修改 Prometheus 的配置文件

掃二維碼與項(xiàng)目經(jīng)理溝通

其他資訊

行業(yè)動(dòng)態(tài)

企業(yè)網(wǎng)站建設(shè)的重要性！

服務(wù)項(xiàng)目

網(wǎng)站建設(shè)

移動(dòng)端/APP

微信/小程序

技術(shù)支持

其它服務(wù)

更多服務(wù)項(xiàng)目

聯(lián)系吧 在百度地圖上找到我們

電話：13518219792

企業(yè)網(wǎng)站建設(shè)的重要性！

聯(lián)系吧在百度地圖上找到我們