cluster-proportional-autoscaler源碼分析及如何解決KubeDNS性能瓶頸,很多新手對此不是很清楚,為了幫助大家解決這個難題,下面小編將為大家詳細(xì)講解,有這方面需求的人可以來學(xué)習(xí)下,希望你能有所收獲。
創(chuàng)新互聯(lián)是一家專業(yè)提供雁山企業(yè)網(wǎng)站建設(shè),專注與成都網(wǎng)站建設(shè)、成都做網(wǎng)站、H5場景定制、小程序制作等業(yè)務(wù)。10年已為雁山眾多企業(yè)、政府機(jī)構(gòu)等服務(wù)。創(chuàng)新互聯(lián)專業(yè)的建站公司優(yōu)惠進(jìn)行中。
cluster-proportional-autoscaler是kubernetes的孵化項目之一,用來根據(jù)集群規(guī)模動態(tài)的擴(kuò)縮容指定的namespace下的target(只支持RC, RS, Deployment),還不支持對StatefulSet。目前只提供兩種autoscale模式,一種是linear,另一種是ladder,你能很容易的定制開發(fā)新的模式,代碼接口非常清晰。
cluster-proportional-autoscaler工作機(jī)制很簡單,每隔一定時間(通過--poll-period-seconds配置,默認(rèn)10s)重復(fù)進(jìn)行如下操作:
統(tǒng)計一次集群中ScheduableNodes和ScheduableCores;
從apiserver中獲取最新configmap數(shù)據(jù);
根據(jù)對應(yīng)的autoscale模式,進(jìn)行configmap參數(shù)解析;
據(jù)對應(yīng)的autoscale模式,計算新的期望副本數(shù);
如果與上一次期望副本數(shù)不同,則調(diào)用Scale接口觸發(fā)AutoScale;
cluster-proportional-autoscaler一共有下面6項flag:
--namespace: 要autoscale的對象所在的namespace;
--target: 要autoscale的對象,只支持deployment/replicationcontroller/replicaset,不區(qū)分大小寫;
--configmap: 配置實現(xiàn)創(chuàng)建好的configmap,里面存儲要使用的模式及其配置,后面會有具體的示例;
--default-params: 如果--configmap
中配置的configmap不存在或者后來被刪除了,則使用該配置來創(chuàng)建新的configmap,建議要配置;
--poll-period-seconds: 檢查周期,默認(rèn)為10s。
--version: 打印vesion并退出。
pkg/autoscaler/autoscaler_server.go:82 func (s *AutoScaler) pollAPIServer() { // Query the apiserver for the cluster status --- number of nodes and cores clusterStatus, err := s.k8sClient.GetClusterStatus() if err != nil { glog.Errorf("Error while getting cluster status: %v", err) return } glog.V(4).Infof("Total nodes %5d, schedulable nodes: %5d", clusterStatus.TotalNodes, clusterStatus.SchedulableNodes) glog.V(4).Infof("Total cores %5d, schedulable cores: %5d", clusterStatus.TotalCores, clusterStatus.SchedulableCores) // Sync autoscaler ConfigMap with apiserver configMap, err := s.syncConfigWithServer() if err != nil || configMap == nil { glog.Errorf("Error syncing configMap with apiserver: %v", err) return } // Only sync updated ConfigMap or before controller is set. if s.controller == nil || configMap.ObjectMeta.ResourceVersion != s.controller.GetParamsVersion() { // Ensure corresponding controller type and scaling params. s.controller, err = plugin.EnsureController(s.controller, configMap) if err != nil || s.controller == nil { glog.Errorf("Error ensuring controller: %v", err) return } } // Query the controller for the expected replicas number expReplicas, err := s.controller.GetExpectedReplicas(clusterStatus) if err != nil { glog.Errorf("Error calculating expected replicas number: %v", err) return } glog.V(4).Infof("Expected replica count: %3d", expReplicas) // Update resource target with expected replicas. _, err = s.k8sClient.UpdateReplicas(expReplicas) if err != nil { glog.Errorf("Update failure: %s", err) } }
GetClusterStatus用于統(tǒng)計集群中SchedulableNodes, SchedulableCores,用于后面計算新的期望副本數(shù)。
pkg/autoscaler/k8sclient/k8sclient.go:142 func (k *k8sClient) GetClusterStatus() (clusterStatus *ClusterStatus, err error) { opt := metav1.ListOptions{Watch: false} nodes, err := k.clientset.CoreV1().Nodes().List(opt) if err != nil || nodes == nil { return nil, err } clusterStatus = &ClusterStatus{} clusterStatus.TotalNodes = int32(len(nodes.Items)) var tc resource.Quantity var sc resource.Quantity for _, node := range nodes.Items { tc.Add(node.Status.Capacity[apiv1.ResourceCPU]) if !node.Spec.Unschedulable { clusterStatus.SchedulableNodes++ sc.Add(node.Status.Capacity[apiv1.ResourceCPU]) } } tcInt64, tcOk := tc.AsInt64() scInt64, scOk := sc.AsInt64() if !tcOk || !scOk { return nil, fmt.Errorf("unable to compute integer values of schedulable cores in the cluster") } clusterStatus.TotalCores = int32(tcInt64) clusterStatus.SchedulableCores = int32(scInt64) k.clusterStatus = clusterStatus return clusterStatus, nil }
Nodes數(shù)量統(tǒng)計時,是會剔除掉那些 Unschedulable Nodes的。
Cores數(shù)量統(tǒng)計時,是會減掉那些 Unschedulable Nodes對應(yīng)Capacity。
請注意,這里計算Cores時統(tǒng)計的是Node的Capacity,而不是Allocatable。
我認(rèn)為,使用Allocatable要比Capacity更好。
這兩者在大規(guī)模集群時就會體現(xiàn)出差別了,比如每個Node Allocatable比Capacity少1c4g
,那么2K個Node集群規(guī)模時,就相差2000c8000g,這將是的target object number相差很大。
有些同學(xué)可能要問:Node Allocatable和Capacity有啥不同呢?
Capacity是Node硬件層面提供的全部資源,服務(wù)器配置的多少內(nèi)存,cpu核數(shù)等,都是由硬件決定的。
Allocatable則要在Capacity的基礎(chǔ)上減去kubelet flag中配置的kube-resreved和system-reserved資源大小,是Kubernetes給應(yīng)用真正可分配的資源數(shù)。
syncConfigWithServer主要是從apiserver中獲取最新configmap數(shù)據(jù),注意這里并沒有去watch configmap,而是按照--poll-period-seconds
(默認(rèn)10s)定期的去get,所以默認(rèn)會存在最多10s的延遲。
pkg/autoscaler/autoscaler_server.go:124 func (s *AutoScaler) syncConfigWithServer() (*apiv1.ConfigMap, error) { // Fetch autoscaler ConfigMap data from apiserver configMap, err := s.k8sClient.FetchConfigMap(s.k8sClient.GetNamespace(), s.configMapName) if err == nil { return configMap, nil } if s.defaultParams == nil { return nil, err } glog.V(0).Infof("ConfigMap not found: %v, will create one with default params", err) configMap, err = s.k8sClient.CreateConfigMap(s.k8sClient.GetNamespace(), s.configMapName, s.defaultParams) if err != nil { return nil, err } return configMap, nil }
如果配置的--configmap
在集群中已經(jīng)存在,則從apiserver中獲取最新的configmap并返回;
如果配置的--configmap
在集群中不存在,則根據(jù)--default-params
的內(nèi)容創(chuàng)建一個configmap并返回;
如果配置的--configmap
在集群中不存在,且--default-params
又沒有配置,則返回nil,意味著失敗,整個流程結(jié)束,使用時請注意!
建議一定要配置--default-params
,因為--configmap
配置的configmap有可能有意或者無意的被管理員/用戶刪除了,而你又沒配置--default-params
,那么這個時候pollAPIServer將就此結(jié)束,因為著你沒達(dá)到autoscale target的目的,關(guān)鍵是你可能并在不知道集群這個時候出現(xiàn)了這個情況。
EnsureController用來根據(jù)configmap中配置的controller type創(chuàng)建對應(yīng)Controller及解析參數(shù)。
pkg/autoscaler/controller/plugin/plugin.go:32 // EnsureController ensures controller type and scaling params func EnsureController(cont controller.Controller, configMap *apiv1.ConfigMap) (controller.Controller, error) { // Expect only one entry, which uses the name of control mode as the key if len(configMap.Data) != 1 { return nil, fmt.Errorf("invalid configMap format, expected only one entry, got: %v", configMap.Data) } for mode := range configMap.Data { // No need to reset controller if control pattern doesn't change if cont != nil && mode == cont.GetControllerType() { break } switch mode { case laddercontroller.ControllerType: cont = laddercontroller.NewLadderController() case linearcontroller.ControllerType: cont = linearcontroller.NewLinearController() default: return nil, fmt.Errorf("not a supported control mode: %v", mode) } glog.V(1).Infof("Set control mode to %v", mode) } // Sync config with controller if err := cont.SyncConfig(configMap); err != nil { return nil, fmt.Errorf("Error syncing configMap with controller: %v", err) } return cont, nil }
檢查configmap data中是否只有一個entry,如果不是,則該configmap不合法,流程結(jié)束;
檢查controller的類型是否為linear
或ladder
其中之一,并調(diào)用對應(yīng)的方法創(chuàng)建對應(yīng)的Controller,否則返回失??;
linear --> NewLinearController
ladder --> NewLadderController
調(diào)用對應(yīng)Controller的SyncConfig解析configmap data中參數(shù)和configmap ResourceVersion更新到Controller對象中;
linear和ladder Controller分別實現(xiàn)了自己的GetExpectedReplicas方法,用來計算期望此次監(jiān)控到的數(shù)據(jù)應(yīng)該有的副本數(shù)。具體的看下面關(guān)于Linear Controller和Ladder Controller部分。
UpdateReplicas將GetExpectedReplicas計算得到的期望副本數(shù),通過調(diào)用對應(yīng)target(rc/rs/deploy)對應(yīng)的Scale接口,由Scale去完成target的縮容擴(kuò)容。
pkg/autoscaler/k8sclient/k8sclient.go:172 func (k *k8sClient) UpdateReplicas(expReplicas int32) (prevRelicas int32, err error) { scale, err := k.clientset.Extensions().Scales(k.target.namespace).Get(k.target.kind, k.target.name) if err != nil { return 0, err } prevRelicas = scale.Spec.Replicas if expReplicas != prevRelicas { glog.V(0).Infof("Cluster status: SchedulableNodes[%v], SchedulableCores[%v]", k.clusterStatus.SchedulableNodes, k.clusterStatus.SchedulableCores) glog.V(0).Infof("Replicas are not as expected : updating replicas from %d to %d", prevRelicas, expReplicas) scale.Spec.Replicas = expReplicas _, err = k.clientset.Extensions().Scales(k.target.namespace).Update(k.target.kind, scale) if err != nil { return 0, err } } return prevRelicas, nil }
下面是對Linear Controller和Ladder Controller具體實現(xiàn)的代碼分析。
先來看看linear Controller的參數(shù):
pkg/autoscaler/controller/linearcontroller/linear_controller.go:50 type linearParams struct { CoresPerReplica float64 `json:"coresPerReplica"` NodesPerReplica float64 `json:"nodesPerReplica"` Min int `json:"min"` Max int `json:"max"` PreventSinglePointFailure bool `json:"preventSinglePointFailure"` }
寫configmap時,參考如下:
kind: ConfigMap apiVersion: v1 metadata: name: nginx-autoscaler namespace: default data: linear: |- { "coresPerReplica": 2, "nodesPerReplica": 1, "preventSinglePointFailure": true, "min": 1, "max": 100 }
其他參數(shù)不多說,我想提的是PreventSinglePointFailure
,字面意思是防止單點故障,是一個bool值,代碼中沒有進(jìn)行顯示的初始化,意味著默認(rèn)為false。可以在對應(yīng)的configmap data或者dafault-params中設(shè)置"preventSinglePointFailure": true
,但設(shè)置為true后,如果schedulableNodes > 1
,則會保證target's replicas至少為2,也就是防止了target單點故障。
pkg/autoscaler/controller/linearcontroller/linear_controller.go:101 func (c *LinearController) GetExpectedReplicas(status *k8sclient.ClusterStatus) (int32, error) { // Get the expected replicas for the currently schedulable nodes and cores expReplicas := int32(c.getExpectedReplicasFromParams(int(status.SchedulableNodes), int(status.SchedulableCores))) return expReplicas, nil } func (c *LinearController) getExpectedReplicasFromParams(schedulableNodes, schedulableCores int) int { replicasFromCore := c.getExpectedReplicasFromParam(schedulableCores, c.params.CoresPerReplica) replicasFromNode := c.getExpectedReplicasFromParam(schedulableNodes, c.params.NodesPerReplica) // Prevent single point of failure by having at least 2 replicas when // there are more than one node. if c.params.PreventSinglePointFailure && schedulableNodes > 1 && replicasFromNode < 2 { replicasFromNode = 2 } // Returns the results which yields the most replicas if replicasFromCore > replicasFromNode { return replicasFromCore } return replicasFromNode } func (c *LinearController) getExpectedReplicasFromParam(schedulableResources int, resourcesPerReplica float64) int { if resourcesPerReplica == 0 { return 1 } res := math.Ceil(float64(schedulableResources) / resourcesPerReplica) if c.params.Max != 0 { res = math.Min(float64(c.params.Max), res) } return int(math.Max(float64(c.params.Min), res)) }
根據(jù)schedulableCores和configmap中的CoresPerReplica,按照如下公式計算得到replicasFromCore;
replicasFromCore = ceil( schedulableCores * 1/CoresPerReplica )
根據(jù)schedulableNodes和configmap中的NodesPerReplica,按照如下公式計算得到replicasFromNode;
replicasFromNode = ceil( schedulableNodes * 1/NodesPerReplica ) )
如果configmap中配置了min或者max,則必須保證replicas在min和max范圍內(nèi);
replicas = min(replicas, max)
replicas = max(replicas, min)
如果配置了preventSinglePointFailure為true并且schedulableNodes > 1
,則根據(jù)前面提到的邏輯進(jìn)行防止單點故障,replicasFromNode必須大于2;
replicasFromNode = max(2, replicasFromNode)
返回replicasFromNode和replicasFromCore中的最大者作為期望副本數(shù)。
概括起來,linear controller計算replicas的公式為:
replicas = max( ceil( cores * 1/coresPerReplica ) , ceil( nodes * 1/nodesPerReplica ) ) replicas = min(replicas, max) replicas = max(replicas, min)
下面是ladder Controller的參數(shù)結(jié)構(gòu):
pkg/autoscaler/controller/laddercontroller/ladder_controller.go:66 type paramEntry [2]int type paramEntries []paramEntry type ladderParams struct { CoresToReplicas paramEntries `json:"coresToReplicas"` NodesToReplicas paramEntries `json:"nodesToReplicas"` }
寫configmap時,參考如下:
kind: ConfigMap apiVersion: v1 metadata: name: nginx-autoscaler namespace: default data: ladder: |- { "coresToReplicas": [ [ 1,1 ], [ 3,3 ], [256,4], [ 512,5 ], [ 1024,7 ] ], "nodesToReplicas": [ [ 1,1 ], [ 2,2 ], [100, 5], [200, 12] ] }
下面是ladder Controller對應(yīng)的計算期望副本值的方法。
func (c *LadderController) GetExpectedReplicas(status *k8sclient.ClusterStatus) (int32, error) { // Get the expected replicas for the currently schedulable nodes and cores expReplicas := int32(c.getExpectedReplicasFromParams(int(status.SchedulableNodes), int(status.SchedulableCores))) return expReplicas, nil } func (c *LadderController) getExpectedReplicasFromParams(schedulableNodes, schedulableCores int) int { replicasFromCore := getExpectedReplicasFromEntries(schedulableCores, c.params.CoresToReplicas) replicasFromNode := getExpectedReplicasFromEntries(schedulableNodes, c.params.NodesToReplicas) // Returns the results which yields the most replicas if replicasFromCore > replicasFromNode { return replicasFromCore } return replicasFromNode } func getExpectedReplicasFromEntries(schedulableResources int, entries []paramEntry) int { if len(entries) == 0 { return 1 } // Binary search for the corresponding replicas number pos := sort.Search( len(entries), func(i int) bool { return schedulableResources < entries[i][0] }) if pos > 0 { pos = pos - 1 } return entries[pos][1] }
根據(jù)schedulableCores在configmap中的CoresToReplicas定義的那個范圍中,就選擇預(yù)先設(shè)定的期望副本數(shù)。
根據(jù)schedulableNodes在configmap中的NodesToReplicas定義的那個范圍中,就選擇預(yù)先設(shè)定的期望副本數(shù)。
返回上面兩者中的最大者作為期望副本數(shù)。
注意:
ladder模式下,沒有防止單點故障的設(shè)置項,用戶配置configmap時候要自己注意;
ladder模式下,沒有NodesToReplicas或者CoresToReplicas對應(yīng)的配置為空,則對應(yīng)的replicas設(shè)為1;
比如前面舉例的configmap,如果集群中schedulableCores=400(對應(yīng)期望副本為4),schedulableNodes=120(對應(yīng)期望副本為5),則最終的期望副本數(shù)為5.
通過如下yaml文件創(chuàng)建kube-dns-autoscaler Deployment和configmap, kube-dns-autoscaler每個30s會進(jìn)行一次副本數(shù)計算檢查,并可能觸發(fā)AutoScale。
kind: ConfigMap apiVersion: v1 metadata: name: kube-dns-autoscaler namespace: kube-system data: linear: | { "nodesPerReplica": 10, "min": 1, "max": 50, "preventSinglePointFailure": true } ‐‐‐ apiVersion: extensions/v1beta1 kind: Deployment metadata: name: kube-dns-autoscaler namespace: kube-system spec: template: metadata: labels: k8s-app: kube-dns-autoscaler spec: imagePullSecrets: - name: harborsecret containers: - name: autoscaler image: registry.vivo.xyz:4443/bigdata_release/cluster_proportional_autoscaler_amd64:1.0.0 resources: requests: cpu: "50m" memory: "100Mi" command: - /cluster-proportional-autoscaler - --namespace=kube-system - --configmap=kube-dns-autoscaler - --target=Deployment/kube-dns - --default-params={"linear":{"nodesPerReplica":10,"min":1}} - --logtostderr=true - --v=2
cluster-proportional-autoscaler代碼很簡單,工作機(jī)制也很單純,我們希望用它根據(jù)集群規(guī)模來動態(tài)擴(kuò)展KubeDNS,以解決TensorFlow on Kubernetes項目中大規(guī)模的域名解析性能問題。
目前它只支持根據(jù)SchedulableNodes和SchedulableCores來autoscale,在AI的場景中,存在集群資源極度壓榨的情況,一個集群承載的svc和pod波動范圍很大,后續(xù)我們可能會開發(fā)根據(jù)service number來autoscale kubedns的controller。
另外,我還考慮將KubeDNS的部署從AI訓(xùn)練服務(wù)器中隔離出來,因為訓(xùn)練時經(jīng)常會將服務(wù)器cpu跑到95%以上,KubeDNS也部署在這臺服務(wù)器上的話,勢必也會影響KubeDNS性能。
看完上述內(nèi)容是否對您有幫助呢?如果還想對相關(guān)知識有進(jìn)一步的了解或閱讀更多相關(guān)文章,請關(guān)注創(chuàng)新互聯(lián)行業(yè)資訊頻道,感謝您對創(chuàng)新互聯(lián)的支持。