在本文中,我们将介绍如何快速部署一个可弹性伸缩的GPU集群,让您可以在 AWS 上轻松管理和扩展 GPU 资源。这个集群将使用 Amazon EKS (Elastic Kubernetes Service) 作为容器编排平台,并结合 AWS 的一些服务和工具,如 Karpenter、AWS Load Balancer Controller、Amazon EBS CSI Driver 等,以及一些开源工具如 Prometheus、Grafana 和 KEDA。
预先准备
在开始之前,请确保您已经准备好以下工具和环境:
步骤
1. 设置环境变量
首先,运行以下命令设置所需的环境变量。这些环境变量仅在安装过程中有用:
export KARPENTER_NAMESPACE="kube-system"
export KARPENTER_VERSION="0.35.0"
export K8S_VERSION="1.28"
export AWS_PARTITION="aws"
export CLUSTER_NAME="<集群名称>"
export AWS_DEFAULT_REGION="<所在区域>"
export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
export TEMPOUT="$(mktemp)"
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
2. 创建Karpenter IAM角色和资源
运行以下命令以创建Karpenter使用的IAM角色和其他资源,并允许使用Spot实例:
curl -fsSL https://raw.githubusercontent.com/aws/karpenter-provider-aws/v"${KARPENTER_VERSION}"/website/content/en/preview/getting-started/getting-started-with-karpenter/cloudformation.yaml > "${TEMPOUT}" \
&& aws cloudformation deploy \
--stack-name "Karpenter-${CLUSTER_NAME}" \
--template-file "${TEMPOUT}" \
--capabilities CAPABILITY_NAMED_IAM \
--parameter-overrides "ClusterName=${CLUSTER_NAME}"
aws iam create-service-linked-role --aws-service-name spot.amazonaws.com || true
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
3. 制作EBS快照
获取bottlerocket-images-cache代码仓库,制作EBS快照以快速加载镜像:
制作完成后,会返回格式类似于 snap-123456abcdef
的EBS快照ID,将该ID保存备用。
4. 部署EKS集群
现在,创建一个EKS集群配置文件 cluster.yaml
:
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: ${CLUSTER_NAME}
region: ${AWS_DEFAULT_REGION}
version: "${K8S_VERSION}"
tags:
karpenter.sh/discovery: ${CLUSTER_NAME}
iam:
withOIDC: true
podIdentityAssociations:
- namespace: "${KARPENTER_NAMESPACE}"
serviceAccountName: karpenter
roleName: ${CLUSTER_NAME}-karpenter
permissionPolicyARNs:
- arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:policy/KarpenterControllerPolicy-${CLUSTER_NAME}
iamIdentityMappings:
- arn: "arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/KarpenterNodeRole-${CLUSTER_NAME}"
username: system:node:{{EC2PrivateDNSName}}
groups:
- system:bootstrappers
- system:nodes
managedNodeGroups: # 有些Controller需要在CPU节点上运行,我们建议部署至少2个,会分布在2个可用区
- instanceType: m6i.2large
amiFamily: AmazonLinux2
name: ${CLUSTER_NAME}-ng
desiredCapacity: 3
privateNetworking: true
addons:
- name: eks-pod-identity-agent
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
运行以下命令创建集群:
5. 部署集群组件
Amazon EBS CSI Driver
运行以下命令以安装EBS CSI Driver:
eksctl create iamserviceaccount\
--name ebs-csi-controller-sa \
--namespace kube-system \
--cluster ${CLUSTER_NAME} \
--role-name AmazonEKS_EBS_CSI_DriverRole \
--role-only \
--attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
--approve
eksctl create addon --name aws-ebs-csi-driver --cluster ${CLUSTER_NAME} --service-account-role-arn arn:aws:iam::${AWS_ACCOUNT_ID}:role/AmazonEKS_EBS_CSI_DriverRole --force
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
安装完成后,运行下列命令以创建 Storage Class
并将其设为默认:
cat <<EOF| kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ebs-sc
annotations:
storageclass.kubernetes.io/is-default-class: "true"
parameters:
csi.storage.k8s.io/fstype: xfs
type: gp3
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
EOF
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
AWS Load Balancer Controller
首先,创建AWS LB Controller使用的IAM角色:
curl -o iam-policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.7.0/docs/install/iam_policy.json
aws iam create-policy \
--policy-name AWSLoadBalancerControllerIAMPolicy \
--policy-document file://iam-policy.json
eksctl create iamserviceaccount \
--cluster=${CLUSTER_NAME} \
--namespace=kube-system \
--name=aws-load-balancer-controller \
--attach-policy-arn=arn:aws:iam::${AWS_ACCOUNT_ID}:policy/AWSLoadBalancerControllerIAMPolicy \
--override-existing-serviceaccounts \
--region ${AWS_DEFAULT_REGION} \
--approve
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
安装AWS LB Controller:
Karpenter
Karpenter的主要用途是实现Kubernetes集群的自动扩缩容能力(节点弹性),以优化资源利用率和降低运行成本。
具体来说,它可以:
1. 自动添加或删除Kubernetes节点,根据实际需求动态调整集群规模。
2. 支持混合使用不同实例类型的节点,充分利用各种实例类型。
3. 与AWS的Spot实例和预留实例相结合,进一步降低运行成本。
4. 快速响应集群负载变化,及时扩缩容以优化资源利用率。
5. 与AWS云服务集成,利用其功能和指标进行自动化管理。
总的来说,Karpenter通过自动扩缩容和成本优化能力,简化了Kubernetes集群在AWS环境中的运维管理,提高了资源利用效率和弹性。
安装Karpenter:
helm upgrade--install karpenter oci://public.ecr.aws/karpenter/karpenter --version "${KARPENTER_VERSION}" --namespace "${KARPENTER_NAMESPACE}" --create-namespace \
--set "settings.clusterName=${CLUSTER_NAME}" \
--set "settings.interruptionQueue=${CLUSTER_NAME}" \
--set controller.resources.requests.cpu=1 \
--set controller.resources.requests.memory=1Gi \
--set controller.resources.limits.cpu=1 \
--set controller.resources.limits.memory=1Gi \
--wait
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
安装完成之后,需要创建新节点所使用的NodePool和NodeClass,运行以下命令:
cat <<EOF> nodepool.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: gpu
spec:
template:
metadata:
labels: # 此处修改成您希望增加到节点的Label,应用在Kubernetes上
node-type: gpu
spec:
taints: # 此处修改成您希望增加到节点的Taint,如无则删除即可
- key: nvidia.com/gpu
effect: NoSchedule
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: kubernetes.io/os
operator: In
values: ["linux"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"] # 允许并优先使用Spot实例,如不需要使用Spot可删除
- key: node.kubernetes.io/instance-type # GPU机型实例类型较少,可固定实例类型;也可使用 karpenter.k8s.aws/instance-family 使Karpenter自动选择实例类型
operator: In
values:
- g5.xlarge
- g5.2xlarge
nodeClassRef:
name: gpu # 与下方EC2NodeClass一致
limits:
nvidia.com/gpu: 2 # 限制最大实例数量,以GPU卡计算
disruption:
consolidationPolicy: WhenEmpty # 仅在实例空时才回收节点
consolidateAfter: 30s # Pod缩容后30秒后回收节点
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: gpu
spec:
amiFamily: Bottlerocket # 使用Bottlerocket操作系统
role: "KarpenterNodeRole-${CLUSTER_NAME}"
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"
kubernetes.io/role/internal-elb: "1" # 选择使用私有子网
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 10Gi
volumeType: gp3
deleteOnTermination: true
iops: 3000
throughput: 125
- deviceName: /dev/xvdb
ebs:
volumeSize: 80Gi
volumeType: gp3
deleteOnTermination: true
iops: 3000
throughput: 125
# snapshotID: <上方创建的EBS Snapshot ID>
EOF
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.
- 40.
- 41.
- 42.
- 43.
- 44.
- 45.
- 46.
- 47.
- 48.
- 49.
- 50.
- 51.
- 52.
- 53.
- 54.
- 55.
- 56.
- 57.
- 58.
- 59.
- 60.
- 61.
- 62.
- 63.
- 64.
- 65.
- 66.
- 67.
- 68.
创建资源:
6. 测试集群
您可以使用样例工作负载进行测试(需要安装提供使用GPU):
cat <<EOF| kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
当Pod成功运行,且日志如下时证明成功创建GPU实例:
7. 部署监控组件
Prometheus
使用 kube-prometheus-stack
可快速部署Prometheus组件到集群中:
kubectl create namespace monitoring
helminstall prometheus oci://registry-1.docker.io/bitnamicharts/kube-prometheus -n monitoring \
--set global.storageClass=ebs-sc \
--set prometheus.persistence.enabled=true \
--set prometheus.persistence.size=50Gi \
--set prometheus.service.type=LoadBalancer \
--set 'prometheus.service.annotations.service\.beta\.kubernetes\.io/aws-load-balancer-scheme=internet-facing' \
--set 'prometheus.service.annotations.service\.beta\.kubernetes\.io/aws-load-balancer-nlb-target-type=ip' \
--set prometheus.service.loadBalancerClass=service\.k8s\.aws/nlb \
--set blackboxExporter.enabled=false \
--set alertmanager.persistence.enabled=true \
--set alertmanager.service.type=LoadBalancer \
--set alertmanager.service.loadBalancerClass=service\.k8s\.aws/nlb \
--set 'alertmanager.service.annotations.service\.beta\.kubernetes\.io/aws-load-balancer-scheme=internet-facing' \
--set 'alertmanager.service.annotations.service\.beta\.kubernetes\.io/aws-load-balancer-nlb-target-type=ip'
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
Grafana
安装Grafana:
helminstall grafana oci://registry-1.docker.io/bitnamicharts/grafana -n monitoring \
--set global.storageClass=ebs-sc \
--set admin.password=password \
--set service.type=LoadBalancer \
--set service.loadBalancerClass=service\.k8s\.aws/nlb \
--set 'service.annotations.service\.beta\.kubernetes\.io/aws-load-balancer-scheme=internet-facing' \
--set 'service.annotations.service\.beta\.kubernetes\.io/aws-load-balancer-nlb-target-type=ip'
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
安装完成后,您可通过以下命令获取Prometheus,AlertManager和Grafana的URL:
Grafana的默认用户名和密码为 admin/password
。
metrics-server
安装metrics-server执行如下命令
修改prometheus-node-exporter
添加和terminationGracePeriodSeconds同级添加如下内容来实现GPU监控
8. 安装KEDA
KEDA是一个可以基于多种指标进行弹性伸缩的组件(Pod弹性伸缩)。运行以下命令进行安装:
您可以根据多种指标配置弹性伸缩,基于Prometheus伸缩的样例配置如下:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: scaled-object-deployment
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: {Deployment名称}
pollingInterval: 30
cooldownPeriod: 300
minReplicaCount: 0
maxReplicaCount: 100
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
query: sum(rate(http_requests_total{deployment="my-deployment"}[2m]))
threshold: '20'
activationThreshold: '1'
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
创建SQS弹性角色,用于实现SQS触发Pod弹性。
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"sqs:DeleteMessage",
"sqs:ReceiveMessage",
"sqs:SendMessage",
"sqs:getqueueattributes"
],
"Resource": [
"arn:aws:sqs:us-east-1:123456789012:test-light-ai-input",
"arn:aws:sqs:us-east-1:123456789012:test-light-ai-ouput"
]
}
]
}
aws iam create-policy --policy-name keda-sqs-policy --policy-document file://iam-policy.json
export AWS_REGION=us-east-1
eksctl create iamserviceaccount --cluster=pro-image-ai --name=keda-sqs-role --namespace=default --attach-policy-arn=arn:aws:iam::123456789012:policy/keda-sqs-policy
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
基于AWS SQS伸缩的样例配置如下:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: scaled-object-deployment
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: image-ai-on-eks
pollingInterval: 1
cooldownPeriod: 60
minReplicaCount: 2
maxReplicaCount: 20
triggers:
- authenticationRef:
name: keda-sqs-role
metadata:
awsRegion: us-east-1
identityOwner: operator
queueLength: "1"
queueURL: https://sqs.us-east-1.amazonaws.com/123456789012/test-light-ai-input
type: aws-sqs-queue
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
9.部署业务
注:多线程可能会导致GPU OOM
创建服务角色
创建iam角色给容器使用
cat >iam-policy.json<<EOF{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "*"
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"sqs:DeleteMessage",
"sqs:ReceiveMessage",
"sqs:SendMessage"
],
"Resource": "*"
}
]
}
EOF
export AWS_REGION=us-east-1
eksctl create iamserviceaccount --cluster=<clusterName> --name=<serviceAccountName> --namespace=<serviceAccountNamespace> --attach-policy-arn=<policyARN> --override-existing-serviceaccounts --approve
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
http对外提供服务
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: image-ai-on-eks
name: image-ai-on-eks
namespace: default
spec:
selector:
matchLabels:
app.kubernetes.io/name: image-ai-on-eks
strategy:
rollingUpdate:
maxSurge: 100%
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/name: image-ai-on-eks
spec:
containers:
- image: 930700710668.dkr.ecr.us-east-1.amazonaws.com/ai/iamge:pro
imagePullPolicy: IfNotPresent
name: image-ai
command: ["gunicorn", "-w", "1", "-b", "0.0.0.0:8080", "app_lamp:app", "--log-level", "debug"]
resources:
limits:
nvidia.com/gpu: "1"
requests:
cpu: "4"
memory: 16Gi
nvidia.com/gpu: "1"
serviceAccount: imageaioneks
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: image-ai-on-eks
name: image-ai-on-eks
namespace: default
spec:
ports:
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
app.kubernetes.io/name: image-ai-on-eks
type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: image-ai-on-eks
namespace: default
annotations:
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/group.name: ai
alb.ingress.kubernetes.io/scheme: internet-facing
labels:
app.kubernetes.io/name: image-ai-on-eks
spec:
ingressClassName: alb
rules:
- http:
paths:
- path: /*
pathType: ImplementationSpecific
backend:
service:
name: image-ai-on-eks
port:
number: 8080
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.
- 40.
- 41.
- 42.
- 43.
- 44.
- 45.
- 46.
- 47.
- 48.
- 49.
- 50.
- 51.
- 52.
- 53.
- 54.
- 55.
- 56.
- 57.
- 58.
- 59.
- 60.
- 61.
- 62.
- 63.
- 64.
- 65.
- 66.
- 67.
- 68.
- 69.
- 70.
- 71.
- 72.
- 73.
- 74.
- 75.
- 76.
- 77.
- 78.
- 79.
- 80.
- 81.
https对外提供服务
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: image-ai-on-eks
name: image-ai-on-eks
namespace: default
spec:
selector:
matchLabels:
app.kubernetes.io/name: image-ai-on-eks
strategy:
rollingUpdate:
maxSurge: 100%
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/name: image-ai-on-eks
spec:
containers:
- image: 930700710668.dkr.ecr.us-east-1.amazonaws.com/ai/iamge:pro
imagePullPolicy: IfNotPresent
name: image-ai
command: ["gunicorn", "-w", "1", "-b", "0.0.0.0:8080", "app_lamp:app", "--log-level", "debug"]
resources:
limits:
nvidia.com/gpu: "1"
requests:
cpu: "4"
memory: 16Gi
nvidia.com/gpu: "1"
serviceAccount: imageaioneks
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: image-ai-on-eks
name: image-ai-on-eks
namespace: default
spec:
ports:
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
app.kubernetes.io/name: image-ai-on-eks
type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: image-ai-on-eks
namespace: default
annotations:
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/group.name: ai
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:930700710668:certificate/67154e58-cc28-4a66-8397-f510424395c0
labels:
app.kubernetes.io/name: image-ai-on-eks
spec:
ingressClassName: alb
rules:
- host: pro-image-ai.my.com
http:
paths:
- path: /*
pathType: ImplementationSpecific
backend:
service:
name: image-ai-on-eks
port:
number: 8080
tls:
- hosts:
- pro-image-ai.my.com
secretName: image-ai-tls
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.
- 40.
- 41.
- 42.
- 43.
- 44.
- 45.
- 46.
- 47.
- 48.
- 49.
- 50.
- 51.
- 52.
- 53.
- 54.
- 55.
- 56.
- 57.
- 58.
- 59.
- 60.
- 61.
- 62.
- 63.
- 64.
- 65.
- 66.
- 67.
- 68.
- 69.
- 70.
- 71.
- 72.
- 73.
- 74.
- 75.
- 76.
- 77.
- 78.
- 79.
- 80.
- 81.
- 82.
- 83.
- 84.
- 85.
- 86.
- 87.
基于sqs提供服务
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: test-light-ai
name: test-light-ai
namespace: default
spec:
selector:
matchLabels:
app.kubernetes.io/name: test-light-ai
strategy:
rollingUpdate:
maxSurge: 100%
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/name: test-light-ai
spec:
containers:
- image: 930700710668.dkr.ecr.us-east-1.amazonaws.com/ai/light:test
imagePullPolicy: IfNotPresent
name: image-ai
command: ["python3", "run_app.py"]
resources:
limits:
nvidia.com/gpu: "1"
requests:
cpu: "4"
memory: 16Gi
nvidia.com/gpu: "1"
serviceAccount: imageaioneks
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.
结束语
通过以上步骤,您已经成功地部署了一个可弹性伸缩的GPU集群,并且集成了监控组件和自动扩展功能。这将为您的工作负载提供更好的性能和可靠性,同时节省了管理成本和工作量。希望这篇文章对您有所帮助!