1500字范文,内容丰富有趣,写作好帮手!
1500字范文 > 手把手教你搭建Kubeflow——基于K8s的机器学习平台

手把手教你搭建Kubeflow——基于K8s的机器学习平台

时间:2019-07-05 10:11:36

相关推荐

手把手教你搭建Kubeflow——基于K8s的机器学习平台

文章有些长,建议先关注、收藏再阅读:

简介

Kubeflow是在k8s平台之上针对机器学习的开发、训练、优化、部署、管理的工具集合,内部集成的方式融合机器学习中的很多领域的开源项目,比如Jupyter、tfserving、Katib、Fairing、Argo等。可以针对机器学习的不同阶段:数据预处理、模型训练、模型预测、服务管理等进行管理。

一、基础环境准备

k8s版本:v1.20.5

docker版本:v19.03.15

kfctl版本:v1.2.0-0-gbc038f9

kustomize版本:v4.1.3

我也不确定到底能否在1.20.5的k8s版本上完全兼容kubeflow 1.2.0版本。现在只是测试。

版本兼容性可参考:/docs/distributions/kfctl/overview#minimum-system-requirements

1、安装kfctl

kfctl 是用于部署和管理 Kubeflow 的控制平面。主要的部署模式是使用 kfctl 作为 CLI,为不同的 Kubernetes 风格配置 KFDef 配置来部署和管理 Kubeflow。

wget/kubeflow/kfctl/releases/download/v1.2.0/kfctl_v1.2.0-0-gbc038f9_linux.tar.gztar-xvfkfctl_v1.2.0-0-gbc038f9_linux.tar.gzchmod755kfctlcpkfctl/usr/binkfctlversion

2、安装kustomize

Kustomize 是一种配置管理解决方案,它利用分层来保留应用程序和组件的基本设置,方法是覆盖声明性 yaml 工件(称为补丁),这些工件有选择地覆盖默认设置而不实际更改原始文件。

下载地址:/kubernetes-sigs/kustomize/releases

wget/kubernetes-sigs/kustomize/releases/download/kustomize%2Fv4.1.3/kustomize_v4.1.3_linux_amd64.tar.gztar-xzvfkustomize_v4.1.3_linux_amd64.tar.gzchmod755kustomizemvkustomize/use/bin/kustomizeversion

三、基于公网的部署

如果你的服务器能够访问外网。就可直接执行安装部署。

本次测试部署使用的阿里云美国西部1(硅谷)的机器。

1、创建kubeflow的工作目录

mkdir/apps/kubeflowcd/apps/kubeflow

2、配置storageclass

#catstorageclass.yamlapiVersion:storage.k8s.io/v1kind:StorageClassmetadata:name:alicloud-nasmountOptions:-nolock,tcp,noresvport-vers=3parameters:volumeAs:subpathserver:"*********.us-west-1.:/nasroot1/"#这里使用的是阿里的NAS存储archiveOnDelete:"false"provisioner:nasplugin.reclaimPolicy:Retain

3、设置为默认的storageclass

#kubectlgetscNAMEPROVISIONERRECLAIMPOLICYVOLUMEBINDINGMODEALLOWVOLUMEEXPANSIONAGEalicloud-nasnasplugin.RetainImmediatefalse24h#为false时为关闭默认#kubectlpatchstorageclassalicloud-nas-p'{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'#kubectlgetscNAMEPROVISIONERRECLAIMPOLICYVOLUMEBINDINGMODEALLOWVOLUMEEXPANSIONAGEalicloud-nas(default)nasplugin.RetainImmediatefalse24h

4、安装部署

wget/kubeflow/manifests/v1.2-branch/kfdef/kfctl_k8s_istio.v1.2.0.yamlkfctlapply-V-fkfctl_k8s_istio.v1.2.0.yaml

等所有pod都创建成功后检查各个pod

保证以下所有的pod都是Running状态。

#kubectlgetpods-ncert-managerNAMEREADYSTATUSRESTARTSAGEcert-manager-7c75b559c4-c2hhj1/1Running023hcert-manager-cainjector-7f964fd7b5-mxbjl1/1Running023hcert-manager-webhook-566dd99d6-6vvzv1/1Running223h#kubectlgetpods-nistio-systemNAMEREADYSTATUSRESTARTSAGEcluster-local-gateway-5898bc5c74-822c91/1Running023hcluster-local-gateway-5898bc5c74-b5tmr1/1Running023hcluster-local-gateway-5898bc5c74-fpswf1/1Running023histio-citadel-6dffd79d7-4scx71/1Running023histio-galley-77cb9b44dc-6l4lm1/1Running023histio-ingressgateway-7bb77f89b8-psqcm1/1Running023histio-nodeagent-5qsmg1/1Running023histio-nodeagent-ccc8j1/1Running023histio-nodeagent-gqrsl1/1Running023histio-pilot-67d94fc954-vl2sx2/2Running023histio-policy-546596d4b4-6ct592/2Running123histio-security-post-install-release-1.3-latest-daily-qbrf60/1Completed023histio-sidecar-injector-796b6454d9-lv8dg1/1Running023histio-telemetry-58f9cd4bf5-8cjj52/2Running123hprometheus-7c6d764c48-s29kn1/1Running023h#kubectlgetpods-nknative-servingNAMEREADYSTATUSRESTARTSAGEactivator-6c87fcbbb6-f4cs21/1Running023hautoscaler-847b9f89dc-5jvml1/1Running023hcontroller-55f67c9ddb-67vvc1/1Running023histio-webhook-db664df87-jn72n1/1Running023hnetworking-istio-76f8cc7796-9jr2j1/1Running023hwebhook-6bff77594b-2r2gx1/1Running023h#kubectlgetpods-nkubeflowNAMEREADYSTATUSRESTARTSAGEadmission-webhook-bootstrap-stateful-set-01/1Running423hadmission-webhook-deployment-5cd7dc96f5-fw7d41/1Running223happlication-controller-stateful-set-01/1Running023hargo-ui-65df8c7c84-qwtc81/1Running023hcache-deployer-deployment-5f4979f45-2xqbf2/2Running223hcache-server-7859fd67f5-hplhm2/2Running023hcentraldashboard-67767584dc-j9ffz1/1Running023hjupyter-web-app-deployment-8486d5ffff-hmbz41/1Running023hkatib-controller-7fcc95676b-rn98v1/1Running123hkatib-db-manager-85db457c64-jx97j1/1Running023hkatib-mysql-6c7f7fb869-bt87c1/1Running023hkatib-ui-65dc4cf6f5-nhmsg1/1Running023hkfserving-controller-manager-02/2Running023hkubeflow-pipelines-profile-controller-797fb44db9-rqzmg1/1Running023hmetacontroller-01/1Running023hmetadata-db-6dd978c5b-zzntn1/1Running023hmetadata-envoy-deployment-67bd5954c-zvpf41/1Running023hmetadata-grpc-deployment-577c67c96f-zjt7w1/1Running323hmetadata-writer-756dbdd478-dm4j42/2Running023hminio-54d995c97b-4rm2d1/1Running023hml-pipeline-7c56db5db9-fprrw2/2Running123hml-pipeline-persistenceagent-d984c9585-vrd4g2/2Running023hml-pipeline-scheduledworkflow-5ccf4c9fcc-9qkrq2/2Running023hml-pipeline-ui-7ddcd74489-95dvl2/2Running023hml-pipeline-viewer-crd-56c68f6c85-tgxc22/2Running123hml-pipeline-visualizationserver-5b9bd8f6bf-4zvwt2/2Running023hmpi-operator-d5bfb8489-gkp5w1/1Running023hmxnet-operator-7576d697d6-qx7rg1/1Running023hmysql-74f8f99bc8-f42zn2/2Running023hnotebook-controller-deployment-5bb6bdbd6d-rclvr1/1Running023hprofiles-deployment-56bc5d7dcb-2nqxj2/2Running023hpytorch-operator-847c8d55d8-z89wh1/1Running023hseldon-controller-manager-6bf8b45656-b7p7g1/1Running023hspark-operatorsparkoperator-fdfbfd99-9k46b1/1Running023hspartakus-volunteer-558f8bfd47-hskwf1/1Running023htf-job-operator-58477797f8-wzdcr1/1Running023hworkflow-controller-64fd7cffc5-zs6wx1/1Running023h

5、访问kubeflow ui

kubectlgetsvc/istio-ingressgateway-nistio-systemNAMETYPECLUSTER-IPEXTERNAL-IPPORT(S)AGEistio-ingressgatewayNodePort12.80.127.69<none>15020:32661/TCP,80:31380/TCP,443:31390/TCP,31400:31400/TCP,15029:30345/TCP,15030:32221/TCP,15031:31392/TCP,15032:31191/TCP,15443:32136/TCP5h14m

代理到本地测试

kubectlport-forwardsvc/istio-ingressgateway80-nistio-system

然后本地访问localhost即可。

四、离线部署kubeflow

如果你的机器是不能够访问外网的话,那可没上面的运行那么顺利。

比如这样。ImagePullBackOff.....

Untitled.png

那你得要一个个pod查看是用到那个image。然后再在可以访问外网的机器上下载下来。

大概的步骤就是:

先准备所需镜像—>将镜像拉取推送到内网镜像仓库—>然后修改manifests-1.2.0项目的镜像地址—>将manifests-1.2.0项目打包成v1.2.0.tar.gz—>启动项目

1、准备所需镜像

quay.io/jetstack/cert-manager-cainjector:v0.11.0quay.io/jetstack/cert-manager-webhook:v0.11.0gcr.io/istio-release/citadel:release-1.3-latest-dailygcr.io/istio-release/proxyv2:release-1.3-latest-dailygcr.io/istio-release/node-agent-k8s:release-1.3-latest-dailygcr.io/istio-release/pilot:release-1.3-latest-dailygcr.io/istio-release/mixer:release-1.3-latest-dailygcr.io/istio-release/kubectl:release-1.3-latest-dailyquay.io/jetstack/cert-manager-controller:v0.11.0gcr.io/istio-release/galley:release-1.3-latest-dailygcr.io/istio-release/sidecar_injector:release-1.3-latest-dailygcr.io/istio-release/proxy_init:release-1.3-latest-dailygcr.io/kubebuilder/kube-rbac-proxy:v0.4.0gcr.io/kfserving/kfserving-controller:v0.4.1python:3.7metacontroller/metacontroller:v0.3.0gcr.io/ml-pipeline/envoy:metadata-grpcgcr.io/istio-release/proxy_init:release-1.3-latest-dailygcr.io/kubebuilder/kube-rbac-proxy:v0.4.0gcr.io/kfserving/kfserving-controller:v0.4.1python:3.7metacontroller/metacontroller:v0.3.0gcr.io/ml-pipeline/envoy:metadata-grpcgcr.io/tfx-oss-public/ml_metadata_store_server:v0.21.1gcr.io/ml-pipeline/persistenceagent:1.0.4gcr.io/ml-pipeline/scheduledworkflow:1.0.4gcr.io/ml-pipeline/frontend:1.0.4mpioperator/mpi-operator:latestkubeflow/mxnet-operator:v1.0.0-2025gcr.io/ml-pipeline/metadata-writer:1.0.4gcr.io/ml-pipeline/visualization-server:1.0.4mxnet-operator-679f456768-rcnfrgcr.io/kubeflow-images-public/notebook-controller:vmaster-g6eb007d0gcr.io/kubeflow-images-public/pytorch-operator:vmaster-g518f9c76docker.io/seldonio/seldon-core-operator:1.4.0gcr.io/kubeflow-images-public/tf_operator:vmaster-gda226016gcr.io/kubeflow-images-public/admission-webhook:v0520-v0-139-gcee39dbc-dirty-0d8f4cgcr.io/ml-pipeline/cache-server:1.0.4mysql:8.0.3gcr.io/ml-pipeline/minio:RELEASE.-08-14T20-37-41Z-license-compliancegcr.io/ml-pipeline/mysql:5.6gcr.io/kubeflow-images-public/metadata:v0.1.11gcr.io/kubeflow-images-public/profile-controller:vmaster-ga49f658fgcr.io/kubeflow-images-public/kfam:vmaster-g9f3bfd00gcr.io/google_containers/spartakus-amd64:v1.1.0argoproj/workflow-controller:v2.3.0gcr.io/kubeflow-images-public/tensorflow-1.15.2-notebook-cpu:1.0.0gcr.io/kubeflow-images-public/tensorflow-1.15.2-notebook-gpu:1.0.0gcr.io/kubeflow-images-public/tensorflow-2.1.0-notebook-cpu:1.0.0gcr.io/kubeflow-images-public/tensorflow-2.1.0-notebook-gpu:1.0.0---以下的镜像拉取下来是没有tag的,需要自己打下tag.建议拉取的时候单独手工拉取gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:ffa3d72ee6c2eeb2357999248191a643405288061b7080381f22875cb703e929gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:f89fd23889c3e0ca3d8e42c9b189dc2f93aa5b3a91c64e8aab75e952a210eeb3gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:b86ac8ecc6b2688a0e0b9cb68298220a752125d0a048b8edf2cf42403224393cgcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:e6b142c0f82e0e0b8cb670c11eb4eef6ded827f98761bbf4bea7bdb777b80092gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:75c7918ca887622e7242ec1965f87036db1dc462464810b72735a8e64111gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:75c7918ca887622e7242ec1965f87036db1dc462464810b72735a8e64111f6gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:75c7918ca887622e7242ec1965f87036db1dc462464810b72735a8e64111f6f7gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:7e6df0fda229a13219bbc90ff72a10434a0c64cd7fe13dc534b914247d1087f4

看到这么多镜像慌了吧?别急。肯定不会让你一个个手工去下载。

编写一个shell脚本来帮助我们完成这一系列的重复操作,建议是拉取镜像的服务器上面是比较干净的哈。最好把镜像都清空后执行。因为下面保存镜像那一部分是根据docker imges过滤的。也就是所如果没有清空将会把原来本地的镜像也一同save过去。。

建议在磁盘有100G以上的机器上执行。。

2、创建pull_images.sh

#vimpull_images.sh#!/bin/bashG=`tputsetaf2`C=`tputsetaf6`Y=`tputsetaf3`Q=`tputsgr0`echo-e"${C}\n\n镜像下载脚本:${Q}"echo-e"${C}pull_images.sh将读取images.txt中的镜像,拉取并保存到images.tar.gz中\n\n${Q}"#清理本地已有镜像#echo"${C}start:清理镜像${Q}"#forrm_imagein$(catimages.txt)#do#dockerrmi$aliNexus$rm_image#done#echo-e"${C}end:清理完成\n\n${Q}"#创建文件夹mkdirimages#pullecho"${C}start:开始拉取镜像...${Q}"forpull_imagein$(catimages.txt)doecho"${Y}开始拉取$pull_image...${Q}"fileName=${pull_image//:/_}dockerpull$pull_imagedoneecho"${C}end:镜像拉取完成...${Q}"#save镜像IMAGES_LIST=($(dockerimages|sed'1d'|awk'{print$1}'))IMAGES_NM_LIST=($(dockerimages|sed'1d'|awk'{print$1"-"$2}'|awk-F/'{print$NF}'))IMAGES_NUM=${#IMAGES_LIST[*]}echo"镜像列表....."dockerimages#dockerimages|sed'1d'|awk'{print$1}'for((i=0;i<$IMAGES_NUM;i++))doecho"正在save${IMAGES_LIST[$i]}image..."dockersave"${IMAGES_LIST[$i]}"-o./images/"${IMAGES_NM_LIST[$i]}".tar.gzdonelsimagesecho-e"${C}end:保存完成\n\n${Q}"#打包镜像#tag_date=$(date"+%Y%m%d%H%M")echo"${C}start:打包镜像:images.tar.gz${Q}"tar-czvfimages.tar.gzimagesecho-e"${C}end:打包完成\n\n${Q}"#上传镜像包到OSS,如果没有oss的可以自行更换自己内网可以访问到的其他仓库#echo"${C}start:将镜像包images.tar.gz上传到OSS${Q}"#ossutil64cpimages.tar.gzoss://aicloud-deploy/kubeflow-images/#echo-e"${C}end:镜像包上传完成\n\n${Q}"#清理镜像read-p"${C}是否清理本地镜像(Y/N,默认N)?:${Q}"is_cleanif[-z"${is_clean}"];thenis_clean="N"fiif["${is_clean}"=="Y"];thenrm-rfimages/*rm-rfimages.tar.gzforclean_imagein$(catimages.txt)dodockerrmi$clean_imagedoneecho-e"${C}清理结束~\n\n${Q}"fiecho-e"${C}执行结束~\n\n${Q}"

3、编辑需要下载的镜像列表文件images.txt

#vimimages.txtquay.io/jetstack/cert-manager-cainjector:v0.11.0quay.io/jetstack/cert-manager-webhook:v0.11.0gcr.io/istio-release/citadel:release-1.3-latest-dailygcr.io/istio-release/proxyv2:release-1.3-latest-dailygcr.io/istio-release/node-agent-k8s:release-1.3-latest-dailygcr.io/istio-release/pilot:release-1.3-latest-dailygcr.io/istio-release/mixer:release-1.3-latest-dailygcr.io/istio-release/kubectl:release-1.3-latest-dailyquay.io/jetstack/cert-manager-controller:v0.11.0gcr.io/istio-release/galley:release-1.3-latest-dailygcr.io/istio-release/sidecar_injector:release-1.3-latest-dailygcr.io/istio-release/proxy_init:release-1.3-latest-dailygcr.io/kubebuilder/kube-rbac-proxy:v0.4.0gcr.io/kfserving/kfserving-controller:v0.4.1python:3.7metacontroller/metacontroller:v0.3.0gcr.io/ml-pipeline/envoy:metadata-grpcgcr.io/istio-release/proxy_init:release-1.3-latest-dailygcr.io/kubebuilder/kube-rbac-proxy:v0.4.0gcr.io/kfserving/kfserving-controller:v0.4.1python:3.7metacontroller/metacontroller:v0.3.0gcr.io/ml-pipeline/envoy:metadata-grpcgcr.io/tfx-oss-public/ml_metadata_store_server:v0.21.1gcr.io/ml-pipeline/persistenceagent:1.0.4gcr.io/ml-pipeline/scheduledworkflow:1.0.4gcr.io/ml-pipeline/frontend:1.0.4mpioperator/mpi-operator:latestkubeflow/mxnet-operator:v1.0.0-2025gcr.io/ml-pipeline/metadata-writer:1.0.4gcr.io/ml-pipeline/visualization-server:1.0.4mxnet-operator-679f456768-rcnfrgcr.io/kubeflow-images-public/notebook-controller:vmaster-g6eb007d0gcr.io/kubeflow-images-public/pytorch-operator:vmaster-g518f9c76docker.io/seldonio/seldon-core-operator:1.4.0gcr.io/kubeflow-images-public/tf_operator:vmaster-gda226016gcr.io/kubeflow-images-public/admission-webhook:v0520-v0-139-gcee39dbc-dirty-0d8f4cgcr.io/ml-pipeline/cache-server:1.0.4mysql:8.0.3gcr.io/ml-pipeline/minio:RELEASE.-08-14T20-37-41Z-license-compliancegcr.io/ml-pipeline/mysql:5.6gcr.io/kubeflow-images-public/metadata:v0.1.11gcr.io/kubeflow-images-public/profile-controller:vmaster-ga49f658fgcr.io/kubeflow-images-public/kfam:vmaster-g9f3bfd00gcr.io/google_containers/spartakus-amd64:v1.1.0argoproj/workflow-controller:v2.3.0gcr.io/kubeflow-images-public/tensorflow-1.15.2-notebook-cpu:1.0.0gcr.io/kubeflow-images-public/tensorflow-1.15.2-notebook-gpu:1.0.0gcr.io/kubeflow-images-public/tensorflow-2.1.0-notebook-cpu:1.0.0gcr.io/kubeflow-images-public/tensorflow-2.1.0-notebook-gpu:1.0.0

4、执行脚本

shpull_images.sh

以下镜像建议单独手工拉取。并打tag, 也可以修改上面的脚本把其他的删掉,只留pull部分

dockerpullgcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:ffa3d72ee6c2eeb2357999248191a643405288061b7080381f22875cb703e929dockerpullgcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:f89fd23889c3e0ca3d8e42c9b189dc2f93aa5b3a91c64e8aab75e952a210eeb3dockerpullgcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:b86ac8ecc6b2688a0e0b9cb68298220a752125d0a048b8edf2cf42403224393cdockerpullgcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:e6b142c0f82e0e0b8cb670c11eb4eef6ded827f98761bbf4bea7bdb777b80092dockerpullgcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:75c7918ca887622e7242ec1965f87036db1dc462464810b72735a8e64111dockerpullgcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:75c7918ca887622e7242ec1965f87036db1dc462464810b72735a8e64111f6dockerpullgcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:75c7918ca887622e7242ec1965f87036db1dc462464810b72735a8e64111f6f7dockerpullgcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:7e6df0fda229a13219bbc90ff72a10434a0c64cd7fe13dc534b914247d1087f4

5、打tag

dockertag镜像ID内网镜像仓库dockertag3208baba46fcaicloud-/library/serving/cmd/activator:v1.2.0dockertag4578f31842abaicloud-/library/serving/cmd/autoscaler:v1.2.0dockertagd1b481df9ac3aicloud-/library/serving/cmd/webhook:v1.2.0dockertag9f8e41e19efbaicloud-/library/serving/cmd/controller:v1.2.0dockertag6749b4c87ac8aicloud-/library/net-istio/cmd/webhook:v1.2.0dockertagba7fa40d9f88aicloud-/library/net-istio/cmd/controller:v1.2.0

6、save镜像

dockersave镜像-o包名dockersaveaicloud-/library/serving/cmd/activator:v1.2.0-oactivator-v1.2.0.tar.gzdockersaveaicloud-/library/serving/cmd/autoscaler:v1.2.0-oautoscaler-v1.2.0.tar.gz....

然后把镜像包上传到内网自建的镜像仓库,如harbor. 我这里是将包上传到harbor。当然 你也可以直接上传到部署的服务器,当然你得要每个node节点都上传,不然pod重启切换节点将又会拉不到镜像...

如果你刚刚拉取镜像的服务器无法和内网的镜像仓库连通,那还需要将镜像包下载到本地再上传到内网harbor服务器。当然如果你们的镜像仓库是在阿里的就更加方便。

推送镜像的脚本,注意还需要编辑一个images.txt文件。。

7、编辑push_images.sh

vimpush_images.sh#!/bin/bashG=`tputsetaf2`C=`tputsetaf6`Y=`tputsetaf3`Q=`tputsgr0`echo-e"${C}\n\n镜像上传脚本:${Q}"echo-e"${C}push_images.sh将读取images.txt中的镜像名称,将images.tar.gz中的镜像推送到内网镜像仓库\n\n${Q}"#获取内网镜像仓库地址read-p"${C}内网镜像仓库地址(默认aicloud-/library):${Q}"nexusAddrif[-z"${nexusAddr}"];thennexusAddr="aicloud-/library"fiif[[${nexusAddr}=~/$]];thenechoelsenexusAddr="${nexusAddr}/"fitar-xzfimages.tar.gzcdimages#tagecho"${C}start:加载镜像${Q}"forimage_namein$(ls./)doecho-e"${Y}开始load$image_name...${Q}"dockerload<${image_name}doneecho-e"${C}end:加载完成...\n\n${Q}"#push镜像echo"${C}start:开始push镜像到harbor...${Q}"IMAGES_LIST=($(dockerimages|sed'1d'|awk'{print$1":"$2}'))forpush_imagein$(dockerimages|sed'1d'|awk'{print$1":"$2}')doecho-e"${Y}开始推送$push_image...${Q}"dockertag$push_image$nexusAddr/$push_imagedockerpush$nexusAddr/$push_imageecho"镜像:$nexusAddr/$push_image推送完成..."doneecho-e"${C}end:全部镜像推送完成\n\n${Q}"

8、修改kubeflow项目文件里面的镜像地址

先将整个项目拉取到本地

项目地址:/kubeflow/manifests/releases

将v1.2.0的包下载到本地。需要改里面的镜像。因为大部分镜像都是国外的源。

/kubeflow/manifests/archive/v1.2.0.tar.gz

我这里使用的是idea打开。方便全局替换、查找和编辑。

先把压缩包解压,会得到manifests-1.2.0这个文件。然后在idea上面打开这个项目。

使用快捷键Alt+Shift+R

或者

然后把原来的镜像地址替换成自己打tag的镜像

上面所下载的镜像都需要替换。这个工作量有点大哈。。

9、项目打包

将项目打包压缩上传到一个部署服务器可以wget到的仓库。我这里用的是nexus。

为了和源文件的打包方式一样,我先在本地Windows上打成zip包,然后在上传到linux服务器解压再打包成tar包。

rzmanifests-1.2.0.zip#上传命令mkdirmanifests-1.2.0mvmanifests-1.2.0.zipmanifests-1.2.0/cdmanifests-1.2.0/unzipmanifests-1.2.0.ziprm-rfmanifests-1.2.0.zipcd..tar-czvfv1.2.0.tar.gzmanifests-1.2.0/curl-utest:***********--upload-file./v1.2.0.tar.gz/repository/public-ftp/kubernetes/package/kubeflow/manifests-1.2.0/

10、创建kubeflow的工作目录

mkdir/apps/kubeflowcd/apps/kubeflow

11、创建一个StorageClass.

#catStorageClass.yamlapiVersion:storage.k8s.io/v1kind:StorageClassmetadata:name:nfs-clientprovisioner:nfs-client-provisioner#自己内网搭建的一个NAScontroller名称,也可以是阿里的NAS,但必须能够访问得到reclaimPolicy:Retainparameters:archiveOnDelete:"true"

并将它改为默认的 StorageClass

#kubectlgetscNAMEPROVISIONERRECLAIMPOLICYVOLUMEBINDINGMODEALLOWVOLUMEEXPANSIONAGEnfs-clientnfs-client-provisionerRetainImmediatefalse21h#为false的时候为关闭#kubectlpatchstorageclassnfs-client-p'{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'#kubectlgetscNAMEPROVISIONERRECLAIMPOLICYVOLUMEBINDINGMODEALLOWVOLUMEEXPANSIONAGEnfs-client(default)nfs-client-provisionerRetainImmediatefalse21h

12、编辑配置文件

vimkfctl_k8s_istio.v1.2.0.yamlapiVersion:kfdef./v1kind:KfDefmetadata:namespace:kubeflowspec:applications:-kustomizeConfig:repoRef:name:manifestspath:namespaces/basename:namespaces-kustomizeConfig:repoRef:name:manifestspath:application/v3name:application-kustomizeConfig:repoRef:name:manifestspath:stacks/kubernetes/application/istio-1-3-1-stackname:istio-stack-kustomizeConfig:repoRef:name:manifestspath:stacks/kubernetes/application/cluster-local-gateway-1-3-1name:cluster-local-gateway-kustomizeConfig:repoRef:name:manifestspath:istio/istio/basename:istio-kustomizeConfig:repoRef:name:manifestspath:stacks/kubernetes/application/cert-manager-crdsname:cert-manager-crds-kustomizeConfig:repoRef:name:manifestspath:stacks/kubernetes/application/cert-manager-kube-system-resourcesname:cert-manager-kube-system-resources-kustomizeConfig:repoRef:name:manifestspath:stacks/kubernetes/application/cert-managername:cert-manager-kustomizeConfig:repoRef:name:manifestspath:stacks/kubernetes/application/add-anonymous-user-filtername:add-anonymous-user-filter-kustomizeConfig:repoRef:name:manifestspath:metacontroller/basename:metacontroller-kustomizeConfig:repoRef:name:manifestspath:admission-webhook/bootstrap/overlays/applicationname:bootstrap-kustomizeConfig:repoRef:name:manifestspath:stacks/kubernetes/application/spark-operatorname:spark-operator-kustomizeConfig:repoRef:name:manifestspath:stacks/kubernetesname:kubeflow-apps-kustomizeConfig:repoRef:name:manifestspath:knative/installs/genericname:knative-kustomizeConfig:repoRef:name:manifestspath:kfserving/installs/genericname:kfserving#Spartakusisaseparateapplicationssothatkfctlcanremoveit#todisableusagereporting-kustomizeConfig:repoRef:name:manifestspath:stacks/kubernetes/application/spartakusname:spartakusrepos:-name:manifests#注意这里需要修改成我们已经替换好镜像路径的项目地址#uri:/kubeflow/manifests/archive/v1.2.0.tar.gzuri:http://aicloud-/repository/public-ftp/kubernetes/package/kubeflow/manifests-1.2.0/v1.2.0.tar.gzversion:v1.2-branch

登录到部署服务器上下载刚刚打包好的包

wgethttp://aicloud-/repository/public-ftp/kubernetes/package/kubeflow/manifests-1.2.0/v1.2.0.tar.gztar-xzvfv1.2.0.tar.gzcpkfctl_k8s_istio.v1.2.0.yaml./manifests-1.2.0cdmanifests-1.2.0

13、部署

kfctlapply-V-fkfctl_k8s_istio.v1.2.0.yaml

检查所有命名空间

#kubectlgetpods-ncert-manager#kubectlgetpods-nistio-system#kubectlgetpods-nknative-serving#kubectlgetpods-nkubeflow

14、使用浏览器访问

访问kubeflow ui

kubectlgetsvc/istio-ingressgateway-nistio-systemNAMETYPECLUSTER-IPEXTERNAL-IPPORT(S)AGEistio-ingressgatewayNodePort12.80.127.69<none>15020:32661/TCP,80:31380/TCP,443:31390/TCP,31400:31400/TCP,15029:30345/TCP,15030:32221/TCP,15031:31392/TCP,15032:31191/TCP,15443:32136/TCP5h14m

因为使用的NodePort的类型。所以我们就可以直接在浏览器上面访问

node节点IP+31380端口

http://10.18.3.228:31380

15、测试

创建一个Notebook Servers

然后去查看pod,将会多出这三个pod,如果都为running状态则表示正常。

#kubectlgetpods-ANAMESPACENAMEREADYSTATUSRESTARTSAGEanonymousml-pipeline-ui-artifact-ccf49557c-s5jk92/2Running04m48sanonymousml-pipeline-visualizationserver-866f48bf7b-pfr4l2/2Running04m48sanonymoustest-02/2Running02m13s

四、删除kubeflow

kfctldelete-V-fkfctl_k8s_istio.v1.2.0.yaml

五、问题

1、启动时一直卡在cert-manager这里

application.app.k8s.io/cert-managerconfiguredWARN[0161]Encounterederrorapplyingapplicationcert-manager:(kubeflow.error):Code500withmessage:Apply.Run:errorwhencreating"/tmp/kout044650944":Internalerroroccurred:failedcallingwebhook"webhook.cert-manager.io":theserveriscurrentlyunabletohandletherequestfilename="kustomize/kustomize.go:284"WARN[0161]Willretryin26seconds.filename="kustomize/kustomize.go:285"

解决

先查看pod

#kubectlgetpods-ncert-managerNAMEREADYSTATUSRESTARTSAGEcert-manager-7c75b559c4-xmsp61/1Running03m46scert-manager-cainjector-7f964fd7b5-fnsg71/1Running03m46scert-manager-webhook-566dd99d6-fnchp0/1ImagePullBackOff03m46s#kubectldescribepodcert-manager-webhook-566dd99d6-fnchp-ncert-managerEvents:TypeReasonAgeFromMessage-------------------------NormalScheduled4m57sdefault-schedulerSuccessfullyassignedcert-manager/cert-manager-webhook-566dd99d6-fnchptonode6WarningFailedMount4m26s(x7over4m58s)kubeletMountVolume.SetUpfailedforvolume"certs":secret"cert-manager-webhook-tls"notfoundWarningFailed3m53skubeletFailedtopullimage"quay.io/jetstack/cert-manager-webhook:v0.11.0":rpcerror:code=Unknowndesc=Errorresponsefromdaemon:Gethttps://quay.io/v2/:dialtcp54.197.99.84:443:connect:connectionrefusedWarningFailed3m37skubeletFailedtopullimage"quay.io/jetstack/cert-manager-webhook:v0.11.0":rpcerror:code=Unknowndesc=Errorresponsefromdaemon:Gethttps://quay.io/v2/:dialtcp54.156.10.58:443:connect:connectionrefusedNormalPulling3m9s(x3over3m53s)kubeletPullingimage"quay.io/jetstack/cert-manager-webhook:v0.11.0"WarningFailed3m9s(x3over3m53s)kubeletError:ErrImagePullWarningFailed3m9skubeletFailedtopullimage"quay.io/jetstack/cert-manager-webhook:v0.11.0":rpcerror:code=Unknowndesc=Errorresponsefromdaemon:Gethttps://quay.io/v2/:dialtcp52.4.104.248:443:connect:connectionrefusedNormalBackOff2m58s(x4over3m52s)kubeletBack-offpullingimage"quay.io/jetstack/cert-manager-webhook:v0.11.0"WarningFailed2m46s(x5over3m52s)kubeletError:ImagePullBackOff

可以看到是因为镜像无法拉取到问题导致。

如果镜像地址没问题的话删除一下这个pod

kubectldeletepodcert-manager-webhook-566dd99d6-fnchp-ncert-manager

问题1、删除kubeflow,pvc都显示Terminating

#kubectlgetpvc-nkubeflowNAMESTATUSVOLUMECAPACITYACCESSMODESSTORAGECLASSAGEmetadata-mysqlTerminatingpvc-4fe5c5f2-a187-4200-95c3-33de0c01f78110GiRWOnfs-client23hminio-pvcTerminatingpvc-cd2dc964-a448-4c68-b0bb-5bc2183e520320GiRWOnfs-client23hmysql-pv-claimTerminatingpvc-514407db-00bd-4767-8043-a31b1a70e47f20GiRWOnfs-client23h

解决

#kubectlpatchpvcmetadata-mysql-p'{"metadata":{"finalizers":null}}'-nkubeflowpersistentvolumeclaim/metadata-mysqlpatched

删除kubeflow完后pv并不会自动删除

#kubectlgetpvNAMECAPACITYACCESSMODESRECLAIMPOLICYSTATUSCLAIMSTORAGECLASSREASONAGEjenkins-home60GiRWORetainBoundinfrastructure/jenkinsjenkins-home9dpvc-0860e679-dd0b-48fc-8326-8a4c993410e620GiRWORetainReleasedkubeflow/minio-pvcnfs-client16mpvc-13e06aac-f688-4d89-a467-93e5c6d6ecf620GiRWORetainReleasedkubeflow/mysql-pv-claimnfs-client16mpvc-3e495907-53c4-468e-9aad-426c2f3e085110GiRWORetainReleasedkubeflow/katib-mysqlnfs-client16mpvc-3f59b851-0429-4e75-929b-33c05f8af66f20GiRWORetainReleasedkubeflow/mysql-pv-claimnfs-client7h42mpvc-5da0ac9b-c1c4-4aa1-b9ff-128174fe152c10GiRWORetainReleasedkubeflow/metadata-mysqlnfs-client7h42mpvc-749f2098-8ba2-469c-8d78-f5889e24a9d45GiRWORetainReleasedanonymous/workspace-testnfs-client7h35mpvc-94e61c9f-0b9c-4589-9e33-efb885c8423320GiRWORetainReleasedkubeflow/minio-pvcnfs-client7h42mpvc-a291c901-f2be-4994-b0d4-d83341879c3b10GiRWORetainReleasedkubeflow/metadata-mysqlnfs-client16mpvc-a657f4c5-abce-47b4-8474-4ee4e60826b910GiRWORetainReleasedkubeflow/katib-mysqlnfs-client7h42m

需要自己手工删除

#kubectldeletepvpvc-0860e679-dd0b-48fc-8326-8a4c993410e6

如果出现katib-db、katib-mysql、metadata-grpc-deployment等这几个pod出现pending或者初始化错误的话。大概几率就是那个持久卷没有挂载上。可以describe查看具体报错原因。

检查pv和pvc有没有挂载

#kubectlgetpvc-A#kubectlgetpv

=全文完=

欢迎关注公众号:持续交付实践指南

往期好文:

kubernetes集群基于traefik对外提供服务

七小时实现云原生,我们都经历了什么

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。