kubeflow踩坑记录
kubeflow ui jupyter权限问题
User None is not authorized to list … for namespace: anonymous
官方issue的解决方法都试过,不生效,其中一条issue提到源码可以改为dev模式去掉权限认证
User None is not authorized to list … for namespace: anonymous · Issue #4731 · kubeflow/kubeflow
jupyter源码
# 修改jupyter kustomize,增加红色部分参数,重启jupyter
#.cache/manifests/manifests-0.7-branch/jupyter/jupyter-web-app/base/deployment.yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: deployment
spec:
replicas: 1
template:
spec:
containers:
- env:
- name: ROK_SECRET_NAME
valueFrom:
configMapKeyRef:
name: parameters
key: ROK_SECRET_NAME
- name: UI
valueFrom:
configMapKeyRef:
name: parameters
key: UI
- name: USERID_HEADER
value: $(userid-header)
- name: USERID_PREFIX
value: $(userid-prefix)
- name: FLASK_ENV
value: development
image: gcr.io/kubeflow-images-public/jupyter-web-app:v0.5.0
imagePullPolicy: $(policy)
command: ["python3", "main.py"]
args: ["--dev"]
name: jupyter-web-app
ports:
- containerPort: 5000
volumeMounts:
- mountPath: /etc/config
name: config-volume
serviceAccountName: service-account
volumes:
- configMap:
name: config
name: config-volume
复制代码
# 重启
kustomize build | kubectl delete -f -
kustomize build | kubectl apply -f -
复制代码
pipelines ui 页面报错
Error: mysql_query failed: errno: 2006, error: MySQL server has gone away. Code: 13

根据官方issue 重启grpc-metadata pod 问题解决
原因解释

mysql_query failed: errno: 2006, error: MySQL server has gone away · Issue #198 · kubeflow/metadata
notebook-server 无法连接
Sorry, /notebook is not a valid page #5010

排查,直接port-forward可以访问
kubectl port-forward svc/kenwood-test -n anonymous 8080:80 –address 10.10.62.180

根据官方issue,是note-controller的deployment 参数硬编码不开启use_istio
Sorry, /notebook is not a valid page · Issue #5010 · kubeflow/kubeflow
修改note-controller 参数
#.cache/manifests/manifests-0.7-branch/jupyter/notebook-controller/base/deployment.yaml
# USE_ISTIO value改为true
apiVersion: apps/v1
kind: Deployment
metadata:
name: deployment
spec:
template:
spec:
containers:
- name: manager
image: gcr.io/kubeflow-images-public/notebook-controller:v20190614-v0-160-g386f2749-e3b0c4
command:
- /manager
env:
- name: USE_ISTIO
value: "true"
- name: POD_LABELS
value: $(POD_LABELS)
imagePullPolicy: IfNotPresent
livenessProbe:
httpGet:
path: /metrics
port: 8080
initialDelaySeconds: 30
periodSeconds: 30
serviceAccountName: service-account
复制代码
重启node-controller
kustomize build | kubectl delete -f -
kustomize build | kubectl apply -f -
复制代码
镜像问题
-
有些镜像拉取策略是Always 需要改成IfNotPresent
-
有些镜像是引用sha256,需要改成tag
-
gcr镜像拉取问题,使用github action 做了同步,同步到dockerhub
可以fork我的项目改造一下
SHA Digest used in knative-install · Issue #1521 · kubeflow/manifests
总结
- FUCK GFW, 拉取镜像浪费很多时间
- 全靠issue 解决方案
kfserving 模型部署
完整 Kubeflow 使用教學 – 開發 ML 模型、進行分散式訓練與部署服務
KFServing 底層由 Knative 與 istio 實作,因此可以做到同時部署兩個版本的模型進行 金絲雀部署 (canary deployment) 進行 A/B test 。
Kubeflow v0.7, KNative 0.8 and Istio 1.1.6 are installed by default as part of the Kubeflow installation.(不能混合部署)
Kubeflow 1.0 onwards, KNative 0.11.1 and Istio 1.1.6 are installed by default
修改knative image tag
gcr.io/knative-releases/knative.dev/serving/cmd/activator:v0.8.0
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler-hpa:v0.8.0
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler:v0.8.0
gcr.io/knative-releases/knative.dev/serving/cmd/controller:v0.8.0
gcr.io/knative-releases/knative.dev/serving/cmd/networking/istio:v0.8.0
gcr.io/knative-releases/knative.dev/serving/cmd/webhook:v0.8.0
gcr.io/knative-releases/knative.dev/serving/cmd/queue:v0.8.0
# 同步kfserving configmap的镜像
mcr.microsoft.com/onnxruntime/server:v0.5.0
gcr.io/kfserving/sklearnserver:0.2.0
gcr.io/kfserving/xgbserver:0.2.0
gcr.io/kfserving/pytorchserver:0.2.2
nvcr.io/nvidia/tensorrtserver:19.05-py3
gcr.io/kfserving/alibi-explainer:0.2.2
gcr.io/kfserving/storage-initializer:0.2.2
gcr.io/kfserving/logger:0.2.2
复制代码






















![[桜井宁宁]COS和泉纱雾超可爱写真福利集-一一网](https://www.proyy.com/skycj/data/images/2020-12-13/4d3cf227a85d7e79f5d6b4efb6bde3e8.jpg)

![[桜井宁宁] 爆乳奶牛少女cos写真-一一网](https://www.proyy.com/skycj/data/images/2020-12-13/d40483e126fcf567894e89c65eaca655.jpg)