kubeflow踩坑记录

kubeflow踩坑记录

kubeflow ui jupyter权限问题

User None is not authorized to list … for namespace: anonymous

官方issue的解决方法都试过,不生效,其中一条issue提到源码可以改为dev模式去掉权限认证

User None is not authorized to list … for namespace: anonymous · Issue #4731 · kubeflow/kubeflow

jupyter源码

kubeflow/kubeflow

# 修改jupyter kustomize,增加红色部分参数,重启jupyter
#.cache/manifests/manifests-0.7-branch/jupyter/jupyter-web-app/base/deployment.yml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: deployment
spec:
  replicas: 1
  template:
    spec:
      containers:
      - env:
        - name: ROK_SECRET_NAME
          valueFrom:
            configMapKeyRef:
              name: parameters
              key: ROK_SECRET_NAME
        - name: UI
          valueFrom:
            configMapKeyRef:
              name: parameters
              key: UI
        - name: USERID_HEADER
          value: $(userid-header)
        - name: USERID_PREFIX
          value: $(userid-prefix)
        - name: FLASK_ENV
          value: development
        image: gcr.io/kubeflow-images-public/jupyter-web-app:v0.5.0
        imagePullPolicy: $(policy)
        command: ["python3", "main.py"]
        args: ["--dev"]
        name: jupyter-web-app
        ports:
        - containerPort: 5000
        volumeMounts:
        - mountPath: /etc/config
          name: config-volume
      serviceAccountName: service-account
      volumes:
      - configMap:
          name: config
        name: config-volume

复制代码
# 重启
kustomize build | kubectl delete -f -
kustomize build | kubectl apply -f -
复制代码

pipelines ui 页面报错

Error: mysql_query failed: errno: 2006, error: MySQL server has gone away. Code: 13

Untitled 1.png

根据官方issue 重启grpc-metadata pod 问题解决

Error: mysql_query failed: errno: 2006, error: MySQL server has gone away. Code: 13 · Issue #4604 · kubeflow/kubeflow

原因解释

Untitled 2.png

mysql_query failed: errno: 2006, error: MySQL server has gone away · Issue #198 · kubeflow/metadata

notebook-server 无法连接

Sorry, /notebook is not a valid page #5010

Untitled 3.png
排查,直接port-forward可以访问

kubectl port-forward svc/kenwood-test -n anonymous 8080:80 –address 10.10.62.180

Untitled.png

根据官方issue,是note-controller的deployment 参数硬编码不开启use_istio

Sorry, /notebook is not a valid page · Issue #5010 · kubeflow/kubeflow

修改note-controller 参数

#.cache/manifests/manifests-0.7-branch/jupyter/notebook-controller/base/deployment.yaml
# USE_ISTIO value改为true
apiVersion: apps/v1
kind: Deployment
metadata:
  name: deployment
spec:
  template:
    spec:
      containers:
      - name: manager
        image: gcr.io/kubeflow-images-public/notebook-controller:v20190614-v0-160-g386f2749-e3b0c4
        command:
          - /manager
        env:
          - name: USE_ISTIO
            value: "true"
          - name: POD_LABELS
            value: $(POD_LABELS)
        imagePullPolicy: IfNotPresent
        livenessProbe:
          httpGet:
            path: /metrics
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 30
      serviceAccountName: service-account
复制代码

重启node-controller

kustomize build | kubectl delete -f -
kustomize build | kubectl apply -f -
复制代码

镜像问题

  1. 有些镜像拉取策略是Always 需要改成IfNotPresent

  2. 有些镜像是引用sha256,需要改成tag

  3. gcr镜像拉取问题,使用github action 做了同步,同步到dockerhub

    可以fork我的项目改造一下

kenwoodjw/sync_gcr

SHA Digest used in knative-install · Issue #1521 · kubeflow/manifests

总结

  • FUCK GFW, 拉取镜像浪费很多时间
  • 全靠issue 解决方案

kfserving 模型部署

完整 Kubeflow 使用教學 – 開發 ML 模型、進行分散式訓練與部署服務

KFServing 底層由 Knative 與 istio 實作,因此可以做到同時部署兩個版本的模型進行 金絲雀部署 (canary deployment) 進行 A/B test 。

Kubeflow v0.7, KNative 0.8 and Istio 1.1.6 are installed by default as part of the Kubeflow installation.(不能混合部署)

Kubeflow 1.0 onwards, KNative 0.11.1 and Istio 1.1.6 are installed by default

kubeflow/kfserving

修改knative image tag

gcr.io/knative-releases/knative.dev/serving/cmd/activator:v0.8.0
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler-hpa:v0.8.0
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler:v0.8.0
gcr.io/knative-releases/knative.dev/serving/cmd/controller:v0.8.0
gcr.io/knative-releases/knative.dev/serving/cmd/networking/istio:v0.8.0
gcr.io/knative-releases/knative.dev/serving/cmd/webhook:v0.8.0
gcr.io/knative-releases/knative.dev/serving/cmd/queue:v0.8.0
# 同步kfserving configmap的镜像
mcr.microsoft.com/onnxruntime/server:v0.5.0
gcr.io/kfserving/sklearnserver:0.2.0
gcr.io/kfserving/xgbserver:0.2.0
gcr.io/kfserving/pytorchserver:0.2.2
nvcr.io/nvidia/tensorrtserver:19.05-py3
gcr.io/kfserving/alibi-explainer:0.2.2
gcr.io/kfserving/storage-initializer:0.2.2
gcr.io/kfserving/logger:0.2.2
复制代码
© 版权声明
THE END
喜欢就支持一下吧
点赞0 分享