Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

katib-db-manager fails update-status due to health check on workload container #654

Closed
Sponge-Bas opened this issue Aug 4, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@Sponge-Bas
Copy link

In test run https://solutions.qa.canonical.com/testruns/4c1c6e4a-c895-4c2b-88d9-16d2b109d511/, which is deploying ckf 1.7/stable on ck8s 1.24 focal on aws, the installation fails with the following juju status:

App                        Version                Status   Scale  Charm                    Channel         Rev  Address         Exposed  Message
admission-webhook          res:oci-image@2d74d1b  active       1  admission-webhook        1.7/stable      205  10.152.183.176  no       
argo-controller            res:oci-image@669ebd5  active       1  argo-controller          3.3/stable      236                  no       
argo-server                res:oci-image@576d038  active       1  argo-server              3.3/stable      185                  no       
dex-auth                                          active       1  dex-auth                 2.31/stable     224  10.152.183.56   no       
istio-ingressgateway                              active       1  istio-gateway            1.16/stable     551  10.152.183.221  no       
istio-pilot                                       active       1  istio-pilot              1.16/stable     551  10.152.183.141  no       
jupyter-controller         res:oci-image@1167186  active       1  jupyter-controller       1.7/stable      607                  no       
jupyter-ui                                        active       1  jupyter-ui               1.7/stable      534  10.152.183.4    no       
katib-controller           res:oci-image@111495a  active       1  katib-controller         0.15/stable     282  10.152.183.77   no       
katib-db                                          waiting      1  mysql-k8s                8.0/stable       75  10.152.183.54   no       installing agent
katib-db-manager                                  waiting      1  katib-db-manager         0.15/stable     253  10.152.183.219  no       installing agent
katib-ui                                          active       1  katib-ui                 0.15/stable     267  10.152.183.180  no       
kfp-api                                           waiting      1  kfp-api                  2.0/stable      540  10.152.183.49   no       installing agent
kfp-db                                            waiting      1  mysql-k8s                8.0/stable       75  10.152.183.155  no       installing agent
kfp-persistence                                   waiting      1  kfp-persistence          2.0/stable      500                  no       Waiting for kfp-api relation data
kfp-profile-controller     res:oci-image@b26a126  active       1  kfp-profile-controller   2.0/stable      478  10.152.183.233  no       
kfp-schedwf                res:oci-image@68cce0a  active       1  kfp-schedwf              2.0/stable      515                  no       
kfp-ui                                            waiting      1  kfp-ui                   2.0/stable      504                  no       Waiting for kfp-api relation data
kfp-viewer                 res:oci-image@c0f065d  active       1  kfp-viewer               2.0/stable      517                  no       
kfp-viz                    res:oci-image@3de6f3c  active       1  kfp-viz                  2.0/stable      476  10.152.183.243  no       
knative-eventing                                  active       1  knative-eventing         1.8/stable      224  10.152.183.150  no       
knative-operator                                  active       1  knative-operator         1.8/stable      199  10.152.183.157  no       
knative-serving                                   active       1  knative-serving          1.8/stable      224  10.152.183.175  no       
kserve-controller                                 active       1  kserve-controller        0.10/stable     267  10.152.183.90   no       
kubeflow-dashboard                                active       1  kubeflow-dashboard       1.7/stable      307  10.152.183.85   no       
kubeflow-profiles                                 active       1  kubeflow-profiles        1.7/stable      269  10.152.183.200  no       
kubeflow-roles                                    active       1  kubeflow-roles           1.7/stable      113  10.152.183.93   no       
kubeflow-volumes           res:oci-image@d261609  active       1  kubeflow-volumes         1.7/stable      178  10.152.183.165  no       
metacontroller-operator                           active       1  metacontroller-operator  2.0/stable      117  10.152.183.116  no       
minio                      res:oci-image@1755999  active       1  minio                    ckf-1.7/stable  186  10.152.183.241  no       
oidc-gatekeeper            res:oci-image@6b720b8  active       1  oidc-gatekeeper          ckf-1.7/stable  176  10.152.183.159  no       
seldon-controller-manager                         active       1  seldon-core              1.15/stable     457  10.152.183.227  no       
tensorboard-controller     res:oci-image@c52f7c2  active       1  tensorboard-controller   1.7/stable      156  10.152.183.178  no       
tensorboards-web-app       res:oci-image@929f55b  active       1  tensorboards-web-app     1.7/stable      158  10.152.183.101  no       
training-operator                                 active       1  training-operator        1.6/stable      215  10.152.183.181  no       

Unit                          Workload  Agent  Address          Ports              Message
admission-webhook/0*          active    idle   192.168.68.201   4443/TCP           
argo-controller/0*            active    idle   192.168.68.233                      
argo-server/0*                active    idle   192.168.68.204   2746/TCP           
dex-auth/0*                   active    idle   192.168.68.200                      
istio-ingressgateway/0*       active    idle   192.168.68.202                      
istio-pilot/0*                active    idle   192.168.68.203                      
jupyter-controller/0*         active    idle   192.168.184.70                      
jupyter-ui/0*                 active    idle   192.168.184.69                      
katib-controller/0*           active    idle   192.168.68.210   443/TCP,8080/TCP   
katib-db-manager/0*           error     idle   192.168.68.206                      hook failed: "update-status"
katib-db/0*                   blocked   idle   192.168.184.72                      Unable to configure instance
katib-ui/0*                   active    idle   192.168.229.203                     
kfp-api/0*                    waiting   idle   192.168.68.207                      Waiting for relational-db data
kfp-db/0*                     blocked   idle   192.168.229.204                     Unable to configure instance
kfp-persistence/0*            waiting   idle                                       Waiting for kfp-api relation data
kfp-profile-controller/0*     active    idle   192.168.229.216  80/TCP             
kfp-schedwf/0*                active    idle   192.168.229.213                     
kfp-ui/0*                     waiting   idle                                       Waiting for kfp-api relation data
kfp-viewer/0*                 active    idle   192.168.68.226                      
kfp-viz/0*                    active    idle   192.168.229.214  8888/TCP           
knative-eventing/0*           active    idle   192.168.68.208                      
knative-operator/0*           active    idle   192.168.68.214                      
knative-serving/0*            active    idle   192.168.68.209                      
kserve-controller/0*          active    idle   192.168.184.73                      
kubeflow-dashboard/0*         active    idle   192.168.68.213                      
kubeflow-profiles/0*          active    idle   192.168.68.216                      
kubeflow-roles/0*             active    idle   192.168.68.211                      
kubeflow-volumes/0*           active    idle   192.168.68.232   5000/TCP           
metacontroller-operator/0*    active    idle   192.168.68.212                      
minio/0*                      active    idle   192.168.184.83   9000/TCP,9001/TCP  
oidc-gatekeeper/0*            active    idle   192.168.68.234   8080/TCP           
seldon-controller-manager/0*  active    idle   192.168.184.74                      
tensorboard-controller/0*     active    idle   192.168.184.84   9443/TCP           
tensorboards-web-app/0*       active    idle   192.168.184.82   5000/TCP           
training-operator/0*          active    idle   192.168.229.205        

Looking at the pod logs (which can be downloaded here, it looks like a health check failed to run:

2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status Error in sys.excepthook:
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status Traceback (most recent call last):
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/usr/lib/python3.8/logging/__init__.py", line 954, in handle
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     self.emit(record)
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/log.py", line 41, in emit
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     self.model_backend.juju_log(record.levelname, self.format(record))
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/usr/lib/python3.8/logging/__init__.py", line 929, in format
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     return fmt.format(record)
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/usr/lib/python3.8/logging/__init__.py", line 676, in format
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     record.exc_text = self.formatException(record.exc_info)
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/usr/lib/python3.8/logging/__init__.py", line 626, in formatException
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     traceback.print_exception(ei[0], ei[1], tb, None, sio)
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/usr/lib/python3.8/traceback.py", line 103, in print_exception
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     for line in TracebackException(
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/usr/lib/python3.8/traceback.py", line 617, in format
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     yield from self.format_exception_only()
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/usr/lib/python3.8/traceback.py", line 566, in format_exception_only
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     stype = smod + '.' + stype
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status Original exception was:
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status Traceback (most recent call last):
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "./src/charm.py", line 366, in _refresh_status
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     check = self._get_check_status()
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "./src/charm.py", line 360, in _get_check_status
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     return self.container.get_check("katib-db-manager-up").status
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/model.py", line 1980, in get_check
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     raise ModelError(f'check {check_name!r} not found')
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status ops.model.ModelError: check 'katib-db-manager-up' not found
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status The above exception was the direct cause of the following exception:
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status Traceback (most recent call last):
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "./src/charm.py", line 430, in <module>
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     main(KatibDBManagerOperator)
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/main.py", line 441, in main
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     _emit_charm_event(charm, dispatcher.event_name)
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/main.py", line 149, in _emit_charm_event
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     event_to_emit.emit(*args, **kwargs)
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/framework.py", line 354, in emit
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     framework._emit(event)
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/framework.py", line 830, in _emit
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     self._reemit(event_path)
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/framework.py", line 919, in _reemit
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     custom_handler(event)
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "./src/charm.py", line 381, in _on_update_status
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     self._refresh_status()
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "./src/charm.py", line 368, in _refresh_status
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     raise GenericCharmRuntimeError(
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status <unknown>GenericCharmRuntimeError: Failed to run health check on workload container

More logs and configs can be found here: https://oil-jenkins.canonical.com/artifacts/4c1c6e4a-c895-4c2b-88d9-16d2b109d511/index.html

@orfeas-k
Copy link
Contributor

orfeas-k commented Aug 9, 2023

Looks like the same issue exactly with #631. Since I just responded there, I 'll just copy over my response here too

There is this known issue with mysql-k8s-operator which has been fixed but still not published to 8.0/stable version (you can view revisions published here). Could you please confirm that deploying 1.7/edge (which uses mysql-k8s edge channel) actually solves this issue for you?

@orfeas-k orfeas-k added the bug Something isn't working label Aug 9, 2023
@orfeas-k
Copy link
Contributor

Note that we 're pushing for this to be released to 8.0/stable so using edge won't be needed.

@NohaIhab
Copy link
Contributor

the fix in mysql-k8s-operator was released to 8.0/stable, this can be closed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

No branches or pull requests

3 participants