Clouddriver in Spinnaker 1.6.0 is more fragile


#1

In our Spinnaker setup we use many Kubernetes accounts and sometimes some of them become outdated. This wasn’t a problem with Spinnaker 1.5.4 (Clouddriver 1.0.4-20180110144440) since the Clouddriver health endpoint returned OK and the clouddriver was up and running (even tough some accounts had outdated keys):

bash-4.3# curl http://localhost:7002/health
{"status":"UP"}

After upgrade to Spinnaker 1.6.0 (Clouddriver 2.0.0-20180221152902) we noticed that an error even in a single Kubernetes account causes the health endpoint to fail:

bash-4.4# curl -m 10 http://localhost:7002/health
{"error":"Internal Server Error","exception":"com.netflix.spinnaker.clouddriver.kubernetes.v1.deploy.exception.KubernetesOperationException","message":"Get Namespace kuba-test for account spinnaker-bolcom-stg-kuba-test-f65 failed: Unauthorized! Token may have expired! Please log-in again. Unauthorized","status":500,"timestamp":1521210238133}

Which in turn make Kubernetes think that Clouddriver is not healthy and Spinnaker becomes not functional.

The questions are:

  1. Is this change of behavior intended?
  2. If so is there a way to override it?

#2

I don’t think this change was introduced intentionally - it seems like a bug. @ethanfrogers do you know if any changes were made around account health in a past few months?


#3

@lwander @wheleph i don’t think there have been any changes around account health in the last few months. I’m pretty sure I saw this back in September as well.

We did bump the V1 Provider client library in 1.6.X. I wonder if the errors themselves are related to that? We have been running that change in production since it was made, though.

But, the fact that 1 unhealthy account takes down the service is def a bug that I think has been around for a little while, I just can’t find any issues about it ATM.


#4

I’ll also add that the behavior is not intended. Clouddriver should be able to handle partial unavailability (network blip, etc) of Kubernetes.


#5

I thought I had a bug out for this problem, but I’m having trouble finding it.

I haven’t seen the problem @wheleph is describing, currently working on deploying 1.6, but in previous versions, I’ve seen that with a bad kubeconfig (in my case, a missing kubeconfig file), clouddriver /health would return ok, but the single broken account would result in 500s for e.g. /credentials. Deck dealt with this poorly and basically wouldn’t render anything.

IMO the semantics of /health are too vague with multiple accounts. When would you want it to start failing? When 1/100 accounts are broken? 50/100?

My solution was going to be to shard clouddriver by purpose (account, readonly, cacher), but that seems like a ton of work.


#6

I see same problem if following setup guide for Kubernetes V2 (manifest based) and Google as storage provider (our Kub infra is the one managed by Google as well).

The question is: how one can debug it? (really eager to check the new manifest based solution)
(for reference: helm based setup from google’s site works fine. so my local/remote setup is probably ok in general. just moving to hal with all those pages of setup I probably miss something)


#7

All of a sudden we found the similar error with Spinnaker 1.5.4.

The difference is that when there’re outdated credentials Clouddriver gets started but Deck UI gets broken (the list of applications is empty). The log is full of messages like:

spin-clouddriver-v031-bdcgb spin-clouddriver Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://35.204.150.142/api/v1/namespaces/ws-mzeijen/services. Message: Unauthorized! Token may have expired! Please log-in again. Unauthorized.

After we removed the faulty account and redeployed Spinnaker the issue was gone.


#8

Debugging my installation, yes it was also an issue with credentials.

Although, I think the installation instructions could be updated to be more clear.
What happened
Looking at the secrets created/used I see that Installation just copied ~/.kube/config However it had my personal credentials there, not service account credentials. worse still, the actual credentials are not shown in the file i.e.
expiry-key: ‘{.credential.token_expiry}’
token-key: ‘{.credential.access_token}’
So using that file was not going to work

What I did is followed this thread. Got token for the service, base64 decoded it. and put it in the config as a “spinnaker-context”. Making sure that the current context is set to that newly created context ( "kubectl config set-context spinnaker-context"). Now Clouddriver is up.

Hope that helps

However, I still can’t use the service.

  1. hal deploy connect fails with “! ERROR Error encountered running script. See above output for more
    details.” :slight_smile: and above output shows no errors
  2. if I just open ports 9000 to deck and 8084 to gate then localhost:9000/ simply shows me spinnaker ui with no apps. And pressing on apps, creating apps, simply shows me spinning circle equivalent
  3. As a debug measure: curl http://localhost:8084/applications gives me a list of apps (yet somehow it is not enough for UI)

#9

well, this thread seems to solve my installation problems. basically kubectl apply -f https://spinnaker.io/downloads/kubernetes/quick-install.yml


#10

I observe the same issue in Spinnaker 1.6.1.

Submitted the issue: https://github.com/spinnaker/spinnaker/issues/2683