Clouddriver caching rate limits


#1

Hi all,

I’m experiencing strange issues with Spinnaker 1.10.5 on GKE 1.10.11-gke.1

I’ve got the HA setup as described with https://www.spinnaker.io/reference/halyard/high-availability/ with 4 caching cloud drivers which have problems starting up occasionally.

Checking permissions on configured kinds for account some-k8s-v2-serviceaccount... [apiService, clusterRole, clusterRoleBinding, configMap, controllerRevision, customResourceDefinition, cronJob, daemonSet, deployment, event, horizontalpodautoscaler, ingress, job, mutatingWebhookConfiguration, namespace, networkPolicy, persistentVolume, persistentVolumeClaim, pod, podPreset, podSecurityPolicy, podDisruptionBudget, replicaSet, role, roleBinding, secret, service, serviceAccount, statefulSet, storageClass, validatingWebhookConfiguration, none]
08:00:51.154  INFO 1 --- [           main] c.n.s.c.k.v.s.KubernetesV2Credentials    : Checking if apiService is readable...
08:02:35.278  INFO 1 --- [           main] c.n.s.c.k.v.s.KubernetesV2Credentials    : Kind 'apiService' will not be cached in account 'some-k8s-v2-serviceaccount' for reason: 'Job took too long to complete'
08:02:35.278  INFO 1 --- [           main] c.n.s.c.k.v.s.KubernetesV2Credentials    : Checking if clusterRole is readable...
08:04:17.998  INFO 1 --- [           main] c.n.s.c.k.v.s.KubernetesV2Credentials    : Kind 'clusterRole' will not be cached in account 'some-k8s-v2-serviceaccount' for reason: 'Job took too long to complete'
08:04:17.998  INFO 1 --- [           main] c.n.s.c.k.v.s.KubernetesV2Credentials    : Checking if clusterRoleBinding is readable...
08:06:00.714  INFO 1 --- [           main] c.n.s.c.k.v.s.KubernetesV2Credentials    : Kind 'clusterRoleBinding' will not be cached in account 'some-k8s-v2-serviceaccount' for reason: 'Job took too long to complete'
08:06:00.714  INFO 1 --- [           main] c.n.s.c.k.v.s.KubernetesV2Credentials    : Checking if configMap is readable...
08:07:43.430  INFO 1 --- [           main] c.n.s.c.k.v.s.KubernetesV2Credentials    : Kind 'configMap' will not be cached in account 'some-k8s-v2-serviceaccount' for reason: 'Job took too long to complete'

Sometimes the pods come up without any problems and sometimes they end up just going through all of our about 200 kubernetes provider accounts. As timeout seems to be about almost 2 minutes, the clouddriver pod never turns up.

I haven’t been able to verify this, but I think we might be hitting the GKE rate limits, but I would expect to see a 4xx error instead of a timeout. Has anyone seen this kind of behaviour before? I haven’t managed to confirm the rate limits being exceeded yet as I don’t know exactly what to search in StackDriver logs (it’s like searching for a needle in the haystack)

Does anyone have good tips on what to try out? Is there a way to throttle cloud driver requests?

Thanks in advance for helping out


#2

Just in case someone’s wondering about this: turned out to be a connectivity issue. There were some misconfigured nodes in the cluster. When clouddriver got scheduled on faulty nodes, it would fail to startup.

If anyone else got this problem, try connecting to the kubernetes master from the pod to verify connectivity