How to setup Ingress on GKE


#1

I’ve got Spinnaker deploying to Google Kubernetes Engine, but it fails sporadically with timeout messages.

I think I tracked my problem down to my set up of the Ingress, more specifically, my setting of spec.type to NodePort for spin-gate and spin-deck manually (through the GKE UI). That seems to cause my nodes to run out of memory.

In my debugging I downgraded to 1.5.4 (had been 1.6.0) because I thought this was my issue:

Halyard Setup:
virtual_machine: Google Compute engine n1-standard-1 (1 vCPU, 3.75 GB memory)
version: 0.43.0-180317140630
config:
version: 1.5.4|1.6.0
type: Distributed


#2

This seems to be a Kubernetes/GKE issue – what version of Kubernetes are you running?


#3

It’s 1.8.8, the current GKE default I believe.

Thanks,
Chris


#4

Have you seen this behavior with any other NodePort services you run in this cluster, or is this unique to Spinnaker?


#5

Spinnaker is the only thing is this cluster now.

To be clear I only see the issue during a hal deploy apply.


#6

Ah does it happen on a fresh cluster with hal deploy apply, or after the NodePort has been changed?


#7

Only after the the NodePort has been set


#8

Hi, Chris - this seems to be an issue with GKE, rather than Spinnaker. Can you please submit a GKE support ticket for this?


#9

@yuriatgoogle: Done

I will update this issue with the results of the support ticket.

Thanks,
Chris


#10

So google support pointed out to me that the Spinnaker replicasets don’t have resource usages defined on them (spec.resources I think?)

I know that is configurable in Spinnaker, but it doesn’t seem to have a reasonable default value attached to it.


#11

@ewiseblatt do we have reasonable default resource limits on Kubernetes?

Also, this seems unique to GKE 1.8.8 + NodePorts, we haven’t heard of something like this before


#12

@lwander, @jacobkiefer

We dont set anything to my knowledge – I certainly dont. I presumed that responsibility would be in halyard and conveyed through each service hal config. I think Jacob was doing some work baselining needs when he was doing the quota analysis work. I think travis at one time might have monkeyed with constraints on his internal long-running deployment so that orca and clouddriver deployed on different nodes.

I do think that an n1-standard-1 [*] is too small as a node, mostly because they have limited RAM. While it doesnt seem like you’d need a lot of RAM, we’re running a lot of independent JVMs, and java (especially with spring) uses a lot of RAM overhead. Clouddriver and orca use a lot of RAM as well.

We use a pool of n1-highmem-2 nodes in our validation process and deploy vanilla spinnaker via halyard, though halyard itself is deployed to a VM outside the k8s cluster for ease in troubleshooting catastrophic builds. I am guessing that you’d need at least two nodes (redundancy aside) – we use more because we’re doing other things. When we test VM deployments, we use n1-standard-4 or equivalent on other platforms, and that includes redis, halyard, and anything else needed. Those VMs have some RAM headspace so I’m guessing 2 nodes for the suite of microservices and halyard would be sufficient as a starting point even with some additional k8s overhead.

At the moment we are using 1.9.x though have used 1.8.x at one time without noticing an issue. We’ve run most versions over the past two years without noticing a compatability issue.

[*] for the sake of discussion:
n1-standard-1 = [1 core, 3.75G]
n1-highmem-2 = [2 cores, 13G]
n1-standard-4 = [4 cores, 15G]


#13

Ok, thanks.

It looks like my nodes may not have been big enough. I started with 2 x custom(2 vCPUs, 4 GB memory), then moved to 4 x custom(2 vCPUs, 4 GB memory) and it made things go more smoothly.

I’ll switch to 2 x (n1-highmem-2) and see how that goes.

Thanks,
Chris


#14

Ok, that went much better. Initial results aren’t showing any issues.

I’ll get some pipelines configured and running and see if it continues to work well.

My summary is:

On the a persons first install they have to put a lot of resources in the node pool. If they don’t it causes very confusing issues that don’t always appear to be related to resources. For example:

  1. hal deploy apply times out.
  2. Logs indicate 503 server errors
  3. Previously deployed pods don’t get destroyed
  4. New versions of pods don’t get created.

These problems are further exacerbated two facts:

  1. There doesn’t seem to be any high level guide to the distributed deployment with enough detail for someone new to Spinnaker to be able to understand what they should expect to see.
  2. There doesn’t seem to be any documentation or enforcement of minimum resource needs per service.

My suggestions to improve this experience would be these things:

  1. Add hal config deploy types distributed that would:
    1. Allow creation of an Ingress through halyard (just a thought, maybe specify a yaml file with the definition inside)
    2. Allow the definition of resource requests (and maybe limits) though halyard, while defining defaults that would be guaranteed to work as long as the nodes can satisfy these requests.
  2. Detailed high level documentation of how the distributed deployment type is functioning under the hood. Using Spinnaker pipeline definitions as a way to illustrate this seems like a good idea.

I’m going to continue working on getting Spinnaker deploying several applications, I’ll update this if I have any more ideas.

I’ll go ahead and create some feature requests for the above suggestions (if they don’t exist already). They should just go to github spinnaker/spinnaker?


#15

I’ve created this issue here: