GKE Autopilot - a Serverless game changer?

Google Autopilot

Google release GKE Autopilot at the end of February 2021. The main difference is that Autopilot takes all cluster and node management out of the hands of your team, thus further blurring the line between traditional Serverless and Kubernetes. Could this be a game changer? Let's have a look.

AWS Lambda and AWS is the clear leader in the Serverless space. But adopting it comes at a price: the tooling and practices are very AWS specific, hence vendor lock-in is a fact. Taking a step back, it also means adopting slightly different tools for infrastructure and automation. The one area where AWS Serverless is at a disadvantage is eco-system: Kubernetes has a wider range of tools available, more well-known patterns and plays nicely with most existing libraries and frameworks for most languages.

When we first started working with AWS Serverless and many edges of the offering were still rough, one spontaneous comment was "I wish we could just have 'Serverless Kubernetes', with zero management of the underlying infrastructure".

Well, it looks like GKE Autopilot is a delayed answer to that prayer. With some caveats.

How does it work?

GKE Autopilot is an automated, opinionated default, secure configuration of GKE. Nodes and node pools are completely managed by Google. Security updates and maintenance are likewise automated. Networking is by default relatively secure, especially if you configure a private cluster (which you should).

It is in essence, Kubernetes, but you only have to worry about configuring and running your applications and workloads, not what they run on. In terms of provisioning resources, all you have to do is set up your resource requests, limits and auto-scaling rules, and Autopilot will do the rest. You will only pay for the resources (CPU, RAM, storage) your pods are running at any one time, not for the nodes of the actual underlying cluster.

Is there a catch?

Short answer: Yes. It's not quite Serverless in the sense of zero-cost for zero load or seamless auto-scaling.

For instance, if you have a pod running, you will be paying for its resources even if it is completely idle and unused.

One issue we observed is autoscaling is not exactly instant, as is the case with AWS Lambda or Google Cloud Run (almost). Lambda cold-starts on a slow JVM process will seem positively snappy compared to Autopilot scaling. Deploying a new workload or scaling one up can in many cases take several minutes. Why? Without knowing the internals of Autopilot, but having observed the Kubernetes event logs, I can make an educated guess: while the management of nodes are none of your concern, requests for additional resources still require Autopilot to provision more nodes for your cluster under the covers. And provisioning additional nodes will certainly be slower than simply spinning up new processes on idle machines.

Privileged pods are also disallowed for security reasons. This is likely a good choice, but if you rely on it, it can be a spanner in the works. Concourse for instance is a tool that will currently not work by default.

Because you have no direct access to nodes, you cannot inject "host ports" into pods. This means tools like Consul will currently not work. By extension, it also means most of the Hashicorp tools that rely on Consul won't work either. I would not be surprised if Hashicorp will have a fix or solution out for this shortly though, they are usually impressively on the ball with these things.

Another limitations observed, that again is likely a good choice, but can be annoying for some, is that ssh:ing into pods is no longer allowed. This is a good thing as a practice, but I know many organisations take shortcuts, which will be cut off.

Should you use it?

If you are currently running Kubernetes, and none of the above limitations are a problem, I'd say "probably yes". At scale, your GCP bill might be higher, but you have to weigh this against the cost in labour to run non-Autopilot cluster. I suspect this will swing the equation in favour of Autopilot.

If you are considering Serverless, I'd call it a strong "maybe": We are more and more moving towards the idea that Kubernetes is the ideal "control plane for the cloud". Together with emerging tools such as Crossplane and ArgoCD, this looks like a vision that could very soon be realised. For us, Crossplane support for Cloud Run (or any other CRD + Operator for Cloud Run) is the missing piece of the jigsaw to make this a very good proposition.

The way it could work is having GKE Autopilot as your control plane and API Gateway (with something like Ambassador) routing traffic into Cloud Run containers for resources that require elastic scaling, and Autopilot pods where load is more predictable.

In summary, we think Autopilot is an exciting addition to Google Clouds offering, swinging some of the balance of power and momentum in the emerging Serverless space back towards Google from AWS. AWS still has the lead in terms of breadth and depth of offering, but at the cost of strong vendor lock-in. Google Autopilot + Cloud Run offer a more standards-based, open option, that also has the benefit of higher developer familiarity.

With Autopilot, we now consider GCP a worthy challenger to AWS in the fight for Serverless mindshare.