GKE Installer

Fury Kubernetes Installer - Managed Services - GKE - oss project.

9 minute read

GKE vs Compute Instances (managed vs self-managed)

Before continuing, you should understand what are the benefits and drawback of creating a GKE cluster instead of creating your Kubernetes control plane in Google Cloud Compute instances.

Price

GKE currently costs $0.10 per hour for a HA control plane.

An n1-standard-2 compute instance currently costs $0.095 per hour. Having an HA cluster with 3 x n1-standard-2 instances will cost: $0.096 x 3 instances = $0.285 per hour.

GKE is cheaper in most scenarios.

You can host these instances using committed use discounts reducing control-plane cost, but you have to upfront pay for it for months.

All the cost analysis was done in May 2020, all prices have been taken from the official Google Platform pricing lists:

Management

GKE is a fully managed service provided by GCP meaning that you don’t need to worry about, backups, recoveries, availability, scalability, certificates… even authentication to the cluster is managed by Google.

You’ll have to set up these features if you choose to host your control-plane. Also, other features can be customized in a self-managed setup: audit-logs, enable Kubernetes API server feature flags, set up your own authentication provider and other platform services.

So, if you need to set up a non-default cluster, you should consider going with the self-managed cluster. Otherwise, GKE is a good option.

Day two operations

As mentioned before, GKE is responsible for making the Kubernetes control plane fully operational with a monthly uptime percentage of at least 99.95%.

source: https://cloud.google.com/kubernetes-engine/sla

On the other side, in a self-managed set up you have to worry about backups, disaster recovery strategies, HA setup, certificate rotations, control-plane and worker updates.

Requirements

As mentioned in the common requirements the operator who is responsible for creating a GKE cluster has to have connectivity from the operator’s machine (bastion host, laptop with configured VPN…) to the network where the cluster will be placed.

The machine used to create the cluster should have installed:

  • OS tooling like: git, ssh, curl and unzip.
  • terraform version 0.15.4.
  • latest gcloud CLI version.

Cloud requirements

This installer requires to have mainly three requirements:

  • Dedicated VPC.
  • Enough permissions to create all resources surrounding the GKE cluster.
  • If your workloads need to have internet connectivity, you should understand how connectivity works in a GKE private cluster: Using Cloud NAT with GKE Cluster

Gather all input values

Before starting to use this installer, you should know the value of the input variables:

  • cluster_name: Unique cluster name.
  • cluster_version: GKE version to use. Example: 1.20.9-gke.700. Take a look to discover available GKE Kubernetes versions.
  • network: Network name where the cluster will be created.
  • subnetworks: List of three subnetworks names:
    • index 0: The subnetwork to host the cluster.
    • index 1: The name of the secondary subnet IP range to use for pods.
    • index 2: The name of the secondary subnet range to use for services. All subnetworks should belong to network.
  • ssh_public_key: Cluster administrator public ssh key. Used to access cluster nodes with the operator_ssh_user
  • dmz_cidr_range: Network CIDR range from where the cluster’s control plane will be accessible.

Specific GKE input variables

Defining a common shared interface across cloud providers is a really difficult task as there are many differences between cloud providers.

We analyze every new requirement to see if it fits in the common cloud installer interface, or if it is something specific to a cloud provider that doesn’t fit in other cloud providers.

This is the reason why this installer has specific input variables listed below:

Name Description Type Default Required
gke_add_additional_firewall_rules Create additional firewall rules bool true no
gke_add_cluster_firewall_rules Create additional firewall rules (Upstream GKE module) bool false no
gke_disable_default_snat Whether to disable the default SNAT to support the private use of public IP addresses bool false no
gke_master_ipv4_cidr_block The IP range in CIDR notation to use for the hosted master network string "10.0.0.0/28" no
gke_network_project_id The project ID of the shared VPC’s host (for shared vpc support) string "" no

These variables are all optional so possibly you don’t need to use them.

Getting started

Make sure to set up all the pre-requirements before continuing including cloud credentials, VPN/Bastion/Network configuration and gathering all required input values.

Create a new directory to save all terraform files:

$ mkdir /home/operator/sighup/my-cluster-at-gke
$ cd /home/operator/sighup/my-cluster-at-gke

Create the following files:

main.tf

variable "cluster_name" {}
variable "cluster_version" {}
variable "network" {}
variable "subnetworks" { type = list }
variable "dmz_cidr_range" {}
variable "ssh_public_key" {}
variable "node_pools" { type = list }
variable "tags" { type = map }

module "my-cluster" {
  source = "github.com/sighupio/fury-gke-installer//modules/gke?ref=v1.8.0"

  cluster_version = var.cluster_version
  cluster_name    = var.cluster_name
  network         = var.network
  subnetworks     = var.subnetworks
  ssh_public_key  = var.ssh_public_key
  dmz_cidr_range  = var.dmz_cidr_range
  node_pools      = var.node_pools
  tags            = var.tags
}

data "google_client_config" "current" {}

output "kube_config" {
  sensitive = true
  value     = <<EOT
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: ${module.my-cluster.cluster_certificate_authority}
    server: ${module.my-cluster.cluster_endpoint}
  name: gke
contexts:
- context:
    cluster: gke
    user: gke
  name: gke
current-context: gke
kind: Config
preferences: {}
users:
- name: gke
  user:
    token: ${data.google_client_config.current.access_token}
EOT
}

Create my-cluster.tfvars including your environment values:

cluster_name    = "my-cluster"
cluster_version = "1.20.9-gke.700"
network         = "gke-vpc"
subnetworks     = ["gke-subnet", "gke-subnet-pod", "gke-subnet-svc"]
ssh_public_key  = "ssh-rsa example"
dmz_cidr_range  = "10.0.0.0/16"
tags            = {}
node_pools = [
  {
    name : "node-pool-1"
    version : null # To use the cluster_version
    min_size : 1
    max_size : 1
    instance_type : "n1-standard-1"
    volume_size : 100
    subnetworks : []
    additional_firewall_rules: []
    labels : {
      "sighup.io/role" : "app"
      "sighup.io/fury-release" : "v1.3.0"
    }
    taints : []
    tags : {}
    max_pods : null # Default
  },
  {
    name : "node-pool-2"
    version : "1.20.9-gke.700"
    min_size : 1
    max_size : 1
    instance_type : "n1-standard-2"
    volume_size : 50
    subnetworks : []
    additional_firewall_rules: []
    labels : {}
    taints : [
      "sighup.io/role=app:NoSchedule"
    ]
    tags : {}
    max_pods : 100 # Default
  }
]

With these two files, the installer is ready to create everything needed to set up a GKE Cluster with two different node pools (if you don’t modify the node_pools variable example value) using Kubernetes 1.14.

$ ls -lrt
total 16
-rw-r--r--  1 sighup  staff  1171 27 abr 16:35 my-cluster.tfvars
-rw-r--r--  1 sighup  staff  1128 27 abr 16:36 main.tf
$ terraform init
Initializing modules...
Downloading github.com/sighupio/fury-gke-installer?ref=v1.8.0 for my-cluster...
- my-cluster in .terraform/modules/my-cluster/modules/gke
Downloading terraform-google-modules/kubernetes-engine/google 14.3.0 for my-cluster.gke...
- my-cluster.gke in .terraform/modules/my-cluster.gke/modules/beta-private-cluster
Downloading terraform-google-modules/gcloud/google 2.0.3 for my-cluster.gke.gcloud_delete_default_kube_dns_configmap...
- my-cluster.gke.gcloud_delete_default_kube_dns_configmap in .terraform/modules/my-cluster.gke.gcloud_delete_default_kube_dns_configmap/modules/kubectl-wrapper
- my-cluster.gke.gcloud_delete_default_kube_dns_configmap.gcloud_kubectl in .terraform/modules/my-cluster.gke.gcloud_delete_default_kube_dns_configmap

Initializing the backend...

Initializing provider plugins...
- Finding hashicorp/google-beta versions matching ">= 3.49.0, 3.55.0, < 4.0.0"...
- Finding hashicorp/kubernetes versions matching "~> 1.10, != 1.11.0, 1.13.3"...
- Finding hashicorp/null versions matching "3.0.0"...
- Finding hashicorp/random versions matching "3.0.1"...
- Finding hashicorp/google versions matching "3.55.0"...
- Finding hashicorp/external versions matching "2.0.0"...
- Installing hashicorp/null v3.0.0...
- Installed hashicorp/null v3.0.0 (signed by HashiCorp)
- Installing hashicorp/random v3.0.1...
- Installed hashicorp/random v3.0.1 (signed by HashiCorp)
- Installing hashicorp/google v3.55.0...
- Installed hashicorp/google v3.55.0 (signed by HashiCorp)
- Installing hashicorp/external v2.0.0...
- Installed hashicorp/external v2.0.0 (signed by HashiCorp)
- Installing hashicorp/google-beta v3.55.0...
- Installed hashicorp/google-beta v3.55.0 (signed by HashiCorp)
- Installing hashicorp/kubernetes v1.13.3...
- Installed hashicorp/kubernetes v1.13.3 (signed by HashiCorp)

Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
$ terraform plan --var-file my-cluster.tfvars --out my-cluster.plan
<TRUNCATED OUTPUT>
Plan: 17 to add, 0 to change, 0 to destroy.

------------------------------------------------------------------------

This plan was saved to: my-cluster.plan

To perform exactly these actions, run the following command to apply:
    terraform apply "my-cluster.plan"

Review carefully the plan before applying anything. It should create 11 resources.

$ terraform apply my-cluster.plan
<TRUNCATED OUTPUT>
Apply complete! Resources: 17 added, 0 changed, 0 destroyed.

Outputs:

kube_config = <sensitive>

To get your kubeconfig file follow these simple commands:

kubectl will have limited access in time token.

$ terraform output --raw kube_config > kube.config
$ kubectl cluster-info --kubeconfig kube.config
Kubernetes control plane is running at https://10.0.0.2
calico-typha is running at https://10.0.0.2/api/v1/namespaces/kube-system/services/calico-typha:calico-typha/proxy
KubeDNS is running at https://10.0.0.2/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Metrics-server is running at https://10.0.0.2/api/v1/namespaces/kube-system/services/https:metrics-server:/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
$ kubectl get nodes --kubeconfig kube.config
NAME                                    STATUS   ROLES    AGE    VERSION
gke-furyctl-node-pool-1-8a911b8f-4hmx   Ready    <none>   3m4s   v1.20.9-gke.700
gke-furyctl-node-pool-2-e4b1026e-0sdq   Ready    <none>   87s    v1.20.9-gke.700

GKE number of nodes

GKE deploys the same number of nodes across zones to provide HA by default, meaning that if you specify just 1 node (min and max) in a node_pool (same as the example), you will end up with 3 nodes in the node_pool (if the region has 3 different availability zones).

Update control plane

To update the control plane, just modify the cluster_version with the next version available

$ diff my-cluster.tfvars my-cluster-updated.tfvars
2c2
< cluster_version = "1.19.9-gke.700"
---
> cluster_version = "1.20.9-gke.700"

after that modifiying the cluster_version execute:

$ terraform plan --var-file my-cluster-updated.tfvars --out my-cluster.plan
<TRUNCATED OUTPUT>
Plan: 0 to add, 2 to change, 0 to destroy.

Please, read carefully the output plan. Once you understand the changes, apply it:

$ terraform apply my-cluster.plan
<TRUNCATED OUTPUT>
Apply complete! Resources: 0 added, 2 changed, 0 destroyed.

It can take up to 25-30 minutes.

After updating the control-plane you end up with:

  • GKE control plane updated from Kubernetes version 1.19 to 1.20
  • The node-pool-1 Updated to 1.20 version. (Updated as it uses cluster_version)
  • The node-pool-2 remains in 1.19 Kubernetes version.

Update node pools

To update a node pool, just modify the node_pool’s version attribute with the same version as the control-plane:

If you have set null, you don’t need to do anything else, node_pools with null version are updated alongside the control-plane update procedure.

$ diff my-cluster.tfvars my-cluster-updated.tfvars
26c26
<     version : "1.19.9-gke.700"
---
>     version : "1.20.9-gke.700"

after that run:

$ terraform plan --var-file my-cluster-updated.tfvars --out my-cluster.plan
<TRUNCATED OUTPUT>
Plan: 0 to add, 1 to change, 0 to destroy.

Review the plan before applying anything:

$ terraform apply my-cluster.plan
<TRUNCATED OUTPUT>
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

It takes less than 10 minutes

Lift and Shift node pool update

You can apply another node pool update strategy named lift and shift. Create a new node pool with the new updated version then move all workloads to the new nodes and remove/set to 0 the number of instances in the old node pool.

Tear down the environment

If you don’t need anymore the cluster, go to the terraform directory where create the cluster (cd /home/operator/sighup/my-cluster-at-gke) and type:

$ terraform destroy --var-file my-cluster.tfvars
<TRUNCATED OUTPUT>
Plan: 0 to add, 0 to change, 11 to destroy.

Do you really want to destroy all resources?
  Terraform will destroy all your managed infrastructure, as shown above.
  There is no undo. Only 'yes' will be accepted to confirm.

Type yes and press intro to continue the destruction. It will take around 15 minutes.