How to automate provisioning and deployment of RabbitMQ with cert-manager on a Kubernetes cluster in GKE within GCP

I was brought in by a startup to set up their core infrastructure in a way that functioned as needed and could be automated for safe and efficient provisioning and deployment. The key requirement was making RabbitMQ work only with secure certificate-based connections – the AMQPS protocol, rather than AMQP – for security and compliance purposes. This needed to be done within a Kubernetes cluster for storage and shared states via StatefulSets, ease of scaling and deployment, and general flexibility. It was also necessary to set this up on GCP (Google Cloud Platform) as that was already in use by the startup and they didn’t want to consider alternative cloud providers at this stage, so GKE (Google Kubernetes Engine) needed to be used for the Kubernetes cluster.

Getting certificates for use with RabbitMQ within Kubernetes required the setup of cert-manager for certificate management, which in turn needed ingress-nginx to allow incoming connections for Let’s Encrypt verification so that certificates could be issued.

I successfully solved the problems and fulfilled the requirements. It’s still a “work in progress” to some extent. Some of the config is a little “rough and ready” and could be improved with more modularisation and better use of variables and secrets. Also, the initial cluster provisioning is fully automated with Terraform, and the rest is only semi automated currently. So there is room for further improvement.

All the code and documentation is available in my GitHub repository. Below I will explain the whole process from start to finish.

Provision the Kubernetes cluster in GKE

Requirements

Local setup

gcloud SDK installed and initialised (instructions here).
kubectl installed (instructions here).
Terraform installed (instructions here).

If you have multiple gcloud SDK projects/configurations set up, you must remember to switch from one project (configuration) to another in gcloud SDK otherwise catastrophe could ensue. (Replace “test” with the name of your desired configuration.):

gcloud config configurations activate test

To check details of the current configuration:

gcloud config list

GCS buckets

A GCS (Google Cloud Storage) bucket needs to exist for remote Terraform state/lock management. This appears to work on an account level, not on a project level, so it’s best to identify the bucket accordingly. The bucket is currently named iac-state but this should be changed to a more meaningful name to avoid confusion between different projects used by the same account.

When creating the GCS bucket, location type can be single region to save costs (europe-west2 in this case, but change that if needed), storage class should be standard, public access should be prevented, access control can be uniform, and object versioning should be switched on (default values should be fine).

Terraform details

The Terraform files can be found in the provision-cluster directory in this project on my GitHub repo.

terraform.tfvars is where you set basic variables:

# Set this to correct Project ID
project_id = "000000000000"
# Change the zone as preferred
zone       = "europe-west2-a"

versions.tf contains the Terraform provider setup and also the details of the GCS bucket backend for state management and locking:

terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "3.52.0"
    }
  }

  required_version = ">= 0.14"

  backend "gcs" {
    # Change this to the bucket used to maintain state/lock
    bucket = "iac-state"
    prefix = "provision-cluster"
  }

}

vpc.tf sets up the VPC and other general parameters:

variable "project_id" {
  description = "project id"
}

variable "zone" {
  description = "zone"
}

provider "google" {
  project = var.project_id
}

resource "google_compute_network" "vpc" {
  name = "test-vpc"
}

cluster.tf is where most of the exciting stuff happens, i.e. the actual cluster and node pool creation:

# Only one node for now to save costs - increase as needed
variable "num_nodes" {
  default     = 1
  description = "number of nodes"
}

# Create cluster and remove default node pool
resource "google_container_cluster" "test" {
  name     = "test-cluster"
  location = var.zone
  
  remove_default_node_pool = true
  initial_node_count       = 1

  network    = google_compute_network.vpc.name
}

# Create node pool for cluster
resource "google_container_node_pool" "test_nodes" {
  name       = "test-nodes"
  location   = var.zone
  cluster    = google_container_cluster.test.name
  node_count = var.num_nodes

  node_config {
    oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
    ]

    labels = {
      env = "test"
    }

    # e2-standard-2 seems to be the minimum required by the RabbitMQ cluster
    # - change as needed
    machine_type = "e2-standard-2"
    tags         = ["test-node"]
    metadata = {
      disable-legacy-endpoints = "true"
    }
  }
}

outputs.tf defines output variables for use by later commands and processes:

output "zone" {
  value       = var.zone
  description = "Zone"
}

output "project_id" {
  value       = var.project_id
  description = "Project ID"
}

output "kubernetes_cluster_name" {
  value       = google_container_cluster.test.name
  description = "Cluster Name"
}

output "kubernetes_cluster_host" {
  value       = google_container_cluster.test.endpoint
  description = "Cluster Host"
}

Usage

N.B. Be very careful when applying or destroying Terraform configuration as these commands have the potential to break things on a massive scale if you make a mistake. Always check that you are using the correct GCP project before you begin (this should have been done above during the gcloud SDK initialisation).

Terraform state and locking is shared remotely via a GCS bucket, which should make it impossible for more than one person to make changes at any given time for safety reasons, and should also ensure everyone is always working with the current state rather than a potentially out-of-date version (which would be potentially dangerous).

Initialise Terraform:

terraform init

See what Terraform will do if you apply the current configuration:

terraform plan

Apply the current configuration:

terraform apply

Destroy the current configuration:

terraform destroy

Configure kubectl

Run this command to configure kubectl with access credentials. This is needed before you can run the kubectl commands for setting up cert-manager, RabbitMQ, etc:

gcloud container clusters get-credentials $(terraform output -raw kubernetes_cluster_name) --zone $(terraform output -raw zone)

Deploy Kubernetes Dashboard

If you also need to deploy the Kubernetes Dashboard, perform the following procedure.

Deploy the Kubernetes Dashboard and create a proxy server to access the Dashboard:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.0-beta8/aio/deploy/recommended.yaml
kubectl proxy

This will keep running until stopped with CTRL-C so you need to open a new terminal tab/window, and create the ClusterRoleBinding resource:

kubectl apply -f https://raw.githubusercontent.com/hashicorp/learn-terraform-provision-gke-cluster/master/kubernetes-dashboard-admin.rbac.yaml

Then create a token to log in to the Dashboard as an admin user:

ADMIN_USER_TOKEN_NAME=$(kubectl -n kube-system get secret | grep admin-user-token | cut -d' ' -f1)
ADMIN_USER_TOKEN_VALUE=$(kubectl -n kube-system get secret "$ADMIN_USER_TOKEN_NAME" -o jsonpath='{.data.token}' | base64 --decode)
echo "$ADMIN_USER_TOKEN_VALUE"

Open the Kubernetes Dashboard in your browser here, choose to log in with a token, then copy/paste the output from the above commands.

Set up cert-manager with ingress-nginx

All the files for cert-manager/ingress-nginx setup can be found in the cert-manager directory in this project on my GitHub repo.

Requirements

User needs Kubernetes Engine Admin permissions in IAM.
GKS cluster set up, and kubectl installed and configured, as described above.
Domain set up in Cloud DNS.

ingress-nginx

Deploy the ingress-nginx controller, using the deployment manifest provided by Kubernetes:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.1.0/deploy/static/provider/cloud/deploy.yaml

I wrote a Bash script dns.sh to update the DNS for the ingress, which you’ll need to modify as needed:

#!/bin/bash

# Change these as needed
# and also in the dns.yml.master file
dns_zone=example-com
dns_name=mq.example.com

file=dns.yml
old_data=$(dig +short -t a $dns_name)
new_data=$(kubectl get svc --namespace=ingress-nginx | grep LoadBalancer | awk '{print $4}')

[ -f $file ] && rm -f $file
cp -f ${file}.master $file
sed -i "s/OLD_DATA/${old_data}/" $file
sed -i "s/NEW_DATA/${new_data}/" $file
gcloud dns record-sets transaction execute --zone=${dns_zone} --transaction-file=dns.yml

[ $? -eq 0 ] && echo "IP updated from $old_data to $new_data"

The DNS script references a YAML file dns.yml.master which you’ll also need to modify for your own DNS setup:

---
additions:
- kind: dns#resourceRecordSet
  name: mq.example.com.
  rrdatas:
  - NEW_DATA
  ttl: 60
  type: A
deletions:
- kind: dns#resourceRecordSet
  name: mq.example.com.
  rrdatas:
  - OLD_DATA
  ttl: 60
  type: A

Then update the DNS:

./dns.sh

cert-manager

Deploy the cert-manager controller, using the deployment manifest provided by Jetstack:

kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.6.1/cert-manager.yaml

This will take some time, so if you initially get errors with the following commands, wait for a little while then try again.

issuer.yml is a manifest file to create the Let’s Encrypt certificate issuer in cert-manager:

apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    # Change this
    email: admin@example.com
    server: https://acme-v02.api.letsencrypt.org/directory
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - selector: {}
      http01:
       ingress:
         class: nginx

Create the issuer:

kubectl create -f issuer.yml

(There is also issuer-staging.yml for using the Let’s Encrypt staging API instead of prod, if needed.)

ingress.yml is a manifest file for the ingress, which requests the certificate, uses ingress-nginx so Let’s Encrypt can verify via incoming HTTP connection, then issues the certificate:

apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: nginx-ingress-cert-manager
  annotations:
    #cert-manager.io/issuer: "letsencrypt-staging"
    cert-manager.io/issuer: "letsencrypt-prod"
spec:
  tls:
  - hosts:
    # Change this
    - mq.example.com
    secretName: rabbitmq-tls
  rules:
  # Change this
  - host: mq.example.com
    http:
      paths:
      - backend:
          serviceName: nginx-ingress-backend
          servicePort: 80

So, deploy ingress and request certificate:

kubectl apply -f ingress.yml

Checking the certificate

Get certificate request status/details:

kubectl describe certificaterequest $(kubectl describe certificate rabbitmq-tls | grep -i request | awk -F '"' '{print $2}')

Get certificate details:

kubectl describe certificate rabbitmq-tls

Renewing the certificate

An email should be received at the email address specified in issuer.yml when it is time to renew the certificate. It may be wise to set a timed reminder also. If unsure, the certificate status can be checked with the commands above.

The simplest option for renewing is simply to delete the existing certificate and create a new one. This is likely to create a gap in the RabbitMQ service so downtime should be scheduled if appropriate:

Firstly, update the DNS to point to nginx ingress (oddly this doesn’t always appear to be necessary when renewing, but it always appears to be necessary when creating the initial certificate, so I think it’s best to always do this to be on the safe side, though I don’t yet have an explanation for why it sometimes doesn’t seem to be needed):

./dns.sh

Then delete and recreate the ingress, certificate and Secret:

kubectl delete -f ingress.yml
kubectl delete secret rabbitmq-tls
kubectl apply -f ingress.yml

Then cd to the rabbitmq-cluster folder and run the command there to revert the DNS back to the RabbitMQ service:

cd ../rabbitmq-cluster
./dns.sh

It appears that RabbitMQ simply uses the updated certificate in the Secret without needing to be restarted/reloaded, though this has not yet been extensively tested. Some errors will likely appear in the RabbitMQ and app logs during the time when the certificate is being updated.

Deletion

If you need to delete cert-manager and ingress-nginx:

kubectl delete -f ingress.yml
kubectl delete secret rabbitmq-tls
kubectl delete -f issuer.yml
kubectl delete -f https://github.com/jetstack/cert-manager/releases/download/v1.6.1/cert-manager.yaml
kubectl delete -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.1.0/deploy/static/provider/cloud/deploy.yaml

Set up the RabbitMQ cluster

All the files for RabbitMQ setup can be found in the rabbitmq-cluster directory in this project on my GitHub repo.

Requirements

User needs Kubernetes Engine Admin permissions in IAM.
GKS cluster set up, and kubectl installed and configured, as described above.
cert-manager setup complete, as described above.
Domain set up in Cloud DNS.

Deployment

Deploy the RabbitMQ Cluster Operator, using the deployment manifest provided by RabbitMQ:

kubectl apply -f https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml

(More details on installing the RabbitMQ Cluster Operator, if needed.)

Set up your RabbitMQ definitions.json file as needed. There’s one included here which gives an example of setting up a user and a vhost for app access:

{
  "users": [
    {
      "name": "username",
      "password": "password",
      "tags": "administrator",
      "limits": {}
    }
  ],
  "vhosts": [
    {
      "name": "virtualhost"
    }
  ],
  "permissions": [
    {
      "user": "username",
      "vhost": "virtualhost",
      "configure": ".*",
      "write": ".*",
      "read": ".*"
    }
  ]
}

Create ConfigMap for RabbitMQ definitions to set up user and vhost etc:

kubectl create configmap definitions --from-file=definitions.json

cluster.yml is a manifest file for deploying the Workloads and Services with the certificate previously created via cert-manager, and with the ConfigMap mounted to import definitions:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: rabbitmq-cluster
spec:
  replicas: 1
  service:
    type: LoadBalancer
  tls:
    secretName: rabbitmq-tls
    disableNonTLSListeners: true
  # Volume for importing RabbitMQ config definitions
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers:
            - name: rabbitmq
              volumeMounts:
              - mountPath: /definitions/
                name: definitions
            volumes:
            - name: definitions
              configMap:
                name: definitions
  rabbitmq:
    additionalConfig: |
      load_definitions = /definitions/definitions.json

So let’s apply the cluster manifest:

kubectl apply -f cluster.yml

Further instructions for deploying/using the cluster etc..

Update DNS

I wrote a Bash script dns.sh to update the DNS for the ingress, which you’ll need to modify as needed:

#!/bin/bash

# Change these as needed
# and also in the dns.yml.master file
dns_zone=example-com
dns_name=mq.example.com

file=dns.yml
old_data=$(dig +short -t a $dns_name)
new_data=$(kubectl get svc --namespace=default | egrep "^rabbitmq-cluster[ ]+LoadBalancer" | awk '{print $4}')

[ -f $file ] && rm -f $file
cp -f ${file}.master $file
sed -i "s/OLD_DATA/${old_data}/" $file
sed -i "s/NEW_DATA/${new_data}/" $file
gcloud dns record-sets transaction execute --zone=${dns_zone} --transaction-file=dns.yml

[ $? -eq 0 ] && echo "IP updated from $old_data to $new_data"

The DNS script references a YAML file dns.yml.master which you’ll also need to modify for your own DNS setup:

---
additions:
- kind: dns#resourceRecordSet
  name: mq.example.com.
  rrdatas:
  - NEW_DATA
  ttl: 60
  type: A
deletions:
- kind: dns#resourceRecordSet
  name: mq.example.com.
  rrdatas:
  - OLD_DATA
  ttl: 60
  type: A

Update the DNS with the new external IP (if this initially fails it probably means you need to wait longer for the cluster to come up):

./dns.sh

Test that RabbitMQ connections are working (optional)

Install and run PerfTest (change username, password, hostname and virtualhost as needed in the URI):

kubectl run perf-test --image=pivotalrabbitmq/perf-test -- --uri "amqps://username:password@mq.example.com:5671/virtualhost"

Check logs to see if connections are successful (won’t work until PerfTest is fully deployed):

kubectl logs -f perf-test

Delete PerfTest:

kubectl delete pod perf-test

Check logs

Get the logs from the RabbitMQ cluster pod(s):

for pod in $(kubectl get pods | egrep "^rabbitmq-cluster-server" | awk '{print $1}') ; do kubectl logs $pod ; done

Deletion

If you need to delete RabbitMQ:

kubectl delete -f cluster.yml
kubectl delete configmap definitions
kubectl delete -f https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml

Deploy app(s)

All the files for app setup can be found in the in the deploy-apps directory in this project on my GitHub repo.

Requirements

GKS cluster set up, and kubectl installed and configured, as described above.
cert-manager setup complete, as described above.
RabbitMQ setup complete, as described above.

You will also, of course, need at least one app which talks to RabbitMQ. The app should be containerised, and the container stored in Artifact Registry. If unsure how to achieve this, refer to “Containerizing an app with Cloud Build” in this document which should point you in the right direction for containerising your app using Docker (ignore the stuff about creating the cluster and everything after that). It’s assumed here that there is a containerised app in Artifact Registry called my-app, so just change the names and settings for your app as needed.

Permissions

In order for GKE to pull the container images, a service account needs to exist with an associated Secret in Kubernetes, with the account given permission to access the Artifact Registry via the Secret. I got the following script permissions.sh from Stack Overflow to achieve this, and added some additions primarily to delete any existing keys before creating a new key.

(It’s simplest to assume we don’t have access to any previously-created private keys since they cannot be re-downloaded, so we create a new key here to ensure we can create the Secret in order to pull container images for the apps, and if we’re deploying apps on a new cluster with a new key then there’s no point in keeping old keys on the service account).

If you’ve been through this process before then the service account should already exist, unless it’s been manually deleted by someone or unless a new GCP project is being used, so the first part of the script – where it tries to create the service account – will likely produce an error. This is to be expected and is nothing to worry about. The rest of the script should run correctly:

#!/bin/bash

# Change these as needed
project=myproject
repo=myrepo
location=europe-west2

# Service Account and Kubernetes Secret name
account=artifact-registry

# Email address of the Service Account
email=${account}@${project}.iam.gserviceaccount.com

# Create Service Account
gcloud iam service-accounts create ${account} \
--display-name="Read Artifact Registry" \
--description="Used by GKE to read Artifact Registry repos" \
--project=${project}

# Delete existing user keys
for key_id in $(gcloud iam service-accounts keys list --iam-account $email --managed-by=user | grep -v KEY_ID | awk '{print $1}') ; do
  gcloud iam service-accounts keys delete $key_id --iam-account $email
done

# Move old key file out of the way if it exists
[ -f ${account}.json ] && mv -f ${account}.json ${account}.json.old

# Create new Service Account key
gcloud iam service-accounts keys create ${PWD}/${account}.json \
--iam-account=${email} \
--project=${project}

# Grant Service Account role to reader Artifact Reg
gcloud projects add-iam-policy-binding ${project} \
--member=serviceAccount:${email} \
--role=roles/artifactregistry.reader

# Create a Kubernetes Secret representing the Service Account
kubectl create secret docker-registry ${account} \
--docker-server=https://${location}-docker.pkg.dev \
--docker-username=_json_key \
--docker-password="$(cat ${PWD}/${account}.json)" \
--docker-email=${email} \
--namespace=default

Run the script:

./permissions.sh

This will produce a file artifact-registry.json containing the private key of the service account. As such it has very restricted permissions and is included in .gitignore so it’s not pushed to GitHub. This file can be shared securely with other users if needed. If artifact-registry.json existed prior to running this script then it will be moved to artifact-registry.json.old in case that’s needed for reference.

To get a description of the service account and a list of its public keys, run the following (changing “project” to the name of your project):

gcloud iam service-accounts describe artifact-registry@project.iam.gserviceaccount.com
gcloud iam service-accounts keys list --iam-account artifact-registry@project.iam.gserviceaccount.com --managed-by=user

Deployment

my-app.yml is the manifest for deploying the app:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      imagePullSecrets:
      - name: artifact-registry
      containers:
      - name: my-app
        image: europe-west2-docker.pkg.dev/myproject/myrepo/my-app:latest
        imagePullPolicy: Always

Deploy your app:

kubectl apply -f my-app.yml

Logs

To get the logs for the app pods in order to ensure they’re running correctly (change the egrep regex string as needed):

for pod in $(kubectl get pods | egrep '^my-app' | awk '{print $1}') ; do echo ; echo "=================================" ; echo $pod ; echo "=================================" ; kubectl logs $pod ; echo ; done

Deletion

kubectl delete -f my-app.yml

Final thoughts

I hope this helps if you’re looking to better understand how to set up RabbitMQ securely within Kubernetes/GKE using cert-manager and ingress-nginx. Any feedback would be very welcome. The intention at some point is that I’ll revisit this to improve its use of variables and secrets, and to ideally bring more of it into Terraform for better automation.

If you’re looking for further help with GCP, Kubernetes, Docker, RabbitMQ, nginx, clustering and containerisation, automation, or any other infrastructure, DevOps or SysAdmin issues, don’t hesitate to get in touch to discuss the services I provide.