I was brought in by a startup to set up their core infrastructure in a way that functioned as needed and could be automated for safe and efficient provisioning and deployment. The key requirement was making RabbitMQ work only with secure certificate-based connections – the AMQPS protocol, rather than AMQP – for security and compliance purposes. This needed to be done within a Kubernetes cluster for storage and shared states via StatefulSets, ease of scaling and deployment, and general flexibility. It was also necessary to set this up on GCP (Google Cloud Platform) as that was already in use by the startup and they didn’t want to consider alternative cloud providers at this stage, so GKE (Google Kubernetes Engine) needed to be used for the Kubernetes cluster.
Getting certificates for use with RabbitMQ within Kubernetes required the setup of cert-manager for certificate management, which in turn needed ingress-nginx to allow incoming connections for Let’s Encrypt verification so that certificates could be issued.
I successfully solved the problems and fulfilled the requirements. It’s still a “work in progress” to some extent. Some of the config is a little “rough and ready” and could be improved with more modularisation and better use of variables and secrets. Also, the initial cluster provisioning is fully automated with Terraform, and the rest is only semi automated currently. So there is room for further improvement.
All the code and documentation is available in my GitHub repository. Below I will explain the whole process from start to finish.
Provision the Kubernetes cluster in GKE
Requirements
Local setup
- gcloud SDK installed and initialised (instructions here).
- kubectl installed (instructions here).
- Terraform installed (instructions here).
If you have multiple gcloud SDK projects/configurations set up, you must remember to switch from one project (configuration) to another in gcloud SDK otherwise catastrophe could ensue. (Replace “test” with the name of your desired configuration.):
gcloud config configurations activate test
To check details of the current configuration:
gcloud config list
GCS buckets
A GCS (Google Cloud Storage) bucket needs to exist for remote Terraform state/lock management. This appears to work on an account level, not on a project level, so it’s best to identify the bucket accordingly. The bucket is currently named iac-state
but this should be changed to a more meaningful name to avoid confusion between different projects used by the same account.
When creating the GCS bucket, location type can be single region to save costs (europe-west2 in this case, but change that if needed), storage class should be standard, public access should be prevented, access control can be uniform, and object versioning should be switched on (default values should be fine).
Terraform details
The Terraform files can be found in the provision-cluster directory in this project on my GitHub repo.
terraform.tfvars is where you set basic variables:
# Set this to correct Project ID
project_id = "000000000000"
# Change the zone as preferred
zone = "europe-west2-a"
versions.tf contains the Terraform provider setup and also the details of the GCS bucket backend for state management and locking:
terraform {
required_providers {
google = {
source = "hashicorp/google"
version = "3.52.0"
}
}
required_version = ">= 0.14"
backend "gcs" {
# Change this to the bucket used to maintain state/lock
bucket = "iac-state"
prefix = "provision-cluster"
}
}
vpc.tf sets up the VPC and other general parameters:
variable "project_id" {
description = "project id"
}
variable "zone" {
description = "zone"
}
provider "google" {
project = var.project_id
}
resource "google_compute_network" "vpc" {
name = "test-vpc"
}
cluster.tf is where most of the exciting stuff happens, i.e. the actual cluster and node pool creation:
# Only one node for now to save costs - increase as needed
variable "num_nodes" {
default = 1
description = "number of nodes"
}
# Create cluster and remove default node pool
resource "google_container_cluster" "test" {
name = "test-cluster"
location = var.zone
remove_default_node_pool = true
initial_node_count = 1
network = google_compute_network.vpc.name
}
# Create node pool for cluster
resource "google_container_node_pool" "test_nodes" {
name = "test-nodes"
location = var.zone
cluster = google_container_cluster.test.name
node_count = var.num_nodes
node_config {
oauth_scopes = [
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring",
]
labels = {
env = "test"
}
# e2-standard-2 seems to be the minimum required by the RabbitMQ cluster
# - change as needed
machine_type = "e2-standard-2"
tags = ["test-node"]
metadata = {
disable-legacy-endpoints = "true"
}
}
}
outputs.tf defines output variables for use by later commands and processes:
output "zone" {
value = var.zone
description = "Zone"
}
output "project_id" {
value = var.project_id
description = "Project ID"
}
output "kubernetes_cluster_name" {
value = google_container_cluster.test.name
description = "Cluster Name"
}
output "kubernetes_cluster_host" {
value = google_container_cluster.test.endpoint
description = "Cluster Host"
}
Usage
N.B. Be very careful when applying or destroying Terraform configuration as these commands have the potential to break things on a massive scale if you make a mistake. Always check that you are using the correct GCP project before you begin (this should have been done above during the gcloud SDK initialisation).
Terraform state and locking is shared remotely via a GCS bucket, which should make it impossible for more than one person to make changes at any given time for safety reasons, and should also ensure everyone is always working with the current state rather than a potentially out-of-date version (which would be potentially dangerous).
Initialise Terraform:
terraform init
See what Terraform will do if you apply the current configuration:
terraform plan
Apply the current configuration:
terraform apply
Destroy the current configuration:
terraform destroy
Configure kubectl
Run this command to configure kubectl with access credentials. This is needed before you can run the kubectl commands for setting up cert-manager, RabbitMQ, etc:
gcloud container clusters get-credentials $(terraform output -raw kubernetes_cluster_name) --zone $(terraform output -raw zone)
Deploy Kubernetes Dashboard
If you also need to deploy the Kubernetes Dashboard, perform the following procedure.
Deploy the Kubernetes Dashboard and create a proxy server to access the Dashboard:
kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.0-beta8/aio/deploy/recommended.yaml
kubectl proxy
This will keep running until stopped with CTRL-C so you need to open a new terminal tab/window, and create the ClusterRoleBinding resource:
kubectl apply -f https://raw.githubusercontent.com/hashicorp/learn-terraform-provision-gke-cluster/master/kubernetes-dashboard-admin.rbac.yaml
Then create a token to log in to the Dashboard as an admin user:
ADMIN_USER_TOKEN_NAME=$(kubectl -n kube-system get secret | grep admin-user-token | cut -d' ' -f1)
ADMIN_USER_TOKEN_VALUE=$(kubectl -n kube-system get secret "$ADMIN_USER_TOKEN_NAME" -o jsonpath='{.data.token}' | base64 --decode)
echo "$ADMIN_USER_TOKEN_VALUE"
Open the Kubernetes Dashboard in your browser here, choose to log in with a token, then copy/paste the output from the above commands.
Set up cert-manager with ingress-nginx
All the files for cert-manager/ingress-nginx setup can be found in the cert-manager directory in this project on my GitHub repo.
Requirements
- User needs Kubernetes Engine Admin permissions in IAM.
- GKS cluster set up, and kubectl installed and configured, as described above.
- Domain set up in Cloud DNS.
ingress-nginx
Deploy the ingress-nginx controller, using the deployment manifest provided by Kubernetes:
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.1.0/deploy/static/provider/cloud/deploy.yaml
I wrote a Bash script dns.sh to update the DNS for the ingress, which you’ll need to modify as needed:
#!/bin/bash
# Change these as needed
# and also in the dns.yml.master file
dns_zone=example-com
dns_name=mq.example.com
file=dns.yml
old_data=$(dig +short -t a $dns_name)
new_data=$(kubectl get svc --namespace=ingress-nginx | grep LoadBalancer | awk '{print $4}')
[ -f $file ] && rm -f $file
cp -f ${file}.master $file
sed -i "s/OLD_DATA/${old_data}/" $file
sed -i "s/NEW_DATA/${new_data}/" $file
gcloud dns record-sets transaction execute --zone=${dns_zone} --transaction-file=dns.yml
[ $? -eq 0 ] && echo "IP updated from $old_data to $new_data"
The DNS script references a YAML file dns.yml.master which you’ll also need to modify for your own DNS setup:
---
additions:
- kind: dns#resourceRecordSet
name: mq.example.com.
rrdatas:
- NEW_DATA
ttl: 60
type: A
deletions:
- kind: dns#resourceRecordSet
name: mq.example.com.
rrdatas:
- OLD_DATA
ttl: 60
type: A
Then update the DNS:
./dns.sh
cert-manager
Deploy the cert-manager controller, using the deployment manifest provided by Jetstack:
kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.6.1/cert-manager.yaml
This will take some time, so if you initially get errors with the following commands, wait for a little while then try again.
issuer.yml is a manifest file to create the Let’s Encrypt certificate issuer in cert-manager:
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: letsencrypt-prod
spec:
acme:
# Change this
email: admin@example.com
server: https://acme-v02.api.letsencrypt.org/directory
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- selector: {}
http01:
ingress:
class: nginx
Create the issuer:
kubectl create -f issuer.yml
(There is also issuer-staging.yml for using the Let’s Encrypt staging API instead of prod, if needed.)
ingress.yml is a manifest file for the ingress, which requests the certificate, uses ingress-nginx so Let’s Encrypt can verify via incoming HTTP connection, then issues the certificate:
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
name: nginx-ingress-cert-manager
annotations:
#cert-manager.io/issuer: "letsencrypt-staging"
cert-manager.io/issuer: "letsencrypt-prod"
spec:
tls:
- hosts:
# Change this
- mq.example.com
secretName: rabbitmq-tls
rules:
# Change this
- host: mq.example.com
http:
paths:
- backend:
serviceName: nginx-ingress-backend
servicePort: 80
So, deploy ingress and request certificate:
kubectl apply -f ingress.yml
Checking the certificate
Get certificate request status/details:
kubectl describe certificaterequest $(kubectl describe certificate rabbitmq-tls | grep -i request | awk -F '"' '{print $2}')
Get certificate details:
kubectl describe certificate rabbitmq-tls
Renewing the certificate
An email should be received at the email address specified in issuer.yml when it is time to renew the certificate. It may be wise to set a timed reminder also. If unsure, the certificate status can be checked with the commands above.
The simplest option for renewing is simply to delete the existing certificate and create a new one. This is likely to create a gap in the RabbitMQ service so downtime should be scheduled if appropriate:
Firstly, update the DNS to point to nginx ingress (oddly this doesn’t always appear to be necessary when renewing, but it always appears to be necessary when creating the initial certificate, so I think it’s best to always do this to be on the safe side, though I don’t yet have an explanation for why it sometimes doesn’t seem to be needed):
./dns.sh
Then delete and recreate the ingress, certificate and Secret:
kubectl delete -f ingress.yml
kubectl delete secret rabbitmq-tls
kubectl apply -f ingress.yml
Then cd
to the rabbitmq-cluster folder and run the command there to revert the DNS back to the RabbitMQ service:
cd ../rabbitmq-cluster
./dns.sh
It appears that RabbitMQ simply uses the updated certificate in the Secret without needing to be restarted/reloaded, though this has not yet been extensively tested. Some errors will likely appear in the RabbitMQ and app logs during the time when the certificate is being updated.
Deletion
If you need to delete cert-manager and ingress-nginx:
kubectl delete -f ingress.yml
kubectl delete secret rabbitmq-tls
kubectl delete -f issuer.yml
kubectl delete -f https://github.com/jetstack/cert-manager/releases/download/v1.6.1/cert-manager.yaml
kubectl delete -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.1.0/deploy/static/provider/cloud/deploy.yaml
Set up the RabbitMQ cluster
All the files for RabbitMQ setup can be found in the rabbitmq-cluster directory in this project on my GitHub repo.
Requirements
- User needs Kubernetes Engine Admin permissions in IAM.
- GKS cluster set up, and kubectl installed and configured, as described above.
- cert-manager setup complete, as described above.
- Domain set up in Cloud DNS.
Deployment
Deploy the RabbitMQ Cluster Operator, using the deployment manifest provided by RabbitMQ:
kubectl apply -f https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml
(More details on installing the RabbitMQ Cluster Operator, if needed.)
Set up your RabbitMQ definitions.json file as needed. There’s one included here which gives an example of setting up a user and a vhost for app access:
{
"users": [
{
"name": "username",
"password": "password",
"tags": "administrator",
"limits": {}
}
],
"vhosts": [
{
"name": "virtualhost"
}
],
"permissions": [
{
"user": "username",
"vhost": "virtualhost",
"configure": ".*",
"write": ".*",
"read": ".*"
}
]
}
Create ConfigMap for RabbitMQ definitions to set up user and vhost etc:
kubectl create configmap definitions --from-file=definitions.json
cluster.yml is a manifest file for deploying the Workloads and Services with the certificate previously created via cert-manager, and with the ConfigMap mounted to import definitions:
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
name: rabbitmq-cluster
spec:
replicas: 1
service:
type: LoadBalancer
tls:
secretName: rabbitmq-tls
disableNonTLSListeners: true
# Volume for importing RabbitMQ config definitions
override:
statefulSet:
spec:
template:
spec:
containers:
- name: rabbitmq
volumeMounts:
- mountPath: /definitions/
name: definitions
volumes:
- name: definitions
configMap:
name: definitions
rabbitmq:
additionalConfig: |
load_definitions = /definitions/definitions.json
So let’s apply the cluster manifest:
kubectl apply -f cluster.yml
Further instructions for deploying/using the cluster etc..
Update DNS
I wrote a Bash script dns.sh to update the DNS for the ingress, which you’ll need to modify as needed:
#!/bin/bash
# Change these as needed
# and also in the dns.yml.master file
dns_zone=example-com
dns_name=mq.example.com
file=dns.yml
old_data=$(dig +short -t a $dns_name)
new_data=$(kubectl get svc --namespace=default | egrep "^rabbitmq-cluster[ ]+LoadBalancer" | awk '{print $4}')
[ -f $file ] && rm -f $file
cp -f ${file}.master $file
sed -i "s/OLD_DATA/${old_data}/" $file
sed -i "s/NEW_DATA/${new_data}/" $file
gcloud dns record-sets transaction execute --zone=${dns_zone} --transaction-file=dns.yml
[ $? -eq 0 ] && echo "IP updated from $old_data to $new_data"
The DNS script references a YAML file dns.yml.master which you’ll also need to modify for your own DNS setup:
---
additions:
- kind: dns#resourceRecordSet
name: mq.example.com.
rrdatas:
- NEW_DATA
ttl: 60
type: A
deletions:
- kind: dns#resourceRecordSet
name: mq.example.com.
rrdatas:
- OLD_DATA
ttl: 60
type: A
Update the DNS with the new external IP (if this initially fails it probably means you need to wait longer for the cluster to come up):
./dns.sh
Test that RabbitMQ connections are working (optional)
Install and run PerfTest (change username, password, hostname and virtualhost as needed in the URI):
kubectl run perf-test --image=pivotalrabbitmq/perf-test -- --uri "amqps://username:password@mq.example.com:5671/virtualhost"
Check logs to see if connections are successful (won’t work until PerfTest is fully deployed):
kubectl logs -f perf-test
Delete PerfTest:
kubectl delete pod perf-test
Check logs
Get the logs from the RabbitMQ cluster pod(s):
for pod in $(kubectl get pods | egrep "^rabbitmq-cluster-server" | awk '{print $1}') ; do kubectl logs $pod ; done
Deletion
If you need to delete RabbitMQ:
kubectl delete -f cluster.yml
kubectl delete configmap definitions
kubectl delete -f https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml
Deploy app(s)
All the files for app setup can be found in the in the deploy-apps directory in this project on my GitHub repo.
Requirements
- GKS cluster set up, and kubectl installed and configured, as described above.
- cert-manager setup complete, as described above.
- RabbitMQ setup complete, as described above.
You will also, of course, need at least one app which talks to RabbitMQ. The app should be containerised, and the container stored in Artifact Registry. If unsure how to achieve this, refer to “Containerizing an app with Cloud Build” in this document which should point you in the right direction for containerising your app using Docker (ignore the stuff about creating the cluster and everything after that). It’s assumed here that there is a containerised app in Artifact Registry called my-app, so just change the names and settings for your app as needed.
Permissions
In order for GKE to pull the container images, a service account needs to exist with an associated Secret in Kubernetes, with the account given permission to access the Artifact Registry via the Secret. I got the following script permissions.sh from Stack Overflow to achieve this, and added some additions primarily to delete any existing keys before creating a new key.
(It’s simplest to assume we don’t have access to any previously-created private keys since they cannot be re-downloaded, so we create a new key here to ensure we can create the Secret in order to pull container images for the apps, and if we’re deploying apps on a new cluster with a new key then there’s no point in keeping old keys on the service account).
If you’ve been through this process before then the service account should already exist, unless it’s been manually deleted by someone or unless a new GCP project is being used, so the first part of the script – where it tries to create the service account – will likely produce an error. This is to be expected and is nothing to worry about. The rest of the script should run correctly:
#!/bin/bash
# Change these as needed
project=myproject
repo=myrepo
location=europe-west2
# Service Account and Kubernetes Secret name
account=artifact-registry
# Email address of the Service Account
email=${account}@${project}.iam.gserviceaccount.com
# Create Service Account
gcloud iam service-accounts create ${account} \
--display-name="Read Artifact Registry" \
--description="Used by GKE to read Artifact Registry repos" \
--project=${project}
# Delete existing user keys
for key_id in $(gcloud iam service-accounts keys list --iam-account $email --managed-by=user | grep -v KEY_ID | awk '{print $1}') ; do
gcloud iam service-accounts keys delete $key_id --iam-account $email
done
# Move old key file out of the way if it exists
[ -f ${account}.json ] && mv -f ${account}.json ${account}.json.old
# Create new Service Account key
gcloud iam service-accounts keys create ${PWD}/${account}.json \
--iam-account=${email} \
--project=${project}
# Grant Service Account role to reader Artifact Reg
gcloud projects add-iam-policy-binding ${project} \
--member=serviceAccount:${email} \
--role=roles/artifactregistry.reader
# Create a Kubernetes Secret representing the Service Account
kubectl create secret docker-registry ${account} \
--docker-server=https://${location}-docker.pkg.dev \
--docker-username=_json_key \
--docker-password="$(cat ${PWD}/${account}.json)" \
--docker-email=${email} \
--namespace=default
Run the script:
./permissions.sh
This will produce a file artifact-registry.json containing the private key of the service account. As such it has very restricted permissions and is included in .gitignore so it’s not pushed to GitHub. This file can be shared securely with other users if needed. If artifact-registry.json existed prior to running this script then it will be moved to artifact-registry.json.old in case that’s needed for reference.
To get a description of the service account and a list of its public keys, run the following (changing “project” to the name of your project):
gcloud iam service-accounts describe artifact-registry@project.iam.gserviceaccount.com
gcloud iam service-accounts keys list --iam-account artifact-registry@project.iam.gserviceaccount.com --managed-by=user
Deployment
my-app.yml is the manifest for deploying the app:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 1
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
imagePullSecrets:
- name: artifact-registry
containers:
- name: my-app
image: europe-west2-docker.pkg.dev/myproject/myrepo/my-app:latest
imagePullPolicy: Always
Deploy your app:
kubectl apply -f my-app.yml
Logs
To get the logs for the app pods in order to ensure they’re running correctly (change the egrep
regex string as needed):
for pod in $(kubectl get pods | egrep '^my-app' | awk '{print $1}') ; do echo ; echo "=================================" ; echo $pod ; echo "=================================" ; kubectl logs $pod ; echo ; done
Deletion
kubectl delete -f my-app.yml
Final thoughts
I hope this helps if you’re looking to better understand how to set up RabbitMQ securely within Kubernetes/GKE using cert-manager and ingress-nginx. Any feedback would be very welcome. The intention at some point is that I’ll revisit this to improve its use of variables and secrets, and to ideally bring more of it into Terraform for better automation.
If you’re looking for further help with GCP, Kubernetes, Docker, RabbitMQ, nginx, clustering and containerisation, automation, or any other infrastructure, DevOps or SysAdmin issues, don’t hesitate to get in touch to discuss the services I provide.