Skip to main content

General

Kubernetes High Availability Deployment

Deploy Proxus on Kubernetes with HA PostgreSQL, replicated ClickHouse, NATS JetStream, active-active UI, and singleton gateways.

This guide explains how to deploy Proxus on Kubernetes for a high availability production topology. Use it when Docker Compose is no longer enough for your availability, update, and operations requirements.

warning
Production cluster required

The manifests can be tested on a local Kubernetes environment, but a real HA production deployment requires a multi-node Kubernetes cluster, reliable persistent storage, monitoring, and a backup policy. A single-node cluster can prove the topology, but it cannot survive node failure.

What This Deployment Provides

The HA deployment separates Proxus into a data layer and an application layer:

LayerComponentHA behavior
DataPostgreSQL3-instance cluster with one read-write endpoint and automatic failover
DataClickHouse Keeper3-pod coordination quorum
DataClickHouse1 shard with 2 replicas for replicated telemetry storage
MessagingNATS JetStream3-node message hub with three-replica streams
ApplicationProxus UI2 replicas spread across nodes behind a Kubernetes Service
EdgeProxus Gateway1 pod per GatewayID, restarted by Kubernetes if it fails
OperationsPod Disruption BudgetsDrain protection on NATS and UI
OperationsPrometheus metricsNATS exporter sidecar on every NATS pod
OperationsKubernetes SecretsNATS and PostgreSQL credentials kept out of plaintext manifests
info
Gateway scaling rule

Gateways are singleton workloads. Do not increase replicas for the same GatewayID. To add capacity for separate sites or lines, create another gateway deployment with a different GatewayID.

Quick HA Deploy

You need a Kubernetes cluster, a default StorageClass, kubectl access, permission to install operators, a Proxus license, and access to the Proxus container images used by your release.

The full HA package brings up:

  • 3 PostgreSQL instances managed by CloudNativePG
  • 3 ClickHouse Keeper pods with a 2-replica ClickHouse cluster
  • 3 NATS pods running JetStream with three-replica streams
  • 2 Proxus UI pods behind a sticky-session ingress
  • 1 Proxus Gateway pod
  • Pod Disruption Budgets, Prometheus metrics, and credential Secrets
warning
Defaults are for first-run only

The package ships with default credentials in two Kubernetes Secrets (nats-credentials, postgres-credentials). Use the Quick Deploy below to get a working cluster in minutes, then change those defaults before exposing the environment to anyone outside your team. See Security and Secrets for the rotation procedure.

Run the full HA deployment:

mkdir -p proxus-kubernetes-ha
cd proxus-kubernetes-ha
curl -L https://proxus.io/deployment/kubernetes-ha/proxus-kubernetes-ha.tar.gz | tar xz

kubectl apply --server-side -f https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.29/releases/cnpg-1.29.0.yaml
kubectl -n cnpg-system rollout status deployment/cnpg-controller-manager --timeout=300s

curl -s https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator-web-installer/clickhouse-operator-install.sh | OPERATOR_NAMESPACE=clickhouse bash
kubectl -n clickhouse rollout status deployment/clickhouse-operator --timeout=300s

kubectl apply -k ha-data
kubectl -n proxus-data wait --for=jsonpath='{.status.phase}'='Cluster in healthy state' cluster/postgresql-ha --timeout=600s
kubectl -n clickhouse wait --for=jsonpath='{.status.status}'=Completed chi/clickhouse-ha --timeout=600s
kubectl -n clickhouse wait --for=condition=complete job/clickhouse-bootstrap --timeout=180s

kubectl apply -k ha
kubectl -n proxus-ha rollout status statefulset/hub-server --timeout=300s
kubectl -n proxus-ha rollout status deployment/proxus-ui --timeout=300s
kubectl -n proxus-ha rollout status deployment/proxus-server-gateway-1 --timeout=300s

Total time on a healthy cluster is around 5 to 10 minutes, dominated by CloudNativePG bringing up the PostgreSQL primary and the ClickHouse Keeper quorum starting.

Verify the Install

Confirm the cluster came up healthy and that JetStream streams are running with three replicas:

kubectl -n proxus-ha get pods,pdb
kubectl run -n proxus-ha nats-tmp --rm -i --restart=Never --image=natsio/nats-box:0.18.0 -- \
  nats --server="nats://acc:acc@hub-server:4222" stream report

Every JetStream stream in the report should show R=3. If any stream shows R=1, see Troubleshooting.

First Login

Open http://127.0.0.1:30080/ in your browser. In a remote cluster, replace 127.0.0.1 with a reachable Kubernetes node address or use the included ingress behind your load balancer.

Default login:

  • Username: Admin
  • Password: leave blank, unless your environment has already changed it

After the first login, activate your Proxus license from the UI. A fresh HA PostgreSQL cluster does not contain license data from an earlier Docker or local test database unless you restored that database.

Before Going to Production

Run through the Security and Secrets procedure to rotate the default NATS and PostgreSQL credentials, then walk the Production Checklist.

Check the Deployment

Use these commands when the quick deploy finishes or if one of the wait commands times out:

kubectl -n proxus-data get cluster,pods,svc,pvc
kubectl -n clickhouse get pods,svc,pvc,chi
kubectl -n proxus-ha get pods,svc,endpoints

Expected shape:

WorkloadExpected
hub-server3 NATS pods
proxus-ui2 UI pods
proxus-server-gateway-11 gateway pod
postgresql serviceExternalName to the PostgreSQL read-write endpoint
clickhouse serviceExternalName to the ClickHouse service

For production, expose the UI through your ingress or load balancer. The included ingress example uses sticky sessions for browser sessions:

kubectl -n proxus-ha get ingress

Production Configuration

Before exposing the environment, review and change these values:

FileWhat to review
ha/secrets.yamlNATS account credentials and PostgreSQL credentials (do this first; see Security and Secrets)
ha-data/postgresql-ha.yamlPostgreSQL storage size and instance count
ha-data/clickhouse-ha.yamlClickHouse password, storage size, Keeper and replica count
ha/proxus-ui.yamlUI image tag, replica count, resource requests and limits
ha/proxus-gateway-1.yamlGateway image tag, GatewayID, exposed protocol ports, resource requests and limits
ha/nats-cluster.yamlJetStream storage size, resource requests and limits
ha/pod-disruption-budgets.yamlMinimum available pods during voluntary disruptions
ha/storage.yamlPersistent volume size and access mode, especially RWX storage for shared UI volumes and nats-hub-config in multi-node production
ha/ingress-nginx-sticky.yamlHost name, ingress class, sticky session settings
warning
Use kustomize, not single-file apply

Always apply with kubectl apply -k ha (kustomize). Applying individual files with kubectl apply -f does not pick up the namespace declared in the kustomization and can create resources in the wrong namespace.

Security and Secrets

Default NATS and PostgreSQL credentials are stored in two Kubernetes Secrets in the proxus-ha namespace, not in plaintext manifest arguments.

SecretKeys
nats-credentialsusers-user, users-password, sys-user, sys-password
postgres-credentialsuser, password, database

The NATS server reads these values at startup through environment-variable interpolation in accounts.conf. UI and Gateway pods receive the same values as environment variables, and the connection arguments use $(VAR_NAME) substitution at pod start.

Rotate values without editing manifests:

kubectl -n proxus-ha edit secret nats-credentials
kubectl -n proxus-ha edit secret postgres-credentials
kubectl -n proxus-ha rollout restart statefulset/hub-server
kubectl -n proxus-ha rollout restart deployment/proxus-ui deployment/proxus-server-gateway-1

Make sure the new credentials are reflected in the PostgreSQL cluster definition (ha-data/postgresql-ha.yaml) before you change postgres-credentials. ClickHouse credentials are managed in ha-data/clickhouse-ha.yaml and do not yet flow through postgres-credentials.

Observability

Each NATS pod runs a sidecar that exposes Prometheus-format metrics on port 7777. A headless service publishes the endpoint with standard scrape annotations:

kubectl -n proxus-ha get svc hub-server-metrics
kubectl -n proxus-ha get pods -l app.kubernetes.io/name=hub-server -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.podIP}{"\n"}{end}'

Point your Prometheus instance at the headless service. Useful metrics include JetStream stream and consumer counts, route status, and per-stream message throughput.

Scaling Rules

UI

The UI can be scaled horizontally:

kubectl -n proxus-ha scale deployment/proxus-ui --replicas=3
kubectl -n proxus-ha rollout status deployment/proxus-ui --timeout=300s

Keep sticky sessions enabled at the ingress layer unless your environment has been validated without them.

Gateways

Do not scale a gateway deployment above one replica:

kubectl -n proxus-ha scale deployment/proxus-server-gateway-1 --replicas=1

To add another gateway, create a separate deployment with a different GatewayID.

PostgreSQL

PostgreSQL uses one writable primary. The application connection points to the read-write service:

postgresql-ha-rw.proxus-data.svc.cluster.local:5432

Use the read-only service only for workloads that are explicitly safe to run against replicas:

postgresql-ha-ro.proxus-data.svc.cluster.local:5432

ClickHouse

The package starts with one shard and two replicas. This gives replica availability. For higher analytics scale, plan shard count, table engines, and distributed table strategy before increasing shards.

Updating Proxus

  1. Take backups or verify recent successful backups.
  2. Edit the image tag in ha/proxus-ui.yaml and ha/proxus-gateway-1.yaml.
  3. Apply the manifests:
kubectl apply -k ha
  1. Watch the rollout:
kubectl -n proxus-ha rollout status deployment/proxus-ui --timeout=300s
kubectl -n proxus-ha rollout status deployment/proxus-server-gateway-1 --timeout=300s

UI updates are rolling. Gateway updates are singleton updates and can briefly disconnect devices attached to that gateway.

Upgrade NATS Control-Plane Streams

If you are upgrading from an earlier HA package, check for the legacy license_stream. Proxus uses license and license.health as core NATS request/reply subjects, so they should not be backed by a JetStream stream.

nats stream info license_stream
nats stream rm license_stream --force

Only remove it after confirming that your current package version no longer creates license_stream and that there are no custom consumers attached to it.

Updating Operators

Treat operator updates as infrastructure maintenance:

  1. Read the operator release notes.
  2. Confirm backups exist.
  3. Apply the operator upgrade in a maintenance window.
  4. Verify database and ClickHouse cluster health.

PostgreSQL:

kubectl -n proxus-data get cluster,pods

ClickHouse:

kubectl -n clickhouse get chi,pods

Backup and Restore

PostgreSQL

Use your storage snapshot system or CloudNativePG backup configuration. At minimum, verify:

kubectl -n proxus-data get cluster postgresql-ha
kubectl -n proxus-data get pvc

For production, configure scheduled backups and test restore into a separate namespace before relying on the backup plan.

ClickHouse

Use storage snapshots or a ClickHouse backup process that covers all replicas and Keeper metadata requirements. Verify both data PVCs and Keeper PVCs are protected:

kubectl -n clickhouse get pvc

Proxus Configuration and Modules

Back up the Proxus UI and gateway PVCs:

kubectl -n proxus-ha get pvc

The important volumes are:

  • UI config
  • UI modules
  • each gateway config volume
  • NATS JetStream data

Failure Testing

Run failure tests during a maintenance window.

UI Pod Failure

kubectl -n proxus-ha delete pod -l app.kubernetes.io/name=proxus-ui
kubectl -n proxus-ha rollout status deployment/proxus-ui --timeout=300s

The service should continue routing to a healthy UI pod while Kubernetes recreates the deleted pod.

Gateway Pod Failure

kubectl -n proxus-ha delete pod -l app.kubernetes.io/name=proxus-server,proxus.io/gateway-id=1
kubectl -n proxus-ha rollout status deployment/proxus-server-gateway-1 --timeout=300s

Connected devices may reconnect after the replacement pod starts.

PostgreSQL Primary Failure

Use the PostgreSQL operator's documented switchover or failover procedure. After failover, verify:

kubectl -n proxus-data get cluster postgresql-ha
kubectl -n proxus-ha logs deploy/proxus-ui --tail=100

ClickHouse Replica Failure

kubectl -n clickhouse delete pod chi-clickhouse-ha-proxus-0-0-0
kubectl -n clickhouse get pods,chi

The remaining replica should stay available while the deleted pod is recreated.

Troubleshooting

UI Does Not Open

Check the service and pods:

kubectl -n proxus-ha get pods,svc,endpoints
kubectl -n proxus-ha logs deploy/proxus-ui --tail=100

The HA package exposes the UI service as NodePort 30080. Open http://127.0.0.1:30080/ locally, or replace 127.0.0.1 with a reachable node address in a remote cluster.

Login Page Opens but Gateway Is Not Active

Check the gateway logs:

kubectl -n proxus-ha logs deploy/proxus-server-gateway-1 --tail=150

If the license is not activated in the current database, activate it from the UI.

PostgreSQL Is Not Ready

kubectl -n proxus-data get cluster,pods,pvc
kubectl -n proxus-data describe cluster postgresql-ha

Common causes:

  • storage class cannot provision volumes
  • database image cannot be pulled
  • cluster has insufficient CPU or memory
  • pod anti-affinity cannot be satisfied in a small cluster

ClickHouse Is Not Ready

kubectl -n clickhouse get chi,pods,svc,pvc
kubectl -n clickhouse logs deploy/clickhouse-operator --tail=150
kubectl -n clickhouse logs statefulset/clickhouse-keeper --tail=150

Common causes:

  • ClickHouse image cannot be pulled
  • Keeper quorum is not healthy
  • PVCs are pending
  • the bootstrap job has not completed

NATS Is Not Ready

kubectl -n proxus-ha get pods -l app.kubernetes.io/name=hub-server
kubectl -n proxus-ha logs statefulset/hub-server --tail=150

Check that all three NATS pods are running and that their PVCs are bound.

Cleanup

For a test environment only:

kubectl delete -k ha
kubectl delete -k ha-data
warning
Data deletion

Deleting namespaces, PVCs, or storage snapshots can permanently delete production data. Do not run cleanup commands in production unless you have a tested restore path and an approved maintenance plan.

Production Checklist

Before going live:

  • Use a multi-node Kubernetes cluster.
  • Confirm storage classes and backup policies.
  • Use RWX-capable storage for shared UI volumes and nats-hub-config in multi-node production.
  • Change the default values in the nats-credentials and postgres-credentials Secrets.
  • Configure ingress TLS.
  • Confirm the UI is reachable through http://<node-address>:30080/ or through your ingress/load balancer.
  • Activate the Proxus license.
  • Keep each gateway deployment singleton.
  • Verify that JetStream streams run with three replicas (nats stream report should show every stream as R=3).
  • Verify Pod Disruption Budgets keep at least two NATS pods and one UI pod available during a node drain.
  • Confirm Prometheus is scraping the NATS metrics endpoint on port 7777.
  • Verify UI failover.
  • Verify PostgreSQL failover.
  • Verify ClickHouse replica recovery.
  • Verify NATS pod recovery.
  • Document your restore procedure.
  • Monitor pod health, storage usage, database health, and application logs.