Kubernetes High Availability Deployment

This guide explains how to deploy Proxus on Kubernetes for a high availability production topology. Use it when Docker Compose is no longer enough for your availability, update, and operations requirements.

warning

Production cluster required

The manifests can be tested on a local Kubernetes environment, but a real HA production deployment requires a multi-node Kubernetes cluster, reliable persistent storage, monitoring, and a backup policy. A single-node cluster can prove the topology, but it cannot survive node failure.

What This Deployment Provides

The HA deployment separates Proxus into a data layer and an application layer:

Layer	Component	HA behavior
Data	PostgreSQL	3-instance cluster with one read-write endpoint and automatic failover
Data	ClickHouse Keeper	3-pod coordination quorum
Data	ClickHouse	1 shard with 2 replicas for replicated telemetry storage
Messaging	NATS JetStream	3-node message hub with three-replica streams
Application	Proxus UI	2 replicas spread across nodes behind a Kubernetes Service
Edge	Proxus Gateway	1 pod per `GatewayID`, restarted by Kubernetes if it fails
Operations	Pod Disruption Budgets	Drain protection on NATS and UI
Operations	Prometheus metrics	NATS exporter sidecar on every NATS pod
Operations	Kubernetes Secrets	NATS and PostgreSQL credentials kept out of plaintext manifests

info

Gateway scaling rule

Gateways are singleton workloads. Do not increase replicas for the same GatewayID. To add capacity for separate sites or lines, create another gateway deployment with a different GatewayID.

Quick HA Deploy

You need a Kubernetes cluster, a default StorageClass, kubectl access, permission to install operators, a Proxus license, and access to the Proxus container images used by your release.

The full HA package brings up:

3 PostgreSQL instances managed by CloudNativePG
3 ClickHouse Keeper pods with a 2-replica ClickHouse cluster
3 NATS pods running JetStream with three-replica streams
2 Proxus UI pods behind a sticky-session ingress
1 Proxus Gateway pod
Pod Disruption Budgets, Prometheus metrics, and credential Secrets

warning

Defaults are for first-run only

The package ships with default credentials in two Kubernetes Secrets (nats-credentials, postgres-credentials). Use the Quick Deploy below to get a working cluster in minutes, then change those defaults before exposing the environment to anyone outside your team. See Security and Secrets for the rotation procedure.

Run the full HA deployment:

mkdir -p proxus-kubernetes-ha
cd proxus-kubernetes-ha
curl -L https://proxus.io/deployment/kubernetes-ha/proxus-kubernetes-ha.tar.gz | tar xz

kubectl apply --server-side -f https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.29/releases/cnpg-1.29.0.yaml
kubectl -n cnpg-system rollout status deployment/cnpg-controller-manager --timeout=300s

curl -s https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator-web-installer/clickhouse-operator-install.sh | OPERATOR_NAMESPACE=clickhouse bash
kubectl -n clickhouse rollout status deployment/clickhouse-operator --timeout=300s

kubectl apply -k ha-data
kubectl -n proxus-data wait --for=jsonpath='{.status.phase}'='Cluster in healthy state' cluster/postgresql-ha --timeout=600s
kubectl -n clickhouse wait --for=jsonpath='{.status.status}'=Completed chi/clickhouse-ha --timeout=600s
kubectl -n clickhouse wait --for=condition=complete job/clickhouse-bootstrap --timeout=180s

kubectl apply -k ha
kubectl -n proxus-ha rollout status statefulset/hub-server --timeout=300s
kubectl -n proxus-ha rollout status deployment/proxus-ui --timeout=300s
kubectl -n proxus-ha rollout status deployment/proxus-server-gateway-1 --timeout=300s

Total time on a healthy cluster is around 5 to 10 minutes, dominated by CloudNativePG bringing up the PostgreSQL primary and the ClickHouse Keeper quorum starting.

Verify the Install

Confirm the cluster came up healthy and that JetStream streams are running with three replicas:

kubectl -n proxus-ha get pods,pdb
kubectl run -n proxus-ha nats-tmp --rm -i --restart=Never --image=natsio/nats-box:0.18.0 -- \
  nats --server="nats://acc:acc@hub-server:4222" stream report

Every JetStream stream in the report should show R=3. If any stream shows R=1, see Troubleshooting.

Open http://127.0.0.1:30080/ in your browser. In a remote cluster, replace 127.0.0.1 with a reachable Kubernetes node address or use the included ingress behind your load balancer.

Default login:

Username: Admin
Password: leave blank, unless your environment has already changed it

After the first login, activate your Proxus license from the UI. A fresh HA PostgreSQL cluster does not contain license data from an earlier Docker or local test database unless you restored that database.

Before Going to Production

Run through the Security and Secrets procedure to rotate the default NATS and PostgreSQL credentials, then walk the Production Checklist.

Check the Deployment

Use these commands when the quick deploy finishes or if one of the wait commands times out:

kubectl -n proxus-data get cluster,pods,svc,pvc
kubectl -n clickhouse get pods,svc,pvc,chi
kubectl -n proxus-ha get pods,svc,endpoints

Expected shape:

Workload	Expected
`hub-server`	3 NATS pods
`proxus-ui`	2 UI pods
`proxus-server-gateway-1`	1 gateway pod
`postgresql` service	ExternalName to the PostgreSQL read-write endpoint
`clickhouse` service	ExternalName to the ClickHouse service

For production, expose the UI through your ingress or load balancer. The included ingress example uses sticky sessions for browser sessions:

kubectl -n proxus-ha get ingress

Production Configuration

Before exposing the environment, review and change these values:

File	What to review
`ha/secrets.yaml`	NATS account credentials and PostgreSQL credentials (do this first; see Security and Secrets)
`ha-data/postgresql-ha.yaml`	PostgreSQL storage size and instance count
`ha-data/clickhouse-ha.yaml`	ClickHouse password, storage size, Keeper and replica count
`ha/proxus-ui.yaml`	UI image tag, replica count, resource requests and limits
`ha/proxus-gateway-1.yaml`	Gateway image tag, `GatewayID`, exposed protocol ports, resource requests and limits
`ha/nats-cluster.yaml`	JetStream storage size, resource requests and limits
`ha/pod-disruption-budgets.yaml`	Minimum available pods during voluntary disruptions
`ha/storage.yaml`	Persistent volume size and access mode, especially RWX storage for shared UI volumes and `nats-hub-config` in multi-node production
`ha/ingress-nginx-sticky.yaml`	Host name, ingress class, sticky session settings

warning

Use kustomize, not single-file apply

Always apply with kubectl apply -k ha (kustomize). Applying individual files with kubectl apply -f does not pick up the namespace declared in the kustomization and can create resources in the wrong namespace.

Security and Secrets

Default NATS and PostgreSQL credentials are stored in two Kubernetes Secrets in the proxus-ha namespace, not in plaintext manifest arguments.

Secret	Keys
`nats-credentials`	`users-user`, `users-password`, `sys-user`, `sys-password`
`postgres-credentials`	`user`, `password`, `database`

The NATS server reads these values at startup through environment-variable interpolation in accounts.conf. UI and Gateway pods receive the same values as environment variables, and the connection arguments use $(VAR_NAME) substitution at pod start.

Rotate values without editing manifests:

kubectl -n proxus-ha edit secret nats-credentials
kubectl -n proxus-ha edit secret postgres-credentials
kubectl -n proxus-ha rollout restart statefulset/hub-server
kubectl -n proxus-ha rollout restart deployment/proxus-ui deployment/proxus-server-gateway-1

Make sure the new credentials are reflected in the PostgreSQL cluster definition (ha-data/postgresql-ha.yaml) before you change postgres-credentials. ClickHouse credentials are managed in ha-data/clickhouse-ha.yaml and do not yet flow through postgres-credentials.

Observability

Each NATS pod runs a sidecar that exposes Prometheus-format metrics on port 7777. A headless service publishes the endpoint with standard scrape annotations:

kubectl -n proxus-ha get svc hub-server-metrics
kubectl -n proxus-ha get pods -l app.kubernetes.io/name=hub-server -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.podIP}{"\n"}{end}'

Point your Prometheus instance at the headless service. Useful metrics include JetStream stream and consumer counts, route status, and per-stream message throughput.

Scaling Rules

UI

The UI can be scaled horizontally:

kubectl -n proxus-ha scale deployment/proxus-ui --replicas=3
kubectl -n proxus-ha rollout status deployment/proxus-ui --timeout=300s

Keep sticky sessions enabled at the ingress layer unless your environment has been validated without them.

Gateways

Do not scale a gateway deployment above one replica:

kubectl -n proxus-ha scale deployment/proxus-server-gateway-1 --replicas=1

To add another gateway, create a separate deployment with a different GatewayID.

PostgreSQL

PostgreSQL uses one writable primary. The application connection points to the read-write service:

postgresql-ha-rw.proxus-data.svc.cluster.local:5432

Use the read-only service only for workloads that are explicitly safe to run against replicas:

postgresql-ha-ro.proxus-data.svc.cluster.local:5432

ClickHouse

The package starts with one shard and two replicas. This gives replica availability. For higher analytics scale, plan shard count, table engines, and distributed table strategy before increasing shards.

Updating Proxus

Take backups or verify recent successful backups.
Edit the image tag in ha/proxus-ui.yaml and ha/proxus-gateway-1.yaml.
Apply the manifests:

kubectl apply -k ha

Watch the rollout:

kubectl -n proxus-ha rollout status deployment/proxus-ui --timeout=300s
kubectl -n proxus-ha rollout status deployment/proxus-server-gateway-1 --timeout=300s

UI updates are rolling. Gateway updates are singleton updates and can briefly disconnect devices attached to that gateway.

Upgrade NATS Control-Plane Streams

If you are upgrading from an earlier HA package, check for the legacy license_stream. Proxus uses license and license.health as core NATS request/reply subjects, so they should not be backed by a JetStream stream.

nats stream info license_stream
nats stream rm license_stream --force

Only remove it after confirming that your current package version no longer creates license_stream and that there are no custom consumers attached to it.

Updating Operators

Treat operator updates as infrastructure maintenance:

Read the operator release notes.
Confirm backups exist.
Apply the operator upgrade in a maintenance window.
Verify database and ClickHouse cluster health.

PostgreSQL:

kubectl -n proxus-data get cluster,pods

ClickHouse:

kubectl -n clickhouse get chi,pods

Backup and Restore

PostgreSQL

Use your storage snapshot system or CloudNativePG backup configuration. At minimum, verify:

kubectl -n proxus-data get cluster postgresql-ha
kubectl -n proxus-data get pvc

For production, configure scheduled backups and test restore into a separate namespace before relying on the backup plan.

ClickHouse

Use storage snapshots or a ClickHouse backup process that covers all replicas and Keeper metadata requirements. Verify both data PVCs and Keeper PVCs are protected:

kubectl -n clickhouse get pvc

Proxus Configuration and Modules

Back up the Proxus UI and gateway PVCs:

kubectl -n proxus-ha get pvc

The important volumes are:

UI config
UI modules
each gateway config volume
NATS JetStream data

Failure Testing

Run failure tests during a maintenance window.

UI Pod Failure

kubectl -n proxus-ha delete pod -l app.kubernetes.io/name=proxus-ui
kubectl -n proxus-ha rollout status deployment/proxus-ui --timeout=300s

The service should continue routing to a healthy UI pod while Kubernetes recreates the deleted pod.

Gateway Pod Failure

kubectl -n proxus-ha delete pod -l app.kubernetes.io/name=proxus-server,proxus.io/gateway-id=1
kubectl -n proxus-ha rollout status deployment/proxus-server-gateway-1 --timeout=300s

Connected devices may reconnect after the replacement pod starts.

PostgreSQL Primary Failure

Use the PostgreSQL operator's documented switchover or failover procedure. After failover, verify:

kubectl -n proxus-data get cluster postgresql-ha
kubectl -n proxus-ha logs deploy/proxus-ui --tail=100

ClickHouse Replica Failure

kubectl -n clickhouse delete pod chi-clickhouse-ha-proxus-0-0-0
kubectl -n clickhouse get pods,chi

The remaining replica should stay available while the deleted pod is recreated.

Troubleshooting

UI Does Not Open

Check the service and pods:

kubectl -n proxus-ha get pods,svc,endpoints
kubectl -n proxus-ha logs deploy/proxus-ui --tail=100

The HA package exposes the UI service as NodePort 30080. Open http://127.0.0.1:30080/ locally, or replace 127.0.0.1 with a reachable node address in a remote cluster.

Check the gateway logs:

kubectl -n proxus-ha logs deploy/proxus-server-gateway-1 --tail=150

If the license is not activated in the current database, activate it from the UI.

PostgreSQL Is Not Ready

kubectl -n proxus-data get cluster,pods,pvc
kubectl -n proxus-data describe cluster postgresql-ha

Common causes:

storage class cannot provision volumes
database image cannot be pulled
cluster has insufficient CPU or memory
pod anti-affinity cannot be satisfied in a small cluster

ClickHouse Is Not Ready

kubectl -n clickhouse get chi,pods,svc,pvc
kubectl -n clickhouse logs deploy/clickhouse-operator --tail=150
kubectl -n clickhouse logs statefulset/clickhouse-keeper --tail=150

Common causes:

ClickHouse image cannot be pulled
Keeper quorum is not healthy
PVCs are pending
the bootstrap job has not completed

NATS Is Not Ready

kubectl -n proxus-ha get pods -l app.kubernetes.io/name=hub-server
kubectl -n proxus-ha logs statefulset/hub-server --tail=150

Check that all three NATS pods are running and that their PVCs are bound.

Cleanup

For a test environment only:

kubectl delete -k ha
kubectl delete -k ha-data

warning

Data deletion

Deleting namespaces, PVCs, or storage snapshots can permanently delete production data. Do not run cleanup commands in production unless you have a tested restore path and an approved maintenance plan.

Production Checklist

Before going live:

Use a multi-node Kubernetes cluster.
Confirm storage classes and backup policies.
Use RWX-capable storage for shared UI volumes and nats-hub-config in multi-node production.
Change the default values in the nats-credentials and postgres-credentials Secrets.
Configure ingress TLS.
Confirm the UI is reachable through http://<node-address>:30080/ or through your ingress/load balancer.
Activate the Proxus license.
Keep each gateway deployment singleton.
Verify that JetStream streams run with three replicas (nats stream report should show every stream as R=3).
Verify Pod Disruption Budgets keep at least two NATS pods and one UI pod available during a node drain.
Confirm Prometheus is scraping the NATS metrics endpoint on port 7777.
Verify UI failover.
Verify PostgreSQL failover.
Verify ClickHouse replica recovery.
Verify NATS pod recovery.
Document your restore procedure.
Monitor pod health, storage usage, database health, and application logs.

Kubernetes High Availability Deployment

What This Deployment Provides

Quick HA Deploy

Verify the Install

First Login

Before Going to Production

Check the Deployment

Production Configuration

Security and Secrets

Observability

Scaling Rules

UI

Gateways

PostgreSQL

ClickHouse

Updating Proxus

Upgrade NATS Control-Plane Streams

Updating Operators

Backup and Restore

PostgreSQL

ClickHouse

Proxus Configuration and Modules

Failure Testing

UI Pod Failure

Gateway Pod Failure

PostgreSQL Primary Failure

ClickHouse Replica Failure

Troubleshooting

UI Does Not Open

Login Page Opens but Gateway Is Not Active

PostgreSQL Is Not Ready

ClickHouse Is Not Ready

NATS Is Not Ready

Cleanup

Production Checklist