This guide explains how to deploy Proxus on Kubernetes for a high availability production topology. Use it when Docker Compose is no longer enough for your availability, update, and operations requirements.
The manifests can be tested on a local Kubernetes environment, but a real HA production deployment requires a multi-node Kubernetes cluster, reliable persistent storage, monitoring, and a backup policy. A single-node cluster can prove the topology, but it cannot survive node failure.
What This Deployment Provides
The HA deployment separates Proxus into a data layer and an application layer:
| Layer | Component | HA behavior |
|---|---|---|
| Data | PostgreSQL | 3-instance cluster with one read-write endpoint and automatic failover |
| Data | ClickHouse Keeper | 3-pod coordination quorum |
| Data | ClickHouse | 1 shard with 2 replicas for replicated telemetry storage |
| Messaging | NATS JetStream | 3-node message hub with three-replica streams |
| Application | Proxus UI | 2 replicas spread across nodes behind a Kubernetes Service |
| Edge | Proxus Gateway | 1 pod per GatewayID, restarted by Kubernetes if it fails |
| Operations | Pod Disruption Budgets | Drain protection on NATS and UI |
| Operations | Prometheus metrics | NATS exporter sidecar on every NATS pod |
| Operations | Kubernetes Secrets | NATS and PostgreSQL credentials kept out of plaintext manifests |
Gateways are singleton workloads. Do not increase replicas for the same GatewayID. To add capacity for separate sites or lines, create another gateway deployment with a different GatewayID.
Quick HA Deploy
You need a Kubernetes cluster, a default StorageClass, kubectl access, permission to install operators, a Proxus license, and access to the Proxus container images used by your release.
The full HA package brings up:
- 3 PostgreSQL instances managed by CloudNativePG
- 3 ClickHouse Keeper pods with a 2-replica ClickHouse cluster
- 3 NATS pods running JetStream with three-replica streams
- 2 Proxus UI pods behind a sticky-session ingress
- 1 Proxus Gateway pod
- Pod Disruption Budgets, Prometheus metrics, and credential Secrets
The package ships with default credentials in two Kubernetes Secrets (nats-credentials, postgres-credentials). Use the Quick Deploy below to get a working cluster in minutes, then change those defaults before exposing the environment to anyone outside your team. See Security and Secrets for the rotation procedure.
Run the full HA deployment:
mkdir -p proxus-kubernetes-ha
cd proxus-kubernetes-ha
curl -L https://proxus.io/deployment/kubernetes-ha/proxus-kubernetes-ha.tar.gz | tar xz
kubectl apply --server-side -f https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.29/releases/cnpg-1.29.0.yaml
kubectl -n cnpg-system rollout status deployment/cnpg-controller-manager --timeout=300s
curl -s https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator-web-installer/clickhouse-operator-install.sh | OPERATOR_NAMESPACE=clickhouse bash
kubectl -n clickhouse rollout status deployment/clickhouse-operator --timeout=300s
kubectl apply -k ha-data
kubectl -n proxus-data wait --for=jsonpath='{.status.phase}'='Cluster in healthy state' cluster/postgresql-ha --timeout=600s
kubectl -n clickhouse wait --for=jsonpath='{.status.status}'=Completed chi/clickhouse-ha --timeout=600s
kubectl -n clickhouse wait --for=condition=complete job/clickhouse-bootstrap --timeout=180s
kubectl apply -k ha
kubectl -n proxus-ha rollout status statefulset/hub-server --timeout=300s
kubectl -n proxus-ha rollout status deployment/proxus-ui --timeout=300s
kubectl -n proxus-ha rollout status deployment/proxus-server-gateway-1 --timeout=300s Total time on a healthy cluster is around 5 to 10 minutes, dominated by CloudNativePG bringing up the PostgreSQL primary and the ClickHouse Keeper quorum starting.
Verify the Install
Confirm the cluster came up healthy and that JetStream streams are running with three replicas:
kubectl -n proxus-ha get pods,pdb
kubectl run -n proxus-ha nats-tmp --rm -i --restart=Never --image=natsio/nats-box:0.18.0 -- \
nats --server="nats://acc:acc@hub-server:4222" stream report Every JetStream stream in the report should show R=3. If any stream shows R=1, see Troubleshooting.
First Login
Open http://127.0.0.1:30080/ in your browser. In a remote cluster, replace 127.0.0.1 with a reachable Kubernetes node address or use the included ingress behind your load balancer.
Default login:
- Username:
Admin - Password: leave blank, unless your environment has already changed it
After the first login, activate your Proxus license from the UI. A fresh HA PostgreSQL cluster does not contain license data from an earlier Docker or local test database unless you restored that database.
Before Going to Production
Run through the Security and Secrets procedure to rotate the default NATS and PostgreSQL credentials, then walk the Production Checklist.
Check the Deployment
Use these commands when the quick deploy finishes or if one of the wait commands times out:
kubectl -n proxus-data get cluster,pods,svc,pvc
kubectl -n clickhouse get pods,svc,pvc,chi
kubectl -n proxus-ha get pods,svc,endpoints Expected shape:
| Workload | Expected |
|---|---|
hub-server | 3 NATS pods |
proxus-ui | 2 UI pods |
proxus-server-gateway-1 | 1 gateway pod |
postgresql service | ExternalName to the PostgreSQL read-write endpoint |
clickhouse service | ExternalName to the ClickHouse service |
For production, expose the UI through your ingress or load balancer. The included ingress example uses sticky sessions for browser sessions:
kubectl -n proxus-ha get ingress Production Configuration
Before exposing the environment, review and change these values:
| File | What to review |
|---|---|
ha/secrets.yaml | NATS account credentials and PostgreSQL credentials (do this first; see Security and Secrets) |
ha-data/postgresql-ha.yaml | PostgreSQL storage size and instance count |
ha-data/clickhouse-ha.yaml | ClickHouse password, storage size, Keeper and replica count |
ha/proxus-ui.yaml | UI image tag, replica count, resource requests and limits |
ha/proxus-gateway-1.yaml | Gateway image tag, GatewayID, exposed protocol ports, resource requests and limits |
ha/nats-cluster.yaml | JetStream storage size, resource requests and limits |
ha/pod-disruption-budgets.yaml | Minimum available pods during voluntary disruptions |
ha/storage.yaml | Persistent volume size and access mode, especially RWX storage for shared UI volumes and nats-hub-config in multi-node production |
ha/ingress-nginx-sticky.yaml | Host name, ingress class, sticky session settings |
Always apply with kubectl apply -k ha (kustomize). Applying individual files with kubectl apply -f does not pick up the namespace declared in the kustomization and can create resources in the wrong namespace.
Security and Secrets
Default NATS and PostgreSQL credentials are stored in two Kubernetes Secrets in the proxus-ha namespace, not in plaintext manifest arguments.
| Secret | Keys |
|---|---|
nats-credentials | users-user, users-password, sys-user, sys-password |
postgres-credentials | user, password, database |
The NATS server reads these values at startup through environment-variable interpolation in accounts.conf. UI and Gateway pods receive the same values as environment variables, and the connection arguments use $(VAR_NAME) substitution at pod start.
Rotate values without editing manifests:
kubectl -n proxus-ha edit secret nats-credentials
kubectl -n proxus-ha edit secret postgres-credentials
kubectl -n proxus-ha rollout restart statefulset/hub-server
kubectl -n proxus-ha rollout restart deployment/proxus-ui deployment/proxus-server-gateway-1 Make sure the new credentials are reflected in the PostgreSQL cluster definition (ha-data/postgresql-ha.yaml) before you change postgres-credentials. ClickHouse credentials are managed in ha-data/clickhouse-ha.yaml and do not yet flow through postgres-credentials.
Observability
Each NATS pod runs a sidecar that exposes Prometheus-format metrics on port 7777. A headless service publishes the endpoint with standard scrape annotations:
kubectl -n proxus-ha get svc hub-server-metrics
kubectl -n proxus-ha get pods -l app.kubernetes.io/name=hub-server -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.podIP}{"\n"}{end}' Point your Prometheus instance at the headless service. Useful metrics include JetStream stream and consumer counts, route status, and per-stream message throughput.
Scaling Rules
UI
The UI can be scaled horizontally:
kubectl -n proxus-ha scale deployment/proxus-ui --replicas=3
kubectl -n proxus-ha rollout status deployment/proxus-ui --timeout=300s Keep sticky sessions enabled at the ingress layer unless your environment has been validated without them.
Gateways
Do not scale a gateway deployment above one replica:
kubectl -n proxus-ha scale deployment/proxus-server-gateway-1 --replicas=1 To add another gateway, create a separate deployment with a different GatewayID.
PostgreSQL
PostgreSQL uses one writable primary. The application connection points to the read-write service:
postgresql-ha-rw.proxus-data.svc.cluster.local:5432 Use the read-only service only for workloads that are explicitly safe to run against replicas:
postgresql-ha-ro.proxus-data.svc.cluster.local:5432 ClickHouse
The package starts with one shard and two replicas. This gives replica availability. For higher analytics scale, plan shard count, table engines, and distributed table strategy before increasing shards.
Updating Proxus
- Take backups or verify recent successful backups.
- Edit the image tag in
ha/proxus-ui.yamlandha/proxus-gateway-1.yaml. - Apply the manifests:
kubectl apply -k ha - Watch the rollout:
kubectl -n proxus-ha rollout status deployment/proxus-ui --timeout=300s
kubectl -n proxus-ha rollout status deployment/proxus-server-gateway-1 --timeout=300s UI updates are rolling. Gateway updates are singleton updates and can briefly disconnect devices attached to that gateway.
Upgrade NATS Control-Plane Streams
If you are upgrading from an earlier HA package, check for the legacy license_stream. Proxus uses license and license.health as core NATS request/reply subjects, so they should not be backed by a JetStream stream.
nats stream info license_stream
nats stream rm license_stream --force Only remove it after confirming that your current package version no longer creates license_stream and that there are no custom consumers attached to it.
Updating Operators
Treat operator updates as infrastructure maintenance:
- Read the operator release notes.
- Confirm backups exist.
- Apply the operator upgrade in a maintenance window.
- Verify database and ClickHouse cluster health.
PostgreSQL:
kubectl -n proxus-data get cluster,pods ClickHouse:
kubectl -n clickhouse get chi,pods Backup and Restore
PostgreSQL
Use your storage snapshot system or CloudNativePG backup configuration. At minimum, verify:
kubectl -n proxus-data get cluster postgresql-ha
kubectl -n proxus-data get pvc For production, configure scheduled backups and test restore into a separate namespace before relying on the backup plan.
ClickHouse
Use storage snapshots or a ClickHouse backup process that covers all replicas and Keeper metadata requirements. Verify both data PVCs and Keeper PVCs are protected:
kubectl -n clickhouse get pvc Proxus Configuration and Modules
Back up the Proxus UI and gateway PVCs:
kubectl -n proxus-ha get pvc The important volumes are:
- UI config
- UI modules
- each gateway config volume
- NATS JetStream data
Failure Testing
Run failure tests during a maintenance window.
UI Pod Failure
kubectl -n proxus-ha delete pod -l app.kubernetes.io/name=proxus-ui
kubectl -n proxus-ha rollout status deployment/proxus-ui --timeout=300s The service should continue routing to a healthy UI pod while Kubernetes recreates the deleted pod.
Gateway Pod Failure
kubectl -n proxus-ha delete pod -l app.kubernetes.io/name=proxus-server,proxus.io/gateway-id=1
kubectl -n proxus-ha rollout status deployment/proxus-server-gateway-1 --timeout=300s Connected devices may reconnect after the replacement pod starts.
PostgreSQL Primary Failure
Use the PostgreSQL operator's documented switchover or failover procedure. After failover, verify:
kubectl -n proxus-data get cluster postgresql-ha
kubectl -n proxus-ha logs deploy/proxus-ui --tail=100 ClickHouse Replica Failure
kubectl -n clickhouse delete pod chi-clickhouse-ha-proxus-0-0-0
kubectl -n clickhouse get pods,chi The remaining replica should stay available while the deleted pod is recreated.
Troubleshooting
UI Does Not Open
Check the service and pods:
kubectl -n proxus-ha get pods,svc,endpoints
kubectl -n proxus-ha logs deploy/proxus-ui --tail=100 The HA package exposes the UI service as NodePort 30080. Open http://127.0.0.1:30080/ locally, or replace 127.0.0.1 with a reachable node address in a remote cluster.
Login Page Opens but Gateway Is Not Active
Check the gateway logs:
kubectl -n proxus-ha logs deploy/proxus-server-gateway-1 --tail=150 If the license is not activated in the current database, activate it from the UI.
PostgreSQL Is Not Ready
kubectl -n proxus-data get cluster,pods,pvc
kubectl -n proxus-data describe cluster postgresql-ha Common causes:
- storage class cannot provision volumes
- database image cannot be pulled
- cluster has insufficient CPU or memory
- pod anti-affinity cannot be satisfied in a small cluster
ClickHouse Is Not Ready
kubectl -n clickhouse get chi,pods,svc,pvc
kubectl -n clickhouse logs deploy/clickhouse-operator --tail=150
kubectl -n clickhouse logs statefulset/clickhouse-keeper --tail=150 Common causes:
- ClickHouse image cannot be pulled
- Keeper quorum is not healthy
- PVCs are pending
- the bootstrap job has not completed
NATS Is Not Ready
kubectl -n proxus-ha get pods -l app.kubernetes.io/name=hub-server
kubectl -n proxus-ha logs statefulset/hub-server --tail=150 Check that all three NATS pods are running and that their PVCs are bound.
Cleanup
For a test environment only:
kubectl delete -k ha
kubectl delete -k ha-data Deleting namespaces, PVCs, or storage snapshots can permanently delete production data. Do not run cleanup commands in production unless you have a tested restore path and an approved maintenance plan.
Production Checklist
Before going live:
- Use a multi-node Kubernetes cluster.
- Confirm storage classes and backup policies.
- Use RWX-capable storage for shared UI volumes and
nats-hub-configin multi-node production. - Change the default values in the
nats-credentialsandpostgres-credentialsSecrets. - Configure ingress TLS.
- Confirm the UI is reachable through
http://<node-address>:30080/or through your ingress/load balancer. - Activate the Proxus license.
- Keep each gateway deployment singleton.
- Verify that JetStream streams run with three replicas (
nats stream reportshould show every stream as R=3). - Verify Pod Disruption Budgets keep at least two NATS pods and one UI pod available during a node drain.
- Confirm Prometheus is scraping the NATS metrics endpoint on port
7777. - Verify UI failover.
- Verify PostgreSQL failover.
- Verify ClickHouse replica recovery.
- Verify NATS pod recovery.
- Document your restore procedure.
- Monitor pod health, storage usage, database health, and application logs.