SANDI Solr — Production Deployment with Docker Swarm

Overview

This guide covers deploying SANDI Solr across multiple dedicated servers using Docker Swarm. Each service runs on its own host, pinned by node label placement constraints. The result is complete resource isolation, simplified per-service troubleshooting, and the ability to scale individual tiers independently. The guide walks through infrastructure preparation, Swarm cluster formation, node labelling, image distribution via a private registry, stack deployment, verification, day-two operations, and detailed troubleshooting procedures.

Deployment topology

Manager (zoo1)

⟷

zoo2

⟷

zoo3

⟷

solr1

⟷

solr2

⟷

search1

⟷

index1

emb (GPU)

⟷

llm (GPU)

⟷

rer (GPU)

⟷

nlp

⟷

client_search

⟷

client_index

All containers communicate over a Docker overlay network (sandi_net) that spans all hosts. Services resolve each other by hostname (e.g. sandi_zoo1, sandi_solr1) via Swarm's built-in DNS.

Infrastructure Requirements

A minimum of 13 hosts is required for full isolation. All hosts must run Ubuntu 24.04 LTS (or compatible Linux) and must be able to reach each other on the network. Static IP addresses are strongly recommended.

Host label	Service	RAM	CPU	Disk	GPU
Coordination layer
`zoo1` Swarm manager	ZooKeeper 1	4 GB	2 cores	100 GB SSD	—
`zoo2`	ZooKeeper 2	4 GB	2 cores	100 GB SSD	—
`zoo3`	ZooKeeper 3	4 GB	2 cores	100 GB SSD	—
Search storage layer
`solr1`	Apache Solr node 1	32 GB	8 cores	500 GB SSD	—
`solr2`	Apache Solr node 2	32 GB	8 cores	500 GB SSD	—
Application layer
`search1`	SANDI Search API (port 8081)	4 GB	4 cores	100 GB	—
`index1`	SANDI Index / Admin API (port 8082)	4 GB	4 cores	100 GB	—
AI services — GPU
`emb3`	Embedding service (Qwen3-Embedding-0.6B)	16 GB	4 cores	100 GB	NVIDIA ≥ 24 GB VRAM
`llm2`	LLM service (Qwen3-4B)	16 GB	4 cores	100 GB	NVIDIA ≥ 24 GB VRAM
`rer1`	Reranking service (Qwen3-Reranker-0.6B)	16 GB	4 cores	100 GB	NVIDIA ≥ 24 GB VRAM
Other services — CPU
`nlp1`	NLP service (SpaCy)	4 GB	4 cores	100 GB	—
`client_search`	Client search hook service	4 GB	4 cores	100 GB	—
`client_index`	Client index hook service	4 GB	4 cores	100 GB	—

Required open ports

Port(s)	Protocol	Purpose	Open on
2377	TCP	Docker Swarm cluster management	All hosts
7946	TCP + UDP	Docker Swarm node-to-node communication	All hosts
4789	UDP	Docker overlay network (VXLAN)	All hosts
2181–2183	TCP	ZooKeeper client connections	zoo1–zoo3
7001–7003	TCP	ZooKeeper Prometheus metrics	zoo1–zoo3
8981–8982	TCP	Solr admin / query API	solr1–solr2
8081	TCP	SANDI Search REST API	search1
8082	TCP	SANDI Index / Admin REST API	index1
8083–8088	TCP	AI and hook services	emb3, llm2, nlp1, rer1, client_search, client_index
5000	TCP	Private Docker registry (if used)	manager (zoo1)

Swarm overlay networking requires ports 7946 and 4789 to be open between all hosts, not just between manager and workers. If any host is behind a firewall or NAT, the overlay network will silently fail to form and services will not be able to reach each other.

Phase 1 — Prepare All Hosts

1

Install Docker on every host

Run the following on all 13 hosts:

# Update packages
sudo apt update && sudo apt upgrade -y

# Install Docker using the official convenience script
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Allow running Docker without sudo
sudo usermod -aG docker $USER
newgrp docker

# Verify
docker --version

2

Install NVIDIA drivers and Container Toolkit on GPU hosts

Run only on the three GPU hosts: emb3, llm2, rer1.

# Install NVIDIA drivers
sudo apt install -y nvidia-driver-535
sudo reboot

# After reboot — verify GPU is visible
nvidia-smi

# Add NVIDIA Container Toolkit repository
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install -y nvidia-docker2

# Configure Docker to use the NVIDIA runtime by default
sudo tee /etc/docker/daemon.json <<EOF
{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}
EOF

sudo systemctl restart docker

# Verify GPU access inside a container
docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi

The default-runtime: nvidia setting is required for Docker Swarm. Swarm does not support the runtime: key in compose files, so NVIDIA must be set as the default runtime on the GPU host — otherwise the GPU reservation constraint will be ignored and the service will start without GPU access.

3

Open firewall ports on every host

# Swarm inter-node ports — ALL hosts
sudo ufw allow 2377/tcp
sudo ufw allow 7946/tcp
sudo ufw allow 7946/udp
sudo ufw allow 4789/udp

# Service-specific ports — open only on the relevant host

# ZooKeeper hosts (zoo1, zoo2, zoo3)
sudo ufw allow 2181:2183/tcp
sudo ufw allow 7001:7003/tcp

# Solr hosts (solr1, solr2)
sudo ufw allow 8981:8982/tcp

# Application hosts (search1, index1)
sudo ufw allow 8081:8082/tcp

# AI service hosts (emb3, llm2, nlp1, rer1, client_search, client_index)
sudo ufw allow 8083:8088/tcp

# Private registry — manager host (zoo1) only
sudo ufw allow 5000/tcp

sudo ufw --force enable
sudo ufw status

Phase 2 — Form the Docker Swarm Cluster

1

Initialise the Swarm on the manager node (zoo1)

# Replace with the actual IP address of the zoo1 host
docker swarm init --advertise-addr <ZOO1_IP>

# Example output:
Swarm initialized: current node (abc123) is now a manager.

To add a worker to this swarm, run the following command:
    docker swarm join --token SWMTKN-1-<token> <ZOO1_IP>:2377

Save the printed join token immediately. If you lose it, you can retrieve it later with: docker swarm join-token worker (run on the manager).

2

Join all 12 worker nodes

Run on each of the remaining 12 hosts:

docker swarm join --token <TOKEN> <ZOO1_IP>:2377

For bulk automation with Ansible or a shell loop:

# From your workstation — assumes SSH key auth is set up
WORKERS=(
  192.168.1.11   # zoo2
  192.168.1.12   # zoo3
  192.168.1.13   # solr1
  192.168.1.14   # solr2
  192.168.1.15   # search1
  192.168.1.16   # index1
  192.168.1.17   # emb3
  192.168.1.18   # llm2
  192.168.1.19   # nlp1
  192.168.1.20   # rer1
  192.168.1.21   # client_search
  192.168.1.22   # client_index
)
for host in "${WORKERS[@]}"; do
  ssh user@$host "docker swarm join --token <TOKEN> <ZOO1_IP>:2377"
done

3

Verify cluster formation

On the manager node:

docker node ls

Expected output — 13 nodes, all with STATUS = Ready and AVAILABILITY = Active:

ID                HOSTNAME        STATUS   AVAILABILITY  MANAGER STATUS
abc123 *          zoo1-host       Ready    Active        Leader
def456            zoo2-host       Ready    Active
ghi789            zoo3-host       Ready    Active
... (10 more workers)

For resilience the Swarm manager itself can be promoted to a multi-manager setup later: docker node promote <zoo2_node_id>. A 3-manager quorum (zoo1, zoo2, zoo3) tolerates one manager failure without losing control of the cluster.

Phase 3 — Label Nodes for Service Placement

Labels tell Swarm which host each service should run on. Every service in the Swarm compose file carries a deploy.placement.constraints that matches exactly one label, pinning it to the correct host.

1

Collect node IDs

# List all nodes with their IDs and hostnames
docker node ls --format "table {{.ID}}\t{{.Hostname}}\t{{.Status}}"

2

Apply labels to each node

Replace the node IDs with actual values from the previous command:

# Coordination layer
docker node update --label-add service=zoo1          <ZOO1_NODE_ID>
docker node update --label-add service=zoo2          <ZOO2_NODE_ID>
docker node update --label-add service=zoo3          <ZOO3_NODE_ID>

# Search storage layer
docker node update --label-add service=solr1         <SOLR1_NODE_ID>
docker node update --label-add service=solr2         <SOLR2_NODE_ID>

# Application layer
docker node update --label-add service=search1       <SEARCH1_NODE_ID>
docker node update --label-add service=index1        <INDEX1_NODE_ID>

# AI services
docker node update --label-add service=emb3          <EMB3_NODE_ID>
docker node update --label-add service=llm2          <LLM2_NODE_ID>
docker node update --label-add service=nlp1          <NLP1_NODE_ID>
docker node update --label-add service=rer1          <RER1_NODE_ID>
docker node update --label-add service=client_search <CLIENT_SEARCH_NODE_ID>
docker node update --label-add service=client_index  <CLIENT_INDEX_NODE_ID>

3

Verify labels

# Quick check — shows hostname and labels for each node
docker node ls -q | xargs -I{} docker node inspect {} \
  --format 'Node: {{.Description.Hostname}}  Labels: {{.Spec.Labels}}'

Each node should show exactly one service=<name> label.

Phase 4 — Private Docker Registry

The six custom Python services (emb, llm, nlp, rer, client_search, client_index) need their Docker images available on their target hosts. A private registry running on the manager node is the cleanest way to build once and distribute to all workers — no manual per-host builds required.

1

Start the registry service on the manager node

# Deploy registry as a Swarm service pinned to the manager
docker service create \
  --name registry \
  --publish 5000:5000 \
  --constraint 'node.role == manager' \
  --mount type=volume,src=registry-data,dst=/var/lib/registry \
  registry:2

# Verify it is running
docker service ls | grep registry

2

Configure all worker nodes to trust the registry

Because the registry uses HTTP (not HTTPS), Docker must be told to allow it as an insecure registry. Run on all 13 hosts:

# Add the manager IP as an insecure registry
sudo tee /etc/docker/daemon.json <<EOF
{
  "insecure-registries": ["<ZOO1_IP>:5000"]
}
EOF
# On GPU hosts, merge with the existing nvidia runtime config:
# { "default-runtime": "nvidia", ..., "insecure-registries": ["<ZOO1_IP>:5000"] }

sudo systemctl restart docker

3

Build and push all custom images

On the manager node, from the project directory:

cd /opt/sandi

for service in emb3 llm2 nlp1 rer1 client_search client_index; do
  echo "Building $service ..."
  docker build -t <ZOO1_IP>:5000/sandi_$service:latest ./$service
  docker push <ZOO1_IP>:5000/sandi_$service:latest
done

# Verify images are in the registry
curl http://<ZOO1_IP>:5000/v2/_catalog

The Swarm compose file must reference the registry-prefixed image names, e.g. <ZOO1_IP>:5000/sandi_emb3:latest. Swarm pulls images from the registry on the target node automatically at deploy time — no manual pre-pulling is needed.

Phase 5 — Distribute Project Files

1

Sync the project directory to all hosts

From your workstation or the manager node:

HOSTS=(
  192.168.1.10 192.168.1.11 192.168.1.12   # zoo1–3
  192.168.1.13 192.168.1.14               # solr1–2
  192.168.1.15 192.168.1.16               # search1, index1
  192.168.1.17 192.168.1.18 192.168.1.19  # emb3, llm2, nlp1
  192.168.1.20 192.168.1.21 192.168.1.22  # rer1, client_search, client_index
)
for host in "${HOSTS[@]}"; do
  rsync -avz --progress /opt/sandi/ user@$host:/opt/sandi/
done

2

Create data directories on each host

Docker bind mounts require the directories to exist on the host before the container starts. Swarm will not create them automatically.

ZooKeeper hosts (zoo1, zoo2, zoo3)

cd /opt/sandi
mkdir -p data/zoo1/{data,datalog}
mkdir -p data/zoo2/{data,datalog}
mkdir -p data/zoo3/{data,datalog}
chmod -R 777 data/

Solr hosts (solr1, solr2)

cd /opt/sandi
mkdir -p data/solr1
mkdir -p data/solr2
chmod -R 777 data/

Search host (search1)

cd /opt/sandi
mkdir -p search/webapps
mkdir -p search/logs
chmod -R 777 search/

Index host (index1)

cd /opt/sandi
mkdir -p index/webapps
mkdir -p index/logs
mkdir -p documents
chmod -R 777 index/ documents/

Copy the built sandi.war files into search/webapps/ and index/webapps/ on the respective hosts before deploying the stack, or the Tomcat containers will start with empty webapps directories and return 404.

3

Place configuration files

The Search and Index API containers read their properties from /sandi/conf/. The bind mount in the compose file maps ./conf on the host to this path. Ensure the correct sandi-solr-search.properties and sandi-solr-index.properties are in /opt/sandi/conf/ on the search1 and index1 hosts respectively, with the ZooKeeper connection string pointing at the actual ZK host IPs:

# In sandi-solr-search.properties and sandi-solr-index.properties
sandi.solr.zk.hosts=<ZOO1_IP>:2181,<ZOO2_IP>:2181,<ZOO3_IP>:2181

# AI service URLs must point to the actual host IPs
sandi.service.emb.url=http://<EMB3_IP>:8083
sandi.service.llm.url=http://<LLM2_IP>:8084
sandi.service.nlp.url=http://<NLP1_IP>:8085
sandi.service.rer.url=http://<RER1_IP>:8086

On the overlay network, services can also be reached by their Swarm DNS hostname (e.g. sandi_emb3) instead of IP. Using hostnames is more resilient to IP changes but requires that the properties files are updated after deployment if the service name changes.

Phase 6 — Deploy the Stack

1

Create the overlay network

# Run on the manager node
docker network create \
  --driver overlay \
  --attachable \
  --subnet 10.10.0.0/16 \
  sandi_net

# Verify
docker network ls | grep sandi_net

2

Deploy the stack

On the manager node, from the directory containing docker-compose.swarm.yml:

cd /opt/sandi
docker stack deploy -c docker-compose.swarm.yml sandi

Swarm will schedule all 13 services. On first run, each worker node pulls its images from the private registry. This may take several minutes, especially for the GPU service images which can be several GB.

3

Monitor deployment progress

# Watch all services until REPLICAS shows 1/1 for each
watch docker stack services sandi

# See which node each task is running on
docker stack ps sandi --format "table {{.Name}}\t{{.Node}}\t{{.CurrentState}}\t{{.Error}}"

ZooKeeper and Solr can take 30–90 seconds to reach a healthy state. The application API containers (search1, index1) depend on Solr being ready, so they may restart once before settling.

Phase 7 — Verify the Deployment

1

Check all services show 1/1 replicas

docker stack services sandi

Expected output — all 13 services with REPLICAS 1/1:

ID          NAME                    MODE      REPLICAS  IMAGE
xxxxxxxx    sandi_sandi_zoo1        replicated  1/1       zookeeper:3.9.2
xxxxxxxx    sandi_sandi_zoo2        replicated  1/1       zookeeper:3.9.2
xxxxxxxx    sandi_sandi_zoo3        replicated  1/1       zookeeper:3.9.2
xxxxxxxx    sandi_sandi_solr1       replicated  1/1       solr:9.8.1
xxxxxxxx    sandi_sandi_solr2       replicated  1/1       solr:9.8.1
xxxxxxxx    sandi_sandi_search1     replicated  1/1       tomcat:10.1.44-jdk17-temurin-noble
xxxxxxxx    sandi_sandi_index1      replicated  1/1       tomcat:10.1.44-jdk17-temurin-noble
xxxxxxxx    sandi_sandi_emb3        replicated  1/1       <ZOO1_IP>:5000/sandi_emb3:latest
xxxxxxxx    sandi_sandi_llm2        replicated  1/1       <ZOO1_IP>:5000/sandi_llm2:latest
xxxxxxxx    sandi_sandi_nlp1        replicated  1/1       <ZOO1_IP>:5000/sandi_nlp1:latest
xxxxxxxx    sandi_sandi_rer1        replicated  1/1       <ZOO1_IP>:5000/sandi_rer1:latest
xxxxxxxx    sandi_sandi_client_search replicated 1/1      <ZOO1_IP>:5000/sandi_client_search:latest
xxxxxxxx    sandi_sandi_client_index  replicated 1/1      <ZOO1_IP>:5000/sandi_client_index:latest

2

Test ZooKeeper quorum

# Each ZK node must respond "imok"
echo ruok | nc <ZOO1_IP> 2181   # → imok
echo ruok | nc <ZOO2_IP> 2182   # → imok
echo ruok | nc <ZOO3_IP> 2183   # → imok

# Check which node is the ZK leader
echo stat | nc <ZOO1_IP> 2181 | grep -E "Mode|Connections"

3

Test Solr cluster

# Solr admin should return 200 on both nodes
curl -s -o /dev/null -w "%{http_code}" http://<SOLR1_IP>:8981/solr/
curl -s -o /dev/null -w "%{http_code}" http://<SOLR2_IP>:8982/solr/

# Check the SolrCloud cluster status
curl http://<SOLR1_IP>:8981/solr/admin/collections?action=CLUSTERSTATUS | python3 -m json.tool

4

Test application and AI services

# Search and Index APIs
curl http://<SEARCH1_IP>:8081/sandi/
curl http://<INDEX1_IP>:8082/sandi/

# AI services — should return HTTP 200
curl http://<EMB3_IP>:8083/
curl http://<LLM2_IP>:8084/
curl http://<NLP1_IP>:8085/
curl http://<RER1_IP>:8086/
curl http://<CLIENT_SEARCH_IP>:8087/
curl http://<CLIENT_INDEX_IP>:8088/

5

Open the SANDI web interfaces

Interface	URL
Search UI	`http://<SEARCH1_IP>:8081/sandi/en/sandi-search.html`
Index UI	`http://<INDEX1_IP>:8082/sandi/en/sandi-index.html`
Admin UI	`http://<INDEX1_IP>:8082/sandi/en/sandi-admin.html`
Solr Admin	`http://<SOLR1_IP>:8981/solr/`

Day-Two Operations

View service status and placement

# All services and replica counts
docker stack services sandi

# Which task runs on which host (detailed)
docker stack ps sandi --format "table {{.Name}}\t{{.Node}}\t{{.CurrentState}}\t{{.Error}}"

# Tasks that have failed or been restarted
docker stack ps sandi --filter "desired-state=shutdown" --no-trunc

View logs

# Follow logs for a specific service
docker service logs -f --tail 100 sandi_sandi_search1
docker service logs -f --tail 100 sandi_sandi_solr1
docker service logs -f --tail 100 sandi_sandi_emb3

# Show timestamps
docker service logs -f --timestamps sandi_sandi_index1

Update a service image (rolling update)

# Rebuild and push the new image to the registry
docker build -t <ZOO1_IP>:5000/sandi_emb3:v2 ./emb3
docker push <ZOO1_IP>:5000/sandi_emb3:v2

# Update the running service — Swarm pulls and restarts it
docker service update --image <ZOO1_IP>:5000/sandi_emb3:v2 sandi_sandi_emb3

# For the Tomcat-based API services, redeploy the stack after updating the WAR file
docker stack deploy -c docker-compose.swarm.yml sandi

Redeploy the full stack after config changes

# Re-running stack deploy is idempotent — only changed services are updated
docker stack deploy -c docker-compose.swarm.yml sandi

Scale the Search API horizontally

# First remove the placement constraint that pins search1 to a single node
docker service update \
  --constraint-rm "node.labels.service == search1" \
  sandi_sandi_search1

# Scale to 3 replicas — Swarm will distribute across available nodes
docker service scale sandi_sandi_search1=3

# Add a load balancer (nginx / HAProxy) in front of all three instances

Remove the stack

# Stops and removes all services — data in bind-mount volumes is preserved on each host
docker stack rm sandi

Security Hardening

Use Docker Secrets for sensitive values

# Create secrets on the manager node
echo "solradmin"    | docker secret create solr_username -
echo "StrongPass1!" | docker secret create solr_password -

# Reference secrets in docker-compose.swarm.yml:
services:
  sandi_solr1:
    secrets:
      - solr_username
      - solr_password
    environment:
      - SOLR_AUTHENTICATION_OPTS=-Dbasicauth=$(cat /run/secrets/solr_username):$(cat /run/secrets/solr_password)

secrets:
  solr_username:
    external: true
  solr_password:
    external: true

Encrypt the overlay network

# Create an encrypted overlay network (IPSec)
docker network create \
  --driver overlay \
  --attachable \
  --opt encrypted \
  --subnet 10.10.0.0/16 \
  sandi_net

Lock the Swarm

# Swarm lock requires a key to unlock after a manager restart
docker swarm update --autolock=true
# Save the printed unlock key securely — losing it means you cannot rejoin after restart

# To unlock after manager restart:
docker swarm unlock

Enable Solr authentication

# In sandi-solr-search.properties and sandi-solr-index.properties
sandi.solr.username=solradmin
sandi.solr.password=StrongPass1!

Backup Procedures

Backup Swarm state (manager node)

# Stop Docker, archive Swarm state, restart
sudo systemctl stop docker
sudo tar -czvf swarm-backup-$(date +%Y%m%d).tar.gz -C /var/lib/docker/swarm .
sudo systemctl start docker

Backup ZooKeeper data

# On each ZooKeeper host (zoo1, zoo2, zoo3)
sudo tar -czvf zoo-backup-$(date +%Y%m%d).tar.gz \
  -C /opt/sandi/data/zoo1 .

Backup Solr data

# Use the Solr backup API to create a consistent snapshot
curl "http://<SOLR1_IP>:8981/solr/<collection>/replication?command=backup&name=backup-$(date +%Y%m%d)"

# Or tar the data directory directly after stopping the Solr service
docker service scale sandi_sandi_solr1=0
sudo tar -czvf solr-backup-$(date +%Y%m%d).tar.gz -C /opt/sandi/data/solr1 .
docker service scale sandi_sandi_solr1=1

Troubleshooting

Service stuck at 0/1 replicas — never starts

Diagnose:

docker service ps sandi_sandi_<service> --no-trunc
docker service logs sandi_sandi_<service>

Common causes:

Image not found: The image has not been pushed to the registry or the registry address in the compose file is wrong. Push the image and verify with curl http://<ZOO1_IP>:5000/v2/_catalog.
No node matches placement constraint: The node label is missing or mistyped. Verify with docker node inspect <node_id> --format '{{.Spec.Labels}}'.
Bind mount path missing: The directory on the host does not exist. Create it and re-deploy.
Port already in use: Another process is occupying the published port. Check with ss -tlnp | grep <port> on the target host.

Service starts but immediately exits (0/1 → restarts in loop)

# See the actual error from the most recent failed task
docker service ps sandi_sandi_<service> --no-trunc --filter "desired-state=shutdown"
docker service logs sandi_sandi_<service> 2>&1 | tail -50

Common causes:

ZooKeeper not ready (Solr): Solr cannot connect to ZK on startup. Wait 60–90 s and re-check — Swarm's restart policy will keep retrying.
Wrong properties path (Tomcat): Confirm the -Dspring.config.location JAVA_OPTS value points to a file that actually exists inside the container.
GPU not accessible (emb, llm, rer): NVIDIA runtime is not set as default on the host. See step 2 of Phase 1.
Python import error (AI services): A Python dependency is missing in the image. Rebuild with docker build --no-cache.
Out of memory: The host has insufficient RAM. Check docker stats and the host's dmesg | grep -i oom.

GPU not used by AI services (model runs on CPU, very slow)

# Verify GPU runtime is the default on the GPU host
docker info | grep -i runtime

# Check that the container can see the GPU
docker run --rm nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi

# Inspect the running container on the GPU host directly
docker ps | grep sandi_emb
docker inspect <container_id> | grep -A5 '"Runtime"'

daemon.json must have "default-runtime": "nvidia". Docker Swarm ignores the runtime: key in compose files — the NVIDIA runtime must be set at the daemon level.

Services cannot reach each other over the overlay network

# Verify the overlay network exists and spans all nodes
docker network inspect sandi_net | grep -E '"Scope"|"Driver"|Peers'

# Test DNS resolution from inside a running container
docker exec -it $(docker ps -q -f label=com.docker.swarm.service.name=sandi_sandi_solr1) \
  ping -c 3 sandi_zoo1

# Check Swarm ports are open between hosts (run on each host)
sudo ufw status | grep -E '2377|7946|4789'

Common causes:

UDP port 4789 (VXLAN) blocked by firewall or cloud security group — this is the most common cause.
Hosts are on different network segments with VXLAN not permitted by the router.
The overlay network was created before all nodes joined — recreate it after all nodes are in the cluster.

Solr cannot connect to ZooKeeper

# Check ZooKeeper is healthy on all three nodes
echo ruok | nc <ZOO1_IP> 2181

# Check Solr logs for the ZK connection error
docker service logs sandi_sandi_solr1 2>&1 | grep -i "zookeeper\|zk\|connection"

# Verify the ZK_HOST env var in the compose file matches the actual ZK IPs/hostnames
docker service inspect sandi_sandi_solr1 | grep ZK_HOST

The ZK_HOST environment variable must use hostnames that resolve inside the overlay network (e.g. sandi_zoo1:2181) or IP addresses that are reachable from the Solr container.

SANDI API returns 503 / cannot connect to Solr or AI services

# Check the application log for connection errors
docker service logs sandi_sandi_search1 2>&1 | grep -iE "error|exception|refused|timeout" | tail -30

# Verify properties file URLs are correct on the search1 / index1 host
cat /opt/sandi/conf/sandi-solr-search.properties | grep -E "url|host|zk"

# Test connectivity from the search1 host directly
curl http://<EMB3_IP>:8083/
curl http://<SOLR1_IP>:8981/solr/

A host goes down — service is stuck, Swarm does not reschedule

# Check the node status
docker node ls

# Option 1: Bring the node back online, Swarm restarts the service automatically

# Option 2: Move the service to another node manually
# Remove the placement constraint and add a new one pointing to an available node
docker service update \
  --constraint-rm "node.labels.service == emb3" \
  --constraint-add "node.labels.service == spare_gpu_host" \
  sandi_sandi_emb3

# Option 3: Remove the placement constraint entirely — Swarm picks any available node
docker service update \
  --constraint-rm "node.labels.service == emb3" \
  sandi_sandi_emb3

Because each service is pinned to a specific node by label constraint, Swarm will not automatically reschedule it to another node when the host goes down. You must intervene manually. For stateless services (AI services, API) removing the constraint and letting Swarm reschedule is safe. For Solr and ZooKeeper, data locality matters — prefer restoring the original node.

docker stack deploy fails with "network not found"

# The overlay network must be created before deploying the stack
docker network ls | grep sandi_net

# If missing, create it
docker network create --driver overlay --attachable --subnet 10.10.0.0/16 sandi_net

# Then redeploy
docker stack deploy -c docker-compose.swarm.yml sandi

Manager node restarted — Swarm is locked

# If autolock is enabled, unlock the Swarm after a manager restart
docker swarm unlock
# Enter the unlock key saved when autolock was enabled

# If the key was lost, you must force a new manager
# (this requires at least one other healthy manager in the cluster)

Useful diagnostic commands — quick reference

# Cluster-wide service state
docker stack ps sandi --no-trunc

# Resource usage on a specific host
ssh user@<HOST_IP> docker stats --no-stream

# Inspect a specific service's full configuration
docker service inspect --pretty sandi_sandi_solr1

# List all tasks ever run (including failed) for a service
docker service ps sandi_sandi_llm2 --no-trunc --filter "desired-state=shutdown"

# Enter a running container interactively for debugging
docker exec -it $(docker ps -q -f label=com.docker.swarm.service.name=sandi_sandi_nlp1) bash

# Force a service restart
docker service update --force sandi_sandi_search1

Monitoring

ZooKeeper already exposes Prometheus metrics on ports 7001–7003 via the metricsProvider configuration in the compose file. Recommended monitoring stack:

Tool	Purpose	Notes
Prometheus	Metrics collection	Scrape ZooKeeper on `:7001–7003/metrics`, Solr on `:8981/solr/admin/metrics`
Grafana	Dashboards	Pre-built ZooKeeper and Solr dashboards available on grafana.com
cAdvisor	Container CPU / RAM / GPU metrics	Deploy as a Swarm global service so it runs on every node
Docker Swarm Visualizer	Live service placement map	`dockersamples/visualizer` — deploy on the manager, port 8080

# Deploy cAdvisor as a global Swarm service (runs on every node)
docker service create \
  --name cadvisor \
  --mode global \
  --publish 9200:8080 \
  --mount type=bind,src=/,dst=/rootfs,readonly \
  --mount type=bind,src=/var/run,dst=/var/run \
  --mount type=bind,src=/sys,dst=/sys,readonly \
  --mount type=bind,src=/var/lib/docker,dst=/var/lib/docker,readonly \
  gcr.io/cadvisor/cadvisor:latest