SANDI Solr

Production Deployment with Docker Swarm

Overview

This guide covers deploying SANDI Solr across multiple dedicated servers using Docker Swarm. Each service runs on its own host, pinned by node label placement constraints. The result is complete resource isolation, simplified per-service troubleshooting, and the ability to scale individual tiers independently. The guide walks through infrastructure preparation, Swarm cluster formation, node labelling, image distribution via a private registry, stack deployment, verification, day-two operations, and detailed troubleshooting procedures.

Deployment topology

Manager (zoo1)
zoo2
zoo3
solr1
solr2
search1
index1
emb (GPU)
llm (GPU)
rer (GPU)
nlp
client_search
client_index

All containers communicate over a Docker overlay network (sandi_net) that spans all hosts. Services resolve each other by hostname (e.g. sandi_zoo1, sandi_solr1) via Swarm's built-in DNS.

Infrastructure Requirements

A minimum of 13 hosts is required for full isolation. All hosts must run Ubuntu 24.04 LTS (or compatible Linux) and must be able to reach each other on the network. Static IP addresses are strongly recommended.

Host label Service RAM CPU Disk GPU
Coordination layer
zoo1 Swarm manager ZooKeeper 1 4 GB2 cores100 GB SSD
zoo2 ZooKeeper 2 4 GB2 cores100 GB SSD
zoo3 ZooKeeper 3 4 GB2 cores100 GB SSD
Search storage layer
solr1 Apache Solr node 1 32 GB8 cores500 GB SSD
solr2 Apache Solr node 2 32 GB8 cores500 GB SSD
Application layer
search1 SANDI Search API (port 8081) 4 GB4 cores100 GB
index1 SANDI Index / Admin API (port 8082) 4 GB4 cores100 GB
AI services — GPU
emb3 Embedding service (Qwen3-Embedding-0.6B) 16 GB4 cores100 GBNVIDIA ≥ 24 GB VRAM
llm2 LLM service (Qwen3-4B) 16 GB4 cores100 GBNVIDIA ≥ 24 GB VRAM
rer1 Reranking service (Qwen3-Reranker-0.6B) 16 GB4 cores100 GBNVIDIA ≥ 24 GB VRAM
Other services — CPU
nlp1 NLP service (SpaCy) 4 GB4 cores100 GB
client_search Client search hook service 4 GB4 cores100 GB
client_index Client index hook service 4 GB4 cores100 GB

Required open ports

Port(s)ProtocolPurposeOpen on
2377TCPDocker Swarm cluster managementAll hosts
7946TCP + UDPDocker Swarm node-to-node communicationAll hosts
4789UDPDocker overlay network (VXLAN)All hosts
2181–2183TCPZooKeeper client connectionszoo1–zoo3
7001–7003TCPZooKeeper Prometheus metricszoo1–zoo3
8981–8982TCPSolr admin / query APIsolr1–solr2
8081TCPSANDI Search REST APIsearch1
8082TCPSANDI Index / Admin REST APIindex1
8083–8088TCPAI and hook servicesemb3, llm2, nlp1, rer1, client_search, client_index
5000TCPPrivate Docker registry (if used)manager (zoo1)
Swarm overlay networking requires ports 7946 and 4789 to be open between all hosts, not just between manager and workers. If any host is behind a firewall or NAT, the overlay network will silently fail to form and services will not be able to reach each other.

Phase 1 — Prepare All Hosts

1

Install Docker on every host

Run the following on all 13 hosts:

# Update packages sudo apt update && sudo apt upgrade -y # Install Docker using the official convenience script curl -fsSL https://get.docker.com -o get-docker.sh sudo sh get-docker.sh # Allow running Docker without sudo sudo usermod -aG docker $USER newgrp docker # Verify docker --version
2

Install NVIDIA drivers and Container Toolkit on GPU hosts

Run only on the three GPU hosts: emb3, llm2, rer1.

# Install NVIDIA drivers sudo apt install -y nvidia-driver-535 sudo reboot # After reboot — verify GPU is visible nvidia-smi # Add NVIDIA Container Toolkit repository distribution=$(. /etc/os-release; echo $ID$VERSION_ID) curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \ | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \ | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \ | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt update && sudo apt install -y nvidia-docker2 # Configure Docker to use the NVIDIA runtime by default sudo tee /etc/docker/daemon.json <<EOF { "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } } EOF sudo systemctl restart docker # Verify GPU access inside a container docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi
The default-runtime: nvidia setting is required for Docker Swarm. Swarm does not support the runtime: key in compose files, so NVIDIA must be set as the default runtime on the GPU host — otherwise the GPU reservation constraint will be ignored and the service will start without GPU access.
3

Open firewall ports on every host

# Swarm inter-node ports — ALL hosts sudo ufw allow 2377/tcp sudo ufw allow 7946/tcp sudo ufw allow 7946/udp sudo ufw allow 4789/udp # Service-specific ports — open only on the relevant host # ZooKeeper hosts (zoo1, zoo2, zoo3) sudo ufw allow 2181:2183/tcp sudo ufw allow 7001:7003/tcp # Solr hosts (solr1, solr2) sudo ufw allow 8981:8982/tcp # Application hosts (search1, index1) sudo ufw allow 8081:8082/tcp # AI service hosts (emb3, llm2, nlp1, rer1, client_search, client_index) sudo ufw allow 8083:8088/tcp # Private registry — manager host (zoo1) only sudo ufw allow 5000/tcp sudo ufw --force enable sudo ufw status

Phase 2 — Form the Docker Swarm Cluster

1

Initialise the Swarm on the manager node (zoo1)

# Replace with the actual IP address of the zoo1 host docker swarm init --advertise-addr <ZOO1_IP> # Example output: Swarm initialized: current node (abc123) is now a manager. To add a worker to this swarm, run the following command: docker swarm join --token SWMTKN-1-<token> <ZOO1_IP>:2377
Save the printed join token immediately. If you lose it, you can retrieve it later with: docker swarm join-token worker (run on the manager).
2

Join all 12 worker nodes

Run on each of the remaining 12 hosts:

docker swarm join --token <TOKEN> <ZOO1_IP>:2377

For bulk automation with Ansible or a shell loop:

# From your workstation — assumes SSH key auth is set up WORKERS=( 192.168.1.11 # zoo2 192.168.1.12 # zoo3 192.168.1.13 # solr1 192.168.1.14 # solr2 192.168.1.15 # search1 192.168.1.16 # index1 192.168.1.17 # emb3 192.168.1.18 # llm2 192.168.1.19 # nlp1 192.168.1.20 # rer1 192.168.1.21 # client_search 192.168.1.22 # client_index ) for host in "${WORKERS[@]}"; do ssh user@$host "docker swarm join --token <TOKEN> <ZOO1_IP>:2377" done
3

Verify cluster formation

On the manager node:

docker node ls

Expected output — 13 nodes, all with STATUS = Ready and AVAILABILITY = Active:

ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS abc123 * zoo1-host Ready Active Leader def456 zoo2-host Ready Active ghi789 zoo3-host Ready Active ... (10 more workers)
For resilience the Swarm manager itself can be promoted to a multi-manager setup later: docker node promote <zoo2_node_id>. A 3-manager quorum (zoo1, zoo2, zoo3) tolerates one manager failure without losing control of the cluster.

Phase 3 — Label Nodes for Service Placement

Labels tell Swarm which host each service should run on. Every service in the Swarm compose file carries a deploy.placement.constraints that matches exactly one label, pinning it to the correct host.

1

Collect node IDs

# List all nodes with their IDs and hostnames docker node ls --format "table {{.ID}}\t{{.Hostname}}\t{{.Status}}"
2

Apply labels to each node

Replace the node IDs with actual values from the previous command:

# Coordination layer docker node update --label-add service=zoo1 <ZOO1_NODE_ID> docker node update --label-add service=zoo2 <ZOO2_NODE_ID> docker node update --label-add service=zoo3 <ZOO3_NODE_ID> # Search storage layer docker node update --label-add service=solr1 <SOLR1_NODE_ID> docker node update --label-add service=solr2 <SOLR2_NODE_ID> # Application layer docker node update --label-add service=search1 <SEARCH1_NODE_ID> docker node update --label-add service=index1 <INDEX1_NODE_ID> # AI services docker node update --label-add service=emb3 <EMB3_NODE_ID> docker node update --label-add service=llm2 <LLM2_NODE_ID> docker node update --label-add service=nlp1 <NLP1_NODE_ID> docker node update --label-add service=rer1 <RER1_NODE_ID> docker node update --label-add service=client_search <CLIENT_SEARCH_NODE_ID> docker node update --label-add service=client_index <CLIENT_INDEX_NODE_ID>
3

Verify labels

# Quick check — shows hostname and labels for each node docker node ls -q | xargs -I{} docker node inspect {} \ --format 'Node: {{.Description.Hostname}} Labels: {{.Spec.Labels}}'

Each node should show exactly one service=<name> label.

Phase 4 — Private Docker Registry

The six custom Python services (emb, llm, nlp, rer, client_search, client_index) need their Docker images available on their target hosts. A private registry running on the manager node is the cleanest way to build once and distribute to all workers — no manual per-host builds required.

1

Start the registry service on the manager node

# Deploy registry as a Swarm service pinned to the manager docker service create \ --name registry \ --publish 5000:5000 \ --constraint 'node.role == manager' \ --mount type=volume,src=registry-data,dst=/var/lib/registry \ registry:2 # Verify it is running docker service ls | grep registry
2

Configure all worker nodes to trust the registry

Because the registry uses HTTP (not HTTPS), Docker must be told to allow it as an insecure registry. Run on all 13 hosts:

# Add the manager IP as an insecure registry sudo tee /etc/docker/daemon.json <<EOF { "insecure-registries": ["<ZOO1_IP>:5000"] } EOF # On GPU hosts, merge with the existing nvidia runtime config: # { "default-runtime": "nvidia", ..., "insecure-registries": ["<ZOO1_IP>:5000"] } sudo systemctl restart docker
3

Build and push all custom images

On the manager node, from the project directory:

cd /opt/sandi for service in emb3 llm2 nlp1 rer1 client_search client_index; do echo "Building $service ..." docker build -t <ZOO1_IP>:5000/sandi_$service:latest ./$service docker push <ZOO1_IP>:5000/sandi_$service:latest done # Verify images are in the registry curl http://<ZOO1_IP>:5000/v2/_catalog
The Swarm compose file must reference the registry-prefixed image names, e.g. <ZOO1_IP>:5000/sandi_emb3:latest. Swarm pulls images from the registry on the target node automatically at deploy time — no manual pre-pulling is needed.

Phase 5 — Distribute Project Files

1

Sync the project directory to all hosts

From your workstation or the manager node:

HOSTS=( 192.168.1.10 192.168.1.11 192.168.1.12 # zoo1–3 192.168.1.13 192.168.1.14 # solr1–2 192.168.1.15 192.168.1.16 # search1, index1 192.168.1.17 192.168.1.18 192.168.1.19 # emb3, llm2, nlp1 192.168.1.20 192.168.1.21 192.168.1.22 # rer1, client_search, client_index ) for host in "${HOSTS[@]}"; do rsync -avz --progress /opt/sandi/ user@$host:/opt/sandi/ done
2

Create data directories on each host

Docker bind mounts require the directories to exist on the host before the container starts. Swarm will not create them automatically.

ZooKeeper hosts (zoo1, zoo2, zoo3)

cd /opt/sandi mkdir -p data/zoo1/{data,datalog} mkdir -p data/zoo2/{data,datalog} mkdir -p data/zoo3/{data,datalog} chmod -R 777 data/

Solr hosts (solr1, solr2)

cd /opt/sandi mkdir -p data/solr1 mkdir -p data/solr2 chmod -R 777 data/

Search host (search1)

cd /opt/sandi mkdir -p search/webapps mkdir -p search/logs chmod -R 777 search/

Index host (index1)

cd /opt/sandi mkdir -p index/webapps mkdir -p index/logs mkdir -p documents chmod -R 777 index/ documents/
Copy the built sandi.war files into search/webapps/ and index/webapps/ on the respective hosts before deploying the stack, or the Tomcat containers will start with empty webapps directories and return 404.
3

Place configuration files

The Search and Index API containers read their properties from /sandi/conf/. The bind mount in the compose file maps ./conf on the host to this path. Ensure the correct sandi-solr-search.properties and sandi-solr-index.properties are in /opt/sandi/conf/ on the search1 and index1 hosts respectively, with the ZooKeeper connection string pointing at the actual ZK host IPs:

# In sandi-solr-search.properties and sandi-solr-index.properties sandi.solr.zk.hosts=<ZOO1_IP>:2181,<ZOO2_IP>:2181,<ZOO3_IP>:2181 # AI service URLs must point to the actual host IPs sandi.service.emb.url=http://<EMB3_IP>:8083 sandi.service.llm.url=http://<LLM2_IP>:8084 sandi.service.nlp.url=http://<NLP1_IP>:8085 sandi.service.rer.url=http://<RER1_IP>:8086
On the overlay network, services can also be reached by their Swarm DNS hostname (e.g. sandi_emb3) instead of IP. Using hostnames is more resilient to IP changes but requires that the properties files are updated after deployment if the service name changes.

Phase 6 — Deploy the Stack

1

Create the overlay network

# Run on the manager node docker network create \ --driver overlay \ --attachable \ --subnet 10.10.0.0/16 \ sandi_net # Verify docker network ls | grep sandi_net
2

Deploy the stack

On the manager node, from the directory containing docker-compose.swarm.yml:

cd /opt/sandi docker stack deploy -c docker-compose.swarm.yml sandi

Swarm will schedule all 13 services. On first run, each worker node pulls its images from the private registry. This may take several minutes, especially for the GPU service images which can be several GB.

3

Monitor deployment progress

# Watch all services until REPLICAS shows 1/1 for each watch docker stack services sandi # See which node each task is running on docker stack ps sandi --format "table {{.Name}}\t{{.Node}}\t{{.CurrentState}}\t{{.Error}}"

ZooKeeper and Solr can take 30–90 seconds to reach a healthy state. The application API containers (search1, index1) depend on Solr being ready, so they may restart once before settling.

Phase 7 — Verify the Deployment

1

Check all services show 1/1 replicas

docker stack services sandi

Expected output — all 13 services with REPLICAS 1/1:

ID NAME MODE REPLICAS IMAGE xxxxxxxx sandi_sandi_zoo1 replicated 1/1 zookeeper:3.9.2 xxxxxxxx sandi_sandi_zoo2 replicated 1/1 zookeeper:3.9.2 xxxxxxxx sandi_sandi_zoo3 replicated 1/1 zookeeper:3.9.2 xxxxxxxx sandi_sandi_solr1 replicated 1/1 solr:9.8.1 xxxxxxxx sandi_sandi_solr2 replicated 1/1 solr:9.8.1 xxxxxxxx sandi_sandi_search1 replicated 1/1 tomcat:10.1.44-jdk17-temurin-noble xxxxxxxx sandi_sandi_index1 replicated 1/1 tomcat:10.1.44-jdk17-temurin-noble xxxxxxxx sandi_sandi_emb3 replicated 1/1 <ZOO1_IP>:5000/sandi_emb3:latest xxxxxxxx sandi_sandi_llm2 replicated 1/1 <ZOO1_IP>:5000/sandi_llm2:latest xxxxxxxx sandi_sandi_nlp1 replicated 1/1 <ZOO1_IP>:5000/sandi_nlp1:latest xxxxxxxx sandi_sandi_rer1 replicated 1/1 <ZOO1_IP>:5000/sandi_rer1:latest xxxxxxxx sandi_sandi_client_search replicated 1/1 <ZOO1_IP>:5000/sandi_client_search:latest xxxxxxxx sandi_sandi_client_index replicated 1/1 <ZOO1_IP>:5000/sandi_client_index:latest
2

Test ZooKeeper quorum

# Each ZK node must respond "imok" echo ruok | nc <ZOO1_IP> 2181 # → imok echo ruok | nc <ZOO2_IP> 2182 # → imok echo ruok | nc <ZOO3_IP> 2183 # → imok # Check which node is the ZK leader echo stat | nc <ZOO1_IP> 2181 | grep -E "Mode|Connections"
3

Test Solr cluster

# Solr admin should return 200 on both nodes curl -s -o /dev/null -w "%{http_code}" http://<SOLR1_IP>:8981/solr/ curl -s -o /dev/null -w "%{http_code}" http://<SOLR2_IP>:8982/solr/ # Check the SolrCloud cluster status curl http://<SOLR1_IP>:8981/solr/admin/collections?action=CLUSTERSTATUS | python3 -m json.tool
4

Test application and AI services

# Search and Index APIs curl http://<SEARCH1_IP>:8081/sandi/ curl http://<INDEX1_IP>:8082/sandi/ # AI services — should return HTTP 200 curl http://<EMB3_IP>:8083/ curl http://<LLM2_IP>:8084/ curl http://<NLP1_IP>:8085/ curl http://<RER1_IP>:8086/ curl http://<CLIENT_SEARCH_IP>:8087/ curl http://<CLIENT_INDEX_IP>:8088/
5

Open the SANDI web interfaces

InterfaceURL
Search UIhttp://<SEARCH1_IP>:8081/sandi/en/sandi-search.html
Index UIhttp://<INDEX1_IP>:8082/sandi/en/sandi-index.html
Admin UIhttp://<INDEX1_IP>:8082/sandi/en/sandi-admin.html
Solr Adminhttp://<SOLR1_IP>:8981/solr/

Day-Two Operations

View service status and placement

# All services and replica counts docker stack services sandi # Which task runs on which host (detailed) docker stack ps sandi --format "table {{.Name}}\t{{.Node}}\t{{.CurrentState}}\t{{.Error}}" # Tasks that have failed or been restarted docker stack ps sandi --filter "desired-state=shutdown" --no-trunc

View logs

# Follow logs for a specific service docker service logs -f --tail 100 sandi_sandi_search1 docker service logs -f --tail 100 sandi_sandi_solr1 docker service logs -f --tail 100 sandi_sandi_emb3 # Show timestamps docker service logs -f --timestamps sandi_sandi_index1

Update a service image (rolling update)

# Rebuild and push the new image to the registry docker build -t <ZOO1_IP>:5000/sandi_emb3:v2 ./emb3 docker push <ZOO1_IP>:5000/sandi_emb3:v2 # Update the running service — Swarm pulls and restarts it docker service update --image <ZOO1_IP>:5000/sandi_emb3:v2 sandi_sandi_emb3 # For the Tomcat-based API services, redeploy the stack after updating the WAR file docker stack deploy -c docker-compose.swarm.yml sandi

Redeploy the full stack after config changes

# Re-running stack deploy is idempotent — only changed services are updated docker stack deploy -c docker-compose.swarm.yml sandi

Scale the Search API horizontally

# First remove the placement constraint that pins search1 to a single node docker service update \ --constraint-rm "node.labels.service == search1" \ sandi_sandi_search1 # Scale to 3 replicas — Swarm will distribute across available nodes docker service scale sandi_sandi_search1=3 # Add a load balancer (nginx / HAProxy) in front of all three instances

Remove the stack

# Stops and removes all services — data in bind-mount volumes is preserved on each host docker stack rm sandi

Security Hardening

Use Docker Secrets for sensitive values

# Create secrets on the manager node echo "solradmin" | docker secret create solr_username - echo "StrongPass1!" | docker secret create solr_password - # Reference secrets in docker-compose.swarm.yml: services: sandi_solr1: secrets: - solr_username - solr_password environment: - SOLR_AUTHENTICATION_OPTS=-Dbasicauth=$(cat /run/secrets/solr_username):$(cat /run/secrets/solr_password) secrets: solr_username: external: true solr_password: external: true

Encrypt the overlay network

# Create an encrypted overlay network (IPSec) docker network create \ --driver overlay \ --attachable \ --opt encrypted \ --subnet 10.10.0.0/16 \ sandi_net

Lock the Swarm

# Swarm lock requires a key to unlock after a manager restart docker swarm update --autolock=true # Save the printed unlock key securely — losing it means you cannot rejoin after restart # To unlock after manager restart: docker swarm unlock

Enable Solr authentication

# In sandi-solr-search.properties and sandi-solr-index.properties sandi.solr.username=solradmin sandi.solr.password=StrongPass1!

Backup Procedures

Backup Swarm state (manager node)

# Stop Docker, archive Swarm state, restart sudo systemctl stop docker sudo tar -czvf swarm-backup-$(date +%Y%m%d).tar.gz -C /var/lib/docker/swarm . sudo systemctl start docker

Backup ZooKeeper data

# On each ZooKeeper host (zoo1, zoo2, zoo3) sudo tar -czvf zoo-backup-$(date +%Y%m%d).tar.gz \ -C /opt/sandi/data/zoo1 .

Backup Solr data

# Use the Solr backup API to create a consistent snapshot curl "http://<SOLR1_IP>:8981/solr/<collection>/replication?command=backup&name=backup-$(date +%Y%m%d)" # Or tar the data directory directly after stopping the Solr service docker service scale sandi_sandi_solr1=0 sudo tar -czvf solr-backup-$(date +%Y%m%d).tar.gz -C /opt/sandi/data/solr1 . docker service scale sandi_sandi_solr1=1

Troubleshooting

Service stuck at 0/1 replicas — never starts

Diagnose:

docker service ps sandi_sandi_<service> --no-trunc docker service logs sandi_sandi_<service>

Common causes:

  • Image not found: The image has not been pushed to the registry or the registry address in the compose file is wrong. Push the image and verify with curl http://<ZOO1_IP>:5000/v2/_catalog.
  • No node matches placement constraint: The node label is missing or mistyped. Verify with docker node inspect <node_id> --format '{{.Spec.Labels}}'.
  • Bind mount path missing: The directory on the host does not exist. Create it and re-deploy.
  • Port already in use: Another process is occupying the published port. Check with ss -tlnp | grep <port> on the target host.
Service starts but immediately exits (0/1 → restarts in loop)
# See the actual error from the most recent failed task docker service ps sandi_sandi_<service> --no-trunc --filter "desired-state=shutdown" docker service logs sandi_sandi_<service> 2>&1 | tail -50

Common causes:

  • ZooKeeper not ready (Solr): Solr cannot connect to ZK on startup. Wait 60–90 s and re-check — Swarm's restart policy will keep retrying.
  • Wrong properties path (Tomcat): Confirm the -Dspring.config.location JAVA_OPTS value points to a file that actually exists inside the container.
  • GPU not accessible (emb, llm, rer): NVIDIA runtime is not set as default on the host. See step 2 of Phase 1.
  • Python import error (AI services): A Python dependency is missing in the image. Rebuild with docker build --no-cache.
  • Out of memory: The host has insufficient RAM. Check docker stats and the host's dmesg | grep -i oom.
GPU not used by AI services (model runs on CPU, very slow)
# Verify GPU runtime is the default on the GPU host docker info | grep -i runtime # Check that the container can see the GPU docker run --rm nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi # Inspect the running container on the GPU host directly docker ps | grep sandi_emb docker inspect <container_id> | grep -A5 '"Runtime"'

daemon.json must have "default-runtime": "nvidia". Docker Swarm ignores the runtime: key in compose files — the NVIDIA runtime must be set at the daemon level.

Services cannot reach each other over the overlay network
# Verify the overlay network exists and spans all nodes docker network inspect sandi_net | grep -E '"Scope"|"Driver"|Peers' # Test DNS resolution from inside a running container docker exec -it $(docker ps -q -f label=com.docker.swarm.service.name=sandi_sandi_solr1) \ ping -c 3 sandi_zoo1 # Check Swarm ports are open between hosts (run on each host) sudo ufw status | grep -E '2377|7946|4789'

Common causes:

  • UDP port 4789 (VXLAN) blocked by firewall or cloud security group — this is the most common cause.
  • Hosts are on different network segments with VXLAN not permitted by the router.
  • The overlay network was created before all nodes joined — recreate it after all nodes are in the cluster.
Solr cannot connect to ZooKeeper
# Check ZooKeeper is healthy on all three nodes echo ruok | nc <ZOO1_IP> 2181 # Check Solr logs for the ZK connection error docker service logs sandi_sandi_solr1 2>&1 | grep -i "zookeeper\|zk\|connection" # Verify the ZK_HOST env var in the compose file matches the actual ZK IPs/hostnames docker service inspect sandi_sandi_solr1 | grep ZK_HOST

The ZK_HOST environment variable must use hostnames that resolve inside the overlay network (e.g. sandi_zoo1:2181) or IP addresses that are reachable from the Solr container.

SANDI API returns 503 / cannot connect to Solr or AI services
# Check the application log for connection errors docker service logs sandi_sandi_search1 2>&1 | grep -iE "error|exception|refused|timeout" | tail -30 # Verify properties file URLs are correct on the search1 / index1 host cat /opt/sandi/conf/sandi-solr-search.properties | grep -E "url|host|zk" # Test connectivity from the search1 host directly curl http://<EMB3_IP>:8083/ curl http://<SOLR1_IP>:8981/solr/
A host goes down — service is stuck, Swarm does not reschedule
# Check the node status docker node ls # Option 1: Bring the node back online, Swarm restarts the service automatically # Option 2: Move the service to another node manually # Remove the placement constraint and add a new one pointing to an available node docker service update \ --constraint-rm "node.labels.service == emb3" \ --constraint-add "node.labels.service == spare_gpu_host" \ sandi_sandi_emb3 # Option 3: Remove the placement constraint entirely — Swarm picks any available node docker service update \ --constraint-rm "node.labels.service == emb3" \ sandi_sandi_emb3
Because each service is pinned to a specific node by label constraint, Swarm will not automatically reschedule it to another node when the host goes down. You must intervene manually. For stateless services (AI services, API) removing the constraint and letting Swarm reschedule is safe. For Solr and ZooKeeper, data locality matters — prefer restoring the original node.
docker stack deploy fails with "network not found"
# The overlay network must be created before deploying the stack docker network ls | grep sandi_net # If missing, create it docker network create --driver overlay --attachable --subnet 10.10.0.0/16 sandi_net # Then redeploy docker stack deploy -c docker-compose.swarm.yml sandi
Manager node restarted — Swarm is locked
# If autolock is enabled, unlock the Swarm after a manager restart docker swarm unlock # Enter the unlock key saved when autolock was enabled # If the key was lost, you must force a new manager # (this requires at least one other healthy manager in the cluster)
Useful diagnostic commands — quick reference
# Cluster-wide service state docker stack ps sandi --no-trunc # Resource usage on a specific host ssh user@<HOST_IP> docker stats --no-stream # Inspect a specific service's full configuration docker service inspect --pretty sandi_sandi_solr1 # List all tasks ever run (including failed) for a service docker service ps sandi_sandi_llm2 --no-trunc --filter "desired-state=shutdown" # Enter a running container interactively for debugging docker exec -it $(docker ps -q -f label=com.docker.swarm.service.name=sandi_sandi_nlp1) bash # Force a service restart docker service update --force sandi_sandi_search1

Monitoring

ZooKeeper already exposes Prometheus metrics on ports 7001–7003 via the metricsProvider configuration in the compose file. Recommended monitoring stack:

ToolPurposeNotes
Prometheus Metrics collection Scrape ZooKeeper on :7001–7003/metrics, Solr on :8981/solr/admin/metrics
Grafana Dashboards Pre-built ZooKeeper and Solr dashboards available on grafana.com
cAdvisor Container CPU / RAM / GPU metrics Deploy as a Swarm global service so it runs on every node
Docker Swarm Visualizer Live service placement map dockersamples/visualizer — deploy on the manager, port 8080
# Deploy cAdvisor as a global Swarm service (runs on every node) docker service create \ --name cadvisor \ --mode global \ --publish 9200:8080 \ --mount type=bind,src=/,dst=/rootfs,readonly \ --mount type=bind,src=/var/run,dst=/var/run \ --mount type=bind,src=/sys,dst=/sys,readonly \ --mount type=bind,src=/var/lib/docker,dst=/var/lib/docker,readonly \ gcr.io/cadvisor/cadvisor:latest