SANDI Solr

OpenAI Cloud Integration

Abstract

SANDI supports pluggable AI back-ends through a service adapter pattern. In our case the emb5, llm3, and rer2 adapter services replace locally-hosted GPU models with OpenAI API calls, enabling a fully functional SANDI stack — including vector embeddings, hybrid search, spell checking, query expansion, RAG answer generation, and result reranking — on standard hardware without any GPU. This document describes the integration architecture, configuration, and the trade-offs between OpenAI cloud services and local self-hosted models.

1Integration Architecture

SANDI's AI functionality is delegated to three independent microservices, each exposing a lightweight HTTP API. The search and index modules communicate with these services via configurable URLs in the properties file. Any service can be swapped between local and cloud-backed implementations with a URL change — the interface contract is identical.

Search and indexing pipeline with OpenAI services

Document text
emb5: text-embedding-3-large
Dense vector stored in Solr
User query
emb5: query vector
KNN + BM25 hybrid search
rer2: gpt-4o-mini rescoring
Final results
User query
llm3: o4-mini
Spell check / Query expansion / RAG answer

OpenAI adapter services

Service Default model Port Endpoints Replaces
emb5 — Embeddings text-embedding-3-large 8085 GET/POST /embed
POST /embedding
POST /similarity
emb3, emb4, emb6 (GPU)
llm3 — Language model o4-mini 8087 GET/POST /run llm2 — Qwen3-4B (GPU)
rer2 — Reranker gpt-4o-mini 8088 POST /rescore rer1 — Qwen3-0.6B (GPU)
The OpenAI services start by default in docker-compose (emb5, llm3, rer2 have no profile restriction). Local GPU services (emb3, emb4, emb6, llm2, rer1) are opt-in via --profile manual_start. Only one embedding service, one LLM service, and one reranker service should be active at a time — they each bind to the same hostname alias and port.

Embedding model options for emb5

Model Max dimensions Quality Notes
text-embedding-3-large 3072 Best Default; supports dimension reduction via OPENAI_DIMENSIONS
text-embedding-3-small 1536 Good Faster and cheaper; suitable for cost-sensitive workloads
text-embedding-ada-002 1536 (fixed) Legacy Older generation; does not support dimension reduction
Changing the embedding model or OPENAI_DIMENSIONS after documents have been indexed requires a full re-index. Vector similarity scores are not comparable across different models or dimension counts. The Solr collection schema must also be updated to match the new vector dimension.

2Configuration Reference

All OpenAI settings are controlled via environment variables. In the provided deployment, these are stored in a .env file in the deployment root directory, which docker-compose reads automatically at startup. Each variable has a documented default defined in docker-compose.yml; the .env file only needs to contain overrides and the required API key.

Environment variables

Variable Required Default Used by Description
OPENAI_API_KEY Yes emb5, llm3, rer2 Your OpenAI platform API key. Starts with sk-. Obtain from platform.openai.com → API keys. Without this key all three services will reject every request.
OPENAI_BASE_URL No https://api.openai.com/v1 emb5, llm3, rer2 API endpoint base URL. Override to use Azure OpenAI Service (https://<resource>.openai.azure.com/openai/deployments/<deployment>), a local OpenAI-compatible server (Ollama, vLLM, LM Studio), or any other provider that mirrors the OpenAI REST API.
OPENAI_EMB_MODEL No text-embedding-3-large emb5 Embedding model name passed to the /embeddings endpoint. Accepted values: text-embedding-3-large, text-embedding-3-small, text-embedding-ada-002. When using a custom base URL, this must match the model name or deployment name registered on that server.
OPENAI_DIMENSIONS No 1024 emb5 Output vector length. Supported range: 1 to the model maximum (3072 for text-embedding-3-large, 1536 for text-embedding-3-small). Not supported by text-embedding-ada-002 — omit or ignore when using that model. Must match the vectorDimension field in the Solr collection schema. Reducing dimensions lowers storage cost and KNN search latency at the price of a small accuracy loss.
OPENAI_LLM_MODEL No o4-mini llm3 Chat completion model for spell checking, query expansion, and RAG answer generation. Recommended values: o4-mini (fast reasoning, default), gpt-4o (highest quality, higher cost), gpt-4o-mini (balanced cost/quality). The model must support the Chat Completions API (/v1/chat/completions).
OPENAI_RER_MODEL No gpt-4o-mini rer2 Chat completion model used for relevance scoring. All documents in a batch are scored in a single call using JSON output mode. A smaller, faster model (gpt-4o-mini) is sufficient because the reranker prompt is structured and deterministic. Temperature is set to 0.01 for near-deterministic output.
OPENAI_SYSTEM_PROMPT No (built-in default) llm3 Overrides the system prompt used by the LLM service. The default instructs the model to act as a precise search assistant and return only requested output without explanations or markdown. Useful when deploying against a fine-tuned or instruction-specific model that requires a custom system prompt.

Minimal .env file

# Minimal configuration — only the API key is required OPENAI_API_KEY=sk-your-api-key-here

Full .env file with all options

# OpenAI API configuration for SANDI services # Required: your OpenAI platform API key OPENAI_API_KEY=sk-your-api-key-here # Optional: API base URL — default is the public OpenAI endpoint # Override for Azure OpenAI or any OpenAI-compatible server OPENAI_BASE_URL=https://api.openai.com/v1 # Optional: embedding model (emb5) # text-embedding-3-large — best quality, max 3072 dims (default) # text-embedding-3-small — faster and cheaper, max 1536 dims # text-embedding-ada-002 — legacy, fixed 1536 dims, no dimension reduction OPENAI_EMB_MODEL=text-embedding-3-large # Optional: output vector dimensions (emb5) # Must match the vectorDimension in the Solr collection schema # Supported by text-embedding-3-large (1–3072) and text-embedding-3-small (1–1536) OPENAI_DIMENSIONS=1024 # Optional: LLM model for spell check, query expansion, RAG answers (llm3) # o4-mini — fast reasoning model (default) # gpt-4o — highest quality, higher cost # gpt-4o-mini — balanced cost and quality OPENAI_LLM_MODEL=o4-mini # Optional: reranker model (rer2) # gpt-4o-mini is sufficient — reranking uses structured prompts at near-zero temperature OPENAI_RER_MODEL=gpt-4o-mini # Optional: override the LLM system prompt (llm3) # Leave empty to use the built-in default # OPENAI_SYSTEM_PROMPT=
The .env file is read by docker-compose automatically when it is placed in the same directory as docker-compose.yml. It is not mounted into any container — values are injected as environment variables at container start time. The file is listed in .gitignore by convention; never commit a real API key to version control.

3Advantages

No GPU required — minimal hardware

All three OpenAI adapter services are thin Flask/Gunicorn wrappers that forward requests to the OpenAI API and return results. They consume negligible CPU and RAM (under 100 MB each). Local alternatives — emb3, emb4, emb6, llm2, rer1 — each load multi-gigabyte neural network weights and require a CUDA-capable GPU with several GB of VRAM. With the OpenAI services, a complete SANDI stack runs comfortably on a laptop, a basic VPS, or any machine with internet access.

Ideal for learning, development, and testing

Developers can explore hybrid search, RAG pipelines, reranking, and query expansion without GPU hardware investment. Switching models, adjusting embedding dimensions, or comparing retrieval strategies is a configuration change followed by a container restart. The full SANDI stack — Solr, ZooKeeper, search, index, NLP, embedding, LLM, and reranker — starts with a single docker compose up command.

Access to the newest and most capable models

OpenAI continuously releases improved models. The embedding service can use text-embedding-3-large with state-of-the-art retrieval quality, the LLM service defaults to o4-mini (a reasoning model capable of precise structured output), and the reranker uses gpt-4o-mini. Upgrading to a newly released model requires only an environment variable change — no model downloads, no compatibility testing, no quantization decisions.

Unlimited and elastic capacity

OpenAI's infrastructure scales automatically. There is no need to manage memory pressure, batching queues, or model sharding. Batch embedding requests send multiple texts in a single API call, and throughput grows with OpenAI's rate limits rather than local hardware constraints. For small-to-medium collections this is effectively unconstrained.

Zero-maintenance deployment

No model downloads, no CUDA driver management, no quantization decisions, no out-of-memory crashes during indexing. The only prerequisite is an active OpenAI API key. Deployments start in seconds and require no ongoing model maintenance.

Azure OpenAI and compatible API support

All three services honour an OPENAI_BASE_URL override, enabling transparent switching to Azure OpenAI Service (which offers private endpoints and data residency options), a self-hosted OpenAI-compatible server (Ollama, vLLM, LM Studio), or another cloud provider. No SANDI application code changes are required — only the base URL and model name in .env.

4Disadvantages and Limitations

Network latency on every AI call

Each embedding, LLM completion, and reranking request crosses the public internet. Round-trip latency of 200–800 ms per call is typical. For interactive search, users notice slower response times compared to sub-millisecond local model inference. Bulk indexing of large document corpora is especially impacted: every text chunk requires a network round-trip to the embedding service, and the IndexingJobScheduler (which checks for work every second) will spend most of its time waiting on network I/O rather than processing.

Reduced throughput for batch indexing

Local GPU models can embed hundreds of text chunks per second. OpenAI's API, even with batch calls, is constrained by network bandwidth, serialization overhead, and per-account rate limits (tokens per minute and requests per minute). Indexing millions of documents will take significantly longer than with a local embedding service and may require explicit rate-limit handling, retry logic with exponential backoff, and throughput monitoring.

Data privacy and security risks

All document content, query text, and search results are transmitted to OpenAI's servers to obtain embeddings and LLM responses. Every text chunk from every indexed document is sent to the embedding API during indexing; every user search query is sent to the embedding API at search time; LLM calls include the full query and retrieved document passages for RAG generation. For use cases involving confidential, proprietary, legally privileged, or personally identifiable data (GDPR, HIPAA, financial, legal), this mode may be unacceptable or require explicit data processing agreements with OpenAI.

Air-gapped, offline, and strict data-residency environments cannot use the public OpenAI endpoint. Azure OpenAI Service with private endpoints and a Virtual Network may satisfy data-residency requirements in some jurisdictions, but transmitting data outside the local network is still required.

Ongoing API cost

Every embedding, completion, and reranking call is billed per token by OpenAI. Embedding costs scale with corpus size (every chunk is embedded once at index time) plus query volume (each search query is embedded at runtime). LLM and reranker costs scale with search query volume and result set size. For large-scale or high-traffic deployments, API costs can exceed the amortized hardware cost of running local models.

Requires continuous internet connectivity

The OpenAI services are non-functional without a live connection to api.openai.com. Any network outage will cause all AI-dependent features — embedding generation, semantic search, spell checking, query expansion, RAG answers, and reranking — to fail. SANDI's basic BM25 text search through Solr continues to function, but the hybrid and intelligent search features become unavailable.

Rate limits and service interruptions

OpenAI enforces per-account rate limits on tokens per minute and requests per minute. Under heavy indexing load, requests may be throttled (HTTP 429), causing SANDI's indexing jobs to stall or fail. OpenAI API outages and scheduled maintenance windows directly affect SANDI availability when using cloud services.

Vendor dependency and model deprecation

Relying on OpenAI introduces dependency on a third-party provider's roadmap and pricing. Models can be deprecated (e.g., text-embedding-ada-002 was succeeded by the text-embedding-3 family). When the embedding model used at index time is no longer available, the entire corpus must be re-indexed with the replacement model, since vector similarity scores are not comparable across different model families or dimension counts.

5OpenAI vs. Local Models — Comparison

Criterion OpenAI (emb5 / llm3 / rer2) Local GPU (emb4 / llm2 / rer1)
Hardware requirement None — any CPU machine CUDA GPU, 8+ GB VRAM per service
Setup time Minutes — API key + docker compose up Hours — model downloads, driver setup
Embedding latency 200–800 ms / call (network bound) 1–10 ms / batch (GPU bound)
Indexing throughput Low to medium (rate limited) High (GPU saturated)
Model quality State-of-the-art (text-embedding-3-large) Good (GTE-Large, competitive)
Model updates Instant — change env var Manual download and rebuild
Ongoing cost Per-token API billing Hardware + electricity (fixed)
Data privacy Data sent to OpenAI servers Fully on-premises, no external calls
Offline operation Not supported Full offline capability
Internet dependency Required at all times None
Rate limits OpenAI per-account limits Hardware capacity only
Best for Dev, learning, small collections Production, large corpora, sensitive data

6Recommendations by Use Case

Recommended: Use OpenAI services

  • Development and prototyping
  • Learning hybrid search and RAG concepts
  • Demo and evaluation environments
  • Small document collections (under ~100k docs)
  • Infrequent indexing with low query volume
  • Multi-tenant SaaS where GPU provisioning is impractical
  • Rapid model comparison and experimentation

Not recommended: Use local GPU services

  • Confidential, personal, or regulated data (GDPR, HIPAA)
  • Air-gapped or offline deployments
  • Large-scale corpora requiring fast indexing
  • Real-time search with sub-100 ms latency requirements
  • High query volume where per-token costs accumulate
  • Environments with strict data residency requirements
  • Production systems requiring predictable latency SLAs
The switch between OpenAI and local services requires only changing the service URLs in sandi-solr-search.properties and sandi-solr-index.properties. If the embedding model or dimension count changes at the same time, the Solr collection schema must be updated and all documents re-indexed. Plan migrations carefully to avoid downtime.

7Conclusion

The OpenAI integration adapters make SANDI immediately accessible to any developer or team that wants to explore semantic search, RAG, and hybrid retrieval without GPU infrastructure. A single OPENAI_API_KEY in the .env file unlocks the full SANDI feature set — vector embeddings, hybrid KNN + BM25 search, spell checking, query expansion, RAG answer generation, and relevance reranking — on commodity hardware. The trade-offs are real: every AI operation crosses the network, all indexed content reaches OpenAI servers, and costs scale with usage. For production systems with sensitive data, large corpora, or strict latency requirements, the local GPU-backed services remain the appropriate choice. The two modes are interchangeable — transitioning from cloud to on-premises is a configuration change, not an architecture change.

Apache Solr 9.8.1 OpenAI API text-embedding-3-large o4-mini gpt-4o-mini Hybrid Search RAG No GPU required Azure OpenAI compatible External data transmission