OpenAI Cloud Integration
SANDI's AI functionality is delegated to three independent microservices, each exposing a lightweight HTTP API. The search and index modules communicate with these services via configurable URLs in the properties file. Any service can be swapped between local and cloud-backed implementations with a URL change — the interface contract is identical.
| Service | Default model | Port | Endpoints | Replaces |
|---|---|---|---|---|
| emb5 — Embeddings | text-embedding-3-large |
8085 | GET/POST /embedPOST /embeddingPOST /similarity |
emb3, emb4, emb6 (GPU) |
| llm3 — Language model | o4-mini |
8087 | GET/POST /run |
llm2 — Qwen3-4B (GPU) |
| rer2 — Reranker | gpt-4o-mini |
8088 | POST /rescore |
rer1 — Qwen3-0.6B (GPU) |
emb5,
llm3, rer2 have no profile restriction). Local GPU
services (emb3, emb4, emb6,
llm2, rer1) are opt-in via
--profile manual_start. Only one embedding service, one LLM service,
and one reranker service should be active at a time — they each bind to the same
hostname alias and port.
| Model | Max dimensions | Quality | Notes |
|---|---|---|---|
text-embedding-3-large |
3072 | Best | Default; supports dimension reduction via OPENAI_DIMENSIONS |
text-embedding-3-small |
1536 | Good | Faster and cheaper; suitable for cost-sensitive workloads |
text-embedding-ada-002 |
1536 (fixed) | Legacy | Older generation; does not support dimension reduction |
OPENAI_DIMENSIONS after documents
have been indexed requires a full re-index. Vector similarity scores are not
comparable across different models or dimension counts. The Solr collection schema
must also be updated to match the new vector dimension.
All OpenAI settings are controlled via environment variables. In the provided
deployment, these are stored in a .env file in the deployment root
directory, which docker-compose reads automatically at startup.
Each variable has a documented default defined in docker-compose.yml;
the .env file only needs to contain overrides and the required API key.
| Variable | Required | Default | Used by | Description |
|---|---|---|---|---|
OPENAI_API_KEY |
Yes | — | emb5, llm3, rer2 |
Your OpenAI platform API key. Starts with sk-.
Obtain from platform.openai.com → API keys. Without
this key all three services will reject every request.
|
OPENAI_BASE_URL |
No | https://api.openai.com/v1 |
emb5, llm3, rer2 |
API endpoint base URL. Override to use Azure OpenAI Service
(https://<resource>.openai.azure.com/openai/deployments/<deployment>),
a local OpenAI-compatible server (Ollama, vLLM, LM Studio),
or any other provider that mirrors the OpenAI REST API.
|
OPENAI_EMB_MODEL |
No | text-embedding-3-large |
emb5 |
Embedding model name passed to the /embeddings
endpoint. Accepted values: text-embedding-3-large,
text-embedding-3-small,
text-embedding-ada-002. When using a custom
base URL, this must match the model name or deployment name
registered on that server.
|
OPENAI_DIMENSIONS |
No | 1024 |
emb5 |
Output vector length. Supported range: 1 to the model maximum
(3072 for text-embedding-3-large, 1536 for
text-embedding-3-small). Not supported by
text-embedding-ada-002 — omit or ignore when
using that model. Must match the vectorDimension
field in the Solr collection schema. Reducing dimensions lowers
storage cost and KNN search latency at the price of a small
accuracy loss.
|
OPENAI_LLM_MODEL |
No | o4-mini |
llm3 |
Chat completion model for spell checking, query expansion, and
RAG answer generation. Recommended values:
o4-mini (fast reasoning, default),
gpt-4o (highest quality, higher cost),
gpt-4o-mini (balanced cost/quality).
The model must support the Chat Completions API
(/v1/chat/completions).
|
OPENAI_RER_MODEL |
No | gpt-4o-mini |
rer2 |
Chat completion model used for relevance scoring. All documents
in a batch are scored in a single call using JSON output mode.
A smaller, faster model (gpt-4o-mini) is
sufficient because the reranker prompt is structured and
deterministic. Temperature is set to 0.01 for near-deterministic
output.
|
OPENAI_SYSTEM_PROMPT |
No | (built-in default) | llm3 | Overrides the system prompt used by the LLM service. The default instructs the model to act as a precise search assistant and return only requested output without explanations or markdown. Useful when deploying against a fine-tuned or instruction-specific model that requires a custom system prompt. |
.env file is read by docker-compose automatically
when it is placed in the same directory as docker-compose.yml. It
is not mounted into any container — values are injected as environment variables
at container start time. The file is listed in .gitignore by
convention; never commit a real API key to version control.
All three OpenAI adapter services are thin Flask/Gunicorn wrappers that forward requests to the OpenAI API and return results. They consume negligible CPU and RAM (under 100 MB each). Local alternatives — emb3, emb4, emb6, llm2, rer1 — each load multi-gigabyte neural network weights and require a CUDA-capable GPU with several GB of VRAM. With the OpenAI services, a complete SANDI stack runs comfortably on a laptop, a basic VPS, or any machine with internet access.
Developers can explore hybrid search, RAG pipelines, reranking, and query expansion
without GPU hardware investment. Switching models, adjusting embedding dimensions,
or comparing retrieval strategies is a configuration change followed by a container
restart. The full SANDI stack — Solr, ZooKeeper, search, index, NLP, embedding,
LLM, and reranker — starts with a single docker compose up command.
OpenAI continuously releases improved models. The embedding service can use
text-embedding-3-large with state-of-the-art retrieval quality,
the LLM service defaults to o4-mini (a reasoning model capable of
precise structured output), and the reranker uses gpt-4o-mini.
Upgrading to a newly released model requires only an environment variable change
— no model downloads, no compatibility testing, no quantization decisions.
OpenAI's infrastructure scales automatically. There is no need to manage memory pressure, batching queues, or model sharding. Batch embedding requests send multiple texts in a single API call, and throughput grows with OpenAI's rate limits rather than local hardware constraints. For small-to-medium collections this is effectively unconstrained.
No model downloads, no CUDA driver management, no quantization decisions, no out-of-memory crashes during indexing. The only prerequisite is an active OpenAI API key. Deployments start in seconds and require no ongoing model maintenance.
All three services honour an OPENAI_BASE_URL override, enabling
transparent switching to Azure OpenAI Service (which offers private endpoints and
data residency options), a self-hosted OpenAI-compatible server (Ollama, vLLM,
LM Studio), or another cloud provider. No SANDI application code changes are
required — only the base URL and model name in .env.
Each embedding, LLM completion, and reranking request crosses the public internet.
Round-trip latency of 200–800 ms per call is typical. For interactive search,
users notice slower response times compared to sub-millisecond local model inference.
Bulk indexing of large document corpora is especially impacted: every text chunk
requires a network round-trip to the embedding service, and the
IndexingJobScheduler (which checks for work every second) will spend
most of its time waiting on network I/O rather than processing.
Local GPU models can embed hundreds of text chunks per second. OpenAI's API, even with batch calls, is constrained by network bandwidth, serialization overhead, and per-account rate limits (tokens per minute and requests per minute). Indexing millions of documents will take significantly longer than with a local embedding service and may require explicit rate-limit handling, retry logic with exponential backoff, and throughput monitoring.
All document content, query text, and search results are transmitted to OpenAI's servers to obtain embeddings and LLM responses. Every text chunk from every indexed document is sent to the embedding API during indexing; every user search query is sent to the embedding API at search time; LLM calls include the full query and retrieved document passages for RAG generation. For use cases involving confidential, proprietary, legally privileged, or personally identifiable data (GDPR, HIPAA, financial, legal), this mode may be unacceptable or require explicit data processing agreements with OpenAI.
Every embedding, completion, and reranking call is billed per token by OpenAI. Embedding costs scale with corpus size (every chunk is embedded once at index time) plus query volume (each search query is embedded at runtime). LLM and reranker costs scale with search query volume and result set size. For large-scale or high-traffic deployments, API costs can exceed the amortized hardware cost of running local models.
The OpenAI services are non-functional without a live connection to
api.openai.com. Any network outage will cause all AI-dependent
features — embedding generation, semantic search, spell checking, query expansion,
RAG answers, and reranking — to fail. SANDI's basic BM25 text search through
Solr continues to function, but the hybrid and intelligent search features
become unavailable.
OpenAI enforces per-account rate limits on tokens per minute and requests per minute. Under heavy indexing load, requests may be throttled (HTTP 429), causing SANDI's indexing jobs to stall or fail. OpenAI API outages and scheduled maintenance windows directly affect SANDI availability when using cloud services.
Relying on OpenAI introduces dependency on a third-party provider's roadmap and
pricing. Models can be deprecated (e.g., text-embedding-ada-002 was
succeeded by the text-embedding-3 family). When the embedding model used at
index time is no longer available, the entire corpus must be re-indexed with the
replacement model, since vector similarity scores are not comparable across
different model families or dimension counts.
| Criterion | OpenAI (emb5 / llm3 / rer2) | Local GPU (emb4 / llm2 / rer1) |
|---|---|---|
| Hardware requirement | None — any CPU machine | CUDA GPU, 8+ GB VRAM per service |
| Setup time | Minutes — API key + docker compose up | Hours — model downloads, driver setup |
| Embedding latency | 200–800 ms / call (network bound) | 1–10 ms / batch (GPU bound) |
| Indexing throughput | Low to medium (rate limited) | High (GPU saturated) |
| Model quality | State-of-the-art (text-embedding-3-large) | Good (GTE-Large, competitive) |
| Model updates | Instant — change env var | Manual download and rebuild |
| Ongoing cost | Per-token API billing | Hardware + electricity (fixed) |
| Data privacy | Data sent to OpenAI servers | Fully on-premises, no external calls |
| Offline operation | Not supported | Full offline capability |
| Internet dependency | Required at all times | None |
| Rate limits | OpenAI per-account limits | Hardware capacity only |
| Best for | Dev, learning, small collections | Production, large corpora, sensitive data |
sandi-solr-search.properties and
sandi-solr-index.properties. If the embedding model or dimension
count changes at the same time, the Solr collection schema must be updated and
all documents re-indexed. Plan migrations carefully to avoid downtime.
The OpenAI integration adapters make SANDI immediately accessible to any
developer or team that wants to explore semantic search, RAG, and hybrid
retrieval without GPU infrastructure. A single OPENAI_API_KEY
in the .env file unlocks the full SANDI feature set — vector
embeddings, hybrid KNN + BM25 search, spell checking, query expansion, RAG
answer generation, and relevance reranking — on commodity hardware. The
trade-offs are real: every AI operation crosses the network, all indexed
content reaches OpenAI servers, and costs scale with usage. For production
systems with sensitive data, large corpora, or strict latency requirements,
the local GPU-backed services remain the appropriate choice. The two modes
are interchangeable — transitioning from cloud to on-premises is a
configuration change, not an architecture change.