SANDI + OpenAI Integration

Abstract

SANDI supports pluggable AI back-ends through a service adapter pattern. In our case the emb5, llm3, and rer2 adapter services replace locally-hosted GPU models with OpenAI API calls, enabling a fully functional SANDI stack — including vector embeddings, hybrid search, spell checking, query expansion, RAG answer generation, and result reranking — on standard hardware without any GPU. This document describes the integration architecture, configuration, and the trade-offs between OpenAI cloud services and local self-hosted models.

1Integration Architecture

SANDI's AI functionality is delegated to three independent microservices, each exposing a lightweight HTTP API. The search and index modules communicate with these services via configurable URLs in the properties file. Any service can be swapped between local and cloud-backed implementations with a URL change — the interface contract is identical.

Search and indexing pipeline with OpenAI services

Document text

→

emb5: text-embedding-3-large

→

Dense vector stored in Solr

User query

→

emb5: query vector

→

KNN + BM25 hybrid search

→

rer2: gpt-4o-mini rescoring

→

Final results

User query

→

llm3: o4-mini

→

Spell check / Query expansion / RAG answer

OpenAI adapter services

Service	Default model	Port	Endpoints	Replaces
emb5 — Embeddings	`text-embedding-3-large`	8085	`GET/POST /embed` `POST /embedding` `POST /similarity`	emb3, emb4, emb6 (GPU)
llm3 — Language model	`o4-mini`	8087	`GET/POST /run`	llm2 — Qwen3-4B (GPU)
rer2 — Reranker	`gpt-4o-mini`	8088	`POST /rescore`	rer1 — Qwen3-0.6B (GPU)

The OpenAI services start by default in docker-compose (emb5, llm3, rer2 have no profile restriction). Local GPU services (emb3, emb4, emb6, llm2, rer1) are opt-in via --profile manual_start. Only one embedding service, one LLM service, and one reranker service should be active at a time — they each bind to the same hostname alias and port.

Embedding model options for emb5

Model	Max dimensions	Quality	Notes
`text-embedding-3-large`	3072	Best	Default; supports dimension reduction via `OPENAI_DIMENSIONS`
`text-embedding-3-small`	1536	Good	Faster and cheaper; suitable for cost-sensitive workloads
`text-embedding-ada-002`	1536 (fixed)	Legacy	Older generation; does not support dimension reduction

Changing the embedding model or OPENAI_DIMENSIONS after documents have been indexed requires a full re-index. Vector similarity scores are not comparable across different models or dimension counts. The Solr collection schema must also be updated to match the new vector dimension.

2Configuration Reference

All OpenAI settings are controlled via environment variables. In the provided deployment, these are stored in a .env file in the deployment root directory, which docker-compose reads automatically at startup. Each variable has a documented default defined in docker-compose.yml; the .env file only needs to contain overrides and the required API key.

Environment variables

Variable	Required	Default	Used by	Description
`OPENAI_API_KEY`	Yes	—	emb5, llm3, rer2	Your OpenAI platform API key. Starts with `sk-`. Obtain from platform.openai.com → API keys. Without this key all three services will reject every request.
`OPENAI_BASE_URL`	No	`https://api.openai.com/v1`	emb5, llm3, rer2	API endpoint base URL. Override to use Azure OpenAI Service (`https://<resource>.openai.azure.com/openai/deployments/<deployment>`), a local OpenAI-compatible server (Ollama, vLLM, LM Studio), or any other provider that mirrors the OpenAI REST API.
`OPENAI_EMB_MODEL`	No	`text-embedding-3-large`	emb5	Embedding model name passed to the `/embeddings` endpoint. Accepted values: `text-embedding-3-large`, `text-embedding-3-small`, `text-embedding-ada-002`. When using a custom base URL, this must match the model name or deployment name registered on that server.
`OPENAI_DIMENSIONS`	No	`1024`	emb5	Output vector length. Supported range: 1 to the model maximum (3072 for `text-embedding-3-large`, 1536 for `text-embedding-3-small`). Not supported by `text-embedding-ada-002` — omit or ignore when using that model. Must match the `vectorDimension` field in the Solr collection schema. Reducing dimensions lowers storage cost and KNN search latency at the price of a small accuracy loss.
`OPENAI_LLM_MODEL`	No	`o4-mini`	llm3	Chat completion model for spell checking, query expansion, and RAG answer generation. Recommended values: `o4-mini` (fast reasoning, default), `gpt-4o` (highest quality, higher cost), `gpt-4o-mini` (balanced cost/quality). The model must support the Chat Completions API (`/v1/chat/completions`).
`OPENAI_RER_MODEL`	No	`gpt-4o-mini`	rer2	Chat completion model used for relevance scoring. All documents in a batch are scored in a single call using JSON output mode. A smaller, faster model (`gpt-4o-mini`) is sufficient because the reranker prompt is structured and deterministic. Temperature is set to 0.01 for near-deterministic output.
`OPENAI_SYSTEM_PROMPT`	No	(built-in default)	llm3	Overrides the system prompt used by the LLM service. The default instructs the model to act as a precise search assistant and return only requested output without explanations or markdown. Useful when deploying against a fine-tuned or instruction-specific model that requires a custom system prompt.

Minimal .env file

# Minimal configuration — only the API key is required
OPENAI_API_KEY=sk-your-api-key-here

Full .env file with all options

# OpenAI API configuration for SANDI services

# Required: your OpenAI platform API key
OPENAI_API_KEY=sk-your-api-key-here

# Optional: API base URL — default is the public OpenAI endpoint
# Override for Azure OpenAI or any OpenAI-compatible server
OPENAI_BASE_URL=https://api.openai.com/v1

# Optional: embedding model (emb5)
# text-embedding-3-large  — best quality, max 3072 dims (default)
# text-embedding-3-small  — faster and cheaper, max 1536 dims
# text-embedding-ada-002  — legacy, fixed 1536 dims, no dimension reduction
OPENAI_EMB_MODEL=text-embedding-3-large

# Optional: output vector dimensions (emb5)
# Must match the vectorDimension in the Solr collection schema
# Supported by text-embedding-3-large (1–3072) and text-embedding-3-small (1–1536)
OPENAI_DIMENSIONS=1024

# Optional: LLM model for spell check, query expansion, RAG answers (llm3)
# o4-mini     — fast reasoning model (default)
# gpt-4o      — highest quality, higher cost
# gpt-4o-mini — balanced cost and quality
OPENAI_LLM_MODEL=o4-mini

# Optional: reranker model (rer2)
# gpt-4o-mini is sufficient — reranking uses structured prompts at near-zero temperature
OPENAI_RER_MODEL=gpt-4o-mini

# Optional: override the LLM system prompt (llm3)
# Leave empty to use the built-in default
# OPENAI_SYSTEM_PROMPT=

The .env file is read by docker-compose automatically when it is placed in the same directory as docker-compose.yml. It is not mounted into any container — values are injected as environment variables at container start time. The file is listed in .gitignore by convention; never commit a real API key to version control.

3Advantages

No GPU required — minimal hardware

All three OpenAI adapter services are thin Flask/Gunicorn wrappers that forward requests to the OpenAI API and return results. They consume negligible CPU and RAM (under 100 MB each). Local alternatives — emb3, emb4, emb6, llm2, rer1 — each load multi-gigabyte neural network weights and require a CUDA-capable GPU with several GB of VRAM. With the OpenAI services, a complete SANDI stack runs comfortably on a laptop, a basic VPS, or any machine with internet access.

Ideal for learning, development, and testing

Developers can explore hybrid search, RAG pipelines, reranking, and query expansion without GPU hardware investment. Switching models, adjusting embedding dimensions, or comparing retrieval strategies is a configuration change followed by a container restart. The full SANDI stack — Solr, ZooKeeper, search, index, NLP, embedding, LLM, and reranker — starts with a single docker compose up command.

Access to the newest and most capable models

OpenAI continuously releases improved models. The embedding service can use text-embedding-3-large with state-of-the-art retrieval quality, the LLM service defaults to o4-mini (a reasoning model capable of precise structured output), and the reranker uses gpt-4o-mini. Upgrading to a newly released model requires only an environment variable change — no model downloads, no compatibility testing, no quantization decisions.

Unlimited and elastic capacity

OpenAI's infrastructure scales automatically. There is no need to manage memory pressure, batching queues, or model sharding. Batch embedding requests send multiple texts in a single API call, and throughput grows with OpenAI's rate limits rather than local hardware constraints. For small-to-medium collections this is effectively unconstrained.

Zero-maintenance deployment

No model downloads, no CUDA driver management, no quantization decisions, no out-of-memory crashes during indexing. The only prerequisite is an active OpenAI API key. Deployments start in seconds and require no ongoing model maintenance.

Azure OpenAI and compatible API support

All three services honour an OPENAI_BASE_URL override, enabling transparent switching to Azure OpenAI Service (which offers private endpoints and data residency options), a self-hosted OpenAI-compatible server (Ollama, vLLM, LM Studio), or another cloud provider. No SANDI application code changes are required — only the base URL and model name in .env.

4Disadvantages and Limitations

Network latency on every AI call

Each embedding, LLM completion, and reranking request crosses the public internet. Round-trip latency of 200–800 ms per call is typical. For interactive search, users notice slower response times compared to sub-millisecond local model inference. Bulk indexing of large document corpora is especially impacted: every text chunk requires a network round-trip to the embedding service, and the IndexingJobScheduler (which checks for work every second) will spend most of its time waiting on network I/O rather than processing.

Reduced throughput for batch indexing

Local GPU models can embed hundreds of text chunks per second. OpenAI's API, even with batch calls, is constrained by network bandwidth, serialization overhead, and per-account rate limits (tokens per minute and requests per minute). Indexing millions of documents will take significantly longer than with a local embedding service and may require explicit rate-limit handling, retry logic with exponential backoff, and throughput monitoring.

Data privacy and security risks

All document content, query text, and search results are transmitted to OpenAI's servers to obtain embeddings and LLM responses. Every text chunk from every indexed document is sent to the embedding API during indexing; every user search query is sent to the embedding API at search time; LLM calls include the full query and retrieved document passages for RAG generation. For use cases involving confidential, proprietary, legally privileged, or personally identifiable data (GDPR, HIPAA, financial, legal), this mode may be unacceptable or require explicit data processing agreements with OpenAI.

Air-gapped, offline, and strict data-residency environments cannot use the public OpenAI endpoint. Azure OpenAI Service with private endpoints and a Virtual Network may satisfy data-residency requirements in some jurisdictions, but transmitting data outside the local network is still required.

Ongoing API cost

Every embedding, completion, and reranking call is billed per token by OpenAI. Embedding costs scale with corpus size (every chunk is embedded once at index time) plus query volume (each search query is embedded at runtime). LLM and reranker costs scale with search query volume and result set size. For large-scale or high-traffic deployments, API costs can exceed the amortized hardware cost of running local models.

Requires continuous internet connectivity

The OpenAI services are non-functional without a live connection to api.openai.com. Any network outage will cause all AI-dependent features — embedding generation, semantic search, spell checking, query expansion, RAG answers, and reranking — to fail. SANDI's basic BM25 text search through Solr continues to function, but the hybrid and intelligent search features become unavailable.

Rate limits and service interruptions

OpenAI enforces per-account rate limits on tokens per minute and requests per minute. Under heavy indexing load, requests may be throttled (HTTP 429), causing SANDI's indexing jobs to stall or fail. OpenAI API outages and scheduled maintenance windows directly affect SANDI availability when using cloud services.

Vendor dependency and model deprecation

Relying on OpenAI introduces dependency on a third-party provider's roadmap and pricing. Models can be deprecated (e.g., text-embedding-ada-002 was succeeded by the text-embedding-3 family). When the embedding model used at index time is no longer available, the entire corpus must be re-indexed with the replacement model, since vector similarity scores are not comparable across different model families or dimension counts.

5OpenAI vs. Local Models — Comparison

Criterion	OpenAI (emb5 / llm3 / rer2)	Local GPU (emb4 / llm2 / rer1)
Hardware requirement	None — any CPU machine	CUDA GPU, 8+ GB VRAM per service
Setup time	Minutes — API key + docker compose up	Hours — model downloads, driver setup
Embedding latency	200–800 ms / call (network bound)	1–10 ms / batch (GPU bound)
Indexing throughput	Low to medium (rate limited)	High (GPU saturated)
Model quality	State-of-the-art (text-embedding-3-large)	Good (GTE-Large, competitive)
Model updates	Instant — change env var	Manual download and rebuild
Ongoing cost	Per-token API billing	Hardware + electricity (fixed)
Data privacy	Data sent to OpenAI servers	Fully on-premises, no external calls
Offline operation	Not supported	Full offline capability
Internet dependency	Required at all times	None
Rate limits	OpenAI per-account limits	Hardware capacity only
Best for	Dev, learning, small collections	Production, large corpora, sensitive data

6Recommendations by Use Case

Recommended: Use OpenAI services

Development and prototyping
Learning hybrid search and RAG concepts
Demo and evaluation environments
Small document collections (under ~100k docs)
Infrequent indexing with low query volume
Multi-tenant SaaS where GPU provisioning is impractical
Rapid model comparison and experimentation

Not recommended: Use local GPU services

Confidential, personal, or regulated data (GDPR, HIPAA)
Air-gapped or offline deployments
Large-scale corpora requiring fast indexing
Real-time search with sub-100 ms latency requirements
High query volume where per-token costs accumulate
Environments with strict data residency requirements
Production systems requiring predictable latency SLAs

The switch between OpenAI and local services requires only changing the service URLs in sandi-solr-search.properties and sandi-solr-index.properties. If the embedding model or dimension count changes at the same time, the Solr collection schema must be updated and all documents re-indexed. Plan migrations carefully to avoid downtime.

7Conclusion

The OpenAI integration adapters make SANDI immediately accessible to any developer or team that wants to explore semantic search, RAG, and hybrid retrieval without GPU infrastructure. A single OPENAI_API_KEY in the .env file unlocks the full SANDI feature set — vector embeddings, hybrid KNN + BM25 search, spell checking, query expansion, RAG answer generation, and relevance reranking — on commodity hardware. The trade-offs are real: every AI operation crosses the network, all indexed content reaches OpenAI servers, and costs scale with usage. For production systems with sensitive data, large corpora, or strict latency requirements, the local GPU-backed services remain the appropriate choice. The two modes are interchangeable — transitioning from cloud to on-premises is a configuration change, not an architecture change.

Apache Solr 9.8.1 OpenAI API text-embedding-3-large o4-mini gpt-4o-mini Hybrid Search RAG No GPU required Azure OpenAI compatible External data transmission

SANDI Solr