Overview

SANDI Solr is a comprehensive search and indexing platform built on Apache Solr. It provides separate APIs for searching documents and indexing content with support for multiple document formats, scheduled processing, and advanced search features including semantic search, reranking and generation of AI summary.

The system consists of two main components:

  • Search API - Handles search queries and returns results
  • Index API - Manages document indexing, scheduling, and administration

Search API

The Search API provides powerful search capabilities with support for various query types, result formatting, and advanced features.

Base URL

http://localhost:8081/search

Endpoints

POST /search

Performs a search query using JSON request body.

Request Body Parameters:

Parameter Type Required Description
requestIdstringYesUnique identifier for the request
clientIdstringYesClient identifier for authentication
searchQuerystringNoMain search query text
pageSizeintegerNoNumber of results per page (default: 10, max: 10000)
pageNumberintegerNoPage number to retrieve (default: 1)
filterQuerystringNoAdditional filter query
resultFieldsstringNoComma-separated list of fields to return
groupFieldsstringNoFields to group results by
facetFieldsstringNoFields to generate facets for
sortFieldsstringNoFields to sort results by
highlightFieldsstringNoFields to highlight in results
highlightTagsstringNoCustom highlight tags
precisionstringNoSearch precision level: "high", "medium", "low"
groupbooleanNoEnable result grouping
facetbooleanNoEnable facet generation
highlightbooleanNoEnable result highlighting
exactbooleanNoEnable exact matching
legacybooleanNoUse legacy search mode
synonymsbooleanNoEnable synonym expansion
dymbooleanNoEnable "Did You Mean" suggestions
rerankbooleanNoEnable result reranking
ragbooleanNoEnable RAG (Retrieval-Augmented Generation)
collapsebooleanNoEnable result collapsing

Example Request:

{
    "requestId": "2c38e64d-19ce-4db2-acde-8df40edbf447",
    "clientId": "TREC_001",
    "pageSize": 10,
    "pageNumber": 1,
    "searchQuery": "Case for \"Samsung Galaxy\" with mirror",
    "filterQuery": "",
    "resultFields": "_id,id,title,content,score,rscore,_chunks",
    "group": false,
    "groupFields": "id",
    "facet": false,
    "facetFields": "id",
    "sortFields": "",
    "precision": "medium",
    "legacy": false,
    "rerank": true,
    "rag": true,
    "synonyms": false,
    "dym": true,
    "collapse": false,
    "exact": false
}

GET /search

Performs a search query using URL parameters. All parameters from the POST endpoint can be passed as URL parameters.

Example Request:

GET /search?requestId=abc123&clientId=TREC_001&searchQuery=Samsung Galaxy&pageSize=5&rerank=true&rag=true

Response Format

Successful Response:

{
    "requestId": "2c38e64d-19ce-4db2-acde-8df40edbf447",
    "status": "SUCCESS",
    "message": null,
    "dymQuery": "case for samsung galaxy with mirror",
    "ragAnswer": "Based on the search results, here are some cases for Samsung Galaxy phones with mirror features...",
    "foundResults": 1247,
    "start": 0,
    "took": 156,
    "results": [
        {
            "_id": "doc123",
            "id": "product_456",
            "title": "Samsung Galaxy S24 Mirror Case",
            "content": "Premium mirror case for Samsung Galaxy...",
            "score": 0.95,
            "rscore": 0.87,
            "_chunks": ["chunk1", "chunk2"]
        }
    ]
}

Error Response:

{
    "requestId": "abc123",
    "status": "ERROR",
    "message": "Client not found"
}

Search Features

Precision Levels

  • high: Most accurate results, slower performance
  • medium: Balanced accuracy and performance (default)
  • low: Fast results, lower accuracy

Advanced Features

  • Reranking: Improves result relevance using ML models
  • RAG: Generates answers based on search results
  • DYM: Provides query suggestions for typos/misspellings
  • Semantic Search: Uses embeddings for contextual matching

Index API

The Index API manages document ingestion, processing, and scheduling of indexing jobs.

Base URL

http://localhost:8082

Indexing Interface

POST /index

Indexes documents directly via API.

Request Body:

{
    "requestId": "req123",
    "clientId": "CLIENT_001",
    "data": [
        {
            "id": "doc1",
            "title": "Document Title",
            "content": "Document content...",
            "metadata": {
                "category": "news",
                "date": "2025-01-20"
            }
        }
    ]
}

POST /index/json

Indexes JSON documents with flexible schema.

Request Body:

{
    "requestId": "req124",
    "clientId": "CLIENT_001",
    "data": [
        {
            "title": "Product Review",
            "description": "Excellent product...",
            "rating": 5,
            "tags": ["electronics", "mobile"]
        }
    ]
}

Scheduler Interface

The scheduler manages automated indexing jobs with support for various document sources and formats.

Job Types

Job Type Description Source File Extensions
JSONSingle JSON document per fileFile system or URL.json
JSONLJSON Lines format (one JSON per line)File system or URL.jsonl
TXTPlain text documentsFile system or URL.txt
TXTLText Lines format (one document per line)File system or URL.txtl
EXCELExcel spreadsheet documentsFile system or URL.xlsx, .xls
SITEWebsite crawlingURLVarious web formats
SITEMAPXML sitemap processingFile system or URL.xml
JSONMAPJSON-based URL mappingFile system or URL.json

POST /schedule/index

Schedules an indexing job.

Request Body:

{
    "requestId": "schedule123",
    "clientId": "CLIENT_001",
    "jobType": "JSONL",
    "directory": "/sandi/documents/data/",
    "fileExtensions": ".jsonl,.json",
    "forceReindexing": true,
    "scheduledTime": "2025-01-21T10:00:00",
    "cron": "0 0 2 * * ?",
    "jobId": "daily-import-001"
}

Parameters:

  • jobType: One of the supported job types (see table above)
  • directory: Source directory or URL
  • fileExtensions: Comma-separated file extensions to process
  • forceReindexing: Whether to reindex existing documents
  • scheduledTime: When to start the job (ISO format)
  • cron: Optional cron expression for recurring jobs
  • jobId: Optional custom job identifier

Job Management

Job Status Types

SCHEDULED Waiting to run
RUNNING Currently executing
COMPLETED Successfully finished
FAILED Encountered errors
CANCELLED Manually cancelled

Cron Expression Examples

  • "0 0 2 * * ?" - Daily at 2:00 AM
  • "0 30 1 * * MON" - Every Monday at 1:30 AM
  • "0 0 */6 * * ?" - Every 6 hours
  • "0 15 10 * * ?" - Daily at 10:15 AM

Available Endpoints:

  • GET /schedule/jobs - Returns all indexing jobs
  • GET /schedule/jobs/{jobId} - Returns specific job details
  • GET /schedule/jobs/status/{status} - Returns jobs by status
  • POST /schedule/jobs/{jobId}/cancel - Cancels a job
  • GET /schedule/stats - Returns job statistics

Admin Interface

The admin interface provides management capabilities for clients, collections, and configurations.

Base URL

http://localhost:8082/admin

Client Management

  • GET /admin/clients - Lists all clients
  • GET /admin/clients/{clientId} - Gets specific client details
  • POST /admin/clients - Creates or updates a client
  • DELETE /admin/clients/{clientId} - Deletes a client

Create Client Request:

{
    "clientId": "NEW_CLIENT",
    "name": "Client Name",
    "collection": "client_collection",
    "active": true,
    "createCollection": true,
    "configuration": "default_config"
}

Collection Management

  • GET /admin/collections - Lists all Solr collections
  • GET /admin/collections/{name} - Gets collection details
  • POST /admin/collections - Creates a new collection
  • DELETE /admin/collections/{name} - Deletes a collection

Configuration Management

  • DELETE /admin/configuration/{name} - Deletes a configuration

Error Handling

Common HTTP Status Codes

  • 200 OK: Request successful
  • 400 Bad Request: Invalid request parameters
  • 404 Not Found: Resource not found
  • 500 Internal Server Error: Server error

Error Response Format

{
    "requestId": "req123",
    "status": "ERROR",
    "message": "Detailed error description"
}

Common Errors

  • Client not found: Invalid clientId
  • Client is inactive: Client exists but is disabled
  • Invalid page size: pageSize exceeds maximum (10000)
  • RequestId is required: Missing requestId parameter
  • Invalid job type: Unsupported jobType in scheduler

Best Practices

Search API

  • Always include a unique requestId for tracking
  • Use appropriate pageSize values (10-100 for UI, larger for batch)
  • Enable rerank for better relevance on important queries
  • Use precision: "medium" for balanced performance
  • Implement proper error handling for client validation

Index API

  • Batch documents when possible (up to 1000 per request)
  • Use appropriate job types for your data format
  • Schedule heavy indexing during off-peak hours
  • Monitor job status and handle failures gracefully
  • Use forceReindexing: false for incremental updates

Performance

  • Cache frequently accessed search results
  • Use filters for better query performance
  • Limit result fields to only what's needed
  • Consider pagination for large result sets
  • Monitor job queue length and processing times