Skip to content

Commit b4397fb

Browse files
Search optimization and indexing based on datetime (#405)
**Related Issue(s):** - #401 # Index Management System with Time-based Partitioning ## Description This PR introduces a new index management system that enables automatic index partitioning based on dates and index size control with automatic splitting. ## How it works ### System Architecture The system consists of several main components: **1. Search Engine Adapters** - `SearchEngineAdapter` - base class - `ElasticsearchAdapter` and `OpenSearchAdapter` - implementations for specific engines **2. Index Selection Strategies** - `AsyncDatetimeBasedIndexSelector` / `SyncDatetimeBasedIndexSelector` - date-based index filtering - `UnfilteredIndexSelector` - returns all indexes (fallback) - Cache with TTL (default 1 hour) for performance **3. Data Insertion Strategies** - **Simple strategy**: one index per collection (behavior as before) - **Datetime strategy**: indexes partitioned by dates with automatic partitioning ### Datetime Strategy - Operation Details **Index Format:** ``` items_collection-name_2025-01-01-2025-03-31 ``` **Item Insertion Process:** 1. System checks item date (`properties.datetime`) 2. Looks for existing index that covers this date 3. If not found - creates new index from this date 4. Checks target index size 5. If exceeds limit (`DATETIME_INDEX_MAX_SIZE_GB`) - splits index **Early Date Handling:** If item has date earlier than oldest index: 1. Creates new index from this earlier date 2. Updates oldest index alias to end one day before new date **Index Splitting:** When index exceeds size limit: 1. Updates current index alias to end on last item's date 2. Creates new index from next day 3. New items go to new index ### Cache and Performance **IndexCacheManager:** - Stores mapping of collection aliases to index lists - TTL default 1 hour - Automatic refresh on expiration - Manual refresh after index modifications **AsyncIndexAliasLoader / SyncIndexAliasLoader:** - Load alias information from search engine - Use cache manager to store results - Async and sync versions for different usage contexts ## Configuration **New Environment Variables:** ```bash # Enable datetime strategy (default false) ENABLE_DATETIME_INDEX_FILTERING=true # Maximum index size in GB before splitting (default 25) DATETIME_INDEX_MAX_SIZE_GB=50 ``` ## Usage Examples ### Scenario 1: Adding items to new collection 1. First item with date `2025-01-15` → creates index `items_collection_2025-01-15` 2. Subsequent items with similar dates → go to same index ### Scenario 2: Size limit exceeded 1. Index `items_collection_2025-01-01` reaches 25GB 2. New item with date `2025-03-15` → system splits index: - Old: `items_collection_2025-01-01-2025-03-15` - New: `items_collection_2025-03-16` ### Scenario 3: Item with early date 1. Existing index: `items_collection_2025-02-01` 2. New item with date `2024-12-15` → creates: - New: `items_collection_2024-12-15-2025-01-31` ## Search System automatically filters indexes during search: **Query with date range:** ```json { "datetime": { "gte": "2025-02-01", "lte": "2025-02-28" } } ``` Searches only indexes containing items from this period, instead of all collection indexes. ## Factories **IndexSelectorFactory:** - Creates appropriate selector based on configuration - `create_async_selector()` / `create_sync_selector()` **IndexInsertionFactory:** - Creates insertion strategy based on configuration - Automatically detects engine type and creates appropriate adapter **SearchEngineAdapterFactory:** - Detects whether you're using Elasticsearch or OpenSearch - Creates appropriate adapter with engine-specific methods ## Backward Compatibility - When `ENABLE_DATETIME_INDEX_FILTERING=false` → works as before - Existing indexes remain unchanged All operations have sync and async versions for different usage contexts in the application. **PR Checklist:** - [x] Code is formatted and linted (run `pre-commit run --all-files`) - [x] Tests pass (run `make test`) - [x] Documentation has been updated to reflect changes, if applicable - [x] Changes are added to the changelog --------- Co-authored-by: Grzegorz Pustulka <[email protected]>
1 parent 59d43f9 commit b4397fb

File tree

32 files changed

+2253
-134
lines changed

32 files changed

+2253
-134
lines changed

.github/workflows/cicd.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ jobs:
2828
xpack.security.enabled: false
2929
xpack.security.transport.ssl.enabled: false
3030
ES_JAVA_OPTS: -Xms512m -Xmx1g
31+
action.destructive_requires_name: false
3132
ports:
3233
- 9200:9200
3334

@@ -44,6 +45,7 @@ jobs:
4445
xpack.security.enabled: false
4546
xpack.security.transport.ssl.enabled: false
4647
ES_JAVA_OPTS: -Xms512m -Xmx1g
48+
action.destructive_requires_name: false
4749
ports:
4850
- 9400:9400
4951

@@ -60,6 +62,7 @@ jobs:
6062
plugins.security.disabled: true
6163
plugins.security.ssl.http.enabled: true
6264
OPENSEARCH_JAVA_OPTS: -Xms512m -Xmx512m
65+
action.destructive_requires_name: false
6366
ports:
6467
- 9202:9202
6568

@@ -120,5 +123,6 @@ jobs:
120123
ES_PORT: ${{ matrix.backend == 'elasticsearch7' && '9400' || matrix.backend == 'elasticsearch8' && '9200' || '9202' }}
121124
ES_HOST: 172.17.0.1
122125
ES_USE_SSL: false
126+
DATABASE_REFRESH: true
123127
ES_VERIFY_CERTS: false
124128
BACKEND: ${{ matrix.backend == 'elasticsearch7' && 'elasticsearch' || matrix.backend == 'elasticsearch8' && 'elasticsearch' || 'opensearch' }}

CHANGELOG.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,32 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
88

99
## [Unreleased]
1010

11+
### Added
12+
13+
- Added comprehensive index management system with dynamic selection and insertion strategies for improved performance and scalability [#405](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/405)
14+
- Added `ENABLE_DATETIME_INDEX_FILTERING` environment variable to enable datetime-based index selection using collection IDs. When enabled, the system creates indexes with UUID-based names and manages them through time-based aliases. Default is `false`. [#405](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/405)
15+
- Added `DATETIME_INDEX_MAX_SIZE_GB` environment variable to set maximum size limit in GB for datetime-based indexes. When an index exceeds this size, a new time-partitioned index will be created. Note: add +20% to target size due to ES/OS compression. Default is `25` GB. Only applies when `ENABLE_DATETIME_INDEX_FILTERING` is enabled. [#405](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/405)
16+
- Added index operations system with unified interface for both Elasticsearch and OpenSearch [#405](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/405):
17+
- `IndexOperations` class with common index creation and management methods
18+
- UUID-based physical index naming: `{prefix}_{collection-id}_{uuid4}`
19+
- Alias management: main collection alias, temporal aliases, and closed index aliases
20+
- Automatic alias updates when indexes reach size limits
21+
- Added datetime-based index selection strategies with caching support [#405](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/405):
22+
- `DatetimeBasedIndexSelector` for temporal filtering with intelligent caching
23+
- `IndexCacheManager` with configurable TTL-based cache expiration (default 1 hour)
24+
- `IndexAliasLoader` for alias management and cache refresh
25+
- `UnfilteredIndexSelector` as fallback for returning all available indexes
26+
- Added index insertion strategies with automatic partitioning [#405](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/405):
27+
- Simple insertion strategy (`SimpleIndexInserter`) for traditional single-index-per-collection approach
28+
- Datetime-based insertion strategy (`DatetimeIndexInserter`) with time-based partitioning
29+
- Automatic index size monitoring and splitting when limits exceeded
30+
- Handling of chronologically early data and bulk operations
31+
- Added index management utilities [#405](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/405):
32+
- `IndexSizeManager` for size monitoring and overflow handling with compression awareness
33+
- `DatetimeIndexManager` for datetime-based index operations and validation
34+
- Factory patterns (`IndexInsertionFactory`, `IndexSelectorFactory`) for strategy creation based on configuration
35+
36+
1137
## [v6.1.0] - 2025-07-24
1238

1339
### Added

Makefile

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ run_os = docker compose \
2727
.PHONY: image-deploy-es
2828
image-deploy-es:
2929
docker build -f dockerfiles/Dockerfile.dev.es -t stac-fastapi-elasticsearch:latest .
30-
30+
3131
.PHONY: image-deploy-os
3232
image-deploy-os:
3333
docker build -f dockerfiles/Dockerfile.dev.os -t stac-fastapi-opensearch:latest .
@@ -71,14 +71,19 @@ test-opensearch:
7171
-$(run_os) /bin/bash -c 'export && ./scripts/wait-for-it-es.sh opensearch:9202 && cd stac_fastapi/tests/ && pytest'
7272
docker compose down
7373

74-
.PHONY: test
75-
test:
76-
-$(run_es) /bin/bash -c 'export && ./scripts/wait-for-it-es.sh elasticsearch:9200 && cd stac_fastapi/tests/ && pytest --cov=stac_fastapi --cov-report=term-missing'
74+
.PHONY: test-datetime-filtering-es
75+
test-datetime-filtering-es:
76+
-$(run_es) /bin/bash -c 'export ENABLE_DATETIME_INDEX_FILTERING=true && ./scripts/wait-for-it-es.sh elasticsearch:9200 && cd stac_fastapi/tests/ && pytest -s --cov=stac_fastapi --cov-report=term-missing -m datetime_filtering'
7777
docker compose down
7878

79-
-$(run_os) /bin/bash -c 'export && ./scripts/wait-for-it-es.sh opensearch:9202 && cd stac_fastapi/tests/ && pytest --cov=stac_fastapi --cov-report=term-missing'
79+
.PHONY: test-datetime-filtering-os
80+
test-datetime-filtering-os:
81+
-$(run_os) /bin/bash -c 'export ENABLE_DATETIME_INDEX_FILTERING=true && ./scripts/wait-for-it-es.sh opensearch:9202 && cd stac_fastapi/tests/ && pytest -s --cov=stac_fastapi --cov-report=term-missing -m datetime_filtering'
8082
docker compose down
8183

84+
.PHONY: test
85+
test: test-elasticsearch test-datetime-filtering-es test-opensearch test-datetime-filtering-os
86+
8287
.PHONY: run-database-es
8388
run-database-es:
8489
docker compose run --rm elasticsearch

README.md

Lines changed: 75 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -230,6 +230,81 @@ You can customize additional settings in your `.env` file:
230230
> [!NOTE]
231231
> The variables `ES_HOST`, `ES_PORT`, `ES_USE_SSL`, `ES_VERIFY_CERTS` and `ES_TIMEOUT` apply to both Elasticsearch and OpenSearch backends, so there is no need to rename the key names to `OS_` even if you're using OpenSearch.
232232
233+
# Datetime-Based Index Management
234+
235+
## Overview
236+
237+
SFEOS supports two indexing strategies for managing STAC items:
238+
239+
1. **Simple Indexing** (default) - One index per collection
240+
2. **Datetime-Based Indexing** - Time-partitioned indexes with automatic management
241+
242+
The datetime-based indexing strategy is particularly useful for large temporal datasets. When a user provides a datetime parameter in a query, the system knows exactly which index to search, providing **multiple times faster searches** and significantly **reducing database load**.
243+
244+
## When to Use
245+
246+
**Recommended for:**
247+
- Systems with large collections containing millions of items
248+
- Systems requiring high-performance temporal searching
249+
250+
**Pros:**
251+
- Multiple times faster queries with datetime filter
252+
- Reduced database load - only relevant indexes are searched
253+
254+
**Cons:**
255+
- Slightly longer item indexing time (automatic index management)
256+
- Greater management complexity
257+
258+
## Configuration
259+
260+
### Enabling Datetime-Based Indexing
261+
262+
Enable datetime-based indexing by setting the following environment variable:
263+
264+
```bash
265+
ENABLE_DATETIME_INDEX_FILTERING=true
266+
```
267+
268+
### Related Configuration Variables
269+
270+
| Variable | Description | Default | Example |
271+
|----------|-------------|---------|---------|
272+
| `ENABLE_DATETIME_INDEX_FILTERING` | Enables time-based index partitioning | `false` | `true` |
273+
| `DATETIME_INDEX_MAX_SIZE_GB` | Maximum size limit for datetime indexes (GB) - note: add +20% to target size due to ES/OS compression | `25` | `50` |
274+
| `STAC_ITEMS_INDEX_PREFIX` | Prefix for item indexes | `items_` | `stac_items_` |
275+
276+
## How Datetime-Based Indexing Works
277+
278+
### Index and Alias Naming Convention
279+
280+
The system uses a precise naming convention:
281+
282+
**Physical indexes:**
283+
```
284+
{ITEMS_INDEX_PREFIX}{collection-id}_{uuid4}
285+
```
286+
287+
**Aliases:**
288+
```
289+
{ITEMS_INDEX_PREFIX}{collection-id} # Main collection alias
290+
{ITEMS_INDEX_PREFIX}{collection-id}_{start-datetime} # Temporal alias
291+
{ITEMS_INDEX_PREFIX}{collection-id}_{start-datetime}_{end-datetime} # Closed index alias
292+
```
293+
294+
**Example:**
295+
296+
*Physical indexes:*
297+
- `items_sentinel-2-l2a_a1b2c3d4-e5f6-7890-abcd-ef1234567890`
298+
299+
*Aliases:*
300+
- `items_sentinel-2-l2a` - main collection alias
301+
- `items_sentinel-2-l2a_2024-01-01` - active alias from January 1, 2024
302+
- `items_sentinel-2-l2a_2024-01-01_2024-03-15` - closed index alias (reached size limit)
303+
304+
### Index Size Management
305+
306+
**Important - Data Compression:** Elasticsearch and OpenSearch automatically compress data. The configured `DATETIME_INDEX_MAX_SIZE_GB` limit refers to the compressed size on disk. It is recommended to add +20% to the target size to account for compression overhead and metadata.
307+
233308
## Interacting with the API
234309

235310
- **Creating a Collection**:
@@ -538,4 +613,3 @@ You can customize additional settings in your `.env` file:
538613
- Ensures fair resource allocation among all clients
539614
540615
- **Examples**: Implementation examples are available in the [examples/rate_limit](examples/rate_limit) directory.
541-

compose.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ services:
2121
- ES_USE_SSL=false
2222
- ES_VERIFY_CERTS=false
2323
- BACKEND=elasticsearch
24+
- DATABASE_REFRESH=true
2425
ports:
2526
- "8080:8080"
2627
volumes:
@@ -72,6 +73,7 @@ services:
7273
hostname: elasticsearch
7374
environment:
7475
ES_JAVA_OPTS: -Xms512m -Xmx1g
76+
action.destructive_requires_name: false
7577
volumes:
7678
- ./elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml
7779
- ./elasticsearch/snapshots:/usr/share/elasticsearch/snapshots
@@ -86,6 +88,7 @@ services:
8688
- discovery.type=single-node
8789
- plugins.security.disabled=true
8890
- OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m
91+
- action.destructive_requires_name=false
8992
volumes:
9093
- ./opensearch/config/opensearch.yml:/usr/share/opensearch/config/opensearch.yml
9194
- ./opensearch/snapshots:/usr/share/opensearch/snapshots

stac_fastapi/core/stac_fastapi/core/core.py

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -324,10 +324,15 @@ async def item_collection(
324324
search=search, collection_ids=[collection_id]
325325
)
326326

327-
if datetime:
328-
search = self.database.apply_datetime_filter(
329-
search=search, interval=datetime
327+
try:
328+
search, datetime_search = self.database.apply_datetime_filter(
329+
search=search, datetime=datetime
330330
)
331+
except (ValueError, TypeError) as e:
332+
# Handle invalid interval formats if return_date fails
333+
msg = f"Invalid interval format: {datetime}, error: {e}"
334+
logger.error(msg)
335+
raise HTTPException(status_code=400, detail=msg)
331336

332337
if bbox:
333338
bbox = [float(x) for x in bbox]
@@ -342,6 +347,7 @@ async def item_collection(
342347
sort=None,
343348
token=token,
344349
collection_ids=[collection_id],
350+
datetime_search=datetime_search,
345351
)
346352

347353
items = [
@@ -500,10 +506,15 @@ async def post_search(
500506
search=search, collection_ids=search_request.collections
501507
)
502508

503-
if search_request.datetime:
504-
search = self.database.apply_datetime_filter(
505-
search=search, interval=search_request.datetime
509+
try:
510+
search, datetime_search = self.database.apply_datetime_filter(
511+
search=search, datetime=search_request.datetime
506512
)
513+
except (ValueError, TypeError) as e:
514+
# Handle invalid interval formats if return_date fails
515+
msg = f"Invalid interval format: {search_request.datetime}, error: {e}"
516+
logger.error(msg)
517+
raise HTTPException(status_code=400, detail=msg)
507518

508519
if search_request.bbox:
509520
bbox = search_request.bbox
@@ -560,6 +571,7 @@ async def post_search(
560571
token=search_request.token,
561572
sort=sort,
562573
collection_ids=search_request.collections,
574+
datetime_search=datetime_search,
563575
)
564576

565577
fields = (

stac_fastapi/core/stac_fastapi/core/datetime_utils.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
"""Utility functions to handle datetime parsing."""
2+
23
from datetime import datetime, timezone
34

45
from stac_fastapi.types.rfc3339 import rfc3339_str_to_datetime

stac_fastapi/core/stac_fastapi/core/serializers.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
"""Serializers."""
2+
23
import abc
34
from copy import deepcopy
45
from typing import Any, List, Optional

stac_fastapi/core/stac_fastapi/core/session.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
"""database session management."""
2+
23
import logging
34

45
import attr

0 commit comments

Comments
 (0)