Skip to content

Commit 711a305

Browse files
authored
dataprep: support air gapped env for redis/qdrant/milvus DB (#1492)
- Enhance dataprep to be able to run in the air gapped environment. - Add documentionation of how to run dataprep in the air gapped environment. Related to bug #1488. Signed-off-by: Lianhao Lu <[email protected]>
1 parent 2eaa3a5 commit 711a305

File tree

10 files changed

+189
-15
lines changed

10 files changed

+189
-15
lines changed

comps/dataprep/README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,3 +64,20 @@ For details, please refer to this [readme](src/README_finance.md)
6464
## Dataprep Microservice with MariaDB Vector
6565

6666
For details, please refer to this [readme](src/README_mariadb.md)
67+
68+
## Running in the air gapped environment
69+
70+
The following steps are common for running the dataprep microservice in an air gapped environment (a.k.a. environment with no internet access), for all DB backends.
71+
72+
1. Download the following models, e.g. `huggingface-cli download --cache-dir <model data directory> <model>`
73+
74+
- microsoft/table-transformer-structure-recognition
75+
- timm/resnet18.a1_in1k
76+
- unstructuredio/yolo_x_layout
77+
78+
2. launch the `dataprep` microservice with the following settings:
79+
80+
- mount the `model data directory` as the `/data` directory within the `dataprep` container
81+
- set environment variable `HF_HUB_OFFLINE` to 1 when launching the `dataprep` microservice
82+
83+
e.g. `docker run -d -v <model data directory>:/data -e HF_HUB_OFFLINE=1 ... ...`

comps/dataprep/deployment/docker_compose/compose.yaml

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,26 @@ services:
7575
minio:
7676
condition: service_healthy
7777

78+
dataprep-milvus-offline:
79+
extends: dataprep-milvus
80+
depends_on:
81+
tei-embedding-serving:
82+
condition: service_healthy
83+
standalone:
84+
condition: service_healthy
85+
etcd:
86+
condition: service_healthy
87+
minio:
88+
condition: service_healthy
89+
environment:
90+
HF_HUB_OFFLINE: 1
91+
# Use non-existing proxy to mimic air gapped environment
92+
no_proxy: localhost,127.0.0.1,${offline_no_proxy}
93+
http_proxy: http://localhost:7777
94+
https_proxy: http://localhost:7777
95+
volumes:
96+
- "${DATA_PATH:-./data}:/data"
97+
7898
dataprep-multimodal-milvus:
7999
image: ${REGISTRY:-opea}/dataprep:${TAG:-latest}
80100
container_name: dataprep-multimodal-milvus-server
@@ -242,6 +262,22 @@ services:
242262
retries: 10
243263
restart: unless-stopped
244264

265+
dataprep-qdrant-offline:
266+
extends: dataprep-qdrant
267+
depends_on:
268+
qdrant-vector-db:
269+
condition: service_healthy
270+
tei-embedding-serving:
271+
condition: service_healthy
272+
environment:
273+
HF_HUB_OFFLINE: 1
274+
# Use non-existing proxy to mimic air gapped environment
275+
no_proxy: localhost,127.0.0.1,${offline_no_proxy}
276+
http_proxy: http://localhost:7777
277+
https_proxy: http://localhost:7777
278+
volumes:
279+
- "${DATA_PATH:-./data}:/data"
280+
245281
dataprep-redis:
246282
image: ${REGISTRY:-opea}/dataprep:${TAG:-latest}
247283
container_name: dataprep-redis-server
@@ -271,6 +307,22 @@ services:
271307
retries: 10
272308
restart: unless-stopped
273309

310+
dataprep-redis-offline:
311+
extends: dataprep-redis
312+
depends_on:
313+
redis-vector-db:
314+
condition: service_healthy
315+
tei-embedding-serving:
316+
condition: service_healthy
317+
environment:
318+
HF_HUB_OFFLINE: 1
319+
# Use non-existing proxy to mimic air gapped environment
320+
no_proxy: localhost,127.0.0.1,${offline_no_proxy}
321+
http_proxy: http://localhost:7777
322+
https_proxy: http://localhost:7777
323+
volumes:
324+
- "${DATA_PATH:-./data}:/data"
325+
274326
dataprep-multimodal-redis:
275327
image: ${REGISTRY:-opea}/dataprep:${TAG:-latest}
276328
container_name: dataprep-multimodal-redis-server

comps/dataprep/src/Dockerfile

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,9 +46,14 @@ RUN pip install --no-cache-dir --upgrade pip setuptools && \
4646
ENV PYTHONPATH=$PYTHONPATH:/home/user
4747

4848
RUN mkdir -p /home/user/comps/dataprep/src/uploaded_files && chown -R user /home/user/comps/dataprep/src/uploaded_files
49+
RUN mkdir -p /data && chown -R user /data
4950

5051
USER user
5152
ENV NLTK_DATA=/home/user/nltk_data
53+
# air gapped support: predownload all needed nltk data
54+
RUN mkdir -p /home/user/nltk_data && python -m nltk.downloader -d /home/user/nltk_data punkt_tab averaged_perceptron_tagger_eng stopwords
55+
# air gapped support: set model cache dir
56+
ENV HF_HUB_CACHE=/data
5257

5358
WORKDIR /home/user/comps/dataprep/src
5459

comps/dataprep/src/README_milvus.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -207,3 +207,7 @@ curl -X POST \
207207
-F "chunk_size=500" \
208208
http://localhost:6010/v1/dataprep/ingest
209209
```
210+
211+
## Running in the air gapped environment
212+
213+
Please follow the [common guide](../README.md#running-in-the-air-gapped-environment) to run dataprep microservice in the air gapped environment.

comps/dataprep/src/README_qdrant.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,3 +72,7 @@ curl -X POST \
7272
-F "table_strategy=hq" \
7373
http://localhost:6007/v1/dataprep/ingest
7474
```
75+
76+
## Running in the air gapped environment
77+
78+
Please follow the [common guide](../README.md#running-in-the-air-gapped-environment) to run dataprep microservice in the air gapped environment.

comps/dataprep/src/README_redis.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -261,3 +261,7 @@ curl -X POST \
261261
-d '{"file_path": "all", "index_name": "test_redis_1"}' \
262262
http://localhost:6007/v1/dataprep/delete
263263
```
264+
265+
## Running in the air gapped environment
266+
267+
Please follow the [common guide](../README.md#running-in-the-air-gapped-environment) to run dataprep microservice in the air gapped environment.

tests/dataprep/dataprep_utils.sh

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -224,3 +224,21 @@ function check_healthy() {
224224
echo "$container_name did not become healthy in time."
225225
return 1
226226
}
227+
228+
DATAPREP_MODELS=(microsoft/table-transformer-structure-recognition timm/resnet18.a1_in1k unstructuredio/yolo_x_layout)
229+
230+
function prepare_dataprep_models() {
231+
local model_path=$1
232+
mkdir -p ${model_path}
233+
python3 -m pip install huggingface_hub[cli] --user
234+
# Workaround for huggingface-cli reporting error when set --cache-dir to same as default
235+
local extra_args=""
236+
local default_model_dir=$(readlink -m ~/.cache/huggingface/hub)
237+
local real_model_dir=$(echo ${model_path/#\~/$HOME} | xargs readlink -m )
238+
if [[ "${default_model_dir}" != "${real_model_dir}" ]]; then
239+
extra_args="--cache-dir ${model_path}"
240+
fi
241+
for m in ${DATAPREP_MODELS[@]}; do
242+
PATH=~/.local/bin:$PATH huggingface-cli download ${extra_args} $m
243+
done
244+
}

tests/dataprep/test_dataprep_milvus.sh

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -28,20 +28,28 @@ function build_docker_images() {
2828
}
2929

3030
function start_service() {
31+
local offline=${1:-false}
3132
export host_ip=${ip_address}
3233
export TEI_EMBEDDER_PORT=12005
3334
export EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"
3435
export MILVUS_HOST=${ip_address}
3536
export TEI_EMBEDDING_ENDPOINT="http://${host_ip}:${TEI_EMBEDDER_PORT}"
3637
export LOGFLAG=true
3738

39+
if [[ "$offline" == "true" ]]; then
40+
service_name="dataprep-milvus-offline tei-embedding-serving etcd minio standalone"
41+
export offline_no_proxy="${ip_address},${host_ip}"
42+
else
43+
service_name="dataprep-milvus tei-embedding-serving etcd minio standalone"
44+
fi
3845
cd $WORKPATH/comps/dataprep/deployment/docker_compose/
3946
docker compose up ${service_name} -d > ${LOG_PATH}/start_services_with_compose.log
4047

4148
check_healthy "dataprep-milvus-server" || exit 1
4249
}
4350

4451
function validate_microservice() {
52+
local offline=${1:-false}
4553
# test /v1/dataprep/delete
4654
delete_all ${ip_address} ${DATAPREP_PORT}
4755
check_result "dataprep - del" '{"status":true}' dataprep-milvus-server ${LOG_PATH}/dataprep_milvus.log
@@ -69,8 +77,10 @@ function validate_microservice() {
6977
check_result "dataprep - upload - xlsx" "Data preparation succeeded" dataprep-milvus-server ${LOG_PATH}/dataprep_milvus.log
7078

7179
# test /v1/dataprep/ingest upload link
72-
ingest_external_link ${ip_address} ${DATAPREP_PORT}
73-
check_result "dataprep - upload - link" "Data preparation succeeded" dataprep-milvus-server ${LOG_PATH}/dataprep_milvus.log
80+
if [[ "$offline" != "true" ]]; then
81+
ingest_external_link ${ip_address} ${DATAPREP_PORT}
82+
check_result "dataprep - upload - link" "Data preparation succeeded" dataprep-milvus-server ${LOG_PATH}/dataprep_milvus.log
83+
fi
7484

7585
# test /v1/dataprep/get
7686
get_all ${ip_address} ${DATAPREP_PORT}
@@ -95,11 +105,21 @@ function main() {
95105
stop_docker
96106

97107
build_docker_images
98-
start_service
108+
trap stop_docker EXIT
99109

110+
echo "Test normal env ..."
111+
start_service
100112
validate_microservice
101-
102113
stop_docker
114+
115+
if [[ -n "${DATA_PATH}" ]]; then
116+
echo "Test air gapped env ..."
117+
prepare_dataprep_models ${DATA_PATH}
118+
start_service true
119+
validate_microservice true
120+
stop_docker
121+
fi
122+
103123
echo y | docker system prune
104124

105125
}

tests/dataprep/test_dataprep_qdrant.sh

Lines changed: 29 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ function build_docker_images() {
2929
}
3030

3131
function start_service() {
32+
local offline=${1:-false}
3233
export host_ip=${ip_address}
3334
export EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"
3435
export EMBED_MODEL=${EMBEDDING_MODEL_ID}
@@ -37,14 +38,20 @@ function start_service() {
3738
export COLLECTION_NAME="rag-qdrant"
3839
export QDRANT_HOST=$ip_address
3940
export QDRANT_PORT=6360
40-
service_name="qdrant-vector-db tei-embedding-serving dataprep-qdrant"
41+
if [[ "$offline" == "true" ]]; then
42+
service_name="qdrant-vector-db tei-embedding-serving dataprep-qdrant-offline"
43+
export offline_no_proxy="${ip_address}"
44+
else
45+
service_name="qdrant-vector-db tei-embedding-serving dataprep-qdrant"
46+
fi
4147
cd $WORKPATH/comps/dataprep/deployment/docker_compose/
4248
docker compose up ${service_name} -d
4349

4450
check_healthy "dataprep-qdrant-server" || exit 1
4551
}
4652

4753
function validate_microservice() {
54+
local offline=${1:-false}
4855
# test /v1/dataprep/ingest upload file
4956
ingest_doc ${ip_address} ${DATAPREP_PORT}
5057
check_result "dataprep - upload - doc" "Data preparation succeeded" dataprep-qdrant-server ${LOG_PATH}/dataprep-qdrant.log
@@ -68,8 +75,10 @@ function validate_microservice() {
6875
check_result "dataprep - upload - xlsx" "Data preparation succeeded" dataprep-qdrant-server ${LOG_PATH}/dataprep-qdrant.log
6976

7077
# test /v1/dataprep/ingest upload link
71-
ingest_external_link ${ip_address} ${DATAPREP_PORT}
72-
check_result "dataprep - upload - link" "Data preparation succeeded" dataprep-qdrant-server ${LOG_PATH}/dataprep-qdrant.log
78+
if [[ "$offline" != "true" ]]; then
79+
ingest_external_link ${ip_address} ${DATAPREP_PORT}
80+
check_result "dataprep - upload - link" "Data preparation succeeded" dataprep-qdrant-server ${LOG_PATH}/dataprep-qdrant.log
81+
fi
7382

7483
}
7584

@@ -78,14 +87,30 @@ function stop_docker() {
7887
if [[ ! -z "$cid" ]]; then docker stop $cid && docker rm $cid && sleep 1s; fi
7988
}
8089

90+
function stop_service() {
91+
cd $WORKPATH/comps/dataprep/deployment/docker_compose/
92+
docker compose down || true
93+
}
94+
8195
function main() {
8296

8397
stop_docker
8498

8599
build_docker_images
86-
start_service
100+
trap stop_service EXIT
87101

102+
echo "Test normal env ..."
103+
start_service
88104
validate_microservice
105+
stop_service
106+
107+
if [[ -n "${DATA_PATH}" ]]; then
108+
echo "Test air gapped env ..."
109+
prepare_dataprep_models ${DATA_PATH}
110+
start_service true
111+
validate_microservice true
112+
stop_service
113+
fi
89114

90115
stop_docker
91116
echo y | docker system prune

tests/dataprep/test_dataprep_redis.sh

Lines changed: 32 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ function build_docker_images() {
2828
}
2929

3030
function start_service() {
31+
local offline=${1:-false}
3132

3233
export host_ip=${ip_address}
3334
export REDIS_HOST=$ip_address
@@ -38,14 +39,20 @@ function start_service() {
3839
export EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"
3940
export TEI_EMBEDDING_ENDPOINT="http://${ip_address}:${TEI_EMBEDDER_PORT}"
4041
export INDEX_NAME="rag_redis"
41-
service_name="redis-vector-db tei-embedding-serving dataprep-redis"
42+
if [[ "$offline" == "true" ]]; then
43+
service_name="redis-vector-db tei-embedding-serving dataprep-redis-offline"
44+
export offline_no_proxy="${ip_address}"
45+
else
46+
service_name="redis-vector-db tei-embedding-serving dataprep-redis"
47+
fi
4248
cd $WORKPATH/comps/dataprep/deployment/docker_compose/
4349
docker compose up ${service_name} -d
4450

4551
check_healthy "dataprep-redis-server" || exit 1
4652
}
4753

4854
function validate_microservice() {
55+
local offline=${1:-false}
4956

5057
# test /v1/dataprep/delete
5158
delete_all ${ip_address} ${DATAPREP_PORT}
@@ -73,12 +80,14 @@ function validate_microservice() {
7380
ingest_xlsx ${ip_address} ${DATAPREP_PORT} "redis"
7481
check_result "dataprep - upload - xlsx" "Data preparation succeeded" dataprep-redis-server ${LOG_PATH}/dataprep_upload_file.log
7582

76-
# test /v1/dataprep/ingest upload link
77-
ingest_external_link ${ip_address} ${DATAPREP_PORT}
78-
check_result "dataprep - upload - link" "Data preparation succeeded" dataprep-redis-server ${LOG_PATH}/dataprep_upload_file.log
83+
# test /v1/dataprep/ingest upload link
84+
if [[ "$offline" != "true" ]]; then
85+
ingest_external_link ${ip_address} ${DATAPREP_PORT}
86+
check_result "dataprep - upload - link" "Data preparation succeeded" dataprep-redis-server ${LOG_PATH}/dataprep_upload_file.log
7987

80-
ingest_external_link_with_chunk_parameters ${ip_address} ${DATAPREP_PORT} "rag_redis_test_link_params"
81-
check_result "dataprep - upload - link" "Data preparation succeeded" dataprep-redis-server ${LOG_PATH}/dataprep_upload_file.log
88+
ingest_external_link_with_chunk_parameters ${ip_address} ${DATAPREP_PORT} "rag_redis_test_link_params"
89+
check_result "dataprep - upload - link" "Data preparation succeeded" dataprep-redis-server ${LOG_PATH}/dataprep_upload_file.log
90+
fi
8291

8392
ingest_txt_with_index_name ${ip_address} ${DATAPREP_PORT} rag_redis_test
8493
check_result "dataprep - upload with index - txt" "Data preparation succeeded" dataprep-redis-server ${LOG_PATH}/dataprep_upload_file.log
@@ -114,14 +123,30 @@ function stop_docker() {
114123
if [[ ! -z "$cid" ]]; then docker stop $cid && docker rm $cid && sleep 1s; fi
115124
}
116125

126+
function stop_service() {
127+
cd $WORKPATH/comps/dataprep/deployment/docker_compose/
128+
docker compose down || true
129+
}
130+
117131
function main() {
118132

119133
stop_docker
120134

121135
build_docker_images
122-
start_service
136+
trap stop_service EXIT
123137

138+
echo "Test normal env ..."
139+
start_service
124140
validate_microservice
141+
stop_service
142+
143+
if [[ -n "${DATA_PATH}" ]]; then
144+
echo "Test air gapped env ..."
145+
prepare_dataprep_models ${DATA_PATH}
146+
start_service true
147+
validate_microservice true
148+
stop_service
149+
fi
125150

126151
stop_docker
127152
echo y | docker system prune

0 commit comments

Comments
 (0)