For extracting data and metadata from binary documents.
The Data Extraction Service was built with Elastic connectors in mind. While it can be used by other clients as well, its chief goal is to enable extracting text data from large binary documents "on the edge", and to provide a simple, stateless, load-balanceable, interface.
If you have a need to extract text from office documents larger than 100mb (see http.max_content_length)
before ingesting to Elasticsearch, the Data Extraction Service is for you.
For product documentation and version compatibility, see: Connectors -> Content Extraction.
The artifact producted by this repo is the data-extraction-service Docker image.
See https://www.docker.elastic.co/r/integrations/data-extraction-service for the full list of artifacts/versions.
This docker image runs openresty and tika-server as background services, which is handled by openrc.
First, pull your image of choice. You can find the latest version by looking at https://www.docker.elastic.co/r/integrations/data-extraction-service.
# replace "<version>" with your selected version
$ docker pull docker.elastic.co/integrations/data-extraction-service:<version>Then, start a container from the image with:
$ docker run \
  -p 8090:8090 \
  -it \
  --name extraction-service \
  docker.elastic.co/integrations/data-extraction-service:<version>You can validate that the service is running with:
$ curl -X GET http://localhost:8090/ping/ # this should output "Running!"To send a file to be extracted:
$ curl -X PUT http://localhost:8090/extract_text/ \
  -T /path/to/file.nameThis will return a response like the following:
{
  "extracted_text": "Hello world!",
  "_meta": {
    "X-ELASTIC:service": "tika",
    "X-ELASTIC:TIKA:parsed_by": ["parser1", "parser2"]
  }
}To extract a file locally, it must first be added to the docker container.
You can manually do this using docker cp or you can mount a volume to share files with a different system.
You must specify the full filepath in the local_file_path argument.
Note: avoid using only /app as your chosen filedrop path. If a config file is overwritten in this directory, data-extraction-service may break. If you intend to use /app, be sure to append a further directory, e.g. /app/files.
With docker cp
$ docker cp /path/to/file.name extraction-service:/app/files/file.name
$ curl -X PUT http://localhost:8090/extract_text/?local_file_path=/app/files/file.name | jqWith volume sharing.
$ docker run \
  -p 8090:8090 \
  -it \
  --name extraction-service \
  -v /local/file/location:/app/files \
  docker.elastic.co/integrations/data-extraction-service:<version>For volume sharing, /local/file/location:/app/files can also be replaced with docker-volume-name:/app/files if you intend to share files between two docker containers. Check the docker volume docs for more details.
Doing this will also require a shared network.
You can read more about using local file pointers in the product documentation for using file pointers.
The running docker image produces log files for each of its underlying components. You can find them at:
- Openresty logs: /var/log/openresty.log
- Tikaserver java logs: /var/log/tika.log
To build the docker image locally, run:
$ docker build --platform=linux/arm64 -t extraction-service .Run your new image:
$ docker run -p 8090:8090 -it --name extraction-service extraction-service(Add -d to run detached, or --rm if you want the docker container to be deleted when you exit the window)
To remove the detached container:
$ docker stop extraction-service
$ docker rm extraction-service