Skip to content

Latest commit

 

History

History

document-processing

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
#Vespa

Vespa sample applications - document processing

Data written to Vespa pass through document processing, where indexing is one example. Applications can add custom processing, normally done before indexing. This is done by adding a Document Processor. Such processing is synchronous, and this is problematic for processing that requires other resources with high latency - this can saturate the threadpool.

This application demonstrates how to use Progress.LATER and the asynchronous Document API. Summary:

  • Document Processors: modify / enrich data in the feed pipeline
  • Multiple Schemas: store different kinds of data, like different database tables
  • Enrich data from multiple sources: here, look up data in one schema and add to another
  • Document API: write asynchronous code to fetch data

Flow:

  1. Feed album document with the music schema
  2. Look up in the lyrics schema if album with given ID has lyrics stored
  3. Store album with lyrics in the music schema

image

Validate the environment:

Make sure you see at minimum 4 GB. Refer to Docker memory for details and troubleshooting:

$ docker info | grep "Total Memory"
or
$ podman info | grep "memTotal"

Install the Vespa CLI:

Using Homebrew:

$ brew install vespa-cli

You can also download Vespa CLI for Windows, Linux and macOS.

Set local target:

$ vespa config set target local

Start a Vespa Docker container:

$ docker run --detach --name vespa --hostname vespa-container \
  --publish 127.0.0.1:8080:8080 --publish 127.0.0.1:19071:19071 \
  vespaengine/vespa

Verify it is ready to use:

$ vespa status deploy --wait 300

Initialize myapp/ to a copy of a sample application package:

$ vespa clone examples/document-processing myapp && cd myapp

Build it:

$ mvn -U clean package

Deploy it:

$ vespa deploy --wait 300

Feed a lyrics document:

... and get the document after the feed as well:

$ vespa document src/test/resources/A-Head-Full-of-Dreams-lyrics.json
$ vespa document get id:mynamespace:lyrics::a-head-full-of-dreams

Feed a music document:

$ vespa document src/test/resources/A-Head-Full-of-Dreams.json

Validate that the Document Processor works

Get the document to validate - see lyrics in music document:

$ vespa document get id:mynamespace:music::a-head-full-of-dreams

Compare, the original document did not have lyrics - it has been added in the LyricsDocumentProcessor:

$ cat src/test/resources/A-Head-Full-of-Dreams.json

Review logs:

Inspect what happened:

docker exec vespa sh -c '/opt/vespa/bin/vespa-logfmt | grep LyricsDocumentProcessor'
...LyricsDocumentProcessor	info	In process
...LyricsDocumentProcessor	info	  Added to requests pending: 1
...LyricsDocumentProcessor	info	  Request pending ID: 1, Progress.LATER
...LyricsDocumentProcessor	info	In process
...LyricsDocumentProcessor	info	  Request pending ID: 1, Progress.LATER
...LyricsDocumentProcessor	info	In handleResponse
...LyricsDocumentProcessor	info	  Async response to put or get, requestID: 1
...LyricsDocumentProcessor	info	  Found lyrics for : document 'id:mynamespace:lyrics::1' of type 'lyrics'
...LyricsDocumentProcessor	info	In process
...LyricsDocumentProcessor	info	  Set lyrics, Progress.DONE

In the first invocation of process, an async request is made - set Progress.LATER In the second invocation of process, the async request has not yet completed (there can be many such invocations) - set Progress.LATER Then, the handler for the async operation is invoked as the call has completed In the subsequent process invocation, we see that the async operation has completed - set Progress.DONE

Shutdown and remove the container:

$ docker rm -f vespa

Further reading: