Metadata parsing and routing for the Prior Art Archive
The file parser is deployed as an Elastic Beanstalk application on AWS (called FileParser). This platform was chosen because it lets deploy with docker images and lets us auto-scale behind a load-balancer.
This means the code in this repository gets built as a docker image and pushed to Docker Hub at https://hub.docker.com/r/priorartarchive/file-parser. Deploying to AWS is just uploading a Dockerrun.aws.json configuration file (documented here) that tells Elastic Beanstalk to pull priorartarchive/file-parser (along with a sibling container from logicalspark/docker-tikaserver).
To get a Dockerrun.aws.json to upload to Elastic Beanstalk, copy & modify the Dockerrun.aws.sample.json to fill out the environment variables:
HOSTNAMEis eitherpriorartarchive.orgordev.priorartarchive.org.IPFS_HOSTis a the DNS address of an https IPFS API route (e.g. if you cancurl https://your.host/api/v0/id, thenIPFS_HOST=your.host). For now, we useapi.underlay.storefor both dev and prod.DATABASE_URLis the fully-qualified postgres URI (including theusername:password@at the beginning).AWS_REGIONisus-east-1.AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYneed to haveAmazonS3FullAccess,AWSLambdaExecute, andAWSLambdaRolepermission policies.CONFIGURATION_IDis the name of the S3 notification handler that is generating the events. The name of the handlers on both theassets.priorartarchive.organdassets.dev.priorartarchive.orgbuckets isNewFile.
In addition, edit the "image": "priorartarchive/priorart-file-parser" line to include the tag of the docker image that you want to use: for now there's only a dev tag but there will be a prod tag once v2 goes live.
The file parser requires access to a remote IPFS node via HTTP API. This is dev-api.underlay.store and api.underlay.store for the dev and prod deployments, respectively.
Local changes should be committed to the dev branch. When you're ready to deploy a dev version, build a local image and push to the docker hub repo:
docker build -t priorartarchive/file-parser:dev .
docker push priorartarchive/file-parser:dev
Then head over to Elastic Beanstalk and upload a Dockerrun.aws.json file (containing URIs for the development database and elasticsearch, and referencing the priorartarchive/priorart-file-parser:dev image) to the file-parser-dev environment of the FileParser application.
Changes to master should only come a pull requests from dev. When you're ready to deploy a prod version, build a local image and push to the docker hub repo:
docker build -t priorartarchive/file-parser:prod .
docker push priorartarchive/file-parser:prod
Then head over to Elastic Beanstalk and upload a Dockerrun.aws.json file (containing URIs for the production database and elasticsearch, and referencing the priorartarchive/file-parser:prod image) to the tika-server-env environment of the TikaServer application.
/Dockerrun.aws.sample.json-spawnChildis documented here. "This starts tika-server in a child process, and if there's an OOM, a timeout or other catastrophic problem with the child process, the parent process will kill and/or restart the child process."-JXmx1gsets the max heap for the spawned child process at 1GB.-JXms256msets the initial heap for the spawned child process at 256MB.
In static/ there are two JSON-LD documents tika-reference.json (aka dweb:/ipfs/bafybeib7p2ibhncu736bewg4jw7o2j4msl72xgrd2ducrqwg5leugasx5u) and tika-provenance.json (aka dweb:/ipfs/bafybeiej4oe7qb5jhighp74mmy3st7fakznynjv62lti762bf4xqcdhmxq). These contain "background" knowledge about Tika that are referenced in the provenance of the assertions we generate.
Specifically, we attribute the resulting transcript and metadata documents to dweb:/ipfs/bafybeib7p2ibhncu736bewg4jw7o2j4msl72xgrd2ducrqwg5leugasx5u#_:c14n74 - the prov:SoftwareAgent that is the Tika software application - with the prov:qualifiedAssociation that the software agent had a prov:Role of dweb:/ipfs/bafybeib7p2ibhncu736bewg4jw7o2j4msl72xgrd2ducrqwg5leugasx5u#_:c14n5 (for metadata) or dweb:/ipfs/bafybeib7p2ibhncu736bewg4jw7o2j4msl72xgrd2ducrqwg5leugasx5u#_:c14n54 (for text extraction). These "roles" correspond to REST API endpoints that are structured as schema.org EntryPoints and derived from the HTML API docs that the Tika server serves from GET "/" by default. These are frighteningly & admittedly unwieldy: in the future you'll be able to paste these URIs into the Underlay Playground to get explorable visualizations (both from the source document and from subsequent published references). These sorts of references are a low-level representation that should rarely be seen; it's our job to build better tools for referencing them.
dweb:/ipfs/bafybeib7p2ibhncu736bewg4jw7o2j4msl72xgrd2ducrqwg5leugasx5u (aka tika-reference.json) is pinned to the cluster and should be considered stable, to be changed only when absolutely necessary. tika-provenance.json contains provenance about tika-reference.json (via explicit reference to dweb:/ipfs/bafybeib7p2ibhncu736bewg4jw7o2j4msl72xgrd2ducrqwg5leugasx5u as a digital document), citing the HTML API reference (that Tika itself generates!) as its source. In the (near) future we should sign (with some public KFG key) this document and publish it as well, but it's not necessary to get the Prior Art Archive working (unlike tika-reference.json, whose hash we need to use in our assertions).
tika-context.json is copied from and documented at this Gist.