This repository contains the PDF to TXT transformation job that is run upon any new ArXiv paper. In addition, it retrieves and sends their metadata to the Dialect map private API by using one of the following sources:
- The public ArXiv Kaggle dataset.
- The public ArXiv export API.
Python dependencies are specified on the multiple files within the reqs directory.
In order to install all the development packages, as long as the defined commit hooks:
make install-devAll Python files are formatted using Black, and the custom properties defined
in the pyproject.toml file.
make checkProject testing is performed using Pytest. In order to run the tests:
make testThe project contains a main.py module exposing a CLI with several commands:
python3 src/main.py [OPTIONS] [COMMAND] [ARGS]...This command starts a process that recursively traverses a file system tree of PDF files, transforming them into their TXT equivalent.
| ARGUMENT | ENV VARIABLE | REQUIRED | DESCRIPTION |
|---|---|---|---|
| --input-files-path | - | Yes | Path to the list of input PDF files |
| --output-files-path | - | Yes | Path to store the output TXT files |
This command starts a process that recursively traverses a file system tree of PDF files, sending their metadata to the Dialect Map private API along the way. The process assumes that each PDF is an ArXiv paper, with their names as their IDs.
| ARGUMENT | ENV VARIABLE | REQUIRED | DESCRIPTION |
|---|---|---|---|
| --input-files-path | - | Yes | Path to the list of input PDF files |
| --input-metadata-urls | - | Yes | URLs to the paper metadata sources |
| --gcp-key-path | - | Yes | GCP Service account key path |
| --output-api-url | - | Yes | Private API base URL |