|
| 1 | +# Vana Data Refinement Template |
| 2 | + |
| 3 | +This repository serves as a template for creating Dockerized *data refinement instructions* that transform raw user data into normalized (and potentially anonymized) SQLite-compatible databases, so data in Vana can be querying by Vana's Query Engine. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +Here is an overview of the data refinement process on Vana. |
| 8 | + |
| 9 | + |
| 10 | + |
| 11 | +1. DLPs upload user-contributed data through their UI, and run proof-of-contribution against it. Afterwards, they call the refinement service to refine this data point. |
| 12 | +1. The refinement service downloads the file from the Data Registry and decrypts it. |
| 13 | +1. The refinement container, containing the instructions for data refinement (this repo), is executed |
| 14 | + 1. The decrypted data is mounted to the container's `/input` directory |
| 15 | + 1. The raw data points are transformed against a normalized SQLite database schema (specifically libSQL, a modern fork of SQLite) |
| 16 | + 1. Optionally, PII (Personally Identifiable Information) is removed or masked |
| 17 | + 1. The refined data is symmetrically encrypted with a derivative of the original file encryption key |
| 18 | +1. The encrypted refined data is uploaded and pinned to a DLP-owned IPFS |
| 19 | +1. The IPFS CID is written to the refinement container's `/output` directory |
| 20 | +1. The CID of the file is added as a refinement under the original file in the Data Registry |
| 21 | +1. Vana's Query Engine indexes that data point, aggregating it with all other data points of a given refiner. This allows SQL queries to run against all data of a particular refiner (schema). |
| 22 | + |
| 23 | +## Project Structure |
| 24 | + |
| 25 | +- `refiner/`: Contains the main refinement logic |
| 26 | + - `refine.py`: Core refinement implementation |
| 27 | + - `config.py`: Environment variables and settings needed to run your refinement |
| 28 | + - `__main__.py`: Entry point for the refinement execution |
| 29 | + - `models/`: Pydantic and SQLAlchemy data models (for both unrefined and refined data) |
| 30 | + - `transformer/`: Data transformation logic |
| 31 | + - `utils/`: Utility functions for encryption, IPFS upload, etc. |
| 32 | +- `input/`: Contains raw data files to be refined |
| 33 | +- `output/`: Contains refined outputs: |
| 34 | + - `schema.json`: Database schema definition |
| 35 | + - `db.libsql`: SQLite database file |
| 36 | + - `db.libsql.pgp`: Encrypted database file |
| 37 | +- `Dockerfile`: Defines the container image for the refinement task |
| 38 | +- `requirements.txt`: Python package dependencies |
| 39 | + |
| 40 | +## Getting Started |
| 41 | + |
| 42 | +1. Fork this repository |
| 43 | +2. Copy `.env.example` to `.env` and modify the values to match your environment |
| 44 | +3. Update the schemas in `refiner/models/` to define your raw and normalized data models |
| 45 | +4. Modify the refinement logic in `refiner/transformer/` to match your data structure |
| 46 | +5. If needed, modify `refiner/refiner.py` with your file(s) that need to be refined |
| 47 | +6. Build and test your refinement container |
| 48 | + |
| 49 | +### Environment variables |
| 50 | + |
| 51 | +Copy `.env.example` to `.env` and configure the following variables: |
| 52 | + |
| 53 | +```dotenv |
| 54 | +# Local directories where inputs and outputs are found |
| 55 | +# When running on the refinement service, files will be mounted to the /input and /output directory of the container |
| 56 | +INPUT_DIR=input |
| 57 | +OUTPUT_DIR=output |
| 58 | +
|
| 59 | +# This key is derived from the user file's original encryption key, automatically injected into the container by the refinement service |
| 60 | +# When developing locally, any string can be used here for testing |
| 61 | +REFINEMENT_ENCRYPTION_KEY=0x1234 |
| 62 | +
|
| 63 | +# Schema configuration |
| 64 | +SCHEMA_NAME=Google Drive Analytics |
| 65 | +SCHEMA_VERSION=0.0.1 |
| 66 | +SCHEMA_DESCRIPTION=Schema for the Google Drive DLP, representing some basic analytics of the Google user |
| 67 | +SCHEMA_DIALECT=sqlite |
| 68 | +
|
| 69 | +# IPFS configuration |
| 70 | +# Required if using https://pinata.cloud (IPFS pinning service) |
| 71 | +PINATA_API_KEY=your_pinata_api_key_here |
| 72 | +PINATA_API_SECRET=your_pinata_api_secret_here |
| 73 | +
|
| 74 | +# Public IPFS gateway URL for accessing uploaded files |
| 75 | +# Recommended to use own dedicated IPFS gateway to avoid congestion / rate limiting |
| 76 | +# Example: "https://ipfs.my-dao.org/ipfs" (Note: won't work for third-party files) |
| 77 | +IPFS_GATEWAY_URL=https://gateway.pinata.cloud/ipfs |
| 78 | +``` |
| 79 | + |
| 80 | +## Local Development |
| 81 | + |
| 82 | +To run the refinement locally for testing: |
| 83 | + |
| 84 | +```bash |
| 85 | +# With Python |
| 86 | +pip install --no-cache-dir -r requirements.txt |
| 87 | +python -m refiner |
| 88 | + |
| 89 | +# Or with Docker |
| 90 | +docker build -t refiner . |
| 91 | +docker run \ |
| 92 | + --rm \ |
| 93 | + --volume $(pwd)/input:/input \ |
| 94 | + --volume $(pwd)/output:/output \ |
| 95 | + --env PINATA_API_KEY=your_key \ |
| 96 | + --env PINATA_API_SECRET=your_secret \ |
| 97 | + refiner |
| 98 | +``` |
| 99 | + |
| 100 | +## Contributing |
| 101 | + |
| 102 | +If you have suggestions for improving this template, please open an issue or submit a pull request. |
| 103 | + |
| 104 | +## License |
| 105 | + |
| 106 | +[MIT License](LICENSE) |
| 107 | + |
0 commit comments