Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 97 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,90 +1,143 @@
# Vana Data Refinement Template
# Vana Data Refiner for Unwrapped Spotify Contribution Proofs

This repository serves as a template for creating Dockerized *data refinement instructions* that transform raw user data into normalized (and potentially anonymized) SQLite-compatible databases, so data in Vana can be querying by Vana's Query Engine.
This repository is a customized version of the Vana Data Refinement template, specifically designed to process and refine output data from the `unwrapped-proof-of-contribution` system. It transforms the JSON-based proof results into a normalized and queryable SQLite database, suitable for the Vana ecosystem.

## Overview

Here is an overview of the data refinement process on Vana.
This data refiner takes the `results.json` file generated by the Unwrapped Spotify proof of contribution system and transforms it into a structured SQLite database. The process involves:

![How Refinements Work](https://files.readme.io/25f8f6a4c8e785a72105d6eb012d09449f63ab5682d1f385120eaf5af871f9a2-image.png "How Refinements Work")
1. **Parsing Input**: Reading the `results.json` file which contains details about a Spotify data contribution, its validation status, score, and various attributes.
2. **Data Transformation**: Mapping the input JSON data to a predefined relational schema. This includes separating different aspects of the proof (main details, attributes, points breakdown, source file metadata) into distinct but related tables.
3. **Database Creation**: Storing the transformed data in a libSQL (SQLite compatible) database.
4. **Encryption**: Symmetrically encrypting the resulting SQLite database file using a key derived from the original file's encryption key.
5. **IPFS Upload (Optional)**: If configured, the encrypted database and its schema definition are uploaded to IPFS.

1. DLPs upload user-contributed data through their UI, and run proof-of-contribution against it. Afterwards, they call the refinement service to refine this data point.
1. The refinement service downloads the file from the Data Registry and decrypts it.
1. The refinement container, containing the instructions for data refinement (this repo), is executed
1. The decrypted data is mounted to the container's `/input` directory
1. The raw data points are transformed against a normalized SQLite database schema (specifically libSQL, a modern fork of SQLite)
1. Optionally, PII (Personally Identifiable Information) is removed or masked
1. The refined data is symmetrically encrypted with a derivative of the original file encryption key
1. The encrypted refined data is uploaded and pinned to a DLP-owned IPFS
1. The IPFS CID is written to the refinement container's `/output` directory
1. The CID of the file is added as a refinement under the original file in the Data Registry
1. Vana's Query Engine indexes that data point, aggregating it with all other data points of a given refiner. This allows SQL queries to run against all data of a particular refiner (schema).
The refined, encrypted database can then be registered with the Vana Data Registry, making the structured proof information queryable by permitted entities within the Vana ecosystem.

## Refined Database Schema

The refinement process generates a SQLite database with the following main tables:

* **`unwrapped_proofs`**: Stores the core information for each proof, such as `file_id`, `dlp_id`, validity, scores, and basic metadata.
* **`proof_attributes`**: Contains detailed attributes from valid proofs, like `track_count`, `total_minutes_listened`, `unique_artist_count`, etc. Linked to `unwrapped_proofs`.
* **`points_breakdown_scores`**: Stores the breakdown of points (volume, diversity, history) for a contribution. Linked to `proof_attributes`.
* **`source_file_metadata`**: Contains metadata about the original encrypted data file processed by the proof system, including its source URL and checksums. Linked to `unwrapped_proofs`.

For detailed column information, refer to `refiner/models/refined.py`.

## Project Structure

- `refiner/`: Contains the main refinement logic
- `refine.py`: Core refinement implementation
- `config.py`: Environment variables and settings needed to run your refinement
- `config.py`: Environment variables and settings (customized for Unwrapped proofs)
- `__main__.py`: Entry point for the refinement execution
- `models/`: Pydantic and SQLAlchemy data models (for both unrefined and refined data)
- `models/`: Pydantic and SQLAlchemy data models
- `unrefined.py`: Pydantic models for the input `results.json`
- `refined.py`: SQLAlchemy models for the output SQLite database schema
- `transformer/`: Data transformation logic
- `unwrapped_transformer.py`: Transforms Unwrapped proof data to the refined schema
- `utils/`: Utility functions for encryption, IPFS upload, etc.
- `input/`: Contains raw data files to be refined
- `input/`: Directory where the input `results.json` (or similar) file should be placed
- `output/`: Contains refined outputs:
- `schema.json`: Database schema definition
- `db.libsql`: SQLite database file
- `db.libsql.pgp`: Encrypted database file
- `schema.json`: SQLite schema definition (generated)
- `db.libsql`: SQLite database file (generated, unencrypted)
- `db.libsql.pgp`: Encrypted SQLite database file (generated)
- `output.json`: JSON file containing the `refinement_url` (IPFS or local file path) and schema details.
- `Dockerfile`: Defines the container image for the refinement task
- `requirements.txt`: Python package dependencies

## Getting Started

1. Fork this repository
1. Modify the config to match your environment, or add a .env file at the root. See below for defaults.
1. Update the schemas in `refiner/models/` to define your raw and normalized data models
1. Modify the refinement logic in `refiner/transformer/` to match your data structure
1. If needed, modify `refiner/refiner.py` with your file(s) that need to be refined
1. Build and test your refinement container
1. **Clone/Fork this Repository**: This repository contains the refiner logic tailored for Unwrapped.
2. **Input Data**: Place the `results.json` file (or a similarly structured JSON file) generated by the `unwrapped-proof-of-contribution` system into the `input/` directory.
3. **Environment Variables**:
* Create a `.env` file in the root of the project or set environment variables directly.
* The most important ones for local testing are `REFINEMENT_ENCRYPTION_KEY` (any string for testing) and optionally `PINATA_API_KEY` and `PINATA_API_SECRET` if you want to upload to IPFS via Pinata.
* See the "Environment Variables" section below for more details.
4. **Build and Test**: Follow the "Local Development" instructions.

### Environment Variables

Create a `.env` file in the project root or set these environment variables:

### Environment variables
```dotenv
# Local directories where inputs and outputs are found. When running on the refinement service, files will be mounted to the /input and /output directory of the container.
# Local directories where inputs and outputs are found.
# When running on the Vana refinement service, these will be /input and /output.
INPUT_DIR=input
OUTPUT_DIR=output

# This key is derived from the user file's original encryption key, automatically injected into the container by the refinement service. When developing locally, any string can be used here for testing.
REFINEMENT_ENCRYPTION_KEY=0x1234

# Required if using https://pinata.cloud (IPFS pinning service)
PINATA_API_KEY=xxx
PINATA_API_SECRET=yyy
# This key is derived from the user file's original encryption key,
# automatically injected into the container by the Vana refinement service.
# When developing locally, any non-empty string can be used here for testing.
REFINEMENT_ENCRYPTION_KEY=your_test_encryption_key_here_for_local_dev

# Schema details (defaults are set in refiner/config.py for Unwrapped)
# SCHEMA_NAME="Unwrapped Spotify Contribution Proof"
# SCHEMA_VERSION="1.0.0"
# SCHEMA_DESCRIPTION="Schema for refined Unwrapped Spotify listening data contribution proofs."
# SCHEMA_DIALECT="sqlite"

# Optional, required if using https://pinata.cloud (IPFS pinning service)
# If not provided, IPFS uploads will be skipped and output.refinement_url will be a local file:// path.
# PINATA_API_KEY=your_pinata_api_key
# PINATA_API_SECRET=your_pinata_api_secret
```

## Local Development

To run the refinement locally for testing:

**1. Install Dependencies:**
```bash
# With Python
pip install --no-cache-dir -r requirements.txt
```

**2. Prepare Input:**
Place your `results.json` (or equivalent) from the `unwrapped-proof-of-contribution` output into the `input/` directory of this project.

**3. Run with Python:**
Make sure your `.env` file is configured or environment variables are set.
```bash
python -m refiner
```

**4. (Alternatively) Run with Docker:**

First, build the Docker image:
```bash
docker build -t unwrapped-refiner .
```

Then, run the container. Remember to replace `your_test_encryption_key_here_for_local_dev` with an actual string if `REFINEMENT_ENCRYPTION_KEY` is not in your `.env` file or environment.
If using Pinata, also pass `PINATA_API_KEY` and `PINATA_API_SECRET`.

# Or with Docker
docker build -t refiner .
```bash
# Example without Pinata (output URL will be local file path)
docker run \
--rm \
--volume $(pwd)/input:/input \
--volume $(pwd)/output:/output \
--env PINATA_API_KEY=your_key \
--env PINATA_API_SECRET=your_secret \
refiner
--env REFINEMENT_ENCRYPTION_KEY="your_test_encryption_key_here_for_local_dev" \
unwrapped-refiner

# Example with Pinata
# docker run \
# --rm \
# --volume $(pwd)/input:/input \
# --volume $(pwd)/output:/output \
# --env REFINEMENT_ENCRYPTION_KEY="your_test_encryption_key_here_for_local_dev" \
# --env PINATA_API_KEY="your_pinata_api_key" \
# --env PINATA_API_SECRET="your_pinata_api_secret" \
# unwrapped-refiner
```

After execution, check the `output/` directory for `db.libsql` (the SQLite database), `db.libsql.pgp` (the encrypted database), `schema.json`, and `output.json`.

## Contributing

If you have suggestions for improving this template, please open an issue or submit a pull request.
This refiner is specifically tailored for Unwrapped.
If you have suggestions for improving the Vana Data Refinement template itself, please refer to the [vana-data-refinement-template repository](https://github.com/vana-com/vana-data-refinement-template). For issues or improvements related to this Unwrapped-specific refiner, please open an issue or submit a pull request in this repository.

## License

[MIT License](LICENSE)

[MIT License](LICENSE)
Binary file removed input/user.zip
Binary file not shown.
Empty file removed output/.gitkeep
Empty file.
65 changes: 48 additions & 17 deletions refiner/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,55 +4,86 @@

class Settings(BaseSettings):
"""Global settings configuration using environment variables"""

INPUT_DIR: str = Field(
default="/input",
description="Directory containing input files to process"
)

OUTPUT_DIR: str = Field(
default="/output",
description="Directory where output files will be written"
)

REFINEMENT_ENCRYPTION_KEY: str = Field(
default=None,
description="Key to symmetrically encrypt the refinement. This is derived from the original file encryption key"
)

SCHEMA_NAME: str = Field(
default="Google Drive Analytics",
description="Name of the schema"
default="Unwrapped Spotify Data",
description="Schema name for Unwrapped Spotify listening data"
)

SCHEMA_VERSION: str = Field(
default="0.0.1",
description="Version of the schema"
default="0.1.3",
description="Version of the Unwrapped Spotify schema"
)

SCHEMA_DESCRIPTION: str = Field(
default="Schema for the Google Drive DLP, representing some basic analytics of the Google user",
description="Description of the schema"
default="Refined schema for Spotify listening history and derived top artists. Artist details are enriched via Spotify Web API.",
description="Description of the Unwrapped Spotify schema"
)

SCHEMA_DIALECT: str = Field(
default="sqlite",
description="Dialect of the schema"
)

# Optional, required if using https://pinata.cloud (IPFS pinning service)
PINATA_API_KEY: Optional[str] = Field(
default=None,
description="Pinata API key"
)

PINATA_API_SECRET: Optional[str] = Field(
default=None,
description="Pinata API secret"
)


PINATA_API_GATEWAY: Optional[str] = Field(
default="https://gateway.pinata.cloud/ipfs",
description="Pinata API gateway URL. Note: This is the gateway to access, not the API endpoint for upload."
)

# Spotify Web API Credentials
SPOTIFY_CLIENT_ID: Optional[str] = Field(
default=None,
description="Spotify Web API Client ID"
)
SPOTIFY_CLIENT_SECRET: Optional[str] = Field(
default=None,
description="Spotify Web API Client Secret"
)
SPOTIFY_API_URL: str = Field(
default="https://api.spotify.com/v1",
description="Base URL for Spotify Web API"
)
SPOTIFY_TOKEN_URL: str = Field(
default="https://accounts.spotify.com/api/token",
description="Token URL for Spotify Web API"
)
SPOTIFY_MAX_IDS_PER_BATCH: int = Field(
default=50,
description="Max IDs for Spotify batch API calls (artists/tracks)"
)
SPOTIFY_API_CALL_DELAY_SECONDS: float = Field(
default=0.1, # Slightly increased default for Spotify API
description="Delay in seconds between individual Spotify API calls."
)

class Config:
env_file = ".env"
case_sensitive = True

settings = Settings()
settings = Settings()
2 changes: 1 addition & 1 deletion refiner/models/offchain_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@ class OffChainSchema(BaseModel):
version: str
description: str
dialect: str
schema: str
schema_definition: str
2 changes: 1 addition & 1 deletion refiner/models/output.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@

class Output(BaseModel):
refinement_url: Optional[str] = None
schema: Optional[OffChainSchema] = None
output_schema: Optional[OffChainSchema] = None
Loading