Skip to content

Key-Value mapping on top of content-addressed systems.

License

Notifications You must be signed in to change notification settings

dClimate/py-hamt

Repository files navigation

dClimate logo

codecov

py-hamt

This is a python implementation of a HAMT, adapted from rvagg's IAMap project written in JavaScript. Like IAMap, py-hamt abstracts over a backing storage layer which lets you store any arbitrary amount of data but returns its own ID, e.g. content-addressed systems.

Key differences from IAMap is that the py-hamt data structure is mutable and not asynchronous. But the key idea of abstracting over a value store is the same.

dClimate created this library to use IPFS to store zarr files. To see this in action, see our data ETLs.

Usage

To install, since we do not publish this package to PyPI, add this library to your project directly from git.

pip install 'git+https://github.com/dClimate/py-hamt'

Below are some examples, but for more information see the API documentation. Looking at the test files, namely test_hamt.py is also quite helpful. You can also see this library used in notebooks for data analysis here dClimate Jupyter Notebooks

Basic Writing/Reading from an in memory store

from py_hamt import HAMT, DictStore

# Setup a HAMT with an in memory store
hamt = HAMT(store=DictStore())

# Set and get one value
hamt["foo"] = "bar"
assert "bar" == hamt["foo"]
assert len(hamt) == 1

# Set and get multiple values
hamt["foo"] = "bar1"
hamt["foo2"] = 2
assert 2 == hamt["foo2"]
assert len(hamt) == 2

# Iterate over keys
for key in hamt:
  print(key)
print (list(hamt)) # [foo, foo2], order depends on the hash function used

# Delete a value
del hamt["foo"]
assert len(hamt) == 1

Reading a CID from IPFS

from py_hamt import HAMT, IPFSStore
from multiformats import CID

# Get the CID you wish to read whether from a blog post, a smart contract, or a friend
dataset_cid = "baf..."

# Use the multiformats library to decode the CID into an object
root_cid = CID.decode(dataset_cid)

# Create HAMT instance using the IPFSStore connecting to your locally
# running IPFS Gateway from your local running IPFS Node, If you wish
# you can change the default IPFS gateway
hamt = HAMT(store=IPFSStore(), root_node_id=root_cid) # You can optionally pass your own gateway instead of defaults with the argument gateway_uri_stem="http://<IP>:<PORT>",

# Do something with the hamt key/values
...

Partially encrypted zarrs

from py_hamt import HAMT, IPFSStore, create_zarr_encryption_transformers

ds = ... # example ds with precip and temp data variables
encryption_key = bytes(32) # change before using, only for demonstration purposes!
header = "sample-header".encode()
encrypt, decrypt = create_zarr_encryption_transformers(
    encryption_key, header, exclude_vars=["temp"]
)
hamt = HAMT(
    store=IPFSStore(), transformer_encode=encrypt, transformer_decode=decrypt
)
ds.to_zarr(store=hamt, mode="w")

print("Attempting to read and print metadata of partially encrypted zarr")
enc_ds = xr.open_zarr(
    store=HAMT(store=IPFSStore(), root_node_id=hamt.root_node_id, read_only=True)
)
print(enc_ds)
assert enc_ds.temp.sum() == ds.temp.sum()
try:
    enc_ds.precip.sum()
except:
    print("Couldn't read encrypted variable")

Development Guide

Setting Up

py-hamt uses uv for project management. Make sure you install that first. Once uv is installed, run

uv sync

to create the project virtual environment at .venv. Don't worry about activating this virtual environment to run tests or formatting and linting, uv will automatically take care of that.

Run tests, formatting, linting

First, make sure you have the ipfs kubo daemon installed so that you can run the tests that utilize IPFS as a backing store. e.g. ipfs daemon. If needed, configure the test with your custom HTTP gateway and RPC API endpoints. Then run the script

bash run-checks.sh

This will run tests with code coverage information, and then check formatting and linting.

We use pytest with 100% code coverage, and with test inputs that are both handwritten as well as generated by hypothesis. This allows us to try out millions of randomized inputs to create a more robust library.

Note that due to the randomized inputs, it is possible sometimes to get 99% or lower test coverage by pure chance. Rerun the tests to get back complete code coverage. If this happens on a GitHub action, try rerunning the action.

CPU and Memory Profiling

We use python's native cProfile for running CPU profiles and snakeviz visualization the profile. We use memray for the memory profiling. We run the profile on the test suite, since the tests are supposed have complete code coverage anyhow.

Creating the CPU and memory profile requires manual activation of the virtual environment.

source .venv/bin/activate
python -m cProfile -o profile.prof -m pytest
python -m memray run -m pytest

The profile viewers can be directly invoked from uv.

uv run snakeviz .
uv run memray flamegraph <memray output> # e.g. <memray-output> = memray-pytest.12398.bin

Generating documentation

py-hamt uses pdoc for its ease of use. To see a live documentation preview on your local machine, run

uv run pdoc py_hamt

Managing dependencies

Use uv add and uv remove, e.g. uv add numpy or uv add --dev pytest. For more information please see the uv documentation.