This is a python implementation of a HAMT, adapted from rvagg's IAMap project written in JavaScript. Like IAMap, py-hamt abstracts over a backing storage layer which lets you store any arbitrary amount of data but returns its own ID, e.g. content-addressed systems.
Key differences from IAMap is that the py-hamt data structure is mutable and not asynchronous. But the key idea of abstracting over a value store is the same.
dClimate created this library to use IPFS to store zarr files. To see this in action, see our data ETLs.
To install, since we do not publish this package to PyPI, add this library to your project directly from git.
pip install 'git+https://github.com/dClimate/py-hamt'
Below are some examples, but for more information see the API documentation. Looking at the test files, namely test_hamt.py
is also quite helpful. You can also see this library used in notebooks for data analysis here dClimate Jupyter Notebooks
from py_hamt import HAMT, DictStore
# Setup a HAMT with an in memory store
hamt = HAMT(store=DictStore())
# Set and get one value
hamt["foo"] = "bar"
assert "bar" == hamt["foo"]
assert len(hamt) == 1
# Set and get multiple values
hamt["foo"] = "bar1"
hamt["foo2"] = 2
assert 2 == hamt["foo2"]
assert len(hamt) == 2
# Iterate over keys
for key in hamt:
print(key)
print (list(hamt)) # [foo, foo2], order depends on the hash function used
# Delete a value
del hamt["foo"]
assert len(hamt) == 1
from py_hamt import HAMT, IPFSStore
from multiformats import CID
# Get the CID you wish to read whether from a blog post, a smart contract, or a friend
dataset_cid = "baf..."
# Use the multiformats library to decode the CID into an object
root_cid = CID.decode(dataset_cid)
# Create HAMT instance using the IPFSStore connecting to your locally
# running IPFS Gateway from your local running IPFS Node, If you wish
# you can change the default IPFS gateway
hamt = HAMT(store=IPFSStore(), root_node_id=root_cid) # You can optionally pass your own gateway instead of defaults with the argument gateway_uri_stem="http://<IP>:<PORT>",
# Do something with the hamt key/values
...
from py_hamt import HAMT, IPFSStore, create_zarr_encryption_transformers
ds = ... # example ds with precip and temp data variables
encryption_key = bytes(32) # change before using, only for demonstration purposes!
header = "sample-header".encode()
encrypt, decrypt = create_zarr_encryption_transformers(
encryption_key, header, exclude_vars=["temp"]
)
hamt = HAMT(
store=IPFSStore(), transformer_encode=encrypt, transformer_decode=decrypt
)
ds.to_zarr(store=hamt, mode="w")
print("Attempting to read and print metadata of partially encrypted zarr")
enc_ds = xr.open_zarr(
store=HAMT(store=IPFSStore(), root_node_id=hamt.root_node_id, read_only=True)
)
print(enc_ds)
assert enc_ds.temp.sum() == ds.temp.sum()
try:
enc_ds.precip.sum()
except:
print("Couldn't read encrypted variable")
py-hamt
uses uv for project management. Make sure you install that first.
Once uv is installed, run
uv sync
to create the project virtual environment at .venv
. Don't worry about activating this virtual environment to run tests or formatting and linting, uv will automatically take care of that.
First, make sure you have the ipfs kubo daemon installed so that you can run the tests that utilize IPFS as a backing store. e.g. ipfs daemon
. If needed, configure the test with your custom HTTP gateway and RPC API endpoints. Then run the script
bash run-checks.sh
This will run tests with code coverage information, and then check formatting and linting.
We use pytest
with 100% code coverage, and with test inputs that are both handwritten as well as generated by hypothesis
. This allows us to try out millions of randomized inputs to create a more robust library.
Note that due to the randomized inputs, it is possible sometimes to get 99% or lower test coverage by pure chance. Rerun the tests to get back complete code coverage. If this happens on a GitHub action, try rerunning the action.
We use python's native cProfile
for running CPU profiles and snakeviz visualization the profile. We use memray
for the memory profiling. We run the profile on the test suite, since the tests are supposed have complete code coverage anyhow.
Creating the CPU and memory profile requires manual activation of the virtual environment.
source .venv/bin/activate
python -m cProfile -o profile.prof -m pytest
python -m memray run -m pytest
The profile viewers can be directly invoked from uv.
uv run snakeviz .
uv run memray flamegraph <memray output> # e.g. <memray-output> = memray-pytest.12398.bin
py-hamt
uses pdoc for its ease of use. To see a live documentation preview on your local machine, run
uv run pdoc py_hamt
Use uv add
and uv remove
, e.g. uv add numpy
or uv add --dev pytest
. For more information please see the uv documentation.