To reduce network consumption, it syncs dataframe from the other nodes only on demand. When your task is divide and conquer style, you should consider dask instead.
- Once the cache has been synced, it will not call remotes. So cache's locality is 1.
- Ideal situations is that you need to read some dataframe multiple times on serveral nodes and the data frame should be updated frequently.
- Only unique str name is required configuration when you add a new dataframe on the system.
- No configuration, no operation is needed when a new node is added and a node is crashed and restored.
- No configuration, no operation makes it be easy to scale up in the cloud.
- Coordination via zookeeper
- Synchronize files via http POST
$ uvicorn pullframe.sender:app
from pullframe import pullframe
with pullframe(hosts, directory, sync_timeo 60.0) as pf:
# set start as None if you want to load from the very beginning
# set end as None if you want to load from the very ending
df = pf.load(name, start: Optional[datetime], end: Optional[datetime])
pf.save(name, df)
- Check cache discrepency/corruption between nodes.
- Stable backup using Amazon S3 / Google cloud storage.
- Replace zookeeper client to zake (fake kazoo client) during tests.
- zookeeper
- the dataframe's index should be datetime
- linux
- python>=3.7
- python = "^3.7"
- pandas = "^1.0.0"
- tables = "^3.6.1"
- fastapi = "^0.58.0"
- aiofiles = "^0.5.0"
- kazoo = "^2.7.0"
- This package was created with Cookiecutter
- Also was copied and modified from the audreyr/cookiecutter-pypackage project template.