-
Notifications
You must be signed in to change notification settings - Fork 4
Usage
The main function of this package is to speedup package imports by loading code objects directly from a memory map (aka. heap archive, or just archive), instead of compiling source code (or deserializing from pyc file) on-demand every time.
The size of archive is significantly larger than the .py files themselves, so we need to limit the size of archive by saving the (frequently) used packages only, rather than saving all installed and spent even more time on IO.
We seperate the workflow into three roles:
- tracer, which records imported packages to a text file (package list);
- dumper, which creates the archive based on given package list;
- replayer, which will try to load packages from the archive.
The tracer and replayer are normal python instances with environment variable specificed, and are expected to run python workloads as usual, without modifying users' code. One thing that should be paying attention to is they should run same workload, to get more hit on package importing and have better performance.
The dumper is a python script
(or one-line statement passed to python -c
)
that will exit after creating the archive.
The reason why we split tracer and dumper is mainly
about devops:
imagine we have a long-running, distributed python script as workload,
we can start one (or more) tracer instance(s) to get the package list,
then after a period of time, generate one single archive based on the package list(s) and run new instances as replayer.
Start a tracer which will write name of imported packages
to <package list>, who will be opened with mode 'a+'
(append to the end, create if not exists):
PYCDSMODE=TRACE PYCDSLIST=<package list> python <python args>
e.g.
PYCDSMODE=TRACE PYCDSLIST=pkg.lst python -c "import json"
PYCDSMODE=TRACE PYCDSLIST=pkg.lst python -m http.server
PYCDSMODE=TRACE PYCDSLIST=pkg.lst python script.py
Create an archive named <heap archive>, containing code object of packages in <package list>:
python -c 'import cds.dump; import cds.dump; cds.dump.run_dump(<package list>, <heap archive>)
e.g.
python -c 'import cds.dump; import cds.dump; cds.dump.run_dump('pkg.lst', 'pkg.img')
Start a replayer which will try to import packages from the archive:
PYCDSMODE=SHARE PYCDSARCHIVE=<heap archive> python <python args>
e.g.
PYCDSMODE=SHARE PYCDSARCHIVE=pkg.img python -c "import json"
PYCDSMODE=SHARE PYCDSARCHIVE=pkg.img python -m http.server
PYCDSMODE=SHARE PYCDSARCHIVE=pkg.img python script.py
If user would like to use the underlying heap archive directly,
rather than automatically sharing the code objects of packages,
please specify PYCDSMODE=MANUALLY
and use the following APIs:
_cds._create_archige(archive: str)
:
Create an archive for writing,
this cannot be called multiple times.
_cds._move_in(o: object)
:
After an archive has been created by the current python instance,
copy a python object to the archive.
Supported types: None
, bool
, int
, float
, complex
, str
, bytes
, tuple
, frozenset
(will be converted to tuple).
_cds._load_archive(archive: str)
:
Load an archive for reading.
_cds._get_obj()
:
Get the archived object.
e.g.
export PYCDSMODE=MANUALLY
python -c 'import _cds; _cds._create_archive("test.img"); _cds._move_in(())'
python -c 'import _cds; _cds._load_archive("test.img"); print(_cds._get_obj())' # print: ()
python -c 'import _cds; _cds._create_archive("test.img"); _cds._move_in({})' # _cds.CDSException: Trying to move in a dict object