Skip to content

Inefficient ProxyDatabase slicing #20

@DominikStiller

Description

@DominikStiller

Operations on a ProxyDatabase that loop over all records slow down/do not have constant iteration time (as measured by tqdm's it/s), which is especially noticeable for large databases. Take for example slicing:

cfr/cfr/proxy.py

Lines 1730 to 1746 in a99c13b

def slice(self, timespan):
''' Slice the records in the proxy database.
Args:
timespan (tuple or list):
The list of time points for slicing, whose length must be even.
When there are n time points, the output Series includes n/2 segments.
For example, if timespan = [a, b], then the sliced output includes one segment [a, b];
if timespan = [a, b, c, d], then the sliced output includes segment [a, b] and segment [c, d].
'''
new = ProxyDatabase()
for pid, pobj in tqdm(self.records.items(), total=self.nrec, desc='Slicing ProxyRecord'):
spobj = pobj.slice(timespan=timespan)
new += spobj
new.refresh()
return new

The operation new += spobj calls .refresh() every time, which itself iterates over all proxies. Therefore, it is O(n^2), while it could easily be O(n) if they were added to a dictionary first, then converted to a ProxyDatabase at the end. Is there a reason this is not being done?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions