Speed Improvements #71

cdgriffith · 2024-05-12T22:06:11Z

Talk about ideas to make PureMagic faster!

Initial thoughts:

How much does JSON slow us down? (Putting the data directly in code looks to be large speedup for repeated initialization, possibly 30%)
How much does iteration vs graph slow us down?
Are namedtuples the fastest way to store the data internally?

Optimizations in progress:

Remove max header length calculation that requires iterating through all data on startup. Provide a global integer. (~0.4% speedup)

cdgriffith · 2024-05-12T22:08:31Z

Quick test script to run a lookup 1000 times to compare speed differences (will vary by computer, but can always test against self to show differences)

start=$( date +"%s.%N" )

for _ in $(seq 1 1000);
do
  python3 -m puremagic test/resources/media/test.iso > /dev/null
done

end=$( date +"%s.%N" )

python3 -c "print(${end} - ${start})"

cdgriffith · 2024-05-12T22:10:33Z

Tested the difference between using named tuples and classes with slots for the PureMagic internal structure.

class PureMagic:
    __slots__ = ["byte_match", "offset", "extension", "mime_type", "name"]

    def __init__(self, byte_match, offset, extension, mime_type, name):
        self.byte_match = byte_match
        self.offset = offset
        self.extension = extension
        self.mime_type = mime_type
        self.name = name

    def _asdict(self):
        return {
            "byte_match": self.byte_match,
            "offset": self.offset,
            "extension": self.extension,
            "mime_type": self.mime_type,
            "name": self.name,
        }


class PureMagicWithConfidence(PureMagic):
    __slots__ = ["name", "confidence"]

    def __init__(self, byte_match, offset, extension, mime_type, name, confidence):
        super().__init__(byte_match, offset, extension, mime_type, name)
        self.name = name
        self.confidence = confidence

vs current

PureMagic = namedtuple(
    "PureMagic",
    (
        "byte_match",
        "offset",
        "extension",
        "mime_type",
        "name",
    ),
)


PureMagicWithConfidence = namedtuple(
    "PureMagicWithConfidence",
    (
        "byte_match",
        "offset",
        "extension",
        "mime_type",
        "name",
        "confidence",
    ),
)

named tuples still win. 42.329 seconds vs 43.922 for the classes

NebularNerd · 2024-05-12T22:20:47Z

I think speedwise that it seems much the muchness, modern CPU's are fast enough that there's little difference to be made.

On low power hardware there might be a more measurable difference. Say on a Pi or low-end x86 system where the sheer horse power is lacking.

I was worried when I suggested Multi-Match or Regex searches that we would see a noticeable increase in search times. However, on my main desktop whatever difference there is, is negligible at worst.

Would/could multi-threading the searches be another way to speed up matching. Once the data is in memory everyone can have a go at identifying it and add to the results pool. This may benefit lower spec systems by utilising their cores rather than sheer horsepower.

NebularNerd · 2024-05-14T09:22:04Z

A thought I just had, would switching to a monolithic file cause issues of its own once it grows beyond a certain point? Both from a code maintenance and physical size standpoints?

cclauss · 2024-05-19T13:14:59Z

Almost all the time in the benchmark #71 (comment) above is in restarting Python over and over again.

Once Python is launched, performance is quite quick. See 0.6 sec for 74 string and file tests:
% python -m pytest --cov=puremagic test/

============================= test session starts ==============================
platform linux -- Python 3.12.3, pytest-8.2.0, pluggy-1.5.0
rootdir: /home/runner/work/puremagic/puremagic
plugins: cov-5.0.0
collected 74 items

test/test_common_extensions.py .....................                     [ 28%]
test/test_main.py .....................................................  [100%]

---------- coverage: platform linux, python 3.12.3-final-0 -----------
Name                    Stmts   Miss  Cover
-------------------------------------------
puremagic/__init__.py       2      0   100%
puremagic/__main__.py       0      0   100%
puremagic/main.py         167      0   100%
-------------------------------------------
TOTAL                     169      0   100%

============================== 74 passed in 0.60s ==============================

cdgriffith · 2024-06-10T18:51:20Z

@cclauss yes specifically targeting fast multi run speed for full python initialization and load.

There are many cases this will be used from a command line, and may be called by other non python scripts repeatedly, like the file command and want it to be faster.

cdgriffith mentioned this issue May 12, 2024

Version 2.0 Goals #70

Open

cclauss mentioned this issue May 16, 2024

Small performance improvements #73

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed Improvements #71

Speed Improvements #71

cdgriffith commented May 12, 2024 •

edited

Loading

cdgriffith commented May 12, 2024

cdgriffith commented May 12, 2024

NebularNerd commented May 12, 2024 •

edited

Loading

NebularNerd commented May 14, 2024

cclauss commented May 19, 2024 •

edited

Loading

cdgriffith commented Jun 10, 2024

Speed Improvements #71

Speed Improvements #71

Comments

cdgriffith commented May 12, 2024 • edited Loading

cdgriffith commented May 12, 2024

cdgriffith commented May 12, 2024

NebularNerd commented May 12, 2024 • edited Loading

NebularNerd commented May 14, 2024

cclauss commented May 19, 2024 • edited Loading

cdgriffith commented Jun 10, 2024

cdgriffith commented May 12, 2024 •

edited

Loading

NebularNerd commented May 12, 2024 •

edited

Loading

cclauss commented May 19, 2024 •

edited

Loading