Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed Improvements #71

Open
cdgriffith opened this issue May 12, 2024 · 6 comments
Open

Speed Improvements #71

cdgriffith opened this issue May 12, 2024 · 6 comments

Comments

@cdgriffith
Copy link
Owner

cdgriffith commented May 12, 2024

Talk about ideas to make PureMagic faster!

Initial thoughts:

How much does JSON slow us down? (Putting the data directly in code looks to be large speedup for repeated initialization, possibly 30%)
How much does iteration vs graph slow us down?
Are namedtuples the fastest way to store the data internally?

Optimizations in progress:

  • Remove max header length calculation that requires iterating through all data on startup. Provide a global integer. (~0.4% speedup)
@cdgriffith
Copy link
Owner Author

Quick test script to run a lookup 1000 times to compare speed differences (will vary by computer, but can always test against self to show differences)

start=$( date +"%s.%N" )

for _ in $(seq 1 1000);
do
  python3 -m puremagic test/resources/media/test.iso > /dev/null
done

end=$( date +"%s.%N" )

python3 -c "print(${end} - ${start})"

@cdgriffith
Copy link
Owner Author

Tested the difference between using named tuples and classes with slots for the PureMagic internal structure.

class PureMagic:
    __slots__ = ["byte_match", "offset", "extension", "mime_type", "name"]

    def __init__(self, byte_match, offset, extension, mime_type, name):
        self.byte_match = byte_match
        self.offset = offset
        self.extension = extension
        self.mime_type = mime_type
        self.name = name

    def _asdict(self):
        return {
            "byte_match": self.byte_match,
            "offset": self.offset,
            "extension": self.extension,
            "mime_type": self.mime_type,
            "name": self.name,
        }


class PureMagicWithConfidence(PureMagic):
    __slots__ = ["name", "confidence"]

    def __init__(self, byte_match, offset, extension, mime_type, name, confidence):
        super().__init__(byte_match, offset, extension, mime_type, name)
        self.name = name
        self.confidence = confidence

vs current

PureMagic = namedtuple(
    "PureMagic",
    (
        "byte_match",
        "offset",
        "extension",
        "mime_type",
        "name",
    ),
)


PureMagicWithConfidence = namedtuple(
    "PureMagicWithConfidence",
    (
        "byte_match",
        "offset",
        "extension",
        "mime_type",
        "name",
        "confidence",
    ),
)

named tuples still win. 42.329 seconds vs 43.922 for the classes

@NebularNerd
Copy link
Contributor

NebularNerd commented May 12, 2024

I think speedwise that it seems much the muchness, modern CPU's are fast enough that there's little difference to be made.

On low power hardware there might be a more measurable difference. Say on a Pi or low-end x86 system where the sheer horse power is lacking.

I was worried when I suggested Multi-Match or Regex searches that we would see a noticeable increase in search times. However, on my main desktop whatever difference there is, is negligible at worst.

Would/could multi-threading the searches be another way to speed up matching. Once the data is in memory everyone can have a go at identifying it and add to the results pool. This may benefit lower spec systems by utilising their cores rather than sheer horsepower.

@NebularNerd
Copy link
Contributor

A thought I just had, would switching to a monolithic file cause issues of its own once it grows beyond a certain point? Both from a code maintenance and physical size standpoints?

@cclauss
Copy link
Contributor

cclauss commented May 19, 2024

Almost all the time in the benchmark #71 (comment) above is in restarting Python over and over again.

Once Python is launched, performance is quite quick. See 0.6 sec for 74 string and file tests:
% python -m pytest --cov=puremagic test/

============================= test session starts ==============================
platform linux -- Python 3.12.3, pytest-8.2.0, pluggy-1.5.0
rootdir: /home/runner/work/puremagic/puremagic
plugins: cov-5.0.0
collected 74 items

test/test_common_extensions.py .....................                     [ 28%]
test/test_main.py .....................................................  [100%]

---------- coverage: platform linux, python 3.12.3-final-0 -----------
Name                    Stmts   Miss  Cover
-------------------------------------------
puremagic/__init__.py       2      0   100%
puremagic/__main__.py       0      0   100%
puremagic/main.py         167      0   100%
-------------------------------------------
TOTAL                     169      0   100%

============================== 74 passed in 0.60s ==============================

@cdgriffith
Copy link
Owner Author

@cclauss yes specifically targeting fast multi run speed for full python initialization and load.

There are many cases this will be used from a command line, and may be called by other non python scripts repeatedly, like the file command and want it to be faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants