Skip to content

Performance issues when using api/resources/filter_by_checksums #551

Open
@JonoYang

Description

@JonoYang

On certain large purldb instances, when using the api/resources/filter_by_checksums endpoint via the scancode.io map_deploy_to_develop pipeline, the match_to_purldb_resource step is very slow and can take +30 hours to complete.

After debugging, we found that the two biggest reasons for the slowness are:

  • Ordering of Resources, a lot of CPU time is spent ordering resources from a query
  • Decoding large JSON fields, a lot of time is spent parsing JSON fields if they are too big, like the history field on Package

Immediate solutions that come to mind:

  • Remove ordering for Resources
  • Create proper History model for Package, expedient thing would be to empty history json field. Look into using .only() on queries.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions