Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing events #294

Open
lukaszgryglicki opened this issue Nov 20, 2023 · 7 comments
Open

Missing events #294

lukaszgryglicki opened this issue Nov 20, 2023 · 7 comments

Comments

@lukaszgryglicki
Copy link
Contributor

Hi, there was an issue reported for DevStats. I've did a full investigation and found that events are missing in GHA JSONS, all details are here.

GHA archives JSON is missing this PR opened event - there should be PullRequestEvent event with pranav-pandey0804 as an author but archives only have comments and reviews. This shoudl be in 2023-10-18-4 file but is not. See for example my PR - it has a correct PullRequestEvent event with lukaszgryglicki as an author.

Other missing events are:

  • This issue is missing IssuesEvent issue-opened event - it should be for the same author pranav-pandey0804.
  • This issue is missing 2 comments from pranav-pandey0804 author.
  • This issue is missing 3 comments from pranav-pandey0804 author.

cc @pranav-pandey0804 @igrigorik

@lukaszgryglicki
Copy link
Contributor Author

cc @caniszczyk

@jiagengliu
Copy link

I wonder if CNCF or Linux Foundation has plans to take on or sponsor the archiving effort of @igrigorik ? Since he seems inactive recently.

@lukaszgryglicki
Copy link
Contributor Author

I can work on this, but I need all the details about deployment(s) and permissions.
cc @caniszczyk

@igrigorik
Copy link
Owner

@lukaszgryglicki would appreciate any help! Please ping me via email (see profile).

@lukaszgryglicki
Copy link
Contributor Author

@igrigorik email sent.

@adityasethCSEK
Copy link

Adding, another case of some of the events being missed , StarWarsAdi3 , this repo should have come in 2024-02-29-15.json but was missed.

@bored-engineer
Copy link
Contributor

bored-engineer commented May 5, 2024

This is pretty easy to explain/fix, you will need to scrape all 3 pages of the events API on each execution of the scraper to obtain complete coverage instead of just the first page of events.

To explain, the events API can return up to 100 events per page (when ?per_page=100), with a limit of 300 total events (3 pages). All 3 pages are replaced at the same moment in time on the GitHub side as far as I can tell. Sometimes, an event can only be found on page 2 or 3 but is never seen on page 1, the most obvious case being if there are more than 100 events in a given second. This happens relatively often (graph from the last 48 hours):

image

A naive (but functional) implementation could be to perform a fetch for all 3 pages at the same time once per second then de-duplicate the returned events with those that have been seen already based on event ID. However, this will require more than 5,000 requests/hour which exceeds the limit for a single API token. You can either switch to using a GitHub app token (that is installed on a paid enterprise) which has a higher rate-limit of 15,000 or use multiple different tokens/accounts such as one per page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants