Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track downloads statistics on timescaledb #4979

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

jonatas
Copy link

@jonatas jonatas commented Aug 23, 2024

It's second part of #4642.

Add the downloads and log_downloads table to the timescaledb database.

  1. The downloads table represents the gem_downloads, but with flat results, every download is a row in the database.
  2. The LogDownloads mimics LogTicket and tracks what logs are already processed by the timescaledb database.
  3. There's a maintenance task that can keep picking pending Logs to be processed.

The actual code makes a duplication of LogTicket and also FastlyLogDownloadProcessor mimics the previous processor. I think we can live with both until we build the trust and guarantee all statistics are working as expected.

TODO

  • Add task to migrate data from LogTicket to LogDownloads
  • Test this PR with timescaledb disabled and no timescaledb database running
  • Make sure when timescaledb is disabled no background jobs or maintenance tasks are scheduled

How it works

Check the final download class:

class Download < DownloadsDB
  extend Timescaledb::ActsAsHypertable
  include Timescaledb::ContinuousAggregatesHelper

  acts_as_hypertable time_column: 'created_at', segment_by: [:gem_name, :gem_version]

  scope :total_downloads, -> { select("count(*) as downloads").order(:created_at) }
  scope :downloads_by_gem, -> { select("gem_name, count(*) as downloads").group(:gem_name).order(:created_at) }
  scope :downloads_by_version, -> { select("gem_name, gem_version, count(*) as downloads").group(:gem_name, :gem_version).order(:created_at) }

  continuous_aggregates(
    timeframes: [:minute, :hour, :day, :month],
    scopes: [:total_downloads, :downloads_by_gem, :downloads_by_version],
    refresh_policy: {
      minute: { start_offset: "10 minutes", end_offset: "1 minute", schedule_interval: "1 minute" },
      hour:   { start_offset: "4 hour",     end_offset: "1 hour",   schedule_interval: "1 hour" },
      day:    { start_offset: "3 day",      end_offset: "1 day",    schedule_interval: "1 day" },
      month:  { start_offset: "3 month",    end_offset: "1 day",    schedule_interval: "1 day" }
  })
end

To use raw data, you can query directly from Download:

Download.total_downloads

It's already faster than a regular table because Download is a hypertable and uses partitions. So, it can use a Parallel scan to counter faster on each partition.

The macro:

continuous_aggregates(
    timeframes: [:minute, :hour, :day, :month],
    scopes: [:total_downloads, :downloads_by_gem, :downloads_by_version],

It will automatically map each scope into all time frames using hierarchical continuous aggregates, which means it will reuse minute data for hourly data. I originally introduced all boilerplates here and later migrated to the timescaledb gem. You can even see I wrote a blog post inspired by this challenge.

With that, you can also query the aggregated data by timeframe using the nested generated classes done by the continuous_aggregates macro:

Download::DownloadsByVersionPerMonth.where(gem_name: "rails")
 Download::DownloadsByVersionPerMonth Load (2.9ms)  SELECT "downloads_by_version_per_month".* FROM "downloads_by_version_per_month" WHERE "downloads_by_version_per_month"."gem_name" = $1 /* loading for pp */ LIMIT $2  [["gem_name", "rails"], ["LIMIT", 11]]
=>
[#<Download::DownloadsByVersionPerMonth:0x000000013895b008
  created_at: "2024-04-01 00:00:00.000000000 +0000",
  gem_name: "rails",
  gem_version: "1.2.3.4",
  downloads: 0.6e1>,
 #<Download::DownloadsByVersionPerMonth:0x000000013fcd1410
  created_at: "2024-04-01 00:00:00.000000000 +0000",
  gem_name: "rails",
  gem_version: "6.1.7",
  downloads: 0.1e1>,
 #<Download::DownloadsByVersionPerMonth:0x000000013fcd12d0
  created_at: "2024-04-01 00:00:00.000000000 +0000",
  gem_name: "rails",
  gem_version: "7.0.2",
  downloads: 0.1e1>]

`TotalDownloadsPerMinute` is the most granular level. Also, the total_downloads is the simplest view:

```ruby
 Download::TotalDownloadsPerMinute.all
  Download::TotalDownloadsPerMinute Load (1.3ms)  SELECT "total_downloads_per_minute".* FROM "total_downloads_per_minute" /* loading for pp */ LIMIT $1  [["LIMIT", 11]]
=>
[#<Download::TotalDownloadsPerMinute:0x000000013d77c818 created_at: "2024-04-26 00:10:00.000000000 +0000", downloads: 110>,
 #<Download::TotalDownloadsPerMinute:0x000000013d77c6d8 created_at: "2024-04-26 00:11:00.000000000 +0000", downloads: 1322>,
 #<Download::TotalDownloadsPerMinute:0x000000013d77c598 created_at: "2024-04-26 00:12:00.000000000 +0000", downloads: 1461>,
 #<Download::TotalDownloadsPerMinute:0x000000013d77c458 created_at: "2024-04-26 00:13:00.000000000 +0000", downloads: 1150>,
 #<Download::TotalDownloadsPerMinute:0x000000013d77c318 created_at: "2024-04-26 00:14:00.000000000 +0000", downloads: 1127>,
 #<Download::TotalDownloadsPerMinute:0x000000013d77c1d8 created_at: "2024-04-26 00:15:00.000000000 +0000", downloads: 1005>]

How to test it

We have a maintenance task that will copy the actual LogTickets to be processed again by the Timescaledb. It will feed the LogDownloads table and then be processed by Timescaledb directly in the database.

To process and see it in action directly from the command line, you can parse a log file

log = LogDownload.create key: "example.log", directory:".", backend: 1
FastlyLogDownloadsProcessor.new(log.directory, log.key).perform

In my example, the output to insert 6175 records in a single transaction:

 Download Bulk Insert (135.9ms)  INSERT INTO "downloads" ("created_at","gem_name","gem_version","payload") VALUES ('2024-04-26 00:10:54', 'racc', '1.7.3-java', '{"env":{"bundler":"2.5.9","rubygems":"3.3.25","ruby":"3.1.0"}}'), ('2024-04-26 00:10:54', 'aws-sdk-core', '3.193.0', '{"env":{"RubyGems":"3.1.4","x86_64":"linux","Ruby":"2.7.2"}}'), ('2024-04-26 00:10:54', 'regexp_parser', '2.9.0', '{"env":{"bundler":"2.5.9","rubygems":"3.3.25","ruby":"3.1.0"}}'),
('2024-04-26 00:15:52', 'tty-prompt', '0.23.1', '{"env":{"RubyGems":"3.2.33","aarch64":"linux","Ruby":"3.0.6"}}'),
('2024-04-26 00:15:52', 'metamagic', '1.0.1', '{"env":{"Mozilla":"5.0","Chrome":"87.0.4280.88","Safari":"537.36"}}'), 
('2024-04-26 00:15:52', 'grpc', '1.60.0-x86_64-linux', '{"env":{"RubyGems":"3.2.33","x86_64":"linux","Ruby":"3.0.6"}}'), 
('2024-04-26 00:15:52', 'babel-transpiler', '0.7.0', '{"env":{"RubyGems":"3.2.32","x86_64":"linux","Ruby":"3.0.3"}}'), 
('2024-04-26 00:15:52', 'rufus-scheduler', '3.9.1', '{"env":{"RubyGems":"3.2.33","x86_64":"linux","Ruby":"3.0.6"}}'),
 ....
ON CONFLICT  DO NOTHING
# => 6175

Migrating data

A maintenance task can be executed to import the actual data from LogTicket.

bundle exec maintenance_tasks perform Maintenance::BackfillLogTicketsToTimescaleDownloadsTask

I haven't tested because I don't have access to production data, so this step needs to be validated.

Refreshing continuous aggregates

By default, the refresh will be done in the background. After the backfill, we'll need a single refresh in the continuous aggregates, which can also be useful for testing.

Here's a shortcut to refresh all at once:

Download.refresh_aggregates

@jonatas jonatas changed the title setup downloads on timescaledb Setup downloads on timescaledb Aug 23, 2024
@simi simi self-requested a review August 28, 2024 19:42
@jonatas jonatas force-pushed the setup-downloads-on-timescaledb branch from 23f058c to 11dfd76 Compare December 19, 2024 20:40
@jonatas jonatas changed the title Setup downloads on timescaledb Track downloads statistics on timescaledb Dec 20, 2024
@jonatas jonatas force-pushed the setup-downloads-on-timescaledb branch from 11dfd76 to eb6fcc7 Compare December 20, 2024 17:24
@jonatas jonatas marked this pull request as ready for review December 20, 2024 17:25
Copy link

codecov bot commented Dec 20, 2024

Codecov Report

Attention: Patch coverage is 22.40000% with 97 lines in your changes missing coverage. Please review.

Project coverage is 20.19%. Comparing base (e28ea63) to head (eb6fcc7).

Files with missing lines Patch % Lines
app/jobs/fastly_log_downloads_processor.rb 34.00% 33 Missing ⚠️
app/models/log_download.rb 0.00% 26 Missing ⚠️
...ce/backfill_log_downloads_from_log_tickets_task.rb 0.00% 15 Missing ⚠️
...ackfill_log_tickets_to_timescale_downloads_task.rb 0.00% 10 Missing ⚠️
app/models/download.rb 12.50% 7 Missing ⚠️
lib/shoryuken/sqs_worker.rb 0.00% 5 Missing ⚠️
app/jobs/fastly_log_downloads_processor_job.rb 85.71% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #4979       +/-   ##
===========================================
- Coverage   97.13%   20.19%   -76.95%     
===========================================
  Files         457      465        +8     
  Lines        9567    11693     +2126     
===========================================
- Hits         9293     2361     -6932     
- Misses        274     9332     +9058     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

1 participant