Track downloads statistics on timescaledb #4979

jonatas · 2024-08-23T21:58:30Z

It's second part of #4642.

Add the downloads and log_downloads table to the timescaledb database.

The downloads table represents the gem_downloads, but with flat results, every download is a row in the database.
The LogDownloads mimics LogTicket and tracks what logs are already processed by the timescaledb database.
There's a maintenance task that can keep picking pending Logs to be processed.

The actual code makes a duplication of LogTicket and also FastlyLogDownloadProcessor mimics the previous processor. I think we can live with both until we build the trust and guarantee all statistics are working as expected.

TODO

Add task to migrate data from LogTicket to LogDownloads
Test this PR with timescaledb disabled and no timescaledb database running
Make sure when timescaledb is disabled no background jobs or maintenance tasks are scheduled

How it works

Check the final download class:

class Download < DownloadsDB
  extend Timescaledb::ActsAsHypertable
  include Timescaledb::ContinuousAggregatesHelper

  acts_as_hypertable time_column: 'created_at', segment_by: [:gem_name, :gem_version]

  scope :total_downloads, -> { select("count(*) as downloads").order(:created_at) }
  scope :downloads_by_gem, -> { select("gem_name, count(*) as downloads").group(:gem_name).order(:created_at) }
  scope :downloads_by_version, -> { select("gem_name, gem_version, count(*) as downloads").group(:gem_name, :gem_version).order(:created_at) }

  continuous_aggregates(
    timeframes: [:minute, :hour, :day, :month],
    scopes: [:total_downloads, :downloads_by_gem, :downloads_by_version],
    refresh_policy: {
      minute: { start_offset: "10 minutes", end_offset: "1 minute", schedule_interval: "1 minute" },
      hour:   { start_offset: "4 hour",     end_offset: "1 hour",   schedule_interval: "1 hour" },
      day:    { start_offset: "3 day",      end_offset: "1 day",    schedule_interval: "1 day" },
      month:  { start_offset: "3 month",    end_offset: "1 day",    schedule_interval: "1 day" }
  })
end

To use raw data, you can query directly from Download:

Download.total_downloads

It's already faster than a regular table because Download is a hypertable and uses partitions. So, it can use a Parallel scan to counter faster on each partition.

The macro:

continuous_aggregates(
    timeframes: [:minute, :hour, :day, :month],
    scopes: [:total_downloads, :downloads_by_gem, :downloads_by_version],

It will automatically map each scope into all time frames using hierarchical continuous aggregates, which means it will reuse minute data for hourly data. I originally introduced all boilerplates here and later migrated to the timescaledb gem. You can even see I wrote a blog post inspired by this challenge.

With that, you can also query the aggregated data by timeframe using the nested generated classes done by the continuous_aggregates macro:

Download::DownloadsByVersionPerMonth.where(gem_name: "rails")
 Download::DownloadsByVersionPerMonth Load (2.9ms)  SELECT "downloads_by_version_per_month".* FROM "downloads_by_version_per_month" WHERE "downloads_by_version_per_month"."gem_name" = $1 /* loading for pp */ LIMIT $2  [["gem_name", "rails"], ["LIMIT", 11]]
=>
[#<Download::DownloadsByVersionPerMonth:0x000000013895b008
  created_at: "2024-04-01 00:00:00.000000000 +0000",
  gem_name: "rails",
  gem_version: "1.2.3.4",
  downloads: 0.6e1>,
 #<Download::DownloadsByVersionPerMonth:0x000000013fcd1410
  created_at: "2024-04-01 00:00:00.000000000 +0000",
  gem_name: "rails",
  gem_version: "6.1.7",
  downloads: 0.1e1>,
 #<Download::DownloadsByVersionPerMonth:0x000000013fcd12d0
  created_at: "2024-04-01 00:00:00.000000000 +0000",
  gem_name: "rails",
  gem_version: "7.0.2",
  downloads: 0.1e1>]


`TotalDownloadsPerMinute` is the most granular level. Also, the total_downloads is the simplest view:

```ruby
 Download::TotalDownloadsPerMinute.all
  Download::TotalDownloadsPerMinute Load (1.3ms)  SELECT "total_downloads_per_minute".* FROM "total_downloads_per_minute" /* loading for pp */ LIMIT $1  [["LIMIT", 11]]
=>
[#<Download::TotalDownloadsPerMinute:0x000000013d77c818 created_at: "2024-04-26 00:10:00.000000000 +0000", downloads: 110>,
 #<Download::TotalDownloadsPerMinute:0x000000013d77c6d8 created_at: "2024-04-26 00:11:00.000000000 +0000", downloads: 1322>,
 #<Download::TotalDownloadsPerMinute:0x000000013d77c598 created_at: "2024-04-26 00:12:00.000000000 +0000", downloads: 1461>,
 #<Download::TotalDownloadsPerMinute:0x000000013d77c458 created_at: "2024-04-26 00:13:00.000000000 +0000", downloads: 1150>,
 #<Download::TotalDownloadsPerMinute:0x000000013d77c318 created_at: "2024-04-26 00:14:00.000000000 +0000", downloads: 1127>,
 #<Download::TotalDownloadsPerMinute:0x000000013d77c1d8 created_at: "2024-04-26 00:15:00.000000000 +0000", downloads: 1005>]

How to test it

We have a maintenance task that will copy the actual LogTickets to be processed again by the Timescaledb. It will feed the LogDownloads table and then be processed by Timescaledb directly in the database.

To process and see it in action directly from the command line, you can parse a log file

log = LogDownload.create key: "example.log", directory:".", backend: 1
FastlyLogDownloadsProcessor.new(log.directory, log.key).perform

In my example, the output to insert 6175 records in a single transaction:

 Download Bulk Insert (135.9ms)  INSERT INTO "downloads" ("created_at","gem_name","gem_version","payload") VALUES ('2024-04-26 00:10:54', 'racc', '1.7.3-java', '{"env":{"bundler":"2.5.9","rubygems":"3.3.25","ruby":"3.1.0"}}'), ('2024-04-26 00:10:54', 'aws-sdk-core', '3.193.0', '{"env":{"RubyGems":"3.1.4","x86_64":"linux","Ruby":"2.7.2"}}'), ('2024-04-26 00:10:54', 'regexp_parser', '2.9.0', '{"env":{"bundler":"2.5.9","rubygems":"3.3.25","ruby":"3.1.0"}}'),
('2024-04-26 00:15:52', 'tty-prompt', '0.23.1', '{"env":{"RubyGems":"3.2.33","aarch64":"linux","Ruby":"3.0.6"}}'),
('2024-04-26 00:15:52', 'metamagic', '1.0.1', '{"env":{"Mozilla":"5.0","Chrome":"87.0.4280.88","Safari":"537.36"}}'), 
('2024-04-26 00:15:52', 'grpc', '1.60.0-x86_64-linux', '{"env":{"RubyGems":"3.2.33","x86_64":"linux","Ruby":"3.0.6"}}'), 
('2024-04-26 00:15:52', 'babel-transpiler', '0.7.0', '{"env":{"RubyGems":"3.2.32","x86_64":"linux","Ruby":"3.0.3"}}'), 
('2024-04-26 00:15:52', 'rufus-scheduler', '3.9.1', '{"env":{"RubyGems":"3.2.33","x86_64":"linux","Ruby":"3.0.6"}}'),
 ....
ON CONFLICT  DO NOTHING
# => 6175

Migrating data

A maintenance task can be executed to import the actual data from LogTicket.

bundle exec maintenance_tasks perform Maintenance::BackfillLogTicketsToTimescaleDownloadsTask

I haven't tested because I don't have access to production data, so this step needs to be validated.

Refreshing continuous aggregates

By default, the refresh will be done in the background. After the backfill, we'll need a single refresh in the continuous aggregates, which can also be useful for testing.

Here's a shortcut to refresh all at once:

Download.refresh_aggregates

codecov · 2024-12-20T18:44:49Z

Codecov Report

Attention: Patch coverage is 22.40000% with 97 lines in your changes missing coverage. Please review.

Project coverage is 20.19%. Comparing base (e28ea63) to head (eb6fcc7).

Files with missing lines	Patch %	Lines
app/jobs/fastly_log_downloads_processor.rb	34.00%	33 Missing ⚠️
app/models/log_download.rb	0.00%	26 Missing ⚠️
...ce/backfill_log_downloads_from_log_tickets_task.rb	0.00%	15 Missing ⚠️
...ackfill_log_tickets_to_timescale_downloads_task.rb	0.00%	10 Missing ⚠️
app/models/download.rb	12.50%	7 Missing ⚠️
lib/shoryuken/sqs_worker.rb	0.00%	5 Missing ⚠️
app/jobs/fastly_log_downloads_processor_job.rb	85.71%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #4979       +/-   ##
===========================================
- Coverage   97.13%   20.19%   -76.95%     
===========================================
  Files         457      465        +8     
  Lines        9567    11693     +2126     
===========================================
- Hits         9293     2361     -6932     
- Misses        274     9332     +9058

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jonatas changed the title ~~setup downloads on timescaledb~~ Setup downloads on timescaledb Aug 23, 2024

simi self-requested a review August 28, 2024 19:42

jonatas force-pushed the setup-downloads-on-timescaledb branch from 23f058c to 11dfd76 Compare December 19, 2024 20:40

jonatas changed the title ~~Setup downloads on timescaledb~~ Track downloads statistics on timescaledb Dec 20, 2024

jonatas added 16 commits December 20, 2024 14:24

Setup model tests

d7d8bcf

Fix hierarchical values to use integer

0639d66

Fix attribute name

95dd20c

Add LogDownload to mimic LogTicket on downloads db

ac8e0d4

Add LogDownloads to sqs in case it's enabled

080bf32

Lock as final step

edb07de

Fix model names

ada8a36

Test cases for continuous aggregates with multi month context

fed1d03

Use continuous_aggregate macro

e0a143e

Refactor scopes

3b64bb5

Fix tests

4d5f703

Update schema

06242b0

Add task to backfill log downloads

c82d53c

Update timescaledb version

4e964ae

Fix after bundle install

48f788e

Remove avo file until we need it

eb6fcc7

jonatas force-pushed the setup-downloads-on-timescaledb branch from 11dfd76 to eb6fcc7 Compare December 20, 2024 17:24

jonatas marked this pull request as ready for review December 20, 2024 17:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track downloads statistics on timescaledb #4979

Track downloads statistics on timescaledb #4979

jonatas commented Aug 23, 2024 •

edited

Loading

codecov bot commented Dec 20, 2024

Track downloads statistics on timescaledb #4979

Are you sure you want to change the base?

Track downloads statistics on timescaledb #4979

Conversation

jonatas commented Aug 23, 2024 • edited Loading

TODO

How it works

How to test it

Migrating data

Refreshing continuous aggregates

codecov bot commented Dec 20, 2024

Codecov Report

jonatas commented Aug 23, 2024 •

edited

Loading