Skip to content

Design Doc: Beaconing vs PhantomJS

Jeff Kaufman edited this page Jan 9, 2017 · 2 revisions

Beaconing vs PhantomJS

If we have access to a headless browser we can collect data about above-the-fold images, critical line, reflows, critical CSS, the list of resources referenced by a page, etc.

In mod_pagespeed we’ve mostly been doing the best we can without this data. We have embarked on a strategy of extracting this data from user’s browsers and sending back beacons to collect on the server and populate the property cache.

The advantages of beaconing are:

  • Easy to deploy: the browsers are already computing the data we want; we are just adding a post-onload request back to our servers to collect the data. No new compute servers are needed.

  • Representative of real browsers; beaconing works on Mobile browsers, Safari, IE, FF, Opera, Chrome, etc, whereas a headless browser is not really a browser; it’s a different implementation of a rendering engine.

The disadvantages of beaconing are:

  • The data that comes back from the beacons, while more representative, is less consistent due to the variety of browsers and other variables. Thus we must somehow aggregate this data and use it to make intelligent decisions.

  • Some of the data might not be collectible via browser-based JS due to e.g. same-origin policy.

  • The data might be too large to send via a signal GET beacon, requiring either multiple GETs or a POST, both of which have challenges in implementation.

  • The data cannot be generated reliably on demand. Need to wait for response from end user’s browsers which may get lost. (This is not a problem in aggregate, since even a portion of data arriving is usually sufficient).

  • Separate implementation is needed for each new type of data collected, unless this approach can be made to work reliably on PSS.

An alternative approach is to use PhantomJS. Drawbacks/challenges to the PhantomJS approach:

  • There is no PhantomJS "service", so we’d need to create that or spawn it as a subprocess, which might be very resource intensive. Or we’d need to build such a “service” and provide some mechanism for site owners to deploy private instances of that service in their networks. Note that many mod_pagespeed users have no data centers; just a small server in their house or using shared hosting or a single virtual machine rented from a hosting provider. However, this is a distinct possibility for large partners.

  • It requires additional RAM & CPU. Note that hosting providers are already very reluctant to allow image rewriting, and starting a browser process will likely be a hard sell.

  • PJS requires a new installation & distribution flow, and adds the burden of version-compatibility & upgrade flow.

  • PJS requires a new invocation/communication/startup strategy. The existing usage model of running it as a command-line program relaying JS console output to stdout is pretty nice, but requires the creation of a daemon to establish an RPC channel and spawn PJS subprocesses to do the work, redirecting its output to the socket.

  • It is not always possible to perform HTTP fetches for resources from the HTTPD server machine, and mod_pagespeed users and developers have to map origin domains or configure short-circuiting fetches via the filesystem. Providing this layer for PJS is another project.

Advantages:

  • Don’t have to deal with real world variability. (On the flip side, this means limited to 1 browser rendering engine)

Note that all the consumers of the browser data get it via the property cache, and are largely agnostic to how the data gets into the pcache, so the integration with the rest of the system is straightforward. Thus in either case, there is not a huge impact on the existing filters that use rendering data.

Attack Vectors

The Beacon path exposes an attack vector that must be considered. Of course, MPS is completely vulnerable to DOS in the same way that the origin site is; MPS is not a DOS solution. But Beacons allow an attacker to inject bogus data in the pcache, inducing suboptimal decisions; critical images that aren’t, non-critical images that are; resources that don’t need to be prefetched, critical-CSS that isn’t, critical CSS that is, etc.

One reasonable defense to attacks might be to introduce a random unique ID (e.g. a 20-byte web64 token generated randomly), which is used to validate all incoming beacons at the server. Only valid tokens that have not already been used will be accepted. This same mechanism can also potentially be used to correlate multiple beacon GETs to allow overflow. The first GET for an ID would identify how many more beacons will be sent for the aggregated data.

Clone this wiki locally