Design Doc: Flushing individual cache entries in mod_pagespeed

Flushing individual cache entries in mod_pagespeed

Joshua Marantz, 2013-03-21

Please see Ilya’s requirements spec for a REST API.

Also see Srihari’s spec for cache invalidation of URLs by pattern. Most of the heavy lifting is done here.

I am hoping that mod_pagespeed and ngx_pagespeed can share implementation, mostly. I am also hoping that the RESTful API can be compatible between MPS & NPS.

The rest of this spec muses on implementation options for url-pattern cache invalidation in MPS/NPS.

Existing UI

In mod_pagespeed, there is only support for a total cache flushes.

Current infrastructural support

The metadata & HTTP caches are validated against invalidation records held in RewriteOptions, roughly:

  struct UrlCacheInvalidationEntry {
    Wildcard url_pattern;
    int64 timestamp_ms;
    bool is_strict;
  };

  vector<UrlCacheInvalidationEntry*> url_cache_invalidation_entries_;

Right now the MD-cache is entirely invalidated when is_strict is true, and is not invalidated at all for URLs with is_strict=false, but that it should be resolvable by a small edit to RewriteContext::IsInputValid, however that might be an expensive edit. Also if we are to support wildcards we must add the input URL to InputInfo. If we are happy with exact matches we can just add the hash.

Infrastructural changes

There are two new classes, PurgeSet and PurgeContext in third_party/pagespeed/kernel/cache, that are unit-tested but otherwise unused.

PurgeSet is a data structure holding a map and a global invalidation timestamp. It’s designed to bound the total amount of data stored in the map and issue global invalidations when that spills, to avoid losing invalidations. It will be integrated into RewriteOptions replacing the existing global cache invalidation timestamp and its single-use helper class MutexedOptionInt64MergeWithMax. PurgeSet merges fast because it is instantiated via CopyOnWrite<>.

PurgeContext manages reading/writing PurgeSets from a file (cache.flush) in a thread-safe way. It is integrated into the class SystemCachePath, and it wakes up at the beginning of every mod_pagespeed request to check the file timestamp.

The cache-purge integration CL replaces the existing invalidation timestamp

We would always check the PurgeSet first before checking the wildcards. It might be better if we then protected the iteration though the entry-vector with a FastWildcardGroup match, on the assumption that most URLs would not match. Hopefully we can discourage the use of the wildcarded invalidations by not documenting them for mod_pagespeed.

The existing method RewriteOptions::IsUrlCacheValid is straightforward in usage and implementation. This is already called by OptionsAwareHTTPCacheCallback::IsCacheValid in rewrite_driver.cc, so the HTTP cache is already properly handled. When to be added to RewriteContext::IsInputValid.

Needs more tests, Merge implementation, and some thinking about how to reduce the cost of merging these maps/vectors, e.g. with ref-counts and copy-on-write. Same goes for DomainLawyer, the LoadFromFile DB, and other objects owned by RewriteOptions.

mod_pagespeed integration

mod_pagespeed currently supports global cache flushing by touching a file "cache.flush" in the cache directory: https://modpagespeed.com/doc/system#flush_cache It would be easy to add a RESTfull interface to this.

The implementation is based on that file because it needs to persist across Apache restarts and visible across child processes. Our new implementation can add optional content to that file. I propose that the file be line-based. All lines will begin with a timestamp, in milliseconds followed by one or more spaces, and either a filename or a single "*".

Note that our wildcard syntax employs "?" to match a single character, but “?” is frequently used in URLs. The presence of a “?” in the filename will not imply that it is a wildcard.

Currently "cache.flush" is never read; only its timestamp is used. Thus we can interpret an empty or mal-formed cache.flush file to mean “flush everything based on the timestamp of the file”.

The current polling for changes to cache.flush file can be augmented to read in the file and if non-empty, override the default behavior of "flush starting at the file’s timestamp". The file will be scanned for “” entries. The most recent “” timestamp will be retained, others discarded. All individual URL entries preceding a “*” invalidation will be discarded. URLs occurring multiple times will get all but the newest entry discarded.

Once the file is fully read and resolved, we can populate the global rewrite options invalidation_url_timestamps_ and cache_invalidation_timestamp_, all protected by the mutex in cache_invalidation_timestamp_. Note that the global_options is concurrently read while it’s being updated, so this mutex is important.

Note: the ‘touch cache.flush’ trick is now part of some Joomla/mod_pagespeed integration: http://extensions.joomla.org/extensions/core-enhancements/performance/cache/23602

RESTful API

mod_pagespeed will add a PURGE handler per quasi-standard: https://www.varnish-cache.org/docs/3.0/tutorial/purging.html . This purge handler will simply append lines to the cache.flush file. It would also be nice to periodically clean the file of old/redundant purge requests that will accumulate over time. There is a risk however that a programmatic burst of PURGE requests could cause entries to be dropped if we tried to clean the file while handling a PURGE request. Two potential remedies:

use file-locking semantics while cleaning, and other processes/threads attempting to append a PURGE command will fail to do the ‘append’, wait a random amount of time, and retry.
only clean the file during apache restart (in RootInit). A long running server might wind up having that file grow in size arbitrarily. But we could manage that by seeking to the last known read-position and reading it incrementally. That last-known read-position could be stored in a shared-memory statistic.

nginx considerations

Currently the code to manage the cache.flush scanning is primarily in ApacheServerContext::PollFilesystemForCacheFlush. This could be moved to ServerContext or another helper class in (say) net/instaweb/system.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly