-
-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low-rank ADVI #3022
base: develop
Are you sure you want to change the base?
Low-rank ADVI #3022
Conversation
Implement low-rank ADVI. Initial pass at unit test cases and service interface required for CmdStan implementation
…4.1 (tags/RELEASE_600/final)
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
…wjn0/stan into feature/issue-2750-advi-lowrank
…4.1 (tags/RELEASE_600/final)
Okay, with the last batch of commits, I think this is in a state that can be looked at. I'm tagging @bob-carpenter and @avehtari as we've discussed this issue previously. Just going to lay out a couple of points that might help someone look at this PR. Math The math is exactly as implemented in Ong et al. (2017), with the exception that we parameterize the log-std additive factor instead of the std additive factor. This prevents a case where both the lower-triangular low-rank factor and the std additive factor might be zero, resulting in bad gradients. It's also more similar to the mean-field approximation implementation. Finally, it also makes for more sensible initializations - just like the mean-field approximation, a "zero" initialization corresponds to a standard independent multivariate Gaussian. Implementation details Some of the internal details of the base ADVI family change. Specifically, the dimensionality of The ADVI interface changes as well. Within the algorithms (eta adaptation, ELBO computation, ELBO gradient computation) in several instances the code needs to instantiate a variational distribution. For the "fixed-rank" approximations (i.e. mean-field/full-rank), this is not problematic - just uses the C++ templates. However, for the low-rank approximation, we need to pass in the rank to properly instantiate it. So we move the fixed-rank interface into a new subclass, leaving most of the implementation as-is, and just creating virtual functions for instantiating variational distributions within the internal algos. Implementations for both the fixed-rank and low-rank approximations are then straightforward in subclasses, and we take care to instantiate properly in the low-rank ADVI service. Testing The low-rank variational family is unit tested in the same way the other two existing families are. I've also gone ahead and begun implementing low-rank parallels to the other variational tests. The following is (as of this writing) an up-to-date list of what files are done and what files need to be done before merge:
C++ conventions C++ is not one of my native languages, so any feedback on this would be significantly appreciated. CmdStan I've got a working CmdStan version at wjn0/cmdstan@277b370. I don't think this is ideal, and I'm not sure I've specified the argument constraints 100% correctly. Any feedback welcome, pls let me know if I should create a separate PR (assume we'll want to restrict discussion here). Using the implementation Now for the fun part 😄 It can be used by setting the variational algorithm to The clearest example I've found of this so far is in the example-models repo, ARM Ch.25 model earnings2. This is (I assume, didn't actually check the book - just tried a bunch until I found one with discrepant mean-field and full-rank performance) a mis-specified/poorly-identified model (posteriors of same betas are highly correlated, probably hyper-correlated predictors? didn't diagnose :). The full-rank approximation captures these correlations well. The low-rank approximation captures some of these correlations well, but not all (increasing as you increase the rank of the approximation). Note the convergence issues are on full display here so to see the behaviour I mention you might have to smash re-run a few times to get meaningful fits (under both full-rank and low-rank approxs, by the way). Would welcome a rec for a better example model to try here! Gotta say, after grokking a few of the C++ features I'd forgotten, working with the code here was a real pleasure. Hopefully I haven't abused the code that was already in place too badly :) Please let me know if anything is unclear. |
Super cool! I'm especially curious how general the low-rank approximation is and whether we could use it elsewhere. What's the complexity of factoring for rank N approximations? We could really use some low-rank covariance estimation in our MCMC algorithm, which currently only allows diagonal and dense settings. @bbbales2 worked on this for his Ph.D. thesis and has some code somewhere we should track down before he disappears into the working world ether in a couple weeks. Thanks for working through the ADVI interface---it's definitely rougher than our optimization or MCMC interfaces. I can't take a look at it today or tomorrow, but should be able to get to it over the weekend with comments. CmdStan has to be a different PR because it's a different interface. So yes, please create a separate PR for that. @mitzimorris has been digging into CmdStan lately and has added a bunch of new features and can probably help on that side. It's very easy to get it into cmdstanr and cmdstanpy, and a bit more difficult to add to RStan and PyStan(3). The user-facing doc is in yet a third repo (docs), which is where the updated CmdStan doc needs to go. We'd also want to describe the algorithm in the reference manual and perhaps suggest/describe it in the user's guide, too. They're all in the user-facing docs repo. I can help with that if you're not keen to write a lot of doc. The test sets Ong et al. used don't look a lot like the models people fit in Stan other than the hierarchical regression. We have a lot of evaluation models to use. The ones in posteriordb have been better vetted than the ones in example-models, but not as well vetted as the one in stats-test-models (or something like that). Rather than the typical ML test error and training error evals, I'd rather measure (a) calibration of probabiliistic predictions, (b) square error of expectation calcuations, and/or (c) measuring the ELBO. With Stan, we care about recovering the posteriors or expectations our users write down as accurately as possible, not maximizing held out classification performance (that's a model tweaking issue, not an approximate fit issue). |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
So, relative to the complexity of the dense ADVI approximation (N^3 where N is the number of model params), complexity is R^3 where R is the rank of the approximation (matrix inversion). This is a happy consequence of the Woodbury lemma for inverses of a low-rank covariance matrix of this form (described in the Ong paper than I ever could, so I'll defer to them). The matrix determinant lemma results in something similar for the entropy computations. I don't know enough about HMC to say whether the inverse/determinant is the bottleneck there too, but I'm guessing yes? I'll plan to do some benchmarking to see how well this is borne out in the implementation as written too. @bbbales2 feel free to let me know if anything jumps out at you as potentially useful from this PR on the MCMC side, unfortunately HMC/MCMC in general just puts me squarely out of my depth mathematically haha. I'll plan to give a more robust evaluation framework some thought over the next couple of days too, with the goal of implementing some stuff this weekend. Thanks for the refs on example models, I'll fit some of those and see if we can get some nice posterior kernel density estimate pairplots for visual assessment. In the original ADVI paper they had a few nice cases (IIRC) of comparing full-rank, mean-field, and MCMC. I totally agree with you re: eval metrics, I think we really want to examine posterior variance (under)estimation as that's the main "gotcha" with mean-field ADVI identified in the lit IIRC. I assume by eval technique (b)/(c) you mean using MCMC samples to approximate the true KL of converged posteriors in different ways? I can't remember if they did that in the original ADVI paper, but definitely would be a nice way of looking at things if it's mathematically feasible (IIRC there are some mathy gotchas around this if not careful). Is there an existing way to extract the variational parameters/constraint transformations used for a given model when using the VI interfaces, or are you thinking we might use samples on the ADVI side too (assuming I've interpreted your comment correctly :)? Anyway, super glad there's still excitement around this feature. I'll follow up with a CmdStan PR soon too. Assistance with other interfaces/documentation would be much appreciated as well, though of course I'm totally willing to assist as necessary! I struggle sometimes with R in particular, but cmdstanpy/pystan shouldn't be too much trouble for me. |
Cool! https://arxiv.org/abs/1905.11916 lists several examples where low-rank mass matrix is better than diagonal mass matrix and not worse than dense. It's a bit different from variational, but the same posteriors would be great for testing low-rank VI. |
Hey @wjn0 this is really interesting and I'm trying to test. I'm getting a compile error. I believe it is due to three of the virtual functions in /**
* Return the approximation family's parameters as a single vector
*/
virtual Eigen::VectorXd return_approx_params() const = 0;
/**
* Set the approximation family's parameters from a single vector
* @param[in] param_vec Vector in which parameter values to be set are stored
*/
virtual void set_approx_params(const Eigen::VectorXd& param_vec) = 0;
/**
* Return the number of approximation parameters lambda for Q(lambda)
*/
virtual const int num_approx_params() const = 0; |
@spinkney thanks for your interest! This branch is developed in reference to |
Yes, me munging two branches together. I'll have to ask those people. |
This may be of interest to others here. Collaborating with @spinkney , I managed to merge together the new RVI with this low-rank approach. It appears to be working. The main merge conflicts had to do with the new RVI advi signatures, and some virtual functions that needed implementing in the normal_lowrank class. Cmdstan here: https://github.com/stephensrmmartin/cmdstan/tree/feature/lowrank |
I got a branch of remotes::install_github("spinkney/cmdstanr", ref = "feature/rvi")
library(cmdstanr)
file <- file.path(cmdstan_path(), "examples", "bernoulli", "bernoulli.stan")
mod <- cmdstan_model(file)
data_list <- list(N = 10, y = c(0,1,0,0,0,0,0,0,0,1))
fit <- mod$variational(
data = data_list,
algorithm = "lowrank",
rank = 2
) |
How does validating low-rank ADVI using multiple samples comparison ecdf sound? So we could constitute the four chains in Fig. 10 of this paper with two chains from ADVI and the rest from posteriordb reference posteriors. This way we could compare the post-stationary samples (most recent a × t samples used to compute Rhat) of ADVI with that of dynamic HMC. It is important that the same prior and model corresponding to the reference posterior are used to get ADVI samples. I expected ADVI ecdfs to be ⋂ shaped as VI would have lower variance but hopefully, if the algorithm is not biased, its ECDF difference plot would cross (0.5,1) in Fig.10-(f). I am not sure what is meant by "we know that ADVI does not produce autocorrelated posterior samples." from the original SBC paper, but as we are using samples that reached stationary, I assume the computed rank would be the same with or without thinning. Please correct me if I wrong.
@wjn0 Would there be any relation between ADVI rank and the variance of the post-stationary samples? If any, confirming this with ADVI ecdf being less ⋂-shaped as we increase its rank would be interesting. |
In general, we can't validate VI by comparing with the truth, because it's drawing from a different distribution. I would try validating it on its own terms instead---does it find the right posterior mean or get stuck during optimization? Is the posterior (co)variance right on its own terms?
ADVI is based on a normal approximation to the posterior, from which independent draws are sampled. So no autocorrelation. And there's no notion of stationarity, so I'm not 100% sure what the next question means.
I think the bigger point is that we know it'll have an effect on the covariance of posterior samples. In order to do full Bayes, we need that posterior covariance, not just the marginal variances. |
Wow, this is incredible - thanks a lot for working on this! I've just tried this on a pretty big model and got what I think is a OOM error:
... and nothing in $output(). Also when I run this and switch to htop is see the 32GB quickly filling up... Is this expected? Is the whole 42K x 42K covariance matrix computed at some step of the computation? |
@adamhaber hmm, I can't say I've seen those errors before. Does the number of parameters in your model scale with the number of data points (e.g. a latent variable model)? If so, I wonder if running a scaled-back version as a first step would help diagnose. Thus far, I have only tested with CmdStan. My next to-do is to get Python bindings (likely PyCmdStan) working so that I can do more in-depth tests. Perhaps you could try fitting your model in CmdStan and seeing if the error pops up there too? It does look R-generated, given the reference to Alternatively, if scaling back your model is not possible, might I recommend disabling eta adaptation temporarily, and seeing if optimization proceeds normally? My original planned use case for this (prototyping latent GP models) has around 10x parameters as yours so hopefully this is not some inherent limitation in the code (I have not had a chance to try these models yet though, unfortunately). I'm reasonably confident we don't ever compute the full covariance matrix (which in your case would be ~15 GB). |
Thanks @wjn0 ! I'll try the scaled-back version and will let you know how it worked. Any chance the issue might be related to differences between CmdStan's variational (2.26.1) vs. RStan's vb (2.21.3)? CmdStan's variational works perfectly on the Bernoulli example, so probably not an installation problem - but on the larger model, vb takes around ~20 mins while variational doesn't proceed past eta adaptation (waited ~1h). Setting |
@adamhaber sounds good. I unfortunately have not used Stan in R (either before this, or @spinkney's bindings [yet] -- which I assume you're using?) Perhaps they might be able to assist with some of your questions. When you say
this is referring to mean-field ADVI in one of the Stan R packages? And when you say
is this in |
Yes, RStan's
Yes, this is cmdstan. What happens is that the program hangs after eta adaptation finishes (or right at the start without it). CPU is still on 100% and memory is filling up quickly, but nothing is printed for more than 1 hour, while the RStan version finishes in less than 20 minutes. |
Update - tried this with the latest release of CmdStan and variational works fine, so this seems specific to this version of CmdStan (but not specific to the |
@adamhaber so, if I understand correctly, something in my branch of CmdStan seems to be breaking something for all variational procedures, not just lowrank (possibly something that comes out specifically only for large models?)? Can I ask how you built this copy of cmdstan? I greatly appreciate the debugging work you've done so far! |
That's my understanding as well, but not 100% sure about.
Thanks for working on this! |
Okay, now I'm wondering if perhaps this is something related to the RVI feature. @stephensrmmartin is there an issue/PR where I can get more info on RVI? @adamhaber if you get a chance and are willing to drop down to the CmdStan CLI, I would be curious if my version of CmdStan here produces the same error: wjn0/cmdstan@277b370 This might help us narrow it down. Thanks again. |
OK, just tried it, and got this on
|
@adamhaber My apologies, looks like I forgot to update my remote cmdstan repo to point the stan submodule to my copy of
and then rebuilding cmdstan/examples as normal. |
Thanks, that's really helpful. I was able to
apologies in advance if this is just me messing up the installation :-) |
…-2750-advi-lowrank
https://github.com/Dashadower/stan/tree/feature/rvi I just merged that branch with this low-rank PR patchset. There were some merge conflicts I had to fix. E.g., some signatures and interfaces changed with RVI, so I changed low-rank's to match. Likewise, I had to tweak the RVI methods to call init_variational, so that low-rank could work. I also had to implement some virtual functions specifically for low-rank; I believe these are done correctly, but I haven't extensively tested it myself. |
@adamhaber definitely not you, just got the same thing on a fresh copy of mine. I just pushed up a version merged with the latest Thanks for the info @stephensrmmartin! |
I've pulled Some more info:
|
My apologies -- I think it's actually my lack of familiarity with git submodules that's giving us trouble! With that said, I think I've updated the right submodules now. The following gives me a working, fresh copy of CmdStan with low-rank support (whereas previously it was giving me the same error as you, about
Please feel free to let me know if this doesn't work for you. |
Thanks @wjn0 , that works for me as well! Mean-field runs as expected on both the Bernoulli example and the large model. Low rank works fine on the Bernoulli example, but seems to be very slow for this model (rank=1) - I'm running it with refresh=1 and nothing is printed in more than 1 hour. Again, CPU is 100% and memory use is high (~25GB), so something seems to be working in the background. Other than refresh=1 (which might be problemtic? I've seen lags in the outputs in previous variational runs, too), is there any way to monitor the progress of the |
Could you try a smaller example and time the mean-field, lowrank, and fullrank? And to confirm, this is without the RVI additions right? It's just the low-rank branch mentioned by @wjn0 (i.e., you are not using my branch, but his branch?) |
Yes, this is based on these installation instructions:
I'll run some benchmarks and post the results later today. |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
Hello! This feature would be very useful for me. It doesn't look like there has been progress in some time though... Are there any plans to pick this up? |
I don't know that anyone's working on this. We are working on coding a new variational inference algorithm that uses a low rank plus diagonal covariance matrix derived from L-BFGS optimization. |
33+3/2
Awesome! |
Submission Checklist
./runTests.py src/test/unit
make cpplint
Summary
Implement low-rank ADVI. See also #2750 and #2751.
Intended Effect
Make available a new variational family called
lowrank
which supports an intermediate between mean-field and full-rank ADVI, intended for more robust posterior variance estimates in large models.How to Verify
Use the CmdStan interface implemented at wjn0/cmdstan@277b370
./model variational algorithm=lowrank rank=4 <...>
where
1 <= rank < # model params
. The example models repo ARM/Ch.25/earnings2 is an interesting example.Side Effects
Documentation
In progress.
Copyright and Licensing
Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company):
Walter Nelson
By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses: