This repository has been archived by the owner on Apr 11, 2024. It is now read-only.
-
-
Notifications
You must be signed in to change notification settings - Fork 83
/
Copy pathStancon.hs
428 lines (359 loc) · 24.7 KB
/
Stancon.hs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
{-# OPTIONS_GHC -F -pgmF inlitpp #-}
# Getting more out of Stan: some ideas from the Haskell bindings
Thomas A Nielsen<sup>1</sup>, Dominic Steinitz<sup>1</sup> and Henrik Nilsson<sup>2</sup>
1. Tweag I/O
2. School of Computer Science, University of Nottingham
## Introduction
Probabilistic programming is one of the most promising developments in statistical computing in recent times.
By combining programming languages theory with statistical modelling, it empowers modellers with a flexible
framework for statistical computing in which a great variety of models can be expressed using composition of
arbitrary probability distributions within flexible control flow to express dependencies and data structures known
to, or hypothesised by, the modeller.
Probabilistic programming is distinguished from stochastic programming by the capability to condition the random
variables on observed data; essentially to perform the Bayesian update and move from prior to posterior
distributions. Sometimes the posterior suffices. For instance, a causal link may be predicated on a certain regression
coefficient being non-zero. Nevertheless, probabilistic programming in its full power is not restricted only to calculating
the posterior. Computations based on this posterior may be more directly relevant to the data analyst or to the
decision maker and often require further probabilistic calculations. For instance:
* predicting outcomes based on new observations
* model criticism based on residuals or the posterior predictive distribution
* forecasting timeseries
* risk analysis
* decision-making
* resource allocation
Stan has emerged as the most practical and widely used probabilistic
programming language for Bayesian computation. In its canonical form,
Stan essentially only calculates the posterior. While some further
calculations can be done within Stan in a generated quantities block,
these are quite limited, necessitates repetition, and are coupled to
the inference and thus cannot be calculated independently.
Consequently, further analysis is usually carried out in a generic host
programming languages like Python or R. Unfortunately, this often leads
to repetition. For instance, to calculate residuals or to make
predictions in a predictive model, one must write the model twice: once
in the Stan modelling language, for inference, and once in the host
language, for simulation and prediction. This has the disadvantage that
the two models may become out of sync and some probability
distributions may be parameterised in different ways. This is not much
work for simple models, such as linear regression model; the reader may
feel that the author doth protest too much. But for more complex models
it may not at all be obvious how to transfer the posterior into
simulation environment.
Here, we present the results of some experiments with creating [bindings to Stan in Haskell](https://github.com/diffusionkinetics/open/tree/master/stanhs),
a purely functional
and statically typed programming language. Rather than present “yet another Stan binding” or even worse, try to
persuade the reader to abandon their current programming language and learn Haskell, our aim here is to present
some ideas that enable a richer set of probabilistic computations from
a subset of Stan models. This obviates the need
to also implement the model in the host language, thus addressing the
above problem with existing bindings. Our ideas are general and could,
in principle, be leveraged to improve existing interfaces to Stan.
Nevertheless, we have chosen here to explore these ideas in Haskell
due to its support for embedded languages, ease of re-factoring experimental code, and its emerging data science
ecosystem.
## The Haskell programming language
Haskell is a pure, statically typed, general purpose functional
language with lazy evaluation. It is widely used in academia for
teaching and research, and increasingly for commercial software
development as Haskell is conducive to rapid development of reliable
code. Purity means that functions have no side effects, except where
explicitly reflected in the function type, and then only those effects
that are permitted by the type in question. This enforced effect
discipline, in combination with an expressive type system, make it a
lot easier to understand and maintain code, especially when it grows
large. Purity implies that equational reasoning is valid, making
Haskell a particularly good fit for mathematical applications. Lazy
evaluation means that computation is demand driven. This, to a large
extent, frees the programmer from operational concerns, allowing them
to focus on stating what to compute rather than how to do it, which is
another reason for why Haskell is a good fit for mathematical
applications.
One area where Haskell has proved particularly successful is as a host
language for domain-specific languages. There are a number of reasons
for this. One is Haskell's concise yet flexible syntax with extensive
support for overloading of operators and other notation through type
classes. Another is that all values in Haskell, including functions,
are first-class; i.e., they can be bound to variables, passed to and
returned from functions, and so on, greatly facilitating implementing
appropriate domain-specific abstractions. Haskell is also particularly
well suited for symbolic computations, such as those needed for
compiler applications. In the setting of embedded domain-specific
languages, this allows for a spectrum of implementation strategies,
from interpretation to compilation, as well as programmatic
construction of programs in the domain-specific language, often
referred to as metaprogramming or metamodelling
(Augustsson et al, 2008; Giorgidze and Nilsson, 2011; Svenningsson and Axelsson, 2013).
Finally, thanks to its powerful type
system, it is often possible to enforce domain-specific typing
constraints (Thiemann, 2002).
We will see some of these features being put to good use
in the following.
For a data scientist or statistician, the Haskell language holds
several attractions. Most importantly, Haskell now has an ecosystem
and a community for data science and numerical computing that is, if
not as well developed as Python or R, then increasingly productive with many packages
implementing different methods, in particular for a general purpose
programming language that was not designed with numerical computing in
mind. Haskell's lazy evaluation model can make it more difficult to reason
about performance, this in practice rarely gets in the way of data science work
and can be disabled in recent Haskell compilers. On the other hand, Haskell's type system makes makes it significantly
easier to reason about and refactor code.
On a spectrum of data science needs, Haskell is particularly
suited to productizing models that have been developed in languages or
environments that may be more suited for exploratory data analysis.
The [inline-r](https://tweag.github.io/HaskellR/) project, for instance, gives Haskell programmers direct
access to all of the R programming language, facilitating moving data
science into a production environment. Haskell’s type system makes it
very simple to re-factor large code bases which may blur the boundary
between data science and software engineering.
```html_header
<script src="https://cdn.plot.ly/plotly-latest.min.js"></script>
```
## Bayesian modelling in Haskell and Stan
```haskell top hide
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE OverloadedStrings #-}
import Graphics.Plotly
import Graphics.Plotly.Histogram
import Stan.AST
import Stan.AST.Pretty
import Stan.Data
import Stan.Run
import Stan.Simulate
import Numeric.Datasets.BostonHousing
import Numeric.Datasets
import Lens.Micro
import Data.Monoid
import Data.Maybe
import Data.Text (unpack, pack)
import Lucid
import Lucid.Bootstrap
import Lucid.Bootstrap3
import qualified Data.Map.Strict as Map
import qualified Data.Random as R
import Data.Random.Source.PureMT
deleteVar = Map.delete
lookupVar k = fromJust . Map.lookup k
```
In our prototype interface, the Stan model itself is described in a
data structure in the host language, here Haskell. This is unlike
typical Stan interfaces, and has the disadvantage that the Stan file has
slightly more syntactic noise and is less familiar. However, the Stan
model description is now a value in a programming language that can be
manipulated and calculated based on the circumstances, a form of
metamodelling as discussed in the previous section. There are also
other advantages that will become apparent later in this paper.
Moreover, there are a number of ways in Haskell to reduce the
syntactic noise by making the model look more like plain Stan code
should that be desired in a more mature implementation; e.g.
parsing the desired Stan syntax directly from a file or a "here document", i.e.
an embedded multi-line string.
In our current implementation, the Stan model value looks like this:
```haskell top
linRegression :: [Stan]
linRegression = [
Data [ lower 0 Int ::: "n"
, lower 0 Int ::: "p"
, Real ::: "y"!["n"]
, Real ::: "x"!["n","p"]
],
Parameters [
Real ::: "beta"!["p"],
Real ::: "sigma"
],
Model [
For "i" 1 "p" ["beta"!["i"] :~ normal (0.0,1.0)]
,"sigma" :~ gamma (1.0,1.0)
,For "i" 1 "n" [
"y"!["i"] :~ normal (("x"!["i"] `dot` "beta"), "sigma")
]
]
]
```
In our implementation, Stan models are encoded as an abstract syntax
tree, defined as an algebraic data type `Stan`.
This makes it straightforward to generate or transform Stan models
programmatically by writing functions working on abstract syntax trees
using pattern matching. For instance, `normal` is a function taking a pair of numbers (mean, standard deviation) and returning
the abstract syntax tree representing a draw from a normal probability distribution. Our encoding as a data type is similar to that employed in the more polished [ScalaStan](https://github.com/cibotech/ScalaStan). Note that we do not yet support vectorised expressions, but this is not an intrinsic limitation of our approach. However, if two models are developed in the same Haskell module, then names will not clash even if they overlap as the Stan code will be treated separately.
In the above code, the syntax `::` indicates a type annotation. Here `linRegression` has
the type `[Stan]`, a list of values of type `Stan`,
an algebraic data type with constructors (variants) Data, Parameters, Model corresponding to sections of a Stan model.
We can easily see the Stan code corresponding to the data structure
(program) we just created by pretty-printing. This is just the Stan model file that we otherwise would use directly.
```haskell eval
ppStans linRegression
```
```haskell top hide
postPlotRow post vnms = row_ $ rowEven MD $ flip map vnms $ \vnm -> toHtml (plotly vnm [histogram 50 $ fromJust $ Map.lookup (unpack vnm) post]
& (layout . margin) ?~ titleMargins & (layout . height) ?~ 300 & (layout . title) ?~ (vnm))
```
We then have to create the data structure holding the input to Stan.
Here, as an example, we will use the Boston Housing dataset from the
[datasets](http://hackage.haskell.org/package/datasets) Haskell package. We load this into the `bh` variable which
will hold a list of records describing Boston housing data (`bh ::
[BostonHousing]` where BostonHousing is a record with fields including `rooms` and `medianValue`)
```haskell do
bh <- getDataset bostonHousing
```
Here, we plot the median value against the number of rooms in a residence:
```haskell eval
plotly "bh" [points (aes & x .~ rooms & y .~ medianValue) bh]
& layout %~ yaxis ?~ (defAxis & axistitle ?~ "Median Value")
& layout %~ xaxis ?~ (defAxis & axistitle ?~ "Rooms")
```
To put this into the Stan data format, we create values of the `StanEnv` type using the custom infix operator `<~`. Haskell allows library
programmers to define their own infix operators describing the model. Here, we have defined `v <~ d` to mean: "Create a Stan environment
where the variable named `v` holds the data contained in the Haskell
variable `d`." `d` can be of any type for which we have defined how to turn
values into Stan values (that is, implemented the `ToStanData` type class). We concatenate these elementary Stan environments using the
append operator `<>`. `StanEnv` itself is represented as a map (known as a dictionary in some other languages) from strings, relating a variable name to its value. This environment is used both to contain data to be passed to Stan for inference and the variables
generated in simulation (see below).
```haskell do
let getRow b = [rooms b, crimeRate b]
sdata = "y" <~ map medianValue bh <>
"x" <~ map getRow bh <>
"n" <~ length bh <>
"p" <~ (2 :: Int)
```
Finally, we run the Stan model using the `runStan` function taking as arguments the model, the data value and a configuration value
that specifies whether we are sampling or optimising the posterior. The resulting posterior will be bound to the variable `res`.
A
```haskell do
res <- runStan linRegression sdata sample {numIterations = 500}
```
At this point the components of the posterior can be plotted as usual:
```haskell eval
postPlotRow res ["beta.1", "beta.2", "sigma" ] :: Html ()
```
## Simulating From a Stan Model
In order to obtain a richer probabilistic programming capability based on the Bayesian update in Stan, it suffices to
add a function to simulate from a probabilistic model with fine control over which variables from the posterior are used in
the simulation. By using no variables from the posterior at all (and therefore not using any data either), we are simulating from the prior (prior predictive distribution); by using all the variables from the posterior, we are simulating from the posterior (posterior predictive distribution). Moreover, by controlling the independent
variables in the dataset, we can make predictions for new observations. In the case of timeseries modelling, by manipulating the
starting value, we can continue a simulation from the endpoint of observed data (that is, forecast). Crucially, we propose
that all of these functions are possible without writing the model twice as is usual: once in Stan, and once in the host language.
Simulation operates on the same model description as that used for inference.
The type of our simulation function is:
```haskell
runSimulate
:: Int -- Number of simulations we would like to perform
-> [Stan] -- The Stan model
-> StanEnv -- The simulation input environment
-> [StanEnv] -- A list of independent simulation outputs
```
Here `runSimulate n m e` performs n independent simulations in the model m using environment e. If m contains values from a
posterior (which has been generated with `runStan`), then each of the independent simulations uses a consistent set of samples from that
posterior representing the state of the Markov chain at a given iteration. Therefore we retain information about posterior correlations in the simulations.
Internally, `runSimulate n` calls `runSimulateOnce` `n` times, simply collecting the results into a list. `runSimulateOnce` steps through the model definition's abstract syntax tree, potentially adding to the provided environment for every line. It will follow these rules:
* If the current line is a static assignment (corresponding to `<-` in the Stan syntax),
calculate the right hand side and assign
it to the variable given on the left-hand side in the environment
* If the current line is a draw from a probability distribution
(corresponding to `~` in the Stan syntax), first check if the variable on the left-hand side
is already present in the environment from the posterior.
* If the variable is present, select one of the iterations from the posterior and insert the posterior
value at that iteration into the environment.
The iteration index selected is constant throughout one invocation of `runSimulateOnce`.
* If the variable is not present, fully evaluate the arguments to the
probability distribution. Then use the random seed in the environment to simulate a
draw from this probability distribution and a new random seed. Insert the drawn value and replace
the random seed with the new seed.
* If the current line is a for loop, repeat the body for the required number of times with the loop
index placed into the environment.
We quickly demonstrate the concept of simulation by simulating a single replica dataset and plotting it:
```haskell do
seed <- seedEnv <$> newPureMT -- generate seed for random numbers
let resEnv = seed <> deleteVar "y" sdata <> mcmcToEnv res -- posterior and data with deleted observation
simEnv = runSimulateOnce linRegression resEnv -- run the simulation
postPredOnce = zip (unPairDoubles (lookupVar "x" simEnv))
(unDoubles (lookupVar "y" simEnv))
```
In the first line, we create an environment holding a random seed to be used in a random number generators during the simulation.
Then, we concatenate three environments to pass to the simulation: the random number seed, the data with the deleted observation
variable `y` such that new variables with this name will be generated in the simulation, and finally the results of the posterior inference
converted to an environment using the function `mcmcToEnv`. `runSimulateOnce` then runs the model in a simulation mode, taking as arguments
the model representation and the environment holding the variables which are not to be simulated but are used as the basis for further simulation.
In the last line, we use `zip` (which converts two lists to a list of pairs) to create a list of simulated `x` vectors, which are those in the data,
and `y` which are the new simulated outcomes.
This can be plotted using the plotly library as:
```haskell eval
plotly "bhpp" [points (aes & x .~ (fst . fst) & y .~ snd) postPredOnce]
& layout %~ yaxis ?~ (defAxis & axistitle ?~ "Median Value")
& layout %~ xaxis ?~ (defAxis & axistitle ?~ "Rooms")
```
This simulation facility can be used for a common operation in model criticism: calculating residuals. In classical statistics residuals are
calculated by averaging the posterior parameters (using a point estimate), making a single prediction and subtracting this
from the observed outcome. From a Bayesian point of view, it almost never makes sense
to average the parameters. Instead, the parameters, the predicted outcome and the residuals themselves are all probability
distributions. In order to achieve something plottable, we average the predicted outcome over all the Markov chain Monte Carlo
samples and subtract this average prediction from the observed outcome; but we could also use intervals or small point clouds for the full residual distribution. This has the advantage that in case of banana-shaped or
multimodal posteriors (arising, for instance from lack of identifiability) the average prediction is more meaningful than
the prediction with the average parameter value.
```haskell do
let -- simulate 100 datasets
simEnvs = runSimulate 100 linRegression resEnv
-- Average the 100 simulations into an average predicted outcome
avgYs = avgVar simEnvs "y"
-- for each average predicted outcome, subtract the observed outcome
differences = zipWith (-) (unDoubles (lookupVar "y" sdata))
(unDoubles avgYs)
-- zip the residual with the dependent variables to prepare for plot
residuals = zip (unPairDoubles (lookupVar "x" sdata))
differences
```
Here we simulate 100 sets of predicted outcomes, that is 100 different outcome vectors `y`.
For each simulated outcome vector index, we average the predictions to obtain a single average
predicted outcome vector `avgYs`. From this average prediction, we subtract the observed outcome
element wise in the list (`zipWith (-)`) and pair these with the dependent variables to facilitate
plotting.
```haskell eval
plotly "bhres" [points (aes & x .~ (fst . fst) & y .~ snd) residuals]
& layout %~ yaxis ?~ (defAxis & axistitle ?~ "Median Value Residual")
& layout %~ xaxis ?~ (defAxis & axistitle ?~ "Rooms")
```
## Discussion
We have shown prototype bindings to Stan in the Haskell language. Unlike other such bindings, our model definition is a data
type in the host language. This enables us to simulate from the model with fine control over the model parameters which may
come from a posterior following inference.
Simulation with fine control over the provenance of the parameter distribution has several advantages. We have already shown
how it can be used to define the residuals and how to simulate from the posterior predictive distribution. In many cases, such
an instruction to simulate from the posterior predictive distribution will be incompletely specified. For instance, if we are
dealing with a hierarchical regression model, should the posterior predictive distribution retain the group identities? One
posterior predictive simulation would have entirely new groups, while another could have new individuals but drawn from groups
that are specified from the data.
### Further work
The Haskell interface to Stan we have shown here is an incomplete prototype. The model definition language can be polished
to reduce the amount of syntactic noise. Ideally, we would also have the ability to parse models written in the standard Stan
language. We can also make it much easier to move data into Stan, and to parse data from the posterior samples and from the
simulation environment. This will be facilitated when the Haskell data science community converges on one of the several
proposed implementations of data frames.
We currently only support a narrow subset of Stan models for post-posterior processing. Some features are simply not implemented
in the work in progress we are presenting here, but we see no technical obstacles to doing so. This includes for instance
vectorized expressions, or with a bit more work, ordinary differential equations. For models that have multiple assignments with `~` in separate lines to the same variable, corresponding to multiple observations, the semantics may be somewhat confusing as in the simulation we would effectively be skipping subsequent assignments as these would be treated as observed after they have been drawn on the first assignment. Therefore the simulated variables would only be influenced by the first draw, unless this is explicitly implemented in a different way.
Other models will be much more difficult to automatically convert to simulation code, for example models
that rely on updating the log posterior using `+=` statements. Simulating posterior predictive data or making predictions from
new independent variables from these models would probably require running a new Markov chain. Thus simulating from these models would be similar to simulating from the Ising model, which is typically done with MCMC.
Representing a Stan model in a data structure opens other possibilities. We plan to implement a function to combine two different
models by specifying a list of the parameters that are common in the two models. Thus, two different manifestations of the same
underlying constants can be investigated independently before the evidence is combined. This functionality would have to deal with name conflicts between different models which could arise when combining to independently developed models. This can be done by appending an identifier corresponding to the model provenance to any conflicting names.
## About this document
The code for this document is [hosted on GitHub](https://github.com/glutamate/stancon). It is written using the [inliterate
notebook format](https://github.com/diffusionkinetics/open/tree/master/inliterate) which allows a mixture of code, text
writing, and the result of running code. Compared to Jupyter notebooks,
it is less interactive (there is no caching) and it emphasises a cleaner, human- readable source format that can be checked
into version control. Plots are generated with Plotly.js.
## References
Lennart Augustsson, Howard Mansell, and Ganesh Sittampalam. Paradise: a
two-stage DSL embedded in Haskell. In ICFP '08: Proceeding of the 13th
ACM SIG-PLAN international conference on Functional programming, pages 225-228, New York, NY, USA, 2008. ACM.
George Giorgidze and Henrik Nilsson. Mixed-level Embedding and JIT Compilation for an Iteratively Staged DSL. In Julio Mariño, editor, Proceedings of the 19th
Workshop on Functional and (Constraint) Logic Programming (WFLP 2010), volume 6559 of Lecture Notes in Computer Science, pages 48–65, Springer-Verlag, 2011.
J. Svenningsson and E. Axelsson. Combining deep and
shallow embedding for EDSL. In Trends in Functional
Programming (TFP) 2012, Revised Selected Papers,
volume 7829 of Lecture Notes in Computer Science,
pages 21–36, 2013.
Peter Thiemann, Programmable Type Systems for Domain Specific Languages,
Electronic Notes in Theoretical Computer Science, Volume 76, 2002,
Pages 233-251, ISSN 1571-0661.