-
Notifications
You must be signed in to change notification settings - Fork 56
[Feauture] Confidence interval bands + errorbars #710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
8f65ac2
234355b
8f7c2d2
494442e
9cbad45
a77a0b5
2cf6b06
5387feb
ac450f4
fe85b63
97b8c63
3118fc9
88ff001
3bb3f5e
13f9dff
c3bbc26
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3,6 +3,9 @@ Base.@kwdef struct LinearAnalysis{I} | |
| dropcollinear::Bool = false | ||
| interval::I = automatic | ||
| level::Float64 = 0.95 | ||
| weighttype::Symbol = :fweights | ||
| weighttransform = identity | ||
| distr::GLM.Distribution = GLM.Normal() | ||
| end | ||
|
|
||
| function add_intercept_column(x::AbstractVector{T}) where {T} | ||
|
|
@@ -12,15 +15,43 @@ function add_intercept_column(x::AbstractVector{T}) where {T} | |
| return mat | ||
| end | ||
|
|
||
| function get_weighttype(s::Symbol) | ||
| weighttype = if s == :fweights | ||
| StatsBase.fweights | ||
| else | ||
| throw(ArgumentError("Currently, GLM.jl only supports `StatsBase.fweights`.")) | ||
| end | ||
|
|
||
| # TODO: Can support these weights as well after GLM v2 is released | ||
| # https://github.com/JuliaStats/GLM.jl/pull/619 | ||
| #weighttype = if s == :aweights | ||
| # StatsBase.aweights | ||
| #elseif s == :pweights | ||
| # StatsBase.pweights | ||
| #elseif s == :fweights | ||
| # StatsBase.fweights | ||
| #else | ||
| # throw(ArgumentError("Currently, GLM.jl only supports `aweights`, `pweights`, and `fweights`.")) | ||
| #end | ||
|
|
||
| return weighttype | ||
| end | ||
|
|
||
| # TODO: add multidimensional version | ||
| function (l::LinearAnalysis)(input::ProcessedLayer) | ||
| output = map(input) do p, n | ||
| x, y = p | ||
| weights = StatsBase.fweights(get(n, :weights, similar(x, 0))) | ||
| default_interval = length(weights) > 0 ? nothing : :confidence | ||
| interval = l.interval === automatic ? default_interval : l.interval | ||
| weights = (get_weighttype(l.weighttype) ∘ l.weighttransform)(get(n, :weights, similar(x, 0))) | ||
| interval = l.interval === automatic ? :confidence : l.interval | ||
| # FIXME: handle collinear case gracefully | ||
| lin_model = GLM.lm(add_intercept_column(x), y; weights, l.dropcollinear) | ||
| # TODO: `wts` --> `weights` after GLM v2 is released | ||
| # https://github.com/JuliaStats/GLM.jl/pull/631 | ||
| lin_model = if isempty(weights) | ||
| GLM.lm(add_intercept_column(x), y; l.dropcollinear) | ||
| else | ||
| # Supports confidence intervals, while `GLM.lm` currently does not | ||
| GLM.glm(add_intercept_column(x), y, l.distr; weights, l.dropcollinear) | ||
| end | ||
| x̂ = range(extrema(x)..., length = l.npoints) | ||
| pred = GLM.predict(lin_model, add_intercept_column(x̂); interval, l.level) | ||
| return if !isnothing(interval) | ||
|
|
@@ -50,7 +81,7 @@ function (l::LinearAnalysis)(input::ProcessedLayer) | |
| end | ||
|
|
||
| """ | ||
| linear(; interval=automatic, level=0.95, dropcollinear=false, npoints=200) | ||
| linear(; interval=automatic, level=0.95, dropcollinear=false, npoints=200, weighttype=:fweights, weighttransform=identity, distr=GLM.Normal()) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you allow passing
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh cool, didn't know about
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't closely examine this but in general, I like it better if options for a different package are pass-through (like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That makes sense. One point to note though is that distribution and link are positional arguments in GLM, so either they would have to be passed as positional arguments to |
||
|
|
||
| Compute a linear fit of `y ~ 1 + x`. An optional named mapping `weights` determines the weights. | ||
| Use `interval` to specify what type of interval the shaded band should represent, | ||
|
|
@@ -63,6 +94,11 @@ it is possible to set `dropcollinear=true`. | |
| `npoints` is the number of points used by Makie to draw the shaded band. | ||
|
|
||
| Weighted data is supported via the keyword `weights` (passed to `mapping`). | ||
| Additional weight support is provided via the `weighttype`, `weighttransform`, and `distr` keywords. | ||
| `weightype` specifies the `StatsBase.AbstractWeights` type to use. | ||
| `weighttransform` accepts an optional function to transform the weights before they are passed to `GLM.glm`. | ||
| `distr` is forwarded to `GLM.glm`. | ||
| See the GLM.jl documentation for more on working with weighted data. | ||
|
|
||
| This transformation creates two `ProcessedLayer`s labelled `:prediction` and `:ci`, which can be styled separately with `[subvisual](@ref)`. | ||
| """ | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is
weighttypereally needed? The whole idea behind theseAbstractWeightstypes in StatsBase is that you only specify once the kind of weights you have when building the vector, and then all functions automatically interpret it correctly. We don't have any API currently that takes the kind of weight separately.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look, Milan! I hear ya, and I as on the fence on if it would be better to have the user explicitly construct the weight vector themselves before passing it to AoG, or to just have AoG handle it internally. I decided to go with the latter for ease of use.
Currently, this just wraps the specified field from the table-like data passed to AoG in
fweights(or to potentially more supported weight kinds after GLM v2 is released). In either case, the weights are still only specified once, no?Maybe calling it
weighttypeis misleading since it's really just wrapping our array in a convenience function call. I think I called itweightkindin an earlier iteration, and could switch back to that if this design makes sense? Sorry if I am misunderstanding what you are asking.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that imagine I'm loading a dataset in
df. I immediately do something likedf.y_unc = aweights(df.y_unc). Then I can create one or more graphs, estimate models, etc., without ever saying again that these areaweights, and everything works automatically. And this is necessary to use weights with StatsBase or GLM as they require anAbstractWeightstype. Introducing a new keyword argument which differs from the method used in other packages actually makes things more complex IMO.As @jkrumbiegel said (https://github.com/MakieOrg/AlgebraOfGraphics.jl/pull/710/changes#r2685551356) it seems better to adopt the API of underlying packages as much as possible.
without introducing new API in AoG.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so your strategy is totally what I would do if I was using vanilla Makie.
In contrast, it's my understanding that the current usage with AoG makes all calls to StatsBase and GLM internally so that the user only needs to pass the raw data without any pre-processing or explicit loading of additional packages needed. It seems this would need to change with your proposed usage, so I guess my question is if this is the direction that we would like to go in this PR.
At any rate, happy to go with either route, was just asking for clarity