Skip to content
Open
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ DataAPI = "1"
Dictionaries = "0.3, 0.4"
DynamicQuantities = "1"
FileIO = "1.1"
GLM = "1.9.1"
GLM = "1.9.2"
GeoInterface = "1"
GeometryBasics = "0.4.1, 0.5"
GridLayoutBase = "0.6, 0.7, 0.8, 0.9, 0.10, 0.11"
Expand Down
28 changes: 28 additions & 0 deletions docs/src/reference/analyses.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,34 @@ specs = data(df) * mapping(:x, :y, color=:a => nonnumeric) * (
draw(specs)
```

Below are a few examples showing the [`linear`](@ref) interface for performing weighted fits:

```@example analyses
using Random
Random.seed!(111)
colors = Makie.wong_colors()
df = let
x = 1:5
y_err = randn(length(x))
y = x .+ y_err
y_unc = abs.(y_err)
(; x, y, y_unc)
end
specs = data(df) * mapping(:x, :y) * (
(visual(Scatter) + mapping(:y_unc) * visual(Errorbars)) * visual(; label = "data") +
mapping(; weights = :y_unc) * (
linear(; weighttype = :fweights, weighttransform = x -> inv.(x .^ 2)) * visual(; color = colors[1], label = "fweights")
Comment on lines +170 to +171

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is weighttype really needed? The whole idea behind these AbstractWeights types in StatsBase is that you only specify once the kind of weights you have when building the vector, and then all functions automatically interpret it correctly. We don't have any API currently that takes the kind of weight separately.

Copy link
Contributor Author

@icweaver icweaver Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look, Milan! I hear ya, and I as on the fence on if it would be better to have the user explicitly construct the weight vector themselves before passing it to AoG, or to just have AoG handle it internally. I decided to go with the latter for ease of use.

Currently, this just wraps the specified field from the table-like data passed to AoG in fweights (or to potentially more supported weight kinds after GLM v2 is released). In either case, the weights are still only specified once, no?

Maybe calling it weighttype is misleading since it's really just wrapping our array in a convenience function call. I think I called it weightkind in an earlier iteration, and could switch back to that if this design makes sense? Sorry if I am misunderstanding what you are asking.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that imagine I'm loading a dataset in df. I immediately do something like df.y_unc = aweights(df.y_unc). Then I can create one or more graphs, estimate models, etc., without ever saying again that these are aweights, and everything works automatically. And this is necessary to use weights with StatsBase or GLM as they require an AbstractWeights type. Introducing a new keyword argument which differs from the method used in other packages actually makes things more complex IMO.

As @jkrumbiegel said (https://github.com/MakieOrg/AlgebraOfGraphics.jl/pull/710/changes#r2685551356) it seems better to adopt the API of underlying packages as much as possible.
without introducing new API in AoG.

Copy link
Contributor Author

@icweaver icweaver Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so your strategy is totally what I would do if I was using vanilla Makie.

In contrast, it's my understanding that the current usage with AoG makes all calls to StatsBase and GLM internally so that the user only needs to pass the raw data without any pre-processing or explicit loading of additional packages needed. It seems this would need to change with your proposed usage, so I guess my question is if this is the direction that we would like to go in this PR.

At any rate, happy to go with either route, was just asking for clarity


# Other weights available once GLM v2 is released
#linear(; weighttype = :aweights) * visual(; color = colors[2], label = "aweights")
#linear(; weighttype = :pweights) * visual(; color = colors[3], label = "pweights")
)
)
draw(specs)
```

Note that the usual caveats still apply for working with different kinds of weights. See the [StatsBase.jl documentation](https://juliastats.org/StatsBase.jl/stable/weights/) for more.

## Smoothing

```@docs
Expand Down
46 changes: 41 additions & 5 deletions src/transformations/linear.jl
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@ Base.@kwdef struct LinearAnalysis{I}
dropcollinear::Bool = false
interval::I = automatic
level::Float64 = 0.95
weighttype::Symbol = :fweights
weighttransform = identity
distr::GLM.Distribution = GLM.Normal()
end

function add_intercept_column(x::AbstractVector{T}) where {T}
Expand All @@ -12,15 +15,43 @@ function add_intercept_column(x::AbstractVector{T}) where {T}
return mat
end

function get_weighttype(s::Symbol)
weighttype = if s == :fweights
StatsBase.fweights
else
throw(ArgumentError("Currently, GLM.jl only supports `StatsBase.fweights`."))
end

# TODO: Can support these weights as well after GLM v2 is released
# https://github.com/JuliaStats/GLM.jl/pull/619
#weighttype = if s == :aweights
# StatsBase.aweights
#elseif s == :pweights
# StatsBase.pweights
#elseif s == :fweights
# StatsBase.fweights
#else
# throw(ArgumentError("Currently, GLM.jl only supports `aweights`, `pweights`, and `fweights`."))
#end

return weighttype
end

# TODO: add multidimensional version
function (l::LinearAnalysis)(input::ProcessedLayer)
output = map(input) do p, n
x, y = p
weights = StatsBase.fweights(get(n, :weights, similar(x, 0)))
default_interval = length(weights) > 0 ? nothing : :confidence
interval = l.interval === automatic ? default_interval : l.interval
weights = (get_weighttype(l.weighttype) ∘ l.weighttransform)(get(n, :weights, similar(x, 0)))
interval = l.interval === automatic ? :confidence : l.interval
# FIXME: handle collinear case gracefully
lin_model = GLM.lm(add_intercept_column(x), y; weights, l.dropcollinear)
# TODO: `wts` --> `weights` after GLM v2 is released
# https://github.com/JuliaStats/GLM.jl/pull/631
lin_model = if isempty(weights)
GLM.lm(add_intercept_column(x), y; l.dropcollinear)
else
# Supports confidence intervals, while `GLM.lm` currently does not
GLM.glm(add_intercept_column(x), y, l.distr; weights, l.dropcollinear)
end
x̂ = range(extrema(x)..., length = l.npoints)
pred = GLM.predict(lin_model, add_intercept_column(x̂); interval, l.level)
return if !isnothing(interval)
Expand Down Expand Up @@ -50,7 +81,7 @@ function (l::LinearAnalysis)(input::ProcessedLayer)
end

"""
linear(; interval=automatic, level=0.95, dropcollinear=false, npoints=200)
linear(; interval=automatic, level=0.95, dropcollinear=false, npoints=200, weighttype=:fweights, weighttransform=identity, distr=GLM.Normal())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you allow passing distr, it would make sense to allow passing link too. Most distributions other than Normal require a link other than identity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh cool, didn't know about link, thanks! I'm wondering now if it might make more sense to start storing these GLM.glm-specific kwargs in their own struct that's then passed to AoG.linear instead of duplicating individual args in AoG.LinearAnalysis internally as I am currently doing. I guess it depends on how much additional API surface we want to introduce. Happy to go with either route depending on what Julius prefers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't closely examine this but in general, I like it better if options for a different package are pass-through (like kwargs...), so AoG doesn't really add API on its own. But if it's just really cumbersome to use that way, that's an argument for making special AoG API. After all, linear() is still supposed to be rather simple to use.

Copy link

@nalimilan nalimilan Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. One point to note though is that distribution and link are positional arguments in GLM, so either they would have to be passed as positional arguments to linear, or they would need special-casing.


Compute a linear fit of `y ~ 1 + x`. An optional named mapping `weights` determines the weights.
Use `interval` to specify what type of interval the shaded band should represent,
Expand All @@ -63,6 +94,11 @@ it is possible to set `dropcollinear=true`.
`npoints` is the number of points used by Makie to draw the shaded band.

Weighted data is supported via the keyword `weights` (passed to `mapping`).
Additional weight support is provided via the `weighttype`, `weighttransform`, and `distr` keywords.
`weightype` specifies the `StatsBase.AbstractWeights` type to use.
`weighttransform` accepts an optional function to transform the weights before they are passed to `GLM.glm`.
`distr` is forwarded to `GLM.glm`.
See the GLM.jl documentation for more on working with weighted data.

This transformation creates two `ProcessedLayer`s labelled `:prediction` and `:ci`, which can be styled separately with `[subvisual](@ref)`.
"""
Expand Down
Loading