Skip to content

Consider shrinking list of dependencies #50

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
juliohm opened this issue Apr 16, 2025 · 5 comments
Open

Consider shrinking list of dependencies #50

juliohm opened this issue Apr 16, 2025 · 5 comments

Comments

@juliohm
Copy link

juliohm commented Apr 16, 2025

This package can be useful in other contexts without GPUs nor neural nets. Could you consider moving some of the dependencies to extensions? Below is the current list of hard dependencies:

[deps]
Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
ChainRulesCore = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4"
Compat = "34da2185-b29b-5c13-b0c7-acf172513d20"
GPUArraysCore = "46192b85-c4d5-4398-a991-12ede77f4527"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
NNlib = "872c559c-99b0-510c-b3b7-b6c96a88d5cd"
@mcabbott
Copy link
Member

The problem is that the principle thing a OneHotArray is good for is that multiplying from the left with a matrix can be done efficiently... and that calls NNlib.scatter:

function Base.:(*)(A::AbstractMatrix, B::OneHotLike{<:Any, 1})
_isonehot(B) || return invoke(*, Tuple{AbstractMatrix, AbstractMatrix}, A, B)
size(A, 2) == size(B, 1) || throw(DimensionMismatch("Matrix column must correspond with OneHot size: $(size(A, 2)) != $(size(B, 1))"))
return NNlib.gather(A, _indices(B))
end

It this were moved to an extension, what should happen when I call rand(1:99, 4, 3) * onehotbatch([1,1,1], 1:3) without loading NNlib?

@juliohm
Copy link
Author

juliohm commented Apr 16, 2025 via email

@juliohm
Copy link
Author

juliohm commented Apr 17, 2025

In particular, I need a simple OneHotMatrix implementation to store a huge dataset of categorical values. Say I have 10^6 values with 5 levels. A naive Matrix consumes 5 x 10^6 bools. The ideal OneHotMatrix would wrap the original vector of categorical values and would provide getindex and setindex! in terms of the one-hot encoding.

Example:

using CategoricalArrays

categs = categorical([:a, :b, :b, :c, :d, :e])
onehot = onehotmatrix(categs)

onehot[1,1] # true
onehot[2,1] # false
onehot[3,1] # false
...

onehot[1,2] # false
onehot[2,2] # true
onehot[3,2] # false
...

onehot[2,1] = true # equivalent to setting categs[1] = :a

Is this usage in the scope of OneHotArrays.jl? Should this onehotmatrix (and onecoldmatrix) function be implemented in CategoricalArrays.jl instead?

In my current understanding, I think it would be better to move all functionality to CategoricalArrays.jl, then add NNlib and GPU-related packages as extensions for the fast matrix multiplication. Is there a feature in OneHotArrays.jl that justifies the existence of both packages?

@mcabbott
Copy link
Member

mcabbott commented Apr 21, 2025

I have never used CategoricalArrays, but it seems to be backed by a Vector{UInt32} which is exactly what OneHotArray likes too:

julia> using CategoricalArrays, OneHotArrays

julia> categs = categorical(string.([:a, :b, :b, :c, :d, :e]));  # error says Symbol no longer supported

julia> categs.refs
6-element Vector{UInt32}:
 0x00000001
 0x00000002
 0x00000002
 0x00000003
 0x00000004
 0x00000005

julia> oh = OneHotMatrix(categs.refs, Int(maximum(categs.refs)))
5×6 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
 1          
   1  1      
       1    
         1  
           1

julia> oh[1,1], oh[2,1]
(true, false)

julia> oh[2,1] = true  # with PR 51 here
true

julia> categs  # has been updated too
6-element CategoricalArray{String,1,UInt32}:
 "b"
 "b"
 "b"
 "c"
 "d"
 "e"

For setindex! via #51 to mutate the CategoricalArray too, it matters that the OneHotMatrix constructor wraps the same Vector{UInt32}. That won't happen with oh2 = onehotbatch(categs, unique(categs)) below...

...and in general this package hasn't thought carefully about when it does and doesn't preserve such references, as everything was immutable anyway. Maybe it's always true that uppercase OneHotMatrix just wraps without copying?

julia> oh2 = onehotbatch(categs, unique(categs));

julia> oh2[3,1] = true;

julia> oh2
4×6 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
   1  1      
       1    
 1        1  
           1

julia> categs  # not changed
6-element CategoricalArray{String,1,UInt32}:
 "b"
 "b"
 "b"
 "c"
 "d"
 "e"

That said I don't see a reason to regard this as out-of-scope. A method OneHotArray(:: CategoricalArray) could certainly live in an extension here. It will forget categs.pool, I hope that seems OK. Edit, xref #45

@mcabbott
Copy link
Member

mcabbott commented May 2, 2025

See #54 for a start on CategoricalArrays support... and having package extensions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants