Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for NamedArray #65

Open
s-celles opened this issue Nov 17, 2017 · 22 comments
Open

Add support for NamedArray #65

s-celles opened this issue Nov 17, 2017 · 22 comments
Milestone

Comments

@s-celles
Copy link

Hello,

I'm using FreqTables.jl freqtable.
This function outputs NamedArray objects.
Maybe it could be a good idea to add support for NamedArrays into IterableTables.

Kind regards

@davidanthoff davidanthoff added this to the Backlog milestone Nov 17, 2017
@davidanthoff
Copy link
Member

Do these have a table structure, though? I'm just not sure...

@s-celles
Copy link
Author

If dimension is no more than 2, I think so.
If dimension is greater, than an error should be raised.
Your comment is very interesting as it shows that there is a room for a project (comparable to yours) that could deal with converting n dimensional datastructures such as NamedArray or IndexedTables currently

@davidanthoff
Copy link
Member

Interesting, interesting... I think the general pattern he might be some way to get rows out of a matrix, both a named and a normal matrix. For Query.jl integration it would be nice if these rows were tuples and named tuples respectively. But I'm not sure that is in general a good idea, I'm not sure whether very large tuples (if there are very many rows) work well...

It would have to be a special function, in any case, something like rows. I'll have to think a bit more about that scenario.

@s-celles
Copy link
Author

See davidavdav/NamedArrays.jl#55 about implementing an Iterator of NamedTuples for NamedArrays

@s-celles
Copy link
Author

Pinging @davidavdav

@davidavdav
Copy link

You rang?

I am a bit out of context, I've tried to read up to the references, but it is not clear to me what an iterator of named tuples would do with a NamedArray. Do you want to be able to iterate over a dimension on a NamedArray, thereby getting a tuple of the name and the array slice of one dimension lower, and do you want this to be of type Iterator for NamedTuples?

@s-celles
Copy link
Author

I think we should first tackle NamedArrays with dimension of 2 and see how such a NamedArray can be converted to a DataFrame (or any table-like data structure that IterableTables deals with as a sink)
IterableTable can easily use a source which consist of any iterator who produces elements of type NamedTuple.

The problem of dimensions greater than 2 can't be handle in this project (I think... or at least not for now).

@nalimilan
Copy link

In R tables of arbitrary dimensionality can be converted to data frames. Each dimension gets transformed into a column, and an additional columns holds the values of the array entries.

@davidanthoff
Copy link
Member

@nalimilan so all the columns except the last one would hold the indices of that respective dimension? For example, this matrix:

1  2  3
4  5  6
7  8  9

Would be transformed into this table:

Dim1 Dim2 Value
1 1 1
1 2 2
1 3 3
2 1 4
2 2 5
2 3 6
3 1 7
3 2 8
3 3 9

? I think that might be a really good general solution. It would still be nice if there was some easy way to handle a matrix differently, i.e. keep the table structure of the matrix, but one that could be done by say a row function.

I guess this also somehow interacts with this idea of how associative are handled. This would essentially treat an array as an associative, with the dimension indices as the key. I didn't follow that debate in detail, though...

@nalimilan
Copy link

Yes, that's it, though it's even clearer when there are dimension names rather than indices.

@davidavdav
Copy link

Ik looks like an R-style NamedArray -> DataFrame export would generally be beneficial.

@s-celles
Copy link
Author

If we can have NamedArray export to IndexedTables (and vice versa) it will be great.
Adding support of NamedArray to IterableTables will help to achieve this goal.

@davidavdav
Copy link

So you would want to treat 0s in the NamedArray special for IndexedTables, i.e., leave these entries out? That makes sense for FreqTable output, but for a NamedArray 0 is type-specific and otherwise not very special.

@nalimilan
Copy link

I agree zeros should not be dropped.

@davidavdav
Copy link

davidavdav commented Nov 20, 2017

How would this do for you?

using IndexedTables

import IndexedTables.IndexedTable

function IndexedTable(n::NamedArray)
	L = length(n) # elements in array
	cols = Dict{Symbol, Array}()
	factor = 1
	for d in 1:ndims(n)
		nlevels = size(n, d)
		nrep = L ÷ (nlevels * factor)
		data = repmat(vcat([fill(x, factor) for x in names(n, d)]...), nrep)
		cols[Symbol(dimnames(n, d))] = data
		factor *= nlevels
	end
	return IndexedTable(Columns(;cols...), array(n)[:])
end

@s-celles
Copy link
Author

s-celles commented Nov 20, 2017

the 2 behaviours could be considered when converting NamedArray to IndexedTables.

  1. Treat "0" (or any other value) as a special value that don't need to be report to IndexTable (because this kind of datastructure was specially designed to deal with sparse data)
  2. treating all values in the same way should also be a possibility (maybe the default one)

@davidavdav
Copy link

The simple implementation above would not become very efficient, memory-wise, for very large and very sparse tables, if we filter out the 0s afterwards. Anyway, the repmat(vcat([filll(...)]) is probably not the most efficient.

@s-celles
Copy link
Author

s-celles commented Nov 21, 2017

Maybe with the aim of filtering out some values, we should probably accept anonymous function instead of a given value such as "0"

x -> x == 0

@davidanthoff
Copy link
Member

There are two issues here, right? How to convert something to a IndexedTable, and how to convert something to just any table. Only the latter interacts with iterable tables at this point.

@s-celles
Copy link
Author

s-celles commented Nov 21, 2017

Not sure if there is really two issues here in fact...
Your comment #65 (comment) shows that, if we extend your example to a 3 dim named array (or more) you can even transform it to a two dimensional array and so a table (and also to IndexedTables as it's a sink that is currenly supported by IterableTables)

@s-celles
Copy link
Author

s-celles commented Nov 21, 2017

On the other side... there is an issue about IndexedTables output with IterableTables not being able to filter out values to keep the sparse feature of IndexedTables

julia> using IterableTables

julia> using IndexedTables

julia> a=[0 0 1 0;2 0 3 0;0 0 5 0;2 0 0 1]
4×4 Array{Int64,2}:
 0  0  1  0
 2  0  3  0
 0  0  5  0
 2  0  0  1

julia> IndexedTable(a)
─────┬──
1  10
1  20
1  31
1  40
2  12
2  20
2  33
2  40
3  10
3  20
3  35
3  40
4  12
4  20
4  30
4  41

we could expect an api like

julia> IndexedTable(a, x -> x == 0)
─────┬──
1  31
2  12
2  33
3  35
4  12
4  41

If we want anonymous function to filter out

So in this case... this is clearly an other issue

or

julia> IndexedTable(a, x -> x != 0)

if we want anonymous function to define which values we want to keep

Issue opened at JuliaData/IndexedTables.jl#91

@s-celles
Copy link
Author

s-celles commented Dec 8, 2017

Thanks to @davidavdav commit davidavdav/NamedArrays.jl@5b8205f a NamedArray of any dimension can now be flattened (returning a flattened NamedArray) as exposed in #65 (comment)

julia> using NamedArrays
julia> srand(1234);

julia> n=NamedArray(rand(2,4,3))
2×4×3 Named Array{Float64,3}

[:, :, C=1] =
A ? B │        1         2         3         4
──────┼───────────────────────────────────────
10.590845  0.566237  0.794026  0.200586
20.766797  0.460085  0.854147  0.298614

[:, :, C=2] =
A ? B │         1          2          3          4
──────┼───────────────────────────────────────────
10.246837   0.648882   0.066423   0.646691
20.579672  0.0109059   0.956753   0.112486

[:, :, C=3] =
A ? B │         1          2          3          4
──────┼───────────────────────────────────────────
10.276021  0.0566425   0.950498   0.945775
20.651664   0.842714    0.96467   0.789904

julia> n[:]
24-element Named Array{Float64,1}
(:A, :B, :C)    │
────────────────┼──────────
("1", "1", "1") │  0.590845
("2", "1", "1") │  0.766797
("1", "2", "1") │  0.566237
("2", "2", "1") │  0.460085
("1", "3", "1") │  0.794026
("2", "3", "1") │  0.854147
("1", "4", "1") │  0.200586
("2", "4", "1") │  0.298614
("1", "1", "2") │  0.246837
("2", "1", "2") │  0.579672
("1", "2", "2") │  0.648882
("2", "2", "2") │ 0.0109059
("1", "3", "2") │  0.066423
("2", "3", "2") │  0.956753
("1", "4", "2") │  0.646691
("2", "4", "2") │  0.112486
("1", "1", "3") │  0.276021
("2", "1", "3") │  0.651664
("1", "2", "3") │ 0.0566425
("2", "2", "3") │  0.842714
("1", "3", "3") │  0.950498
("2", "3", "3") │   0.96467
("1", "4", "3") │  0.945775
("2", "4", "3") │  0.789904

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants