Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: How to convert the obsm layer Arrow format to long tables #3570

Open
ddemaeyer opened this issue Jan 16, 2025 · 3 comments
Open

Question: How to convert the obsm layer Arrow format to long tables #3570

ddemaeyer opened this issue Jan 16, 2025 · 3 comments
Assignees

Comments

@ddemaeyer
Copy link

I got a question how to convert the SparseNDArrays in the arrow format.

Converted h5ad to tiledbsoma and retrieving X_umap field as

experiment.ms["RNA"].obsm["X_tsne"].read().tables().concat()

this results in the following arrow table

pyarrow.Table

soma_dim_0: int64

soma_dim_1: int64

soma_data: double

This format in essence has in soma_dim_0 the row while soma_dim_1 represents the column. The value is the soma_data.

Now my question is how to convert this to a format that is

x: double
y: double

Sorry for my inexperience since this is probably just an arrow mapping operation, but I can't get my head around it.

Thx in advance
3s

@johnkerl johnkerl self-assigned this Jan 16, 2025
@johnkerl
Copy link
Member

Hello @ddemaeyer !

Here are some options:

>>> tsne = experiment.ms["RNA"].obsm["X_tsne"]
>>> table = tsne.read().tables().concat()
>>> coo = tsne.read().coos().concat()
>>> type(table)
<class 'pyarrow.lib.Table'>
>>> type(coo)
<class 'pyarrow.lib.SparseCOOTensor'>
>>> table
pyarrow.Table
soma_dim_0: int64
soma_dim_1: int64
soma_data: double
----
soma_dim_0: [[0,0,1,1,2,...,2635,2636,2636,2637,2637]]
soma_dim_1: [[0,1,0,1,0,...,1,0,1,0,1]]
soma_data: [[13.15872093772412,7.006184998533557,37.95347414965122,6.560701668986042,-2.5593245320510536,...,-0.818536313824123,36.08706869759458,-3.090926935634254,-2.9450132420802992,9.521005512302525]]
>>> coo
<pyarrow.SparseCOOTensor>
type: double
shape: (2638, 2)
>>> table.to_pandas()
      soma_dim_0  soma_dim_1  soma_data
0              0           0  13.158721
1              0           1   7.006185
2              1           0  37.953474
3              1           1   6.560702
4              2           0  -2.559325
...          ...         ...        ...
5271        2635           1  -0.818536
5272        2636           0  36.087069
5273        2636           1  -3.090927
5274        2637           0  -2.945013
5275        2637           1   9.521006

[5276 rows x 3 columns]
>>> n = coo.to_numpy()

>>> type(n)
<class 'tuple'>

>>> n[0]
array([[13.15872094],
       [ 7.006185  ],
       [37.95347415],
       ...,
       [-3.09092694],
       [-2.94501324],
       [ 9.52100551]])

>>> n[1]
array([[   0,    0],
       [   0,    1],
       [   1,    0],
       ...,
       [2636,    1],
       [2637,    0],
       [2637,    1]])

Is this helpful?

@johnkerl johnkerl changed the title QUESTION: how to convert the obsm layer arrow format to long tables Question: How to convert the obsm layer Arrow format to long tables Jan 16, 2025
@ddemaeyer
Copy link
Author

Hey John, thx for the swift reply, my question is more about reformatting the data to x and y coordinates. I gave it a try and landed on something like:

import pyarrow.compute as pc

select_x = pc.field('soma_dim_1') == 0
select_y = pc.field('soma_dim_1') == 1

experiment.ms["RNA"].obsm["X_tsne"].read().tables().concat() \
    .filter(select_x).select(["soma_dim_0","soma_data"]) \
    .rename_columns({"soma_data" : "x"}) \
    .join(
        experiment.ms["RNA"].obsm["X_tsne"].read().tables().concat() \
            .filter(select_y).select(["soma_dim_0","soma_data"]) \
            .rename_columns({"soma_data" : "y"}) ,
        keys = ["soma_dim_0"])

Which results in

pyarrow.Table
soma_dim_0: int64
x: double
y: double
----
soma_dim_0: [[0,1,2,3,4,...,2633,2634,2635,2636,2637]]
x: [[13.15872093772412,37.95347414965122,-2.5593245320510536,-32.37378795579511,7.196922105222845,...,-21.379158716529204,34.4453763599561,32.12078558289555,36.08706869759458,-2.9450132420802992]]
y: [[7.006184998533557,6.560701668986042,-3.5671152535833204,-2.0713730645252943,-27.119968344164352,...,-6.2204168512079,3.7516192915535673,-0.818536313824123,-3.090926935634254,9.521005512302525]]

Is this the best way to do this?

@johnkerl
Copy link
Member

@ddemaeyer interesting! :)

This makes sense to me -- I've not used pyarrow's renaming before.

One optimization I can think of is in

experiment.ms["RNA"].obsm["X_tsne"].read().tables().concat() \
    .filter(select_x).select(["soma_dim_0","soma_data"]) \
    .rename_columns({"soma_data" : "x"}) \
    .join(
        experiment.ms["RNA"].obsm["X_tsne"].read().tables().concat() \
            .filter(select_y).select(["soma_dim_0","soma_data"]) \
            .rename_columns({"soma_data" : "y"}) ,
        keys = ["soma_dim_0"])

you have the experiment.ms["RNA"].obsm["X_tsne"].read().tables().concat() twice -- maybe compute that once and put it in a variable and re-use it ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants