Question: How to convert the `obsm` layer Arrow format to long tables #3570

ddemaeyer · 2025-01-16T13:10:45Z

I got a question how to convert the SparseNDArrays in the arrow format.

Converted h5ad to tiledbsoma and retrieving X_umap field as

experiment.ms["RNA"].obsm["X_tsne"].read().tables().concat()

this results in the following arrow table

pyarrow.Table

soma_dim_0: int64

soma_dim_1: int64

soma_data: double

This format in essence has in soma_dim_0 the row while soma_dim_1 represents the column. The value is the soma_data.

Now my question is how to convert this to a format that is

x: double
y: double

Sorry for my inexperience since this is probably just an arrow mapping operation, but I can't get my head around it.

Thx in advance
3s

The text was updated successfully, but these errors were encountered:

johnkerl · 2025-01-16T14:15:31Z

Hello @ddemaeyer !

Here are some options:

>>> tsne = experiment.ms["RNA"].obsm["X_tsne"]

>>> table = tsne.read().tables().concat()
>>> coo = tsne.read().coos().concat()

>>> type(table)
<class 'pyarrow.lib.Table'>

>>> type(coo)
<class 'pyarrow.lib.SparseCOOTensor'>

>>> table
pyarrow.Table
soma_dim_0: int64
soma_dim_1: int64
soma_data: double
----
soma_dim_0: [[0,0,1,1,2,...,2635,2636,2636,2637,2637]]
soma_dim_1: [[0,1,0,1,0,...,1,0,1,0,1]]
soma_data: [[13.15872093772412,7.006184998533557,37.95347414965122,6.560701668986042,-2.5593245320510536,...,-0.818536313824123,36.08706869759458,-3.090926935634254,-2.9450132420802992,9.521005512302525]]

>>> coo
<pyarrow.SparseCOOTensor>
type: double
shape: (2638, 2)

>>> table.to_pandas()
      soma_dim_0  soma_dim_1  soma_data
0              0           0  13.158721
1              0           1   7.006185
2              1           0  37.953474
3              1           1   6.560702
4              2           0  -2.559325
...          ...         ...        ...
5271        2635           1  -0.818536
5272        2636           0  36.087069
5273        2636           1  -3.090927
5274        2637           0  -2.945013
5275        2637           1   9.521006

[5276 rows x 3 columns]

>>> n = coo.to_numpy()

>>> type(n)
<class 'tuple'>

>>> n[0]
array([[13.15872094],
       [ 7.006185  ],
       [37.95347415],
       ...,
       [-3.09092694],
       [-2.94501324],
       [ 9.52100551]])

>>> n[1]
array([[   0,    0],
       [   0,    1],
       [   1,    0],
       ...,
       [2636,    1],
       [2637,    0],
       [2637,    1]])

Is this helpful?

ddemaeyer · 2025-01-16T16:50:05Z

Hey John, thx for the swift reply, my question is more about reformatting the data to x and y coordinates. I gave it a try and landed on something like:

import pyarrow.compute as pc

select_x = pc.field('soma_dim_1') == 0
select_y = pc.field('soma_dim_1') == 1

experiment.ms["RNA"].obsm["X_tsne"].read().tables().concat() \
    .filter(select_x).select(["soma_dim_0","soma_data"]) \
    .rename_columns({"soma_data" : "x"}) \
    .join(
        experiment.ms["RNA"].obsm["X_tsne"].read().tables().concat() \
            .filter(select_y).select(["soma_dim_0","soma_data"]) \
            .rename_columns({"soma_data" : "y"}) ,
        keys = ["soma_dim_0"])

Which results in

pyarrow.Table
soma_dim_0: int64
x: double
y: double
----
soma_dim_0: [[0,1,2,3,4,...,2633,2634,2635,2636,2637]]
x: [[13.15872093772412,37.95347414965122,-2.5593245320510536,-32.37378795579511,7.196922105222845,...,-21.379158716529204,34.4453763599561,32.12078558289555,36.08706869759458,-2.9450132420802992]]
y: [[7.006184998533557,6.560701668986042,-3.5671152535833204,-2.0713730645252943,-27.119968344164352,...,-6.2204168512079,3.7516192915535673,-0.818536313824123,-3.090926935634254,9.521005512302525]]

Is this the best way to do this?

johnkerl · 2025-01-16T17:10:10Z

@ddemaeyer interesting! :)

This makes sense to me -- I've not used pyarrow's renaming before.

One optimization I can think of is in

experiment.ms["RNA"].obsm["X_tsne"].read().tables().concat() \
    .filter(select_x).select(["soma_dim_0","soma_data"]) \
    .rename_columns({"soma_data" : "x"}) \
    .join(
        experiment.ms["RNA"].obsm["X_tsne"].read().tables().concat() \
            .filter(select_y).select(["soma_dim_0","soma_data"]) \
            .rename_columns({"soma_data" : "y"}) ,
        keys = ["soma_dim_0"])

you have the experiment.ms["RNA"].obsm["X_tsne"].read().tables().concat() twice -- maybe compute that once and put it in a variable and re-use it ...

johnkerl self-assigned this Jan 16, 2025

johnkerl changed the title ~~QUESTION: how to convert the obsm layer arrow format to long tables~~ Question: How to convert the obsm layer Arrow format to long tables Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: How to convert the `obsm` layer Arrow format to long tables #3570

Question: How to convert the `obsm` layer Arrow format to long tables #3570

ddemaeyer commented Jan 16, 2025

johnkerl commented Jan 16, 2025

ddemaeyer commented Jan 16, 2025

johnkerl commented Jan 16, 2025

Question: How to convert the obsm layer Arrow format to long tables #3570

Question: How to convert the obsm layer Arrow format to long tables #3570

Comments

ddemaeyer commented Jan 16, 2025

johnkerl commented Jan 16, 2025

ddemaeyer commented Jan 16, 2025

johnkerl commented Jan 16, 2025

Question: How to convert the `obsm` layer Arrow format to long tables #3570

Question: How to convert the `obsm` layer Arrow format to long tables #3570