Skip to content

Commit

Permalink
iterating [skip ci]
Browse files Browse the repository at this point in the history
  • Loading branch information
johnkerl committed Nov 7, 2024
1 parent 4a3a4c5 commit 5d38375
Showing 1 changed file with 101 additions and 116 deletions.
217 changes: 101 additions & 116 deletions apis/r/vignettes/soma-shapes.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ Likewise, the N-dimensional arrays within the experiment have their shapes as we

There's an important difference: while the dataframe domain gives you the inclusive lower and upper bounds for `soma_joinid` writes, the shape for the N-dimensional arrays is the upper bound plus 1.

Since there are 2638 cells here and 1838 genes here, `X`'s shape reflects that.
Since there are 80 cells here and 230 genes here, `X`'s shape reflects that.

```{r}
obs <- exp$obs
Expand Down Expand Up @@ -118,139 +118,124 @@ In particular, the `X` array in this experiment --- and in most experiments ---
As a general rule you'll see the following:

* An `X` array's shape is nobs x nvar
* An `obsm` array's shape is `nobs` x some number, maybe 50
* An `obsm` array's shape is `nobs` x some number, maybe 20
* An `obsp` array's shape is `nobs` x `nobs`
* A `varm` array's shape is `nvar` x some number, maybe 50
* A `varm` array's shape is `nvar` x some number, maybe 20
* A `varp` array's shape is `nvar` x `nvar`

## How to upgrade older experiments
## Advanced usage: dataframes with non-standard index columns

Experiments created by TileDB-SOMA 1.15 and higher will look as shown above. Let's take a look at an experiment from before TileDB-SOMA 1.15.
In the SOMA data model, the `SOMASparseNDArray` and `SOMADenseNDArray` objects always have int64 dimensions named `soma_dim_0`, `soma_dim_1`, and up, and they have a numeric `soma_data` attribute for the contents of the array. Furthermore, this is always the case.

import tiledbsoma.io
import tarfile
import tempfile

uri <- tempfile.mktemp()
with tarfile.open("data/pbmc3k-sparse-pre-1.15.tgz") as handle:
handle.extractall(uri)
expold <- tiledbsoma.Experiment.open(uri)
This is the same PBMC3K data as above. Compare the old and new shapes:
```{r}
exp$ms$get("RNA")$X$get("data")$schema()
```

expold.obs$domain
((0, 9223372036854773758),)
expold.obs$maxdomain
((0, 9223372036854773758),)
expold.obs.tiledbsoma_has_upgraded_domain
False
[ expold.ms$get("RNA")$X$get("data")$shape, expold.ms$get("RNA")$X$get("data")$maxshape, expold.ms$get("RNA")$X$get("data").tiledbsoma_has_upgraded_shape ]
[(9223372036854773759, 9223372036854773759),
(9223372036854773759, 9223372036854773759),
False]
Note that for the pre-1.15 experiment, the shape is huge --- like the maxshape --- and tiledbsoma_has_upgraded_domain is False.
For dataframes, though, while there must be a `soma_joinid` column of type int64, you can have one or more other index columns in addtion --- or, `soma_joinid` can be a non-index column.

To make the old experiment look like the new experiment, simply call upgrade_experiment_shapes, and re-open:
```{r}
exp$obs$schema()
```

tiledbsoma.io.upgrade_experiment_shapes(expold.uri)
True
expold <- tiledbsoma.open(expold.uri)
[ expold.ms$get("RNA")$X$get("data")$shape, expold.ms$get("RNA")$X$get("data")$maxshape, expold.ms$get("RNA")$X$get("data").tiledbsoma_has_upgraded_shape ]
[(2638, 1838), (9223372036854773759, 9223372036854773759), True]
Additionally, you can call tiledbsoma.io.show_experiment_shapes(expold.uri) before and after doing the upgrade.

To run a pre-check, you can do

tiledbsoma.io.upgrade_experiment_shapes(expold.uri, check_only=True)
This won't change anything --- it'll simply tell you if the operation will be possible.

Advanced usage: dataframes with non-standard index columns
In the SOMA data model, the SparseNDArray and DenseNDArray objects always have int64 dimensions named soma_dim_0, soma_dim_1, and up, and they have a numeric soma_data attribute for the contents of the array. Furthermore, this is always the case.

exp$ms$get("RNA")$X$get("data").schema
soma_dim_0: int64 not null
soma_dim_1: int64 not null
soma_data: float not null
For dataframes, though, while there must be a soma_joinid column of type int64, you can have one or more other index columns in addtion --- or, soma_joinid can be a non-index column.

This means that in the default, simplest, and most common case, you can think of a dataframe has having a shape just as the N-dimensional arrays do.

exp$obs.schema
soma_joinid: int64 not null
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: dictionary<values=string, indices=int32, ordered=0>
exp$obs.index_column_names
('soma_joinid',)
But really, dataframes are capable of more than that, via the index-column names you specify at creation time.

Let's create a couple dataframes, with the same data, but different choices of index-column names.

sdfuri1 <- tempfile.mktemp()
sdfuri2 <- tempfile.mktemp()
```{r}
sdfuri1 <- tempdir()
sdfuri2 <- tempdir()
```
import pyarrow as pa

schema <- pa.schema([
("soma_joinid", pa.int64()),
("mystring", pa.string()),
("myint", pa.int32()),
("myfloat", pa.float32()),
])

data <- pa.Table.from_pydict({
"soma_joinid": [0, 1],
"mystring": ["hello", "world"],
"myint": [33, 44],
"myfloat": [4.5, 5.5],
})
with tiledbsoma.DataFrame.create(
sdfuri1,
schema=schema,
index_column_names=["soma_joinid", "mystring"],
domain=[(0, 9), None],
) as sdf1:
sdf1.write(data)
Now let's look at the domain and maxdomain for these dataframes.

sdf1 <- tiledbsoma.DataFrame.open(sdfuri1)
sdf1.index_column_names
('soma_joinid', 'mystring')
Here we see the soma_joinid slot of the dataframe's domain is as requested.

Another point is that domain cannot be specified for string-type index columns.

You can set them at create one of two ways:

domain=[(0, 9), None],
```{r}
asch <- arrow::schema(
arrow::field("soma_joinid", arrow::int64(), nullable = FALSE),
arrow::field("mystring", arrow::large_utf8(), nullable = FALSE),
arrow::field("myint", arrow::int32(), nullable = FALSE),
arrow::field("myfloat", arrow::float32(), nullable = FALSE)
)
soma_joinid = c(10, 20)
mystring = c("hello", "world")
myint = c(33, 44)
myfloat = c(4.5, 5.5)
tbl <- arrow::arrow_table(
soma_joinid = c(soma_joinid),
mystring = c(mystring),
myint = c(myint),
myfloat = c(myfloat)
)
```

```{r}
sdf1 <- SOMADataFrameCreate(
sdfuri1,
asch,
index_column_names = c("soma_joinid", "mystring"),
domain = list(soma_joinid = c(0, 9), mystring = NULL),
)
sdf1$write(tbl)
sdf1$close()
```

Now let's look at the `domain` and `maxdomain` for these dataframes.

```{r}
sdf1 <- SOMADataFrameOpen(sdfuri1)
sdf1$index_column_names()
```

Here we see the `soma_joinid` slot of the dataframe's domain is as requested.

Another point is that domain cannot be specified for string-type index columns. You can set them at create one of two ways:

`domain = list(soma_joinid = (0, 9), mystring = NULL)`

or

domain=[(0, 9), ('', '')],
and in either case the domain slot for a string-typed index column will read back as ('', '').
`domain = list(soma_joinid = (0, 9), mystring = c('', ''))`

and in either case the domain slot for a string-typed index column will read back as `('', '')`.

sdf1$domain
((0, 9), ('', ''))
```{r}
sdf1$domain()
```

```{r}
sdf1$maxdomain
((0, 9223372036854775796), ('', ''))
Now let's look at our other dataframe. Here soma_joinid is not an index column at all. This is fine, as long as within the data you write to it, the index-column values uniquely identify each row.

with tiledbsoma.DataFrame.create(
sdfuri2,
schema=schema,
index_column_names=["myfloat", "myint"],
domain=[(0, 999), (-1000, 1000)],
) as sdf2:
sdf2.write(data)
sdf2 <- tiledbsoma.DataFrame.open(sdfuri2)
sdf2.index_column_names
('myfloat', 'myint')
The domain reads back as written.

sdf2$domain
((0.0, 999.0), (-1000, 1000))
sdf2$maxdomain
((-3.4028234663852886e+38, 3.4028234663852886e+38), (-2147483648, 2147481645))
Advanced usage: using resize at the dataframe/array level using the SOMA API
```

Now let's look at our other dataframe. Here `soma_joinid` is not an index column at all. This is fine, as long as within the data you write to it, the index-column values uniquely identify each row.

```{r}
sdf2 <- SOMADataFrameCreate(
sdfuri2,
asch,
index_column_names = c("myfloat", "myint"),
domain = list(myfloat = c(0, 9999), myint = c(-1000, 1000)),
)
sdf2$write(tbl)
sdf2$close()
```

```{r}
sdf2 <- SOMADataFrameOpen(sdfuri1)
sdf2$index_column_names()
```

The domain reads back as written:

```{r}
sdf2$domain()
```

```{r}
sdf2$maxdomain()
```

## Advanced usage: using resize at the dataframe/array level using the SOMA API

Above we saw a simple and convenient way to resize all the dataframes and arrays within an experiment.

However, should you choose to do so, you can apply these one dataframe or array at a time.
Expand Down

0 comments on commit 5d38375

Please sign in to comment.