iterating [skip ci]

single-cell-data · Nov 7, 2024 · 5d38375 · 5d38375
1 parent 4a3a4c5
commit 5d38375
Showing 1 changed file with 101 additions and 116 deletions.
diff --git a/apis/r/vignettes/soma-shapes.Rmd b/apis/r/vignettes/soma-shapes.Rmd
@@ -65,7 +65,7 @@ Likewise, the N-dimensional arrays within the experiment have their shapes as we
 
 There's an important difference: while the dataframe domain gives you the inclusive lower and upper bounds for `soma_joinid` writes, the shape for the N-dimensional arrays is the upper bound plus 1.
 
-Since there are 2638 cells here and 1838 genes here, `X`'s shape reflects that.
+Since there are 80 cells here and 230 genes here, `X`'s shape reflects that.
 
 ```{r}
 obs <- exp$obs
@@ -118,139 +118,124 @@ In particular, the `X` array in this experiment --- and in most experiments ---
 As a general rule you'll see the following:
 
 * An `X` array's shape is nobs x nvar
-* An `obsm` array's shape is `nobs` x some number, maybe 50
+* An `obsm` array's shape is `nobs` x some number, maybe 20
 * An `obsp` array's shape is `nobs` x `nobs`
-* A `varm` array's shape is `nvar` x some number, maybe 50
+* A `varm` array's shape is `nvar` x some number, maybe 20
 * A `varp` array's shape is `nvar` x `nvar`
 
-## How to upgrade older experiments
+## Advanced usage: dataframes with non-standard index columns
 
-Experiments created by TileDB-SOMA 1.15 and higher will look as shown above. Let's take a look at an experiment from before TileDB-SOMA 1.15.
+In the SOMA data model, the `SOMASparseNDArray` and `SOMADenseNDArray` objects always have int64 dimensions named `soma_dim_0`, `soma_dim_1`, and up, and they have a numeric `soma_data` attribute for the contents of the array. Furthermore, this is always the case.
 
-import tiledbsoma.io
-import tarfile
-import tempfile
-
-uri <- tempfile.mktemp()
-with tarfile.open("data/pbmc3k-sparse-pre-1.15.tgz") as handle:
-    handle.extractall(uri)
-expold <- tiledbsoma.Experiment.open(uri)
-This is the same PBMC3K data as above. Compare the old and new shapes:
+```{r}
+exp$ms$get("RNA")$X$get("data")$schema()
+```
 
-expold.obs$domain
-((0, 9223372036854773758),)
-expold.obs$maxdomain
-((0, 9223372036854773758),)
-expold.obs.tiledbsoma_has_upgraded_domain
-False
-[ expold.ms$get("RNA")$X$get("data")$shape, expold.ms$get("RNA")$X$get("data")$maxshape, expold.ms$get("RNA")$X$get("data").tiledbsoma_has_upgraded_shape ]
-[(9223372036854773759, 9223372036854773759),
- (9223372036854773759, 9223372036854773759),
- False]
-Note that for the pre-1.15 experiment, the shape is huge --- like the maxshape --- and tiledbsoma_has_upgraded_domain is False.
+For dataframes, though, while there must be a `soma_joinid` column of type int64, you can have one or more other index columns in addtion --- or, `soma_joinid` can be a non-index column.
 
-To make the old experiment look like the new experiment, simply call upgrade_experiment_shapes, and re-open:
+```{r}
+exp$obs$schema()
+```
 
-tiledbsoma.io.upgrade_experiment_shapes(expold.uri)
-True
-expold <- tiledbsoma.open(expold.uri)
-[ expold.ms$get("RNA")$X$get("data")$shape, expold.ms$get("RNA")$X$get("data")$maxshape, expold.ms$get("RNA")$X$get("data").tiledbsoma_has_upgraded_shape ]
-[(2638, 1838), (9223372036854773759, 9223372036854773759), True]
-Additionally, you can call tiledbsoma.io.show_experiment_shapes(expold.uri) before and after doing the upgrade.
-
-To run a pre-check, you can do
-
-tiledbsoma.io.upgrade_experiment_shapes(expold.uri, check_only=True)
-This won't change anything --- it'll simply tell you if the operation will be possible.
-
-Advanced usage: dataframes with non-standard index columns
-In the SOMA data model, the SparseNDArray and DenseNDArray objects always have int64 dimensions named soma_dim_0, soma_dim_1, and up, and they have a numeric soma_data attribute for the contents of the array. Furthermore, this is always the case.
-
-exp$ms$get("RNA")$X$get("data").schema
-soma_dim_0: int64 not null
-soma_dim_1: int64 not null
-soma_data: float not null
-For dataframes, though, while there must be a soma_joinid column of type int64, you can have one or more other index columns in addtion --- or, soma_joinid can be a non-index column.
-
-This means that in the default, simplest, and most common case, you can think of a dataframe has having a shape just as the N-dimensional arrays do.
-
-exp$obs.schema
-soma_joinid: int64 not null
-obs_id: large_string
-n_genes: int64
-percent_mito: float
-n_counts: float
-louvain: dictionary<values=string, indices=int32, ordered=0>
-exp$obs.index_column_names
-('soma_joinid',)
 But really, dataframes are capable of more than that, via the index-column names you specify at creation time.
 
 Let's create a couple dataframes, with the same data, but different choices of index-column names.
 
-sdfuri1 <- tempfile.mktemp()
-sdfuri2 <- tempfile.mktemp()
+```{r}
+sdfuri1 <- tempdir()
+sdfuri2 <- tempdir()
+```
 import pyarrow as pa
 
-schema <- pa.schema([
-    ("soma_joinid", pa.int64()),
-    ("mystring", pa.string()),
-    ("myint", pa.int32()),
-    ("myfloat", pa.float32()),
-])
-
-data <- pa.Table.from_pydict({
-    "soma_joinid": [0, 1],
-    "mystring": ["hello", "world"],
-    "myint": [33, 44],
-    "myfloat": [4.5, 5.5],
-})
-with tiledbsoma.DataFrame.create(
-    sdfuri1,
-    schema=schema,
-    index_column_names=["soma_joinid", "mystring"],
-    domain=[(0, 9), None],
-) as sdf1:
-        sdf1.write(data)
-Now let's look at the domain and maxdomain for these dataframes.
-
-sdf1 <- tiledbsoma.DataFrame.open(sdfuri1)
-sdf1.index_column_names
-('soma_joinid', 'mystring')
-Here we see the soma_joinid slot of the dataframe's domain is as requested.
-
-Another point is that domain cannot be specified for string-type index columns.
-
-You can set them at create one of two ways:
-
-    domain=[(0, 9), None],
+```{r}
+asch <- arrow::schema(
+    arrow::field("soma_joinid", arrow::int64(), nullable = FALSE),
+    arrow::field("mystring", arrow::large_utf8(), nullable = FALSE),
+    arrow::field("myint", arrow::int32(), nullable = FALSE),
+    arrow::field("myfloat", arrow::float32(), nullable = FALSE)
+)
+
+soma_joinid = c(10, 20)
+mystring    = c("hello", "world")
+myint       = c(33, 44)
+myfloat     = c(4.5, 5.5)
+
+tbl <- arrow::arrow_table(
+    soma_joinid = c(soma_joinid),
+    mystring = c(mystring),
+    myint = c(myint),
+    myfloat = c(myfloat)
+)
+```
+
+```{r}
+sdf1 <- SOMADataFrameCreate(
+  sdfuri1,
+  asch,
+  index_column_names = c("soma_joinid", "mystring"),
+  domain = list(soma_joinid = c(0, 9), mystring = NULL),
+)
+sdf1$write(tbl)
+sdf1$close()
+```
+
+Now let's look at the `domain` and `maxdomain` for these dataframes.
+
+```{r}
+sdf1 <- SOMADataFrameOpen(sdfuri1)
+sdf1$index_column_names()
+```
+
+Here we see the `soma_joinid` slot of the dataframe's domain is as requested.
+
+Another point is that domain cannot be specified for string-type index columns.  You can set them at create one of two ways:
+
+`domain = list(soma_joinid = (0, 9), mystring = NULL)`
+
 or
 
-    domain=[(0, 9), ('', '')],
-and in either case the domain slot for a string-typed index column will read back as ('', '').
+`domain = list(soma_joinid = (0, 9), mystring = c('', ''))`
+
+and in either case the domain slot for a string-typed index column will read back as `('', '')`.
 
-sdf1$domain
-((0, 9), ('', ''))
+```{r}
+sdf1$domain()
+```
+
+```{r}
 sdf1$maxdomain
-((0, 9223372036854775796), ('', ''))
-Now let's look at our other dataframe. Here soma_joinid is not an index column at all. This is fine, as long as within the data you write to it, the index-column values uniquely identify each row.
-
-with tiledbsoma.DataFrame.create(
-    sdfuri2,
-    schema=schema,
-    index_column_names=["myfloat", "myint"],
-    domain=[(0, 999), (-1000, 1000)],
-) as sdf2:
-        sdf2.write(data)
-sdf2 <- tiledbsoma.DataFrame.open(sdfuri2)
-sdf2.index_column_names
-('myfloat', 'myint')
-The domain reads back as written.
-
-sdf2$domain
-((0.0, 999.0), (-1000, 1000))
-sdf2$maxdomain
-((-3.4028234663852886e+38, 3.4028234663852886e+38), (-2147483648, 2147481645))
-Advanced usage: using resize at the dataframe/array level using the SOMA API
+```
+
+Now let's look at our other dataframe. Here `soma_joinid` is not an index column at all. This is fine, as long as within the data you write to it, the index-column values uniquely identify each row.
+
+```{r}
+sdf2 <- SOMADataFrameCreate(
+  sdfuri2,
+  asch,
+  index_column_names = c("myfloat", "myint"),
+  domain = list(myfloat = c(0, 9999), myint = c(-1000, 1000)),
+)
+sdf2$write(tbl)
+sdf2$close()
+```
+
+```{r}
+sdf2 <- SOMADataFrameOpen(sdfuri1)
+sdf2$index_column_names()
+```
+
+The domain reads back as written:
+
+```{r}
+sdf2$domain()
+```
+
+```{r}
+sdf2$maxdomain()
+```
+
+## Advanced usage: using resize at the dataframe/array level using the SOMA API
+
 Above we saw a simple and convenient way to resize all the dataframes and arrays within an experiment.
 
 However, should you choose to do so, you can apply these one dataframe or array at a time.