-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathstarting.qmd
179 lines (134 loc) · 4.77 KB
/
starting.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
title: "Starting Out Tips"
resources: data
subtitle: "Data4All"
author: "Ted Laderas, PhD"
format:
live-html:
scrollable: true
toc-location: left
engine: knitr
webr:
render-df: paged-table
packages:
- readxl
- dplyr
- tidyr
resources:
- data
pyodide:
render-df: paged-table
resources:
- data
packages:
- pandas
- openpyxl
---
{{< include ./_extensions/r-wasm/live/_knitr.qmd >}}
## Rows and columns
- Rows are for observations
- Columns are for variables
- Keep the data type the same in a column
- Numbers (Integer vs Decimal)
- Text
- Avoid more than one table per spreadsheet
## Be Consistent
- Use consistent codes for categorical variables
- Use a consistent fixed code for any missing values
- Use consistent subject identifiers
- Use consistent date formatting
## No empty cells
- If the data is missing, explicitly encode it as missing
- Avoid headers with more than one row
### Original
<iframe width="600" height= "150" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vR7q0kwXBXTxgpoLZsPjWtJiL_P9khADMpfpNdaNfChn0wMALfjEAjQ35prUoeaSGxQa3e0iWKpESul/pubhtml?gid=0&single=true&widget=true&range=A1:I14&headers=false"></iframe>
### Better Version
<iframe width="600" height = "250" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vR7q0kwXBXTxgpoLZsPjWtJiL_P9khADMpfpNdaNfChn0wMALfjEAjQ35prUoeaSGxQa3e0iWKpESul/pubhtml?gid=1953255846&single=true&widget=true&range=A1:E17&headers=false"></iframe>
### Can we load the original?
What does it look like when we try to load?
::::{.panel-tabset group="language"}
## R
```{webr}
library(readxl)
library(dplyr)
fig2 <- read_excel("data/better_excel_examples.xlsx", sheet="fig2")
fig2
```
## Python
```{pyodide}
import pandas as pd
fig2 = pd.read_excel("data/better_excel_examples.xlsx", sheet_name="fig2")
fig2
```
::::
### Loading the Better Version
With the better version of the dataset, we can load it and group by variables.
Try changing the grouping variable from `genotype` to `strain`.
::::{.panel-tabset group="language"}
## R
```{webr}
library(readxl)
library(dplyr)
fig2better <- read_excel("data/better_excel_examples.xlsx", sheet="fig2better")
fig2better |>
group_by(genotype) |>
summarize(mean_response = mean(response),
sd_response = sd(response))
```
## Python
```{pyodide}
import pandas as pd
fig2better = pd.read_excel("data/better_excel_examples.xlsx", sheet_name="fig2better")
fig2better.groupby("genotype").response.agg(["mean", "std"])
```
::::
## Choose Good Names for Things
- Variable names: avoid spaces, avoid starting with numbers.
- Do: Use numbers and letters, avoid special characters & symbols
- Do: use `_` instead of spaces
- Have an internal column name (`max_temp`) and a displayed name (`Maximum Temp (°C)`)
<iframe width="600" height = "250" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vR7q0kwXBXTxgpoLZsPjWtJiL_P9khADMpfpNdaNfChn0wMALfjEAjQ35prUoeaSGxQa3e0iWKpESul/pubhtml?gid=1983153915&single=true&widget=true&headers=false&range=A1:C8"></iframe>
Why is this necessary?
- Spaces can be hard to deal with in variable names
- Special characters may not be accepted in R/Python
```r
my_data |>
filter(`Maximum Temp (°C)` > 10)
```
- R doesn't like column names to begin with numbers (it changes `1st_place` to `X1st_place`)
:::{.callout-note}
### For Data Scientists: `janitor::clean_names()`
- `clean_names()` will remove spaces and special characters, and use camel case
- Instead of `Maximum Temp (°C)` - will transform to `maximum_temp_c`
- Removes capitalization
- Removes accents and diacriticals
:::
## Put just one thing in a cell
- Avoid combining columns into a single column
- Avoid putting multiple bits of information in a cell
::::{.columns}
:::{.column width="50%"}
Example:
<iframe height = 200 src="https://docs.google.com/spreadsheets/d/e/2PACX-1vR7q0kwXBXTxgpoLZsPjWtJiL_P9khADMpfpNdaNfChn0wMALfjEAjQ35prUoeaSGxQa3e0iWKpESul/pubhtml?gid=292780420&single=true&widget=true&range=A1:A5&headers=false"></iframe>
:::
:::{.column width="50%"}
Better
<iframe height = 200 src="https://docs.google.com/spreadsheets/d/e/2PACX-1vR7q0kwXBXTxgpoLZsPjWtJiL_P9khADMpfpNdaNfChn0wMALfjEAjQ35prUoeaSGxQa3e0iWKpESul/pubhtml?gid=1002305030&single=true&widget=true&range=A1:C5&headers=false"></iframe>
:::
::::
:::{.callout-note}
## Data Science Tools: `tidy::separate()`
```{webr}
library(readxl)
library(dplyr)
combined <- read_excel("data/better_excel_examples.xlsx", sheet="combined")
combined |>
tidyr::separate(sample_well_replicate,
into=c("Sample", "Well", "Replicate"),
sep="_")
```
:::
## Write Dates as YYYY-MM-DD
- DD-MM-YYYY has a lot of issues
- Convert YYYY-MM-DD to text when you load
- Or use as YYYYMMDD as integer