updates on the Week 3 slides

uo-datasci-specialization · Oct 3, 2023 · 2c5fb85 · 2c5fb85
1 parent d8530c1
commit 2c5fb85
Show file tree

Hide file tree

Showing 7 changed files with 73 additions and 44 deletions.
diff --git a/_site/Syllabus_EDUC654.doc b/_site/Syllabus_EDUC654.doc
diff --git a/_site/Syllabus_EDUC654.pdf b/_site/Syllabus_EDUC654.pdf
diff --git a/_site/slides/slide2.Rmd b/_site/slides/slide2.Rmd
@@ -1336,9 +1336,9 @@ model
 
 ---
 
-- Each model has a limit for the number of characters they can process. 
+- Each model has a limit for the number of **tokens** they can process. 
 
-- For instance, RoBERTa can handle a text sequence with a maximum number of 512 characters. 
+- For instance, RoBERTa can handle a text sequence with a maximum number of 512 tokens. 
 
 .indent[
 .single[
@@ -1350,7 +1350,7 @@ model$get_max_seq_length()
 ]
 ]
 
-- If we submit any text with more than 512 characters, it will only process the first 512 characters.
+- If we submit any text with more than 512 tokens, it will only process the first tokens.
 
 - Another essential characteristic is the length of the output vector when a language model returns numerical embeddings. 
 
@@ -1366,7 +1366,7 @@ model$get_sentence_embedding_dimension()
 ]
 ]
 
-- RoBERTa can take any text sequence up to 512 characters as input and then return a numerical vector with a length of 768 that represent this text sequence. This process is also called **encoding**.
+- RoBERTa can take any text sequence up to 512 tokens as input and then return a numerical vector with a length of 768 that represent this text sequence. This process is also called **encoding**.
 
 ---
 
@@ -1424,3 +1424,5 @@ dim(embeddings)
 Make sure you review the following notebook to see how to process 2834 reading excerpts in the CommonLit Readability dataset and obtain a 2834 x 768 matrix of numerical embeddings using the [AllenAI Longformer model](https://huggingface.co/allenai/longformer-base-4096). 
 
 [https://www.kaggle.com/code/uocoeeds/lecture-2b-data-preprocessing-ii](https://www.kaggle.com/code/uocoeeds/lecture-2b-data-preprocessing-ii)
+
+We will use the output of this notebook (a 2834 x 768 matrix of numerical embeddings, **readability_features.csv**) in the following weeks to build a prediction model for readability scores.
diff --git a/_site/slides/slide2.html b/_site/slides/slide2.html
diff --git a/slides/my_custom.css b/slides/my_custom.css
@@ -15624,6 +15624,27 @@
 }
 
 
+/* Extra CSS */
+.red {
+  color: red;
+}
+.blue {
+  color: blue;
+}
+.red-pink {
+  color: red_pink;
+}
+.grey-light {
+  color: grey_light;
+}
+.purple {
+  color: purple;
+}
+.small {
+  font-size: 90%;
+}
+
+
 /* Extra CSS */
 .red {
   color: red;

diff --git a/slides/slide2.Rmd b/slides/slide2.Rmd
@@ -1336,9 +1336,9 @@ model
 
 ---
 
-- Each model has a limit for the number of characters they can process. 
+- Each model has a limit for the number of **tokens** they can process. 
 
-- For instance, RoBERTa can handle a text sequence with a maximum number of 512 characters. 
+- For instance, RoBERTa can handle a text sequence with a maximum number of 512 tokens. 
 
 .indent[
 .single[
@@ -1350,7 +1350,7 @@ model$get_max_seq_length()
 ]
 ]
 
-- If we submit any text with more than 512 characters, it will only process the first 512 characters.
+- If we submit any text with more than 512 tokens, it will only process the first tokens.
 
 - Another essential characteristic is the length of the output vector when a language model returns numerical embeddings. 
 
@@ -1366,7 +1366,7 @@ model$get_sentence_embedding_dimension()
 ]
 ]
 
-- RoBERTa can take any text sequence up to 512 characters as input and then return a numerical vector with a length of 768 that represent this text sequence. This process is also called **encoding**.
+- RoBERTa can take any text sequence up to 512 tokens as input and then return a numerical vector with a length of 768 that represent this text sequence. This process is also called **encoding**.
 
 ---
 
@@ -1424,3 +1424,5 @@ dim(embeddings)
 Make sure you review the following notebook to see how to process 2834 reading excerpts in the CommonLit Readability dataset and obtain a 2834 x 768 matrix of numerical embeddings using the [AllenAI Longformer model](https://huggingface.co/allenai/longformer-base-4096). 
 
 [https://www.kaggle.com/code/uocoeeds/lecture-2b-data-preprocessing-ii](https://www.kaggle.com/code/uocoeeds/lecture-2b-data-preprocessing-ii)
+
+We will use the output of this notebook (a 2834 x 768 matrix of numerical embeddings, **readability_features.csv**) in the following weeks to build a prediction model for readability scores.
diff --git a/slides/slide2.html b/slides/slide2.html