diff --git a/_site/Syllabus_EDUC654.doc b/_site/Syllabus_EDUC654.doc deleted file mode 100644 index 26d47f1..0000000 Binary files a/_site/Syllabus_EDUC654.doc and /dev/null differ diff --git a/_site/Syllabus_EDUC654.pdf b/_site/Syllabus_EDUC654.pdf index 49b04f1..7e869c0 100644 Binary files a/_site/Syllabus_EDUC654.pdf and b/_site/Syllabus_EDUC654.pdf differ diff --git a/_site/slides/slide2.Rmd b/_site/slides/slide2.Rmd index abc8c11..7d72575 100644 --- a/_site/slides/slide2.Rmd +++ b/_site/slides/slide2.Rmd @@ -1336,9 +1336,9 @@ model --- -- Each model has a limit for the number of characters they can process. +- Each model has a limit for the number of **tokens** they can process. -- For instance, RoBERTa can handle a text sequence with a maximum number of 512 characters. +- For instance, RoBERTa can handle a text sequence with a maximum number of 512 tokens. .indent[ .single[ @@ -1350,7 +1350,7 @@ model$get_max_seq_length() ] ] -- If we submit any text with more than 512 characters, it will only process the first 512 characters. +- If we submit any text with more than 512 tokens, it will only process the first tokens. - Another essential characteristic is the length of the output vector when a language model returns numerical embeddings. @@ -1366,7 +1366,7 @@ model$get_sentence_embedding_dimension() ] ] -- RoBERTa can take any text sequence up to 512 characters as input and then return a numerical vector with a length of 768 that represent this text sequence. This process is also called **encoding**. +- RoBERTa can take any text sequence up to 512 tokens as input and then return a numerical vector with a length of 768 that represent this text sequence. This process is also called **encoding**. --- @@ -1424,3 +1424,5 @@ dim(embeddings) Make sure you review the following notebook to see how to process 2834 reading excerpts in the CommonLit Readability dataset and obtain a 2834 x 768 matrix of numerical embeddings using the [AllenAI Longformer model](https://huggingface.co/allenai/longformer-base-4096). [https://www.kaggle.com/code/uocoeeds/lecture-2b-data-preprocessing-ii](https://www.kaggle.com/code/uocoeeds/lecture-2b-data-preprocessing-ii) + +We will use the output of this notebook (a 2834 x 768 matrix of numerical embeddings, **readability_features.csv**) in the following weeks to build a prediction model for readability scores. diff --git a/_site/slides/slide2.html b/_site/slides/slide2.html index 5d8defa..2e3f5fb 100644 --- a/_site/slides/slide2.html +++ b/_site/slides/slide2.html @@ -179,8 +179,8 @@ <br> <br> -
- + + <br> @@ -223,15 +223,15 @@ <br> - - + + --- <br> - - + + --- @@ -274,8 +274,8 @@ <br> - - + + --- @@ -308,8 +308,8 @@ <br> - - + + --- @@ -342,8 +342,8 @@ `$$x_{1} = sin(\frac{2 \pi x}{max(x)})$$` `$$x_{2} = cos(\frac{2 \pi x}{max(x)})$$` - - + + --- @@ -651,8 +651,8 @@ - Identify the variables with missing data, and then create a binary indicator variable for every variable to indicate missingness (0: not missing, 1: missing). - - + + --- @@ -964,9 +964,9 @@ --- -- Each model has a limit for the number of characters they can process. +- Each model has a limit for the number of **tokens** they can process. -- For instance, RoBERTa can handle a text sequence with a maximum number of 512 characters. +- For instance, RoBERTa can handle a text sequence with a maximum number of 512 tokens. .indent[ .single[ @@ -983,7 +983,7 @@ ] ] -- If we submit any text with more than 512 characters, it will only process the first 512 characters. +- If we submit any text with more than 512 tokens, it will only process the first tokens. - Another essential characteristic is the length of the output vector when a language model returns numerical embeddings. @@ -1004,7 +1004,7 @@ ] ] -- RoBERTa can take any text sequence up to 512 characters as input and then return a numerical vector with a length of 768 that represent this text sequence. This process is also called **encoding**. +- RoBERTa can take any text sequence up to 512 tokens as input and then return a numerical vector with a length of 768 that represent this text sequence. This process is also called **encoding**. --- @@ -1102,6 +1102,8 @@ Make sure you review the following notebook to see how to process 2834 reading excerpts in the CommonLit Readability dataset and obtain a 2834 x 768 matrix of numerical embeddings using the [AllenAI Longformer model](https://huggingface.co/allenai/longformer-base-4096). [https://www.kaggle.com/code/uocoeeds/lecture-2b-data-preprocessing-ii](https://www.kaggle.com/code/uocoeeds/lecture-2b-data-preprocessing-ii) + +We will use the output of this notebook (a 2834 x 768 matrix of numerical embeddings, **readability_features.csv**) in the following weeks to build a prediction model for readability scores. diff --git a/slides/my_custom.css b/slides/my_custom.css index 21aca8c..968c85e 100644 --- a/slides/my_custom.css +++ b/slides/my_custom.css @@ -15624,6 +15624,27 @@ } +/* Extra CSS */ +.red { + color: red; +} +.blue { + color: blue; +} +.red-pink { + color: red_pink; +} +.grey-light { + color: grey_light; +} +.purple { + color: purple; +} +.small { + font-size: 90%; +} + + /* Extra CSS */ .red { color: red; diff --git a/slides/slide2.Rmd b/slides/slide2.Rmd index abc8c11..7d72575 100644 --- a/slides/slide2.Rmd +++ b/slides/slide2.Rmd @@ -1336,9 +1336,9 @@ model --- -- Each model has a limit for the number of characters they can process. +- Each model has a limit for the number of **tokens** they can process. -- For instance, RoBERTa can handle a text sequence with a maximum number of 512 characters. +- For instance, RoBERTa can handle a text sequence with a maximum number of 512 tokens. .indent[ .single[ @@ -1350,7 +1350,7 @@ model$get_max_seq_length() ] ] -- If we submit any text with more than 512 characters, it will only process the first 512 characters. +- If we submit any text with more than 512 tokens, it will only process the first tokens. - Another essential characteristic is the length of the output vector when a language model returns numerical embeddings. @@ -1366,7 +1366,7 @@ model$get_sentence_embedding_dimension() ] ] -- RoBERTa can take any text sequence up to 512 characters as input and then return a numerical vector with a length of 768 that represent this text sequence. This process is also called **encoding**. +- RoBERTa can take any text sequence up to 512 tokens as input and then return a numerical vector with a length of 768 that represent this text sequence. This process is also called **encoding**. --- @@ -1424,3 +1424,5 @@ dim(embeddings) Make sure you review the following notebook to see how to process 2834 reading excerpts in the CommonLit Readability dataset and obtain a 2834 x 768 matrix of numerical embeddings using the [AllenAI Longformer model](https://huggingface.co/allenai/longformer-base-4096). [https://www.kaggle.com/code/uocoeeds/lecture-2b-data-preprocessing-ii](https://www.kaggle.com/code/uocoeeds/lecture-2b-data-preprocessing-ii) + +We will use the output of this notebook (a 2834 x 768 matrix of numerical embeddings, **readability_features.csv**) in the following weeks to build a prediction model for readability scores. diff --git a/slides/slide2.html b/slides/slide2.html index 5d8defa..2e3f5fb 100644 --- a/slides/slide2.html +++ b/slides/slide2.html @@ -179,8 +179,8 @@ <br> <br> - - + + <br> @@ -223,15 +223,15 @@ <br> - - + + --- <br> - - + + --- @@ -274,8 +274,8 @@ <br> - - + + --- @@ -308,8 +308,8 @@ <br> - - + + --- @@ -342,8 +342,8 @@ `$$x_{1} = sin(\frac{2 \pi x}{max(x)})$$` `$$x_{2} = cos(\frac{2 \pi x}{max(x)})$$` - - + + --- @@ -651,8 +651,8 @@ - Identify the variables with missing data, and then create a binary indicator variable for every variable to indicate missingness (0: not missing, 1: missing). - - + + --- @@ -964,9 +964,9 @@ --- -- Each model has a limit for the number of characters they can process. +- Each model has a limit for the number of **tokens** they can process. -- For instance, RoBERTa can handle a text sequence with a maximum number of 512 characters. +- For instance, RoBERTa can handle a text sequence with a maximum number of 512 tokens. .indent[ .single[ @@ -983,7 +983,7 @@ ] ] -- If we submit any text with more than 512 characters, it will only process the first 512 characters. +- If we submit any text with more than 512 tokens, it will only process the first tokens. - Another essential characteristic is the length of the output vector when a language model returns numerical embeddings. @@ -1004,7 +1004,7 @@ ] ] -- RoBERTa can take any text sequence up to 512 characters as input and then return a numerical vector with a length of 768 that represent this text sequence. This process is also called **encoding**. +- RoBERTa can take any text sequence up to 512 tokens as input and then return a numerical vector with a length of 768 that represent this text sequence. This process is also called **encoding**. --- @@ -1102,6 +1102,8 @@ Make sure you review the following notebook to see how to process 2834 reading excerpts in the CommonLit Readability dataset and obtain a 2834 x 768 matrix of numerical embeddings using the [AllenAI Longformer model](https://huggingface.co/allenai/longformer-base-4096). [https://www.kaggle.com/code/uocoeeds/lecture-2b-data-preprocessing-ii](https://www.kaggle.com/code/uocoeeds/lecture-2b-data-preprocessing-ii) + +We will use the output of this notebook (a 2834 x 768 matrix of numerical embeddings, **readability_features.csv**) in the following weeks to build a prediction model for readability scores.