Skip to content

Commit

Permalink
updates on the Week 3 slides
Browse files Browse the repository at this point in the history
  • Loading branch information
czopluoglu committed Oct 3, 2023
1 parent d8530c1 commit 2c5fb85
Show file tree
Hide file tree
Showing 7 changed files with 73 additions and 44 deletions.
Binary file removed _site/Syllabus_EDUC654.doc
Binary file not shown.
Binary file modified _site/Syllabus_EDUC654.pdf
Binary file not shown.
10 changes: 6 additions & 4 deletions _site/slides/slide2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -1336,9 +1336,9 @@ model

---

- Each model has a limit for the number of characters they can process.
- Each model has a limit for the number of **tokens** they can process.

- For instance, RoBERTa can handle a text sequence with a maximum number of 512 characters.
- For instance, RoBERTa can handle a text sequence with a maximum number of 512 tokens.

.indent[
.single[
Expand All @@ -1350,7 +1350,7 @@ model$get_max_seq_length()
]
]

- If we submit any text with more than 512 characters, it will only process the first 512 characters.
- If we submit any text with more than 512 tokens, it will only process the first tokens.

- Another essential characteristic is the length of the output vector when a language model returns numerical embeddings.

Expand All @@ -1366,7 +1366,7 @@ model$get_sentence_embedding_dimension()
]
]

- RoBERTa can take any text sequence up to 512 characters as input and then return a numerical vector with a length of 768 that represent this text sequence. This process is also called **encoding**.
- RoBERTa can take any text sequence up to 512 tokens as input and then return a numerical vector with a length of 768 that represent this text sequence. This process is also called **encoding**.

---

Expand Down Expand Up @@ -1424,3 +1424,5 @@ dim(embeddings)
Make sure you review the following notebook to see how to process 2834 reading excerpts in the CommonLit Readability dataset and obtain a 2834 x 768 matrix of numerical embeddings using the [AllenAI Longformer model](https://huggingface.co/allenai/longformer-base-4096).

[https://www.kaggle.com/code/uocoeeds/lecture-2b-data-preprocessing-ii](https://www.kaggle.com/code/uocoeeds/lecture-2b-data-preprocessing-ii)

We will use the output of this notebook (a 2834 x 768 matrix of numerical embeddings, **readability_features.csv**) in the following weeks to build a prediction model for readability scores.
38 changes: 20 additions & 18 deletions _site/slides/slide2.html

Large diffs are not rendered by default.

21 changes: 21 additions & 0 deletions slides/my_custom.css
Original file line number Diff line number Diff line change
Expand Up @@ -15624,6 +15624,27 @@
}


/* Extra CSS */
.red {
color: red;
}
.blue {
color: blue;
}
.red-pink {
color: red_pink;
}
.grey-light {
color: grey_light;
}
.purple {
color: purple;
}
.small {
font-size: 90%;
}


/* Extra CSS */
.red {
color: red;
Expand Down
10 changes: 6 additions & 4 deletions slides/slide2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -1336,9 +1336,9 @@ model

---

- Each model has a limit for the number of characters they can process.
- Each model has a limit for the number of **tokens** they can process.

- For instance, RoBERTa can handle a text sequence with a maximum number of 512 characters.
- For instance, RoBERTa can handle a text sequence with a maximum number of 512 tokens.

.indent[
.single[
Expand All @@ -1350,7 +1350,7 @@ model$get_max_seq_length()
]
]

- If we submit any text with more than 512 characters, it will only process the first 512 characters.
- If we submit any text with more than 512 tokens, it will only process the first tokens.

- Another essential characteristic is the length of the output vector when a language model returns numerical embeddings.

Expand All @@ -1366,7 +1366,7 @@ model$get_sentence_embedding_dimension()
]
]

- RoBERTa can take any text sequence up to 512 characters as input and then return a numerical vector with a length of 768 that represent this text sequence. This process is also called **encoding**.
- RoBERTa can take any text sequence up to 512 tokens as input and then return a numerical vector with a length of 768 that represent this text sequence. This process is also called **encoding**.

---

Expand Down Expand Up @@ -1424,3 +1424,5 @@ dim(embeddings)
Make sure you review the following notebook to see how to process 2834 reading excerpts in the CommonLit Readability dataset and obtain a 2834 x 768 matrix of numerical embeddings using the [AllenAI Longformer model](https://huggingface.co/allenai/longformer-base-4096).

[https://www.kaggle.com/code/uocoeeds/lecture-2b-data-preprocessing-ii](https://www.kaggle.com/code/uocoeeds/lecture-2b-data-preprocessing-ii)

We will use the output of this notebook (a 2834 x 768 matrix of numerical embeddings, **readability_features.csv**) in the following weeks to build a prediction model for readability scores.
38 changes: 20 additions & 18 deletions slides/slide2.html

Large diffs are not rendered by default.

0 comments on commit 2c5fb85

Please sign in to comment.