Skip to content

Latest commit

 

History

History
87 lines (66 loc) · 10.4 KB

readme.md

File metadata and controls

87 lines (66 loc) · 10.4 KB

Syllabus Cultural Data Science - Language Analytics

NB: The information presented here has been taken from the AU Course Catalogue.

This page should be viewed as indicative, rather than definitive. In the case of any errors, the official AU version is binding.

Overview

The purpose of this course is to familiarize students with the tools and techniques used to explore questions about culture through the use of digitized textual data. After taking this course, students will be able to; A) understand how text mining and text analytics are employed in humanities research; B) communicate the advantages and the challenges of quantitative and computationally-assisted approaches to texts; and C) apply relevant methods of textual analysis to research questions in their primary academic field and present the results.

The course provides an introduction to language and text analytics in theory and in practice, and introduces historical, philosophical, and technological perspectives on this growing field.

Written text is a central focus of scholarship in many areas of humanities and social sciences, and the large-scale digitization of texts and computational text mining tools offer unprecedented opportunities for pattern extraction and quantitative analysis. Emerging over the last two decades, text analytics draws on recent advances in computational linguistics, information retrieval, and machine learning. We will cover the central approaches used in contemporary research. We will engage with ongoing debates surrounding the role of text analytics in the humanities and social science, and its relationship to traditional methods (e.g. logical analysis, discourse analysis, genre analysis, etc) for analysing texts.

This course will enable students to critically engage with the digitalization revolution in text analytics, and its implications across research disciplines. At the same time, students will gain methodological skills which are valued in both academic research and in the labour market.

Academic Objectives

In the evaluation of the student’s performance, emphasis is placed on the extent to which the student is able to:

  1. Knowledge:
    • contrast strengths and weaknesses of automated text analysis, compared to more traditional text analysis methods, across different use contexts
    • discuss different types of linguistic data and their affordances for automated data analysis
    • explain how language and text analytics bears on central research questions in the humanities, and its relation to traditional methods
  2. Skills:
    • conduct original research using computationally-assisted analysis of digital language data.
  3. Competences:
    • critically reflect on the role of text analytics in the students’ primary field
    • independently identify relevant analytical features of digital language data for an original research question.

Course Assessment

The exam consists of a portfolio containing a number of assignments. The portfolio will consist of 5-7 assignments. The number of assignments as well as their form and length will be announced at the start of the semester. The portfolio may include products. Depending on their length, and subject to the teacher’s approval, these products can replace some of the standard pages in the portfolio.

Participation

Students will be expected to complete the in-class assignments in order to progress to the examination. These assignments are designed first and foremost to develop skills rather than “prove” you have learned concepts.

I encourage you to communicate and work together, so long as you write and explain your code yourself and do not copy work wholesale. You can learn a lot from replicating others’ code but you will learn nothing if you copy it without knowing how it works.

Schedule

Each course session (1-13) is four hours long, consisting of a two-hour lecture and two-hour code-along session.

Week Session Lecture Classroom Reading
5 1 Introductions Work stack - Slack, UCloud, Github NO ASSIGNED READINGS
6 2 Text mining Simple text processing with Python Tahmasebi & Hengchen (2019)
7 3 NLP for linguistic analysis Introduction to spaCy Jurafsky & Martin (2023), sections 8.1, 8.2, and 8.3
8 4 Text classification 1 Logistic regression with Scikit-Learn So & Roland (2020); Kim & Klinger (2019)
9 5 Text classification 2 Simple neural networks Nielsen (2019[2015]), Chapter 1
10 6 Word embeddings Semantic analysis with word2vec Garg et al (2018)
11 7 Language modelling 1 Training a custom NER model Blanke et al (2020)
12 8 Language modelling 2 Advanced Python scripting Alammar (2018a)
13 NO TEACHING NO TEACHING CRFM (2019) (sections)
14 9 BERT Working with BERT Alammar (2018b); Underwood 2019
15 10 More BERT Exploring BERT Rogers et al. (2020)_
16 11 Generative models Prompt engineering Contreras-Kallens et al. (2023); Dou et al. (2022)
17 12 Social impact Measuring impact Bender et al. (2021)
18 13 Portfolio development Portfolio development NO ASSIGNED READINGS

Reading

Access to some articles may require you to be on the university VPN, or can be accessed through the library website.

  • Bender, E.M., Gebru, T., McMillan-Major, A., Schmitchell, S. (2021). "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜". In Proceedings of FAccT 2021, pp.610-623. DOI: 10.1145/3442188.3445922
  • Blanke, T., Bryant, M., & Hedges, M. (2020). 'Understanding memories of the Holocaust—A new approach to neural networks in the digital humanities', Digital Scholarship in the Humanities, 35(1), 17-33. DOI: 10.1093/llc/fqy082
  • Contreras Kallens, P. A., Kristensen-McLachlan, R.D., & Christiansen, M. H. (2023). “Large Language Models demonstrate the potential of statistical learning in language”, Cognitive Science, 47(3), e13256. DOI: 10.1111/cogs.13256
  • Dou, Y., Forbes, M., Koncel-Kedziorski, R., Smith, N.A., & Choi, Y. (2022). 'Is GPT-3 Text Indistinguishable from Human Text? Scarecrow: A Framework for Scrutinizing Machine Text'. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7250–7274, Dublin, Ireland. Association for Computational Linguistics. DOI: 10.18653/v1/2022.acl-long.501
  • Garg, N., Schiebinger, L., Jurafsky, D. & Zou, J. (2018). 'Word embeddings quantify 100 years of gender and ethnic stereotypes', PNAS, 16, E3635-E3644. DOI: 10.1073/pnas.1720347115
  • Kim, E. & Klinger, R. (2019). 'A Survey on Sentiment and Emotion Analysis for Computational Literary Studies'. In Zeitschrift für digitale Geisteswissenschaften. DOI: 10.17175/2019_008
  • Jurafsky, D. & Martin, J.H. (2021). Speech and Language Processing, 3rd edition online pre-print. Access
  • Nielsen, M.A. (2015). Neural Networks and Deep Learning*, Determination Press. Online.
  • Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). "A Primer in BERTology: what we know about about how BERT works", Transactions of the Association for Computational Linguistics, 8, 842-866.
  • Tahmasebi, N. & Hengchen, S. (2019). 'The Strengths and Pitfalls of Large-Scale Text Mining for Literary Studies', Samlaren, 140, 198-227. Download
  • So, R.J. & Roland, E. (2020). 'Race and Distant Reading', Publication of the Modern Language Association (PMLA), special issue on "Varieties of Digital Humanities, 135(1), 59-73. Download
  • Underwoord, T. (2019). 'Do humanists need BERT?", blog post.

Additional Resources

The following resources are not compulsory assigned readings. Instead, these are a mixture of textbooks and other resources which can be used as reference texts. Specifically, these will be useful for people who want to improve their understanding of linear algebra and neural networks. I strongly recommend all of the textbooks by Gilbert Strang - he's a fantastically clear writer, which is a rare skill among mathematicians. VanderPlas (2016) is a useful reference text for basic data science using Python (pandas, matplotlib, scikit-learn). It's below the level we'll be working at but it's good to have nevertheless.

  • Bittinger, M.L., Ellenbogen, D.J., & Surgent, S.A. (2012). Calculus and its Applications, 10th Edition. Boston, MA: Addison-Wesley.
  • Goldberg, N. (2017). Neural Network Methods for Natural Language Processing. New York: Morgan & Claypool Publishers.
  • Strang, G. (2009). Introduction to Linear Algebra (4th Edition). Wellesley, MA: Wellesley-Cambridge Press.
    • (2016). Linear Algebra and its Applications, (5th Edition). Wellesley, MA: Wellesley-Cambridge Press.
    • (2019). Linear Algebra and Learning from Data. Wellesley, MA: Wellesley-Cambridge Press.
    • (2020). Linear Algebra for Everyone. Wellesley, MA: Wellesley-Cambridge Press.
  • VanderPlas, J. (2016). Python Data Science Handbook. Access