Syllabus Cultural Data Science - Language Analytics

NB: The information presented here has been taken from the AU Course Catalogue.

This page should be viewed as indicative, rather than definitive. In the case of any errors, the official AU version is binding.

Overview

The purpose of this course is to familiarize students with the tools and techniques used to explore questions about culture through the use of digitized textual data. After taking this course, students will be able to; A) understand how text mining and text analytics are employed in humanities research; B) communicate the advantages and the challenges of quantitative and computationally-assisted approaches to texts; and C) apply relevant methods of textual analysis to research questions in their primary academic field and present the results.

The course provides an introduction to language and text analytics in theory and in practice, and introduces historical, philosophical, and technological perspectives on this growing field.

Written text is a central focus of scholarship in many areas of humanities and social sciences, and the large-scale digitization of texts and computational text mining tools offer unprecedented opportunities for pattern extraction and quantitative analysis. Emerging over the last two decades, text analytics draws on recent advances in computational linguistics, information retrieval, and machine learning. We will cover the central approaches used in contemporary research. We will engage with ongoing debates surrounding the role of text analytics in the humanities and social science, and its relationship to traditional methods (e.g. logical analysis, discourse analysis, genre analysis, etc) for analysing texts.

This course will enable students to critically engage with the digitalization revolution in text analytics, and its implications across research disciplines. At the same time, students will gain methodological skills which are valued in both academic research and in the labour market.

Academic Objectives

In the evaluation of the student’s performance, emphasis is placed on the extent to which the student is able to:

Knowledge:
- contrast strengths and weaknesses of automated text analysis, compared to more traditional text analysis methods, across different use contexts
- discuss different types of linguistic data and their affordances for automated data analysis
- explain how language and text analytics bears on central research questions in the humanities, and its relation to traditional methods
Skills:
- conduct original research using computationally-assisted analysis of digital language data.
Competences:
- critically reflect on the role of text analytics in the students’ primary field
- independently identify relevant analytical features of digital language data for an original research question.

Course Assessment

The exam consists of a portfolio containing a number of assignments. The portfolio will consist of 5-7 assignments. The number of assignments as well as their form and length will be announced at the start of the semester. The portfolio may include products. Depending on their length, and subject to the teacher’s approval, these products can replace some of the standard pages in the portfolio.

Participation

Students will be expected to complete the in-class assignments in order to progress to the examination. These assignments are designed first and foremost to develop skills rather than “prove” you have learned concepts.

I encourage you to communicate and work together, so long as you write and explain your code yourself and do not copy work wholesale. You can learn a lot from replicating others’ code but you will learn nothing if you copy it without knowing how it works.

Schedule

Each course session (1-13) is four hours long, consisting of a two-hour lecture and two-hour code-along session.

Week	Session	Lecture	Classroom	Reading
5	1	Introductions	Work stack - Slack, UCloud, Github	NO ASSIGNED READINGS
6	2	Text mining	Simple text processing with Python	Tahmasebi & Hengchen (2019)
7	3	NLP for linguistic analysis	Introduction to spaCy	Jurafsky & Martin (2023), sections 8.1, 8.2, and 8.3
8	4	Text classification 1	Logistic regression with Scikit-Learn	So & Roland (2020); Kim & Klinger (2019)
9	5	Text classification 2	Simple neural networks	Nielsen (2019[2015]), Chapter 1
10	6	Word embeddings	Semantic analysis with word2vec	Garg et al (2018)
11	7	Language modelling 1	Training a custom NER model	Blanke et al (2020)
12	8	Language modelling 2	Advanced Python scripting	Alammar (2018a)
13		NO TEACHING	NO TEACHING	CRFM (2019) (sections)
14	9	BERT	Working with BERT	Alammar (2018b); Underwood 2019
15	10	More BERT	Exploring BERT	Rogers et al. (2020)_
16	11	Generative models	Prompt engineering	Contreras-Kallens et al. (2023); Dou et al. (2022)
17	12	Social impact	Measuring impact	Bender et al. (2021)
18	13	Portfolio development	Portfolio development	NO ASSIGNED READINGS

Reading

Access to some articles may require you to be on the university VPN, or can be accessed through the library website.

Bender, E.M., Gebru, T., McMillan-Major, A., Schmitchell, S. (2021). "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜". In Proceedings of FAccT 2021, pp.610-623. DOI: 10.1145/3442188.3445922
Blanke, T., Bryant, M., & Hedges, M. (2020). 'Understanding memories of the Holocaust—A new approach to neural networks in the digital humanities', Digital Scholarship in the Humanities, 35(1), 17-33. DOI: 10.1093/llc/fqy082
Contreras Kallens, P. A., Kristensen-McLachlan, R.D., & Christiansen, M. H. (2023). “Large Language Models demonstrate the potential of statistical learning in language”, Cognitive Science, 47(3), e13256. DOI: 10.1111/cogs.13256
Dou, Y., Forbes, M., Koncel-Kedziorski, R., Smith, N.A., & Choi, Y. (2022). 'Is GPT-3 Text Indistinguishable from Human Text? Scarecrow: A Framework for Scrutinizing Machine Text'. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7250–7274, Dublin, Ireland. Association for Computational Linguistics. DOI: 10.18653/v1/2022.acl-long.501
Garg, N., Schiebinger, L., Jurafsky, D. & Zou, J. (2018). 'Word embeddings quantify 100 years of gender and ethnic stereotypes', PNAS, 16, E3635-E3644. DOI: 10.1073/pnas.1720347115
Kim, E. & Klinger, R. (2019). 'A Survey on Sentiment and Emotion Analysis for Computational Literary Studies'. In Zeitschrift für digitale Geisteswissenschaften. DOI: 10.17175/2019_008
Jurafsky, D. & Martin, J.H. (2021). Speech and Language Processing, 3rd edition online pre-print. Access
Nielsen, M.A. (2015). Neural Networks and Deep Learning*, Determination Press. Online.
Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). "A Primer in BERTology: what we know about about how BERT works", Transactions of the Association for Computational Linguistics, 8, 842-866.
Tahmasebi, N. & Hengchen, S. (2019). 'The Strengths and Pitfalls of Large-Scale Text Mining for Literary Studies', Samlaren, 140, 198-227. Download
So, R.J. & Roland, E. (2020). 'Race and Distant Reading', Publication of the Modern Language Association (PMLA), special issue on "Varieties of Digital Humanities, 135(1), 59-73. Download
Underwoord, T. (2019). 'Do humanists need BERT?", blog post.

Additional Resources

The following resources are not compulsory assigned readings. Instead, these are a mixture of textbooks and other resources which can be used as reference texts. Specifically, these will be useful for people who want to improve their understanding of linear algebra and neural networks. I strongly recommend all of the textbooks by Gilbert Strang - he's a fantastically clear writer, which is a rare skill among mathematicians. VanderPlas (2016) is a useful reference text for basic data science using Python (pandas, matplotlib, scikit-learn). It's below the level we'll be working at but it's good to have nevertheless.

Bittinger, M.L., Ellenbogen, D.J., & Surgent, S.A. (2012). Calculus and its Applications, 10th Edition. Boston, MA: Addison-Wesley.
Goldberg, N. (2017). Neural Network Methods for Natural Language Processing. New York: Morgan & Claypool Publishers.
Strang, G. (2009). Introduction to Linear Algebra (4th Edition). Wellesley, MA: Wellesley-Cambridge Press.
- (2016). Linear Algebra and its Applications, (5th Edition). Wellesley, MA: Wellesley-Cambridge Press.
- (2019). Linear Algebra and Learning from Data. Wellesley, MA: Wellesley-Cambridge Press.
- (2020). Linear Algebra for Everyone. Wellesley, MA: Wellesley-Cambridge Press.
VanderPlas, J. (2016). Python Data Science Handbook. Access

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

Syllabus Cultural Data Science - Language Analytics

Overview

Academic Objectives

Course Assessment

Participation

Schedule

Reading

Additional Resources

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

Syllabus Cultural Data Science - Language Analytics

Overview

Academic Objectives

Course Assessment

Participation

Schedule

Reading

Additional Resources