Since the invention of the printing press, the written word is the most common format by which knowledge is transferred. Computational linguistics has seen dramatic changes and some tools are very useful for applied economics and public policy research. This course will introduce students to quantitative text analysis methods and concepts and illustrate their application using examples drawn from text corpora that may be commonly used in public policy analysis, such as legislative texts or political speeches. We will work with conventional statistical and heuristic methods to summarize and compare coropra of text and extract information. However, we will also draw on a range of supervised and unsupervised machine learning methods to perform clustering and classification tasks, and illustrate their application in applied public policy research. The course will introduce students to the art of programming with R.
There are no programming or machine learning prerequisites for this class. Students with significant programming experience may choose an accelerated pace through the introductory materials presented in the class at their discretion.
What? Quantitative Text Analysis Lectures
When? every Thursday 3-5:50 pm
Where? Room 140C.
Pass/Fail allowed. The final grade will be based on: 1) class attendance and participation in discussions (20%); 2) programming assignments (50%); 3) results of a group project (30%).
There is no single text book covering the material we discuss in the course. I will post recommended readings for each week – the books are available for free electronically.
[IIR] Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press.
[ISLR] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning (Vol. 103). New York, NY: Springer New York.
[SLP] Jurafsky, D., & Martin, J. H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (1st ed.). Upper Saddle River, NJ, USA: Prentice Hall PTR.
Part 1: Introduce simple NLP concepts, introduce the R language and illustrate how we can work with getting text into a useful format for subsequent analysis. The unit of analysis in the first part of the course will mainly be (short) sequences of words.
- Introduction: Motivation, setting up, examples, basic introduction to “R”.
- Sourcing data: String manipulation, regular expressions, loading text data, basic web scraping, (social media) API’s.
- Text Normalization: Zipf’s law, Herd’s law, tokenization methods and routines, stemming, Levenshtein distance.
- Describing Text: readability measures
- Identifying and extracting collocations: heuristic and statistical methods.
- N-Gram generative language model.
- Part-of-Speech Tagging.
- Named Entity Recognition
Part 2: Turn (larger) documents into a vector-space representation and perform (supervised) machine learning for classification type purposes.
- Vector Space model: vector space representation, bag of words model, measuring distance/ similarity between texts.
- Ideological scaling: introduction to supervised learning concepts; naive LBK ideology scaling, Bayesscore.
- Reading: Laver, M., Benoit, K., & Garry, J. (2003). Extracting Policy Positions from Political Texts Using Words as Data. The American Political Science Review, 97(2), 311–331.
- Beauchamp, N. (2010). Text-Based Scaling of Legislatures: A Comparison of Methods with Applications to the US Senate and UK House of Commons. Politics, 1–30.
- R code for week 4
- Classification introduction: Logistic regression, classification error types, measuring accuracy and bias-variance trade-off
- kNN classification
- Naive Bayes: Bernoulli versus Multi-nomial language models
- Topic Modelling [moved ahead]
- Trees and Forests
- [ISLR], Chapter 8.
- Support Vector Machines
- ISLR], Chapter 9
- R code for week 8
- Some more application examples
Part 3: In the third part, we will turn to unsupervised learning methods for textual data.
- Unsupervised learning
- K-Means clustering: k-medoids (PAM), importance of distance measures
- Hierarchical clustering: different linkage
- Campaign Speeches: Speeches delivered by Trump and Clinton during campaign to similar audiences or on similar issues. CongressionalSpeeches
- Congressional Speeches: Congressional speeches given by (primary) candidates for presidential race 2016 in congress. Note this only includes speeches given by candidates who had some role in congress since 1996 and does thus not include Donald Trump.
- Donald Trump tweets: Some tweets from Donald Trump in the last 3-5 months.