Since the invention of the printing press, the written word is the most common format by which knowledge is transferred. Computational linguistics has seen dramatic changes and some tools are very useful for applied economics and public policy research. This course will introduce students to quantitative text analysis methods and concepts and illustrate their application using examples drawn from text corpora that may be commonly used in public policy analysis, such as legislative texts or political speeches. We will work with conventional statistical and heuristic methods to summarize and compare coropra of text and extract information. However, we will also draw on a range of supervised and unsupervised machine learning methods to perform clustering and classification tasks, and illustrate their application in applied public policy research. The course will introduce students to the art of programming with R.
There are no programming or machine learning prerequisites for this class. Students with significant programming experience may choose an accelerated pace through the introductory materials presented in the class at their discretion.
Course Hours [UPDATE!!!]
What? Quantitative Text Analysis Lectures
When? every Tuesday from 9-11:50am starting Jan 17th.
Where? Room 140A
Pass/Fail allowed. The final grade will be based on: 1) class attendance and participation in discussions (20%); 2) programming assignments (40%); 3) results of a group project (40%).
There is no single text book covering the material we discuss in the course. I will post recommended readings for each week – the books are available for free electronically.
[IIR] Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press.
[ISLR] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning (Vol. 103). New York, NY: Springer New York.
[SLP] Jurafsky, D., & Martin, J. H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (1st ed.). Upper Saddle River, NJ, USA: Prentice Hall PTR.
Part 1: Introduce simple NLP concepts, introduce the R language and illustrate how we can work with getting text into a useful format for subsequent analysis. The unit of analysis in the first part of the course will mainly be (short) sequences of words.
- Introduction: Motivation, setting up, examples, basic introduction to “R”.
- Reading and stuff to do: RStudio Installation, An Introduction to R , Getting Started with Markdown in Rstudio
- Homework (not for hand in): install and get started with R, try to go through the code exercises on todays slides. It is imperative you understand the basics of R and the basic language as otherwise, you may feel like getting lost fast in later lectures..
- Sourcing data: String manipulation, regular expressions, loading text data, basic web scraping, (social media) API’s.
- Text Normalization: Zipf’s law, Herd’s law, tokenization methods and routines, stemming, Levenshtein distance.
- Describing Text: readability measures
- Identifying and extracting collocations: heuristic and statistical methods.
- Part-of-Speech Tagging.
- N-Gram generative language model.
- Named Entity Recognition
Part 2: Turn (larger) documents into a vector-space representation and perform (supervised) machine learning for classification type purposes.
- Vector Space model: vector space representation, bag of words model, measuring distance/ similarity between texts.
- Ideological scaling: introduction to supervised learning concepts; naive LBK ideology scaling, Bayesscore.
- Reading: Laver, M., Benoit, K., & Garry, J. (2003). Extracting Policy Positions from Political Texts Using Words as Data. The American Political Science Review, 97(2), 311–331.
- Beauchamp, N. (2010). Text-Based Scaling of Legislatures: A Comparison of Methods with Applications to the US Senate and UK House of Commons. Politics, 1–30.
- Classification introduction: Logistic regression, classification error types, measuring accuracy and bias-variance trade-off
- Homework 2 (submission on Feb 10th, 12:00pm)
- Class project proposals, first discussion.
- Naive Bayes: Bernoulli versus Multi-nomial language models
- kNN classification
- Trees and Forests
- [ISLR], Chapter 8.
- Support Vector Machines
- Application examples
Part 3: In the third part, we will turn to unsupervised learning methods for textual data.
- Campaign Speeches: Speeches delivered by Trump and Clinton during campaign to similar audiences or on similar issues. CongressionalSpeeches
- Congressional Speeches: Congressional speeches given by (primary) candidates for presidential race 2016 in congress. Note this only includes speeches given by candidates who had some role in congress since 1996 and does thus not include Donald Trump.
- Donald Trump tweets: Some tweets from Donald Trump in the last 3-5 months.