Quantitative Text Analysis for Public Policy

Course Outline
Since the invention of the printing press, the written word is the most common format by which knowledge is transferred. Computational linguistics has seen dramatic changes and some tools are very useful for applied economics and public policy research. This course will introduce students to quantitative text analysis methods and concepts and illustrate their application using examples drawn from text corpora that may be commonly used in public policy analysis, such as legislative texts or political speeches. We will work with conventional statistical and heuristic methods to summarize and compare coropra of text and extract information. However, we will also draw on a range of supervised and unsupervised machine learning methods to perform clustering and classification tasks, and illustrate their application in applied public policy research. The course will introduce students to the art of programming with R.

Prerequisites
There are no programming or machine learning prerequisites for this class. Students with significant programming experience may choose an accelerated pace through the introductory materials presented in the class at their discretion.

Course Hours [UPDATE!!!]

What? Quantitative Text Analysis Lectures
When? every Tuesday from 9-11:50am starting Jan 17th.
Where?  Room 140A

Piazza

This term we will be using Piazza for class discussion. The system is highly catered to getting you help fast and efficiently from classmates, the TA, and myself. Rather than emailing questions to the teaching staff, I encourage you to post your questions on Piazza. If you have any problems or feedback for the developers, email team@piazza.com.

Find our class page at:

Class Feedback

Please leave feedback here

Feedback Form

Grading
Pass/Fail allowed. The final grade will be based on: 1) class attendance and participation in discussions (20%); 2) programming assignments (40%); 3) results of a group project (40%).

Recommended Reading

There is no single text book covering the material we discuss in the course. I will post recommended readings for each week – the books are available for free electronically.

[IIR] Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press.

[ISLR] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning (Vol. 103). New York, NY: Springer New York.

[SLP] Jurafsky, D., & Martin, J. H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (1st ed.). Upper Saddle River, NJ, USA: Prentice Hall PTR.

Topics Covered

Part 1: Introduce simple NLP concepts, introduce the R language and illustrate how we can work with getting text into a useful format for subsequent analysis. The unit of analysis in the first part of the course will mainly be (short) sequences of words.

  1. Introduction: Motivation, setting up, examples, basic introduction to “R”.
    1. Slides
    2. Reading and stuff to do: RStudio InstallationAn Introduction to R , Getting Started with Markdown in Rstudio
    3. Homework (not for hand in): install and get started with R, try to go through the code exercises on todays slides. It is imperative you understand the basics of R and the basic language as otherwise, you may feel like getting lost fast in later lectures..
  2. Sourcing data: String manipulation, regular expressions, loading text data, basic web scraping, (social media) API’s.
    1. Slides
    2. Reading and stuff to do: Regular Expressions in R, Another nice Regex introduction
    3. Homework 1
  3. Text Normalization: Zipf’s law, Herd’s law, tokenization methods and routines, stemming, Levenshtein distance.
    1. Slides
    2. Reading: [SLP], Chapter 2, http://web.stanford.edu/~jurafsky/slp3/2.pdf
  4. Describing Text: readability measures
    1. Slides
  5. Identifying and extracting collocations: heuristic and statistical methods.
    1. Slides
    2. Reading, [IIR], Chapter 5, http://nlp.stanford.edu/fsnlp/promo/colloc.pdf; Optional:

      Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19, 61–74.

  6. Part-of-Speech Tagging.
    1. Slides
    2. [SLP], chapter 10, http://web.stanford.edu/~jurafsky/slp3/10.pdf
  7. N-Gram generative language model.
    1. Slides
    2. [SLP], chapter 4, http://web.stanford.edu/~jurafsky/slp3/4.pdf
  8. Named Entity Recognition
    1. Slides

Part 2: Turn (larger) documents into a vector-space representation and perform (supervised) machine learning for classification type purposes.

  1. Vector Space model: vector space representation, bag of words model, measuring distance/ similarity between texts.
    1. Slides
  2. Ideological scaling: introduction to supervised learning concepts; naive LBK ideology scaling, Bayesscore.
    1. Slides
    2. Reading: Laver, M., Benoit, K., & Garry, J. (2003). Extracting Policy Positions from Political Texts Using Words as Data. The American Political Science Review, 97(2), 311–331.
    3. Beauchamp, N. (2010). Text-Based Scaling of Legislatures: A Comparison of Methods with Applications to the US Senate and UK House of Commons. Politics, 1–30.
  3. Classification introduction: Logistic regression, classification error types, measuring accuracy and bias-variance trade-off
    1. Slides
    2. Homework 2 (submission on Feb 10th, 12:00pm)
    3. Class project proposals, first discussion.
    4. Reading
      1. [SLP], Logistic Regression, Chapter 7, https://web.stanford.edu/~jurafsky/slp3/7.pdf
      2. [ISLR], Chapter 1-2.2 and Chapter 5, http://www-bcf.usc.edu/~gareth/ISL/
      3. Note on regularized Logistic regression
  4. Naive Bayes: Bernoulli versus Multi-nomial language models
    1. Slides 1 , Slides 2
    2. Reading
      1. [SLP], Naive Bayes, Chapter 6, https://web.stanford.edu/~jurafsky/slp3/6.pdf 
      2. [IIR], Naive Bayes, Chapter 13, http://nlp.stanford.edu/IR-book/pdf/13bayes.pdf
  5. kNN classification
    1. Slides
    2. Reading
      1. [IIR], Vector space classification, sections on Rocchio and knn, http://nlp.stanford.edu/IR-book/pdf/14vcat.pdf
  6. Trees and Forests
    1. Slides
    2. Reading
      1. [ISLR], Chapter 8.
  7. Support Vector Machines
    1. Slides
    2. Reading
      1. [ISLR], Chapter 9.
    3. Homework, Assignment 3
  8. Application examples
    1. Slides

Part 3:  In the third part, we will turn to unsupervised learning methods for textual data.

  1. Unsupervised learning
  2. K-Means clustering: k-medoids (PAM), importance of distance measures
    1. Slides
  3. Hierarchical clustering: different linkage
    1. Slides
  4. Topic Modelling
    1. Slides

Data

  • Campaign Speeches: Speeches delivered by Trump and Clinton during campaign to similar audiences or on similar issues. CongressionalSpeeches
  • Congressional Speeches: Congressional speeches given by (primary) candidates for presidential race 2016 in congress. Note this only includes speeches given by candidates who had some role in congress since 1996 and does thus not include Donald Trump.
  • Donald Trump tweets: Some tweets from Donald Trump in the last 3-5 months.