Quantitative Text Analysis for Public Policy

Course Outline
Since the invention of the printing press, the written word is the most common format by which knowledge is transferred. Computational linguistics has seen dramatic changes and some tools are very useful for applied economics and public policy research. This course will introduce students to quantitative text analysis methods and concepts and illustrate their application using examples drawn from text corpora that may be commonly used in public policy analysis, such as legislative texts or political speeches. We will work with conventional statistical and heuristic methods to summarize and compare coropra of text and extract information. However, we will also draw on a range of supervised and unsupervised machine learning methods to perform clustering and classification tasks, and illustrate their application in applied public policy research. The course will introduce students to the art of programming with R.

Prerequisites
There are no programming or machine learning prerequisites for this class. Students with significant programming experience may choose an accelerated pace through the introductory materials presented in the class at their discretion.

Course Hours

What? Quantitative Text Analysis Lectures
When? every Thursday 3-5:50 pm
Where?  Room 140C.

Piazza

This term we will be using Piazza for class discussion. The system is highly catered to getting you help fast and efficiently from classmates, the TA, and myself. Rather than emailing questions to the teaching staff, I encourage you to post your questions on Piazza. If you have any problems or feedback for the developers, email team@piazza.com.

Find our class page at:

Grading
Pass/Fail allowed. The final grade will be based on: 1) class attendance and participation in discussions (20%); 2) programming assignments (50%); 3) results of a group project (30%).

Recommended Reading

There is no single text book covering the material we discuss in the course. I will post recommended readings for each week – the books are available for free electronically.

[IIR] Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press.

[ISLR] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning (Vol. 103). New York, NY: Springer New York.

[SLP] Jurafsky, D., & Martin, J. H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (1st ed.). Upper Saddle River, NJ, USA: Prentice Hall PTR.

Topics Covered

Part 1: Introduce simple NLP concepts, introduce the R language and illustrate how we can work with getting text into a useful format for subsequent analysis. The unit of analysis in the first part of the course will mainly be (short) sequences of words.

  1. Introduction: Motivation, setting up, examples, basic introduction to “R”.

    1. Reading and stuff to do: RStudio InstallationAn Introduction to R , Getting Started with Markdown in Rstudio
    2. Slides
    3. Homework 1, Homework 1 marks
  2. Sourcing data: String manipulation, regular expressions, loading text data, basic web scraping, (social media) API’s.

    1. Slides
    2. Reading and stuff to do: Regular Expressions in R, Another nice Regex introduction
    3. Rcode for Week 1
  3. Text Normalization: Zipf’s law, Herd’s law, tokenization methods and routines, stemming, Levenshtein distance.

    1. Reading: [SLP], Chapter 2, http://web.stanford.edu/~jurafsky/slp3/2.pdf
    2. Slides
  4. Describing Text: readability measures
    1. Reading:  Heylighen, F., & Dewaele, J. (1999). Formality of Language : definition , measurement and behavioral determinants, 38.
    2. Slides
  5. Identifying and extracting collocations: heuristic and statistical methods.

    1. Reading, [IIR], Chapter 5, http://nlp.stanford.edu/fsnlp/promo/colloc.pdf; Optional:

      Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19, 61–74.

    2. Slides
    3. Rcode for Week 2
  6. N-Gram generative language model.
    1. Slides
    2. [SLP], chapter 4, http://web.stanford.edu/~jurafsky/slp3/4.pdf
    3. R-packages: ngram, textcat
  7. Part-of-Speech Tagging.

    1. Slides
    2. [SLP], chapter 10, http://web.stanford.edu/~jurafsky/slp3/10.pdf
    3. R-packages:
    4. Rcode for Week 3
  8. Named Entity Recognition

    1. Slides
    2. Homework 2 , Getting started R-script prepare-hw2.RPresidential Debates Rdata file
    3. Results for Homework 2

Part 2: Turn (larger) documents into a vector-space representation and perform (supervised) machine learning for classification type purposes.

  1. Vector Space model: vector space representation, bag of words model, measuring distance/ similarity between texts.
    1. Slides
  2. Ideological scaling: introduction to supervised learning concepts; naive LBK ideology scaling, Bayesscore.
    1. Slides
    2. Reading: Laver, M., Benoit, K., & Garry, J. (2003). Extracting Policy Positions from Political Texts Using Words as Data. The American Political Science Review, 97(2), 311–331.
    3. Beauchamp, N. (2010). Text-Based Scaling of Legislatures: A Comparison of Methods with Applications to the US Senate and UK House of Commons. Politics, 1–30.
    4. R code for week 4
  3. Classification introduction: Logistic regression, classification error types, measuring accuracy and bias-variance trade-off
    1. Slides
    2. Homework 3
    3. Results for Homework 3
    4. Reading
    5. R code for week 5
      1. [SLP], Logistic Regression, Chapter 7, https://web.stanford.edu/~jurafsky/slp3/7.pdf
      2. [ISLR], Chapter 1-2.2 and Chapter 5, http://www-bcf.usc.edu/~gareth/ISL/
  4. kNN classification

    1. Slides
    2. Reading
      1. [IIR], Vector space classification, sections on Rocchio and knn, http://nlp.stanford.edu/IR-book/pdf/14vcat.pdf
  5. Naive Bayes: Bernoulli versus Multi-nomial language models

    1. Slides (1), Slides (2)
    2. Reading
    3. R code for week 6
      1. [SLP], Naive Bayes, Chapter 6, https://web.stanford.edu/~jurafsky/slp3/6.pdf 
      2. [IIR], Naive Bayes, Chapter 13, http://nlp.stanford.edu/IR-book/pdf/13bayes.pdf
  6. Topic Modelling [moved ahead] 
    1. Slides
    2. Reading
      1. Blei (2001), Introduction to Probabilistic Topic Models
    3. R code for week 7
  7. Trees and Forests

    1. Slides
    2. Reading
      1. [ISLR], Chapter 8.
  8. Support Vector Machines
    1. Slides
    2. Reading
      1. ISLR], Chapter 9
    3. R code for week 8

  9. Some more application examples

    1. Slides

Part 3:  In the third part, we will turn to unsupervised learning methods for textual data.

  1. Unsupervised learning
  2. K-Means clustering: k-medoids (PAM), importance of distance measures
  3. Hierarchical clustering: different linkage

 

Data

  • Campaign Speeches: Speeches delivered by Trump and Clinton during campaign to similar audiences or on similar issues. CongressionalSpeeches
  • Congressional Speeches: Congressional speeches given by (primary) candidates for presidential race 2016 in congress. Note this only includes speeches given by candidates who had some role in congress since 1996 and does thus not include Donald Trump.
  • Donald Trump tweets: Some tweets from Donald Trump in the last 3-5 months.