Computational Text Analysis
- give students ready-to-use instruments allowing to analyze relatively large text and related data for the purposes of political science
- learn the types of research tasks that may be solved with text data, approaches to data preparation, analysis and interpretation, including text classification, clustering, topic modeling and other techniques
- Introduction to computational text analysis in political science.Why modern political science needs automated text analysis. What types of tasks may be solved with such method. Advantages and limitations of automated approach to text. What is machine learning and how it is related to text analysis. What is natural language processing.
- Preparing texts for analysis.Text formats: plain text, vector representation, dissimilarity matrix. Text cleaning. Lemmatization and stemming. Stop words and approaches to their deletion. Transformation to vector form and types of frequencies: absolute frequencies, tf-idf and their advantages. Types of distances between texts: cosine similarity, Hamming, etc, their advantages and limitations. Dimentionality reduction and feature selection.
- Unsupervised machine learning: clusteringUnsupervised learning. What tasks it is suited for, its advantages and limitations. Clustering and the problem of ground truth. Quality metrics for cluster analysis. Flat clustering, K-means. Types of distances between clusters (not to be confused with distances between texts). Instability and approaches to algorithm initialization. How to choose the number of clusters. Hierarchical clustering and when it is better. Where to cut the hierarchy. Difficulties of clustering texts and other high-dimentional data. How to interpret clusters and working with outputs. Clustering in Orange: tutorial with a simple dataset.
- Unsupervised machine learning: topic modelingTopic modeling as bi-clustering of texts and words. Its advantages and limitations. Research examples. Topic modeling output, its labeling and interpretation. Junk, “glued”, wide and narrow topics. Choice of the topic number with research intuition and with metrics. The problem of measuring topic modeling quality. Perplexity, coherence and entropy as metrics of quality. The problem of stability and approaches to solving it: comparing solutions and choosing stable topics; maximizing stability. The meaning and influence of hyperparameters.
- Supervised machine learning: classificationPreparing data for classification. Approaches to feature engineering: manual and automatic approaches. N-grams, emoticons, linguistic rules and text meta-data as features. Feature weighting. Main algorithms: Naïve Bayes, SVM, Logistic Regression, neural networks. Choosing SVM parameters. Classification quality measures: accuracy, precision, recall, F-measure; overall measures and class-specific measures. Their relative importance for different tasks. The problem of class imbalance. Performing classification in Orange.
- Supervised machine learning: sentiment analysisWhat is sentiment? Sentiment, emotion and opinion. Machine learning and dictionary approaches in sentiment analysis. Approaches to creating dictionaries, their labeling and testing. Examples of existing dictionaries. Advantages and limitations of narrow and wide dictionaries. Creating linguistic rules. The importance of grammar and syntax for sentiment analysis. Difficulties of sentiment analysis of political texts.
- Text labeling for supervised methodsThe problem of labelled corpora. Examples of labelled corpora for different tasks. Natural mark-up. Mark-up by assessors and the problem of ground truth. Crowdsourcing platforms, their advantages and limitations. Examples. Assessor work quality, criteria and methods of control and stimulation. Assessor disagreement, its causes and approaches dealing with it. Agreement metrics: simple share, Cohen’s Kappa, Krippendorf’s Alpha, their advantages and limitations. Labeling practice: assessor training, pilot labeling and discussion.
- Designing, discussing and performing political science projects with automated text analysisThis topic is first learned through reading examples of research that implement some of the studied methods to political science tasks. It then includes performing a home team task that is then presented and discussed in class.
- Introduction to Orange and tools overviewIntroduction to Orange is a topic that goes through all other topics as their practical extension. It is controlled through Gcw1&2. It includes the following subtopics. Opening Orange and importing data. Organizing workflows. Orange functions overview. Data visualization. Exporting results. Text preprocessing with Orange. Clustering with Orange. Classification with Orange.
- Class work 1Individual in-class practical work on clustering and topic modeling for political science purposes, in Orange.
- Class work 2Individual in-class practical work on classification and sentiment analysis for political science purposes, in Orange.
- In-class presentation
- Examstudents can choose one of the tasks: (a) a four-page home essay based on group work results or on a similar individual project (b) an in-class task either on classification or clustering similar to CW1 or CW2. Home essay should be submitted 1 day prior to the official examination date. Students may be exempt of the exam on their request in which case their exam grade is considered to be equal to the average of other grades.
- Interim assessment (2 module)0.28 * Class work 1 + 0.28 * Class work 2 + 0.2 * Exam + 0.24 * In-class presentation
- Elfrinkhof, A. van, Maks, I., & Kaal, B. (2014). From Text to Political Positions : Text Analysis Across Disciplines. Amsterdam: John Benjamins Publishing Company. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=761345
- Grimmer, J., & Stewart, B. M. (2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsbas&AN=edsbas.BC6A6457
- Benoit, K., Conway, D., Lauderdale, B. E., Laver, M., & Mikhaylov, S. (2016). Crowd-sourced Text Analysis: Reproducible and Agile Production of Political Data. American Political Science Review, (02), 278. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsrep&AN=edsrep.a.cup.apsrev.v110y2016i02p278.295.00
- Gayo-Avello, D. (2012). A meta-analysis of state-of-the-art electoral prediction from Twitter data. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsbas&AN=edsbas.D8F9D90B
- Michael Laver, Kenneth Benoit, & Trinity College. (2003). Extracting Policy Positions from Political Texts Using Words as Data. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsbas&AN=edsbas.D6545538
- van Atteveldt, W., Kleinnijenhuis, J., & Ruigrok, N. (2008). Parsing, Semantic Networks, and Political Authority Using Syntactic Analysis to Extract Semantic Relations from Dutch Newspaper Articles. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsbas&AN=edsbas.A3B59093
- Young, L., & Soroka, S. (2012). Affective News: The Automated Coding of Sentiment in Political Texts. Political Communication, 29(2), 205–231. https://doi.org/10.1080/10584609.2012.671234