• A
  • A
  • A
  • АБВ
  • АБВ
  • АБВ
  • А
  • А
  • А
  • А
  • А
Обычная версия сайта

Computational Text Analysis

Учебный год
Обучение ведется на английском языке
Курс по выбору
Когда читается:
2-й курс, 1, 2 модуль


Course Syllabus


Our society is leaving more and more “digital traces” that are being accumulated at an unprecedented scale. Some of these data are publicly available and produce immediate societal effects, others have to be mined and processed before they yield some meaning, still they are available for the analysis by scholars. The process of creation / emergence of these traces is often a process of individual or mass communication, or of digitally mediated problem solving (such as online purchases, search, or rating). This process is not just a mirror or a derivation of some offline social reality, it is a new type of social processes, and a large portion of these processes are political in nature. Therefore, all these traces, either stored or evolving in real time, can be the subject of study in political science. And they demand new methods of research. Much of this data is textual, and another large portion is closely linked to the texts, and mostly they are too large to be processed manually.
Learning Objectives

Learning Objectives

  • give students ready-to-use instruments allowing to analyze relatively large text and related data for the purposes of political science
Expected Learning Outcomes

Expected Learning Outcomes

  • learn the types of research tasks that may be solved with text data, approaches to data preparation, analysis and interpretation, including text classification, clustering, topic modeling and other techniques
Course Contents

Course Contents

  • Introduction to computational text analysis in political science.
    Why modern political science needs automated text analysis. What types of tasks may be solved with such method. Advantages and limitations of automated approach to text. What is machine learning and how it is related to text analysis. What is natural language processing.
  • Preparing texts for analysis.
    Text formats: plain text, vector representation, dissimilarity matrix. Text cleaning. Lemmatization and stemming. Stop words and approaches to their deletion. Transformation to vector form and types of frequencies: absolute frequencies, tf-idf and their advantages. Types of distances between texts: cosine similarity, Hamming, etc, their advantages and limitations. Dimentionality reduction and feature selection.
  • Unsupervised machine learning: clustering
    Unsupervised learning. What tasks it is suited for, its advantages and limitations. Clustering and the problem of ground truth. Quality metrics for cluster analysis. Flat clustering, K-means. Types of distances between clusters (not to be confused with distances between texts). Instability and approaches to algorithm initialization. How to choose the number of clusters. Hierarchical clustering and when it is better. Where to cut the hierarchy. Difficulties of clustering texts and other high-dimentional data. How to interpret clusters and working with outputs. Clustering in Orange: tutorial with a simple dataset.
  • Unsupervised machine learning: topic modeling
    Topic modeling as bi-clustering of texts and words. Its advantages and limitations. Research examples. Topic modeling output, its labeling and interpretation. Junk, “glued”, wide and narrow topics. Choice of the topic number with research intuition and with metrics. The problem of measuring topic modeling quality. Perplexity, coherence and entropy as metrics of quality. The problem of stability and approaches to solving it: comparing solutions and choosing stable topics; maximizing stability. The meaning and influence of hyperparameters.
  • Supervised machine learning: classification
    Preparing data for classification. Approaches to feature engineering: manual and automatic approaches. N-grams, emoticons, linguistic rules and text meta-data as features. Feature weighting. Main algorithms: Naïve Bayes, SVM, Logistic Regression, neural networks. Choosing SVM parameters. Classification quality measures: accuracy, precision, recall, F-measure; overall measures and class-specific measures. Their relative importance for different tasks. The problem of class imbalance. Performing classification in Orange.
  • Supervised machine learning: sentiment analysis
    What is sentiment? Sentiment, emotion and opinion. Machine learning and dictionary approaches in sentiment analysis. Approaches to creating dictionaries, their labeling and testing. Examples of existing dictionaries. Advantages and limitations of narrow and wide dictionaries. Creating linguistic rules. The importance of grammar and syntax for sentiment analysis. Difficulties of sentiment analysis of political texts.
  • Text labeling for supervised methods
    The problem of labelled corpora. Examples of labelled corpora for different tasks. Natural mark-up. Mark-up by assessors and the problem of ground truth. Crowdsourcing platforms, their advantages and limitations. Examples. Assessor work quality, criteria and methods of control and stimulation. Assessor disagreement, its causes and approaches dealing with it. Agreement metrics: simple share, Cohen’s Kappa, Krippendorf’s Alpha, their advantages and limitations. Labeling practice: assessor training, pilot labeling and discussion.
  • Designing, discussing and performing political science projects with automated text analysis
    This topic is first learned through reading examples of research that implement some of the studied methods to political science tasks. It then includes performing a home team task that is then presented and discussed in class.
  • Introduction to Orange and tools overview
    Introduction to Orange is a topic that goes through all other topics as their practical extension. It is controlled through Gcw1&2. It includes the following subtopics. Opening Orange and importing data. Organizing workflows. Orange functions overview. Data visualization. Exporting results. Text preprocessing with Orange. Clustering with Orange. Classification with Orange.
Assessment Elements

Assessment Elements

  • non-blocking Class work 1
    Individual in-class practical work on clustering and topic modeling for political science purposes, in Orange.
  • non-blocking Class work 2
    Individual in-class practical work on classification and sentiment analysis for political science purposes, in Orange.
  • non-blocking In-class presentation
  • non-blocking Exam
    students can choose one of the tasks: (a) a four-page home essay based on group work results or on a similar individual project (b) an in-class task either on classification or clustering similar to CW1 or CW2. Home essay should be submitted 1 day prior to the official examination date. Students may be exempt of the exam on their request in which case their exam grade is considered to be equal to the average of other grades.
Interim Assessment

Interim Assessment

  • Interim assessment (2 module)
    0.28 * Class work 1 + 0.28 * Class work 2 + 0.2 * Exam + 0.24 * In-class presentation


Recommended Core Bibliography

  • Elfrinkhof, A. van, Maks, I., & Kaal, B. (2014). From Text to Political Positions : Text Analysis Across Disciplines. Amsterdam: John Benjamins Publishing Company. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=761345
  • Grimmer, J., & Stewart, B. M. (2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsbas&AN=edsbas.BC6A6457

Recommended Additional Bibliography

  • Benoit, K., Conway, D., Lauderdale, B. E., Laver, M., & Mikhaylov, S. (2016). Crowd-sourced Text Analysis: Reproducible and Agile Production of Political Data. American Political Science Review, (02), 278. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsrep&AN=edsrep.a.cup.apsrev.v110y2016i02p278.295.00
  • Gayo-Avello, D. (2012). A meta-analysis of state-of-the-art electoral prediction from Twitter data. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsbas&AN=edsbas.D8F9D90B
  • Michael Laver, Kenneth Benoit, & Trinity College. (2003). Extracting Policy Positions from Political Texts Using Words as Data. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsbas&AN=edsbas.D6545538
  • van Atteveldt, W., Kleinnijenhuis, J., & Ruigrok, N. (2008). Parsing, Semantic Networks, and Political Authority Using Syntactic Analysis to Extract Semantic Relations from Dutch Newspaper Articles. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsbas&AN=edsbas.A3B59093
  • Young, L., & Soroka, S. (2012). Affective News: The Automated Coding of Sentiment in Political Texts. Political Communication, 29(2), 205–231. https://doi.org/10.1080/10584609.2012.671234