• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Text Mining and Natural Language Processing

2021/2022
Academic Year
ENG
Instruction in English
3
ECTS credits
Course type:
Compulsory course
When:
1 year, 2 module

Instructor

Course Syllabus

Abstract

For social and political sciences, written text provide essential data for studying ideology and political discourse, conflict, sentiment and political affiliation, among many other things. With a growing availability of larger collections of text in digital form it is tempting to scale the research up in terms of the population studied (e.g. “all of twitter”), time spans (e.g. “all of the American history”), and geographical scope (e.g. “all foreign ties of China”). Computational methods for text analysis promise to aid at the scale where traditional conetnt analysis is not feasible. We will use R programming environment as a toolbox for text analysis. To “learn by doing” we will work with real text collections and will replicate some methods from the recent social research employing computational text analysis.
Learning Objectives

Learning Objectives

  • The goal of the course is to provide basic understanding on how to properly use collections of texts as quantitative evidence, and to make this knowledge practical.
Expected Learning Outcomes

Expected Learning Outcomes

  • Understanding possibilities of the automated text analysis as well as its pitfalls and important caveats about applying statistical tests to language data.
  • Being able to apply computational methods of text analysis (e.g. analysis of word frequency and co-occurrence, document classification, topic modeling) to collections of texts.
  • Being able to adequately interpret and report the results of computational text analysis in research papers.
Course Contents

Course Contents

  • Counting words — Preprocessing: transforming text into data in R.
  • Comparing corpora — Comparing word usage in contrast corpora.
  • Document-level modeling — Vector space model. Document classification.
  • Co-occurrence — Distributional semantics.
  • Dictionary methods — Sentiment.
Assessment Elements

Assessment Elements

  • non-blocking Homework
  • non-blocking Class participation
  • non-blocking Final project
Interim Assessment

Interim Assessment

  • Interim assessment (2 module)
    0.2 * Class participation + 0.3 * Final project + 0.5 * Homework
Bibliography

Bibliography

Recommended Core Bibliography

  • Grimmer, J., & Stewart, B. M. (2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 3, 267.
  • Rule, A., Cointet, J.-P., Bearman, P. S., ISSN: 0027-8424 ; EISSN: 1091-6490 ; Proceedings of the National Academy of Sciences of the United States of America ; https://hal.inrae.fr/hal-02636957 ; Proceedings of the National Academy of Sciences of the United States of America, National Academy of Sciences, 2015, & 112 (35). (2015). Lexical shifts, substantive changes, and continuity in State of the Union discourse, 1790-2014. ISSN: 0027-8424. https://doi.org/10.1073/pnas.1512221112
  • Zhai, C., & Aggarwal, C. C. (2012). Mining Text Data. New York: Springer. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=537386

Recommended Additional Bibliography

  • Grimmer, J. (2010). A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases. Political Analysis, 1, 1.
  • Hamilton, W. L., Clark, K., Leskovec, J., & Jurafsky, D. (2016). Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora.
  • Lipton, Z. C. (2016). The Mythos of Model Interpretability. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsbas&AN=edsbas.E8C74632