• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Advanced Data Analysis

2024/2025
Academic Year
ENG
Instruction in English
5
ECTS credits
Course type:
Elective course
When:
4 year, 1, 2 module

Instructor


Bobrikov, Dmitrii

Course Syllabus

Abstract

The course is targeted at undergraduate social science students aiming at careers in data analysis or academia. The course consists of seminars. It covers special prediction and classification models (logistic regression and cluster analysis) and more advanced data management topics such as web scraping and data imputation. The course discusses data culture, from data management to coding styles to narrating with data and Bayesian statistics. Another feature of that course is to learn the coding instruments which helps to create tidy analysis. It is also a starting point for students interested in pursuing advanced training in research methods or planning to use quantitative methods with categorical outcomes in their own research.
Learning Objectives

Learning Objectives

  • The course covers the foundations and popular techniques of categorical data analysis with the goal of training students to be informed producers and consumers of quantitative research.
Expected Learning Outcomes

Expected Learning Outcomes

  • Students can apply classification techniques, propose hypotheses and choose the methods in categorical data analysis in R, including supervised classification with a binary outcome, and unsupervised classification with clustering techniques of mixed data types.
  • Students create customized R Markdown reports.
  • Students create reproducible analysis scripts.
  • Students define basic terms and identify the purposes of Bayesian inference vs frequentist inference.
  • Students describe known problems with the null hypothesis statistical testing and propose known solutions to them.
  • Students inspect missing data patterns and apply various methods of data imputation.
  • Students interpret the results and assess the quality of proposed analytical and visualization solutions, provide reasons for their choice of techniques, interpret the outputs correctly, and assess the quality of models and data stories.
  • Students propose and apply tools for reproducible and ethical data analysis.
  • Students scrape simple web tables and texts with R and convert them into standard data formats.
Course Contents

Course Contents

  • Coding with Style
  • Web Scraping
  • Binary Logistic Regression
  • Missing Data
  • Cluster Analysis
  • Data Culture and Data Acumen
Assessment Elements

Assessment Elements

  • non-blocking Final Exam
    The exam is conducted in the form of an online test, in which students will have to answer questions on all topics covered. Question blocks include topics on PCA, Logistics Model Diagnostics, Web Data Collection, Bayesian Thinking, Data Culture, Reproducible Science.
  • non-blocking Data Imputation
    The binary logistic regression project is another project to be evaluated separately.
  • non-blocking Binary Outcome project
    The project includes an extra part for additional points (up to 2 points, counts if the main task is complete): use one or more decision tree methods to compare and contrast the quality of both logistic regression and decision tree solutions. Compare the performance of the two methods and make a conclusion about which of them performs better here.
  • non-blocking Cluster Analysis project
  • non-blocking Web scraping
  • non-blocking Rmd Customization
  • non-blocking Bayes reaction paper
    This is a non-compulsory task for extra points.
Interim Assessment

Interim Assessment

  • 2024/2025 2nd module
    Rmd Customization * 0.1 + Web scraping * 0.2 + Binary Outcome project * 0.15 + Cluster Analysis project * 0.15 + Data Imputation * 0.2 + Bayes reaction paper * 0.1 + Final Exam * 0.1
Bibliography

Bibliography

Recommended Core Bibliography

  • Baker, M. (2015). Reproducibility crisis: Blame it on the antibodies. Nature, 521(7552), 274–276. https://doi.org/10.1038/521274a
  • Ledolter, J. (2013). Data Mining and Business Analytics with R. Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=587979
  • Munzert, S. (2014). Automated Data Collection with R : A Practical Guide to Web Scraping and Text Mining. HobokenChichester, West Sussex, United Kingdom: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=878670
  • Upton, G. J. G. (2016). Categorical Data Analysis by Example. Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1402878
  • Wickham, H. (2015). Advanced R, Second Edition. Boca Raton, FL: Chapman and Hall/CRC. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=934735
  • Wickham, H., & Grolemund, G. (2016). R for Data Science : Import, Tidy, Transform, Visualize, and Model Data (Vol. First edition). Sebastopol, CA: Reilly - O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1440131

Recommended Additional Bibliography

  • 9781439898208 - Andrew Gelman , John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin - Bayesian Data Analysis, Third Edition - 2013 - Chapman & Hall/CRC Press - http://search.ebscohost.com/login.aspx?direct=true&db=nlebk&AN=1763244 - nlebk - 1763244
  • 9781482253467 - McElreath, Richard - Statistical Rethinking : A Bayesian Course with Examples in R and Stan - 2015 - Chapman and Hall/CRC - http://search.ebscohost.com/login.aspx?direct=true&db=nlebk&AN=1338291 - nlebk - 1338291
  • Hadley, W. (2016). Ggplot2 : Elegant Graphics for Data Analysis. New York, NY: Springer. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1175341
  • Little, R. J. A., & Rubin, D. B. (2002). Statistical Analysis with Missing Data (Vol. Second edition). Hoboken: Wiley-Interscience. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=838162
  • Mood, C. (2010). Logistic Regression: Why We Cannot Do What We Think We Can Do, and What We Can Do About It. European Sociological Review, 26(1), 67–82. https://doi.org/10.1093/esr/jcp006
  • Seppe vanden Broucke, & Bart Baesens. (2018). Practical Web Scraping for Data Science : Best Practices and Examples with Python. Apress.

Authors

  • BOBRIKOV DMITRII DMITRIEVICH
  • Ильина Мария Ивановна