• A
  • A
  • A
  • АБВ
  • АБВ
  • АБВ
  • А
  • А
  • А
  • А
  • А
Обычная версия сайта

Бакалаврская программа «Социология и социальная информатика»

28
Январь

Advanced Data Analysis

2021/2022
Учебный год
ENG
Обучение ведется на английском языке
5
Кредиты
Статус:
Курс по выбору
Когда читается:
4-й курс, 1, 2 модуль

Преподаватель

Course Syllabus

Abstract

The course is targeted at undergraduate social science students aiming at careers in data analysis or academia. The course consists of seminars. It covers special prediction and classification models (logistic regression and cluster analysis) and more advanced data management topics such as web scraping and data imputation. The course discusses data culture, from data management to coding styles to narrating with data and Bayesian statistics. It is also a starting point for students interested in pursuing advanced training in research methods or planning to use quantitative methods with categorical outcomes in their own research.
Learning Objectives

Learning Objectives

  • The course covers the foundations and popular techniques of categorical data analysis with the goal of training students to be informed producers and consumers of quantitative research.
Expected Learning Outcomes

Expected Learning Outcomes

  • Students can apply classification techniques, propose hypotheses and choose the methods in categorical data analysis in R, including supervised classification with a binary outcome, and unsupervised classification with clustering techniques of mixed data types.
  • Students interpret the results and assess the quality of proposed analytical and visualization solutions, provide reasons for their choice of techniques, interpret the outputs correctly, and assess the quality of models and data stories.
  • Students define basic terms and identify the purposes of Bayesian inference vs frequentist inference.
  • Students create customized R Markdown reports.
  • Students create reproducible analysis scripts.
  • Students inspect missing data patterns and apply various methods of data imputation.
  • Students scrape simple web tables and texts with R and convert them into standard data formats.
  • Students describe known problems with the null hypothesis statistical testing and propose known solutions to them.
  • Students propose and apply tools for reproducible and ethical data analysis.
Course Contents

Course Contents

  • Coding with Style
    Customized reports in R Markdown: from fonts to headers. Review of the grammar of graphics in ggplot2. Creating reproducible scripts. Data simulation for hypothesis testing. Review of R style guide. Tidyverse and easystats R superpackages.
  • Web Scraping
    Web data. Working with API. Data scraping with R packages. JSON and XML data. Structure of a web page. Html tags. Using regular expressions in data management.
  • Binary Logistic Regression
    Models for categorical outcome variables. Variety of goals of analysis with categorical data. Typical goals of analysis and interpretation of results. Binary logistic regression. Objectives of logistic regression. The logistic curve. Maximum likelihood estimation. Assumptions of logistic regression. Perfect separation. Transforming a probability into odds and logit values. Goodness-of-fit measures for logistic regression. Out-of-sample validation. Classification matrix. Interpretation of results with linear and dichotomous predictors. Stepwise model building. Model diagnostics. Binary logistic regression in R. A comparison of binary logistic regression with decision trees.
  • Missing Data
    Patterns of missing data. Types and rules of data imputation (listwise deletion, median imputation, hot-deck, k-Nearest Neighbors, regression, MICE).
  • Cluster Analysis
    Objectives of cluster analysis: segmentation, taxonomy description, data simplification, and relationship identification. A conceptual framework for cluster analysis. Distance between objects. Similarity measures. Distance measures for various types of variables. Proximity matrix. Assumptions of cluster analysis. Hierarchical and non-hierarchical clustering algorithms. K-means clustering, PAM, DBSCAN clustering. Dendrograms. Measures of overall fit. Cluster profiles. Between- and within-cluster variation. Determining the number of clusters. Interpretation of clusters. Cross-classification from several solutions. Cluster analysis in R. A comparison of cluster analysis with data reduction by principal components analysis and factor analysis.
  • Data Culture and Data Acumen
    Data acumen: making meaningful, correct, and useful judgments about data. The data life-cycle: generation, collection, processing, management, analysis, visualization, interpretation, and delivery. Privacy and transparency in data analysis and research. Data culture areas: data life-cycle, data curation, understanding causality, understanding conditional and joint probabilities, false negatives and false positives, critical assessment of popular practices and further use of R functionalities to make sense of the data. Problems in current data modeling. Stevens’ typology of data and the meaningfulness of scaling. Alternative scale taxonomies. Computational (predictive) versus statistical (inferential) thinking. Overview of direct acyclic graphs. The replication crisis: false-negative results, selection bias, p-value misuse. Introduction to Bayesian thinking. The Bayesian factor.
Assessment Elements

Assessment Elements

  • non-blocking Binary Outcome project
    The project includes an extra part for additional points (up to 2 points, counts if the main task is complete): use one or more decision tree methods to compare and contrast the quality of both logistic regression and decision tree solutions. Compare the performance of the two methods and make a conclusion about which of them performs better here.
  • non-blocking Dimension Reduction project
  • non-blocking Cluster Analysis project
  • non-blocking Coding reflection paper
    This is a non-compulsory task for extra points.
  • non-blocking Rmd Customization
  • non-blocking Web scraping
  • non-blocking Bayes reaction paper
    This is a non-compulsory task for extra points.
  • non-blocking Final Exam
  • non-blocking Data Imputation
    The binary logistic regression project is another project to be evaluated separately.
Interim Assessment

Interim Assessment

  • Interim assessment (2 module)
    0.05 * Bayes reaction paper + 0.25 * Binary Outcome project + 0.2 * Cluster Analysis project + 0.05 * Coding reflection paper + 0.1 * Data Imputation + 0.1 * Dimension Reduction project + 0.1 * Final Exam + 0.05 * Rmd Customization + 0.1 * Web scraping
Bibliography

Bibliography

Recommended Core Bibliography

  • Ledolter, J. (2013). Data Mining and Business Analytics with R. Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=587979
  • Munzert, S. (2014). Automated Data Collection with R : A Practical Guide to Web Scraping and Text Mining. HobokenChichester, West Sussex, United Kingdom: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=878670
  • Upton, G. J. G. (2016). Categorical Data Analysis by Example. Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1402878
  • Wickham, H., & Grolemund, G. (2016). R for Data Science : Import, Tidy, Transform, Visualize, and Model Data (Vol. First edition). Sebastopol, CA: Reilly - O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1440131

Recommended Additional Bibliography

  • Little, R. J. A., & Rubin, D. B. (2002). Statistical Analysis with Missing Data (Vol. Second edition). Hoboken: Wiley-Interscience. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=838162
  • McElreath, R. (2016). Statistical Rethinking : A Bayesian Course with Examples in R and Stan. Boca Raton: Chapman and Hall/CRC. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=nlebk&AN=1338291
  • Mood, C. (2010). Logistic Regression: Why We Cannot Do What We Think We Can Do, and What We Can Do About It. European Sociological Review, 26(1), 67–82. https://doi.org/10.1093/esr/jcp006
  • Seppe vanden Broucke, & Bart Baesens. (2018). Practical Web Scraping for Data Science : Best Practices and Examples with Python. Apress.