• A
  • A
  • A
  • АБВ
  • АБВ
  • АБВ
  • А
  • А
  • А
  • А
  • А
Обычная версия сайта

Бакалаврская программа «Социология и социальная информатика»

Advanced Data Analysis

2020/2021
Учебный год
ENG
Обучение ведется на английском языке
6
Кредиты
Статус:
Курс по выбору
Когда читается:
4-й курс, 1, 2 модуль

Преподаватель

Course Syllabus

Abstract

Advanced Data Analysis in Sociology focuses on categorical data analysis and covers special types of prediction and classification models (logistic regression and cluster analysis). The course continues with a discussion of data culture, from data management to narrating with data. This course is also the starting point for students interested in pursuing advanced training in research methods or planning to use quantitative methods with categorical outcomes in their own research.
Learning Objectives

Learning Objectives

  • The course covers the foundations and popular techniques of categorical data analysis with the goal of training students to be informed producers and consumers of quantitative research.
Expected Learning Outcomes

Expected Learning Outcomes

  • Students read published research and data narratives critically, express their own opinions and give their interpretation.
  • Students propose recipe-based modelling solutions in R and efficient visualization strategies using dashboards and suitable graph types.
  • Students can identify reproducible research, know good practices in data management and data visualization and apply them to own research projects.
  • Students can apply classification techniques, propose hypotheses and choose the methods in categorical data analysis in R, including supervised classification with a binary outcome, and unsupervised classification with clustering techniques of mixed data types.
  • Students interpret the results and assess the quality of proposed analytical and visualization solutions, provide reasons for their choice of techniques, interpret the outputs correctly, and assess the quality of models and data stories.
Course Contents

Course Contents

  • Data culture and data acumen
    Building data acumen: making meaningful, correct and useful judgments about data. Privacy and ethical concerns in data analysis and research. Data culture areas: data life-cycle, data curation, understanding causality, understanding conditional and joint probabilities, false negatives and false positives, critical assessment of popular practices and further use of R functionalities to make sense of the data. The data life-cycle: generation, collection, processing, management, analysis, visualization, interpretation, and delivery.
  • Binary logistic regression
    Models for categorical outcome variables. Variety of goals of analysis with categorical data. Typical goals of analysis and interpretation of results. Binary logistic regression. Objectives of logistic regression. The logistic curve. Maximum likelihood estimation. Assumptions of logistic regression. Perfect separation. Transforming a probability into odds and logit values. Goodness-of-fit measures for logistic regression. Out-of-sample validation. Classification matrix. Interpretation of results with linear and dichotomous predictors. Stepwise model building. Model diagnostics. Binary logistic regression in R.
  • Cluster analysis
    Objectives of cluster analysis: segmentation, taxonomy description, data simplification, and relationship identification. Conceptual framework for cluster analysis. Distance between objects. Similarity measures. Distance measures for various types of variables. Proximity matrix. Assumptions of cluster analysis. Hierarchical and non-hierarchical clustering algorithms. K-means clustering, DBSCAN clustering. Dendrograms. Measures of overall fit. Cluster profiles. Between- and within-cluster variation. Determining the number of clusters. Interpretation of clusters. Cross-classification from several solutions. Cluster analysis in R.
  • Data management and visualization
    Getting and wrangling data in R. 'Data curation'. Delivering results in applications using plotly. Data simulation for hypothesis testing. Stevens’ typology of data and the meaningfulness of scaling. Alternative scale taxonomies. Transforming data values to simplify the structure. Research questions and data types. Good practices in data visualization. Choice and critical reading of data narratives using graphs. Learning from data and making effective communication to decision-makers. Using dashboards and presentations for data stories.
  • Understanding causality and prediction
    Computational (predictive) versus statistical (inferential) thinking. Association and causation in data analysis. Overview of pseudo-experimental techniques and direct acyclic graphs. The description-prediction-prescription framework and its critique. Data modeling culture. Data science language. Problems in current data modeling. The replication crisis: false-negative results, selection bias, p-value misuse. Recipe-based modelling techniques. Introduction into Bayesian thinking. The Bayesian factor.
Assessment Elements

Assessment Elements

  • non-blocking Binary Outcome project
  • non-blocking Dimension Reduction project
  • non-blocking Cluster Analysis project
  • non-blocking Coding reflection paper
    If you fail to submit this paper in time, you can make up by contributing to the 'R Gems' seminar.
  • non-blocking Rmd Customization
  • non-blocking Internet datasets
  • non-blocking Bayes reaction paper
  • non-blocking Viz Quiz
    The Tidy Tuesday: https://github.com/rfordatascience/tidytuesday
  • non-blocking Final Exam
Interim Assessment

Interim Assessment

  • Interim assessment (2 module)
    0.05 * Bayes reaction paper + 0.2 * Binary Outcome project + 0.2 * Cluster Analysis project + 0.1 * Coding reflection paper + 0.15 * Dimension Reduction project + 0.1 * Final Exam + 0.05 * Internet datasets + 0.05 * Rmd Customization + 0.1 * Viz Quiz
Bibliography

Bibliography

Recommended Core Bibliography

  • Ledolter, J. (2013). Data Mining and Business Analytics with R. Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=587979
  • Upton, G. J. G. (2016). Categorical Data Analysis by Example. Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1402878
  • Valentin Amrhein, David Trafimow, & Sander Greenland. (2019). Inferential Statistics as Descriptive Statistics: There Is No Replication Crisis if We Don’t Expect Replication. The American Statistician, (S1), 262. https://doi.org/10.1080/00031305.2018.1543137
  • Yau, N. (2013). Data Points : Visualization That Means Something. New York: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=566405

Recommended Additional Bibliography

  • Field, A. V. (DE-588)128714581, (DE-627)378310763, (DE-576)186310501, aut. (2012). Discovering statistics using R Andy Field, Jeremy Miles, Zoë Field. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edswao&AN=edswao.363067604
  • Mood, C. (2010). Logistic Regression: Why We Cannot Do What We Think We Can Do, and What We Can Do About It. European Sociological Review, 26(1), 67–82. https://doi.org/10.1093/esr/jcp006