Advanced Data Analysis
- The course covers the foundations and popular techniques of categorical data analysis with the goal of training students to be informed producers and consumers of quantitative research.
- Students can apply classification techniques, propose hypotheses and choose the methods in categorical data analysis in R, including supervised classification with a binary outcome, and unsupervised classification with clustering techniques of mixed data types.
- Students interpret the results and assess the quality of proposed analytical and visualization solutions, provide reasons for their choice of techniques, interpret the outputs correctly, and assess the quality of models and data stories.
- Students define basic terms and identify the purposes of Bayesian inference vs frequentist inference.
- Students create customized R Markdown reports.
- Students create reproducible analysis scripts.
- Students inspect missing data patterns and apply various methods of data imputation.
- Students scrape simple web tables and texts with R and convert them into standard data formats.
- Students describe known problems with the null hypothesis statistical testing and propose known solutions to them.
- Students propose and apply tools for reproducible and ethical data analysis.
- Coding with StyleCustomized reports in R Markdown: from fonts to headers. Review of the grammar of graphics in ggplot2. Creating reproducible scripts. Data simulation for hypothesis testing. Review of R style guide. Tidyverse and easystats R superpackages.
- Web ScrapingWeb data. Working with API. Data scraping with R packages. JSON and XML data. Structure of a web page. Html tags. Using regular expressions in data management.
- Binary Logistic RegressionModels for categorical outcome variables. Variety of goals of analysis with categorical data. Typical goals of analysis and interpretation of results. Binary logistic regression. Objectives of logistic regression. The logistic curve. Maximum likelihood estimation. Assumptions of logistic regression. Perfect separation. Transforming a probability into odds and logit values. Goodness-of-fit measures for logistic regression. Out-of-sample validation. Classification matrix. Interpretation of results with linear and dichotomous predictors. Stepwise model building. Model diagnostics. Binary logistic regression in R. A comparison of binary logistic regression with decision trees.
- Missing DataPatterns of missing data. Types and rules of data imputation (listwise deletion, median imputation, hot-deck, k-Nearest Neighbors, regression, MICE).
- Cluster AnalysisObjectives of cluster analysis: segmentation, taxonomy description, data simplification, and relationship identification. A conceptual framework for cluster analysis. Distance between objects. Similarity measures. Distance measures for various types of variables. Proximity matrix. Assumptions of cluster analysis. Hierarchical and non-hierarchical clustering algorithms. K-means clustering, PAM, DBSCAN clustering. Dendrograms. Measures of overall fit. Cluster profiles. Between- and within-cluster variation. Determining the number of clusters. Interpretation of clusters. Cross-classification from several solutions. Cluster analysis in R. A comparison of cluster analysis with data reduction by principal components analysis and factor analysis.
- Data Culture and Data AcumenData acumen: making meaningful, correct, and useful judgments about data. The data life-cycle: generation, collection, processing, management, analysis, visualization, interpretation, and delivery. Privacy and transparency in data analysis and research. Data culture areas: data life-cycle, data curation, understanding causality, understanding conditional and joint probabilities, false negatives and false positives, critical assessment of popular practices and further use of R functionalities to make sense of the data. Problems in current data modeling. Stevens’ typology of data and the meaningfulness of scaling. Alternative scale taxonomies. Computational (predictive) versus statistical (inferential) thinking. Overview of direct acyclic graphs. The replication crisis: false-negative results, selection bias, p-value misuse. Introduction to Bayesian thinking. The Bayesian factor.
- Binary Outcome projectThe project includes an extra part for additional points (up to 2 points, counts if the main task is complete): use one or more decision tree methods to compare and contrast the quality of both logistic regression and decision tree solutions. Compare the performance of the two methods and make a conclusion about which of them performs better here.
- Dimension Reduction project
- Cluster Analysis project
- Coding reflection paperThis is a non-compulsory task for extra points.
- Rmd Customization
- Web scraping
- Bayes reaction paperThis is a non-compulsory task for extra points.
- Final Exam
- Data ImputationThe binary logistic regression project is another project to be evaluated separately.
- Interim assessment (2 module)0.05 * Bayes reaction paper + 0.25 * Binary Outcome project + 0.2 * Cluster Analysis project + 0.05 * Coding reflection paper + 0.1 * Data Imputation + 0.1 * Dimension Reduction project + 0.1 * Final Exam + 0.05 * Rmd Customization + 0.1 * Web scraping
- Ledolter, J. (2013). Data Mining and Business Analytics with R. Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=587979
- Munzert, S. (2014). Automated Data Collection with R : A Practical Guide to Web Scraping and Text Mining. HobokenChichester, West Sussex, United Kingdom: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=878670
- Upton, G. J. G. (2016). Categorical Data Analysis by Example. Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1402878
- Wickham, H., & Grolemund, G. (2016). R for Data Science : Import, Tidy, Transform, Visualize, and Model Data (Vol. First edition). Sebastopol, CA: Reilly - O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1440131
- Little, R. J. A., & Rubin, D. B. (2002). Statistical Analysis with Missing Data (Vol. Second edition). Hoboken: Wiley-Interscience. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=838162
- McElreath, R. (2016). Statistical Rethinking : A Bayesian Course with Examples in R and Stan. Boca Raton: Chapman and Hall/CRC. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=nlebk&AN=1338291
- Mood, C. (2010). Logistic Regression: Why We Cannot Do What We Think We Can Do, and What We Can Do About It. European Sociological Review, 26(1), 67–82. https://doi.org/10.1093/esr/jcp006
- Seppe vanden Broucke, & Bart Baesens. (2018). Practical Web Scraping for Data Science : Best Practices and Examples with Python. Apress.