Machine Learning and Data Mining
- Learn algorithms and their main advantages and limitations for social science goals
- Obtain skills to work with machine learning software / codes
- Be able to work with different types of data, such as textual or relational data
- Collect data from the Internet for social science research
- Analyze them with machine learning tools
- Visualize results of the analysis
- Present the resulting project
- Introduction to machine learningIntroduction to machine learning and software review. Overview of the application of machine learning methods in various industries, including social science. A discussion of how modern methods of machine learning and artificial intelligence change approaches in many scientific fields, and why knowledge of these methods becomes part of the researcher’s general scientific culture, regardless of the specific subject area. Discussions of data types, quality metrics, methodology for conducting experiments on data of various types.
- Overview of mathematical formalism necessary for understanding of machine learningOverview of the mathematical model of machine learning. Overview of basic concepts from the field of linear algebra. Overview of elements of mathematical analysis. Introduction to the ‘Orange’ package, the general principles of the общие Orange ’package. Data visualization. Introduction to R Studio, general principles of work. Overview of Jupiter notebook, the general principles of working with notebook.
- Data collection and existing databases for machine learningOverview of various free data warehouses. Overview of data collection from VKontakte (API requests).
- Data preprocessingVector model of text collections. Discussion of the data cleansing process, text lematization, stop words removal. Implementing the preprocessing process in Python (Jupiter notebook).
- Cluster analysis (Kmeans, Cmeans, Hierarchical clustering)Implement cluster analysis in python. Development of a cluster analysis pipeline based on K means. Ways to initialize the algorithm. Visualization of cluster analysis results. Development of a pipeline for hierarchical cluster analysis. Visualization of the results of hierarchical data clustering. Discussion of the results. Implement cluster analysis in Jupiter notebook. Discussion of the results of clustering on different datasets.
- Linear models of classification and regressionsIntroduction to the classification procedure. Discussion of the difference between classification and regression. Mathematical model of logistic regression for classification purposes. Optimization of classification results based on regularization procedures. Discussion of the quality metrics of classifiers (Precision, Recall, F measure, ROC, confusion matrix). Realization of logistic regression in python (Jupiter notebook). Example of classification on real data.
- KNN and SVM classificationDiscussion of the KNN algorithm. Analysis of the advantages and disadvantages of KNN. The problem of choosing the number of neighbors. Evaluation of the method of selecting the number of neighbors. Discussion of the SVM (Support Vector Machines) algorithm. Analysis of the advantages and disadvantages of this algorithm. Discussion of parameters in linear and polynomial SVM models. Implementation of the KNN model in the Jupiter notebook. Quality assessment model KNN. Evaluation of the quality of KNN and SVM. Comparison of KNN and SVM results with each other on real data (text and non-text data). Discussion of the results.
- Naïve Bayes classifierIntroduction probability theory. The classic and Bayesian version of calculating the event probability. Discussion of Bayes Rule. A priori and a posteriori judgments. The use of a naive Bayes algorithm for classification purposes. Discussion of the advantages and disadvantages of the Bayes classifier. Implementation of the classification pipeline based on the naive Bayes algorithm Jupiter notebook. Comparison of the work of the Bayesian classifier with the KNN and SVM classifiers on real datasets (textual and non-textual data).
- Topic ModelingIntroduction to topic modeling. Probabilistic formulation of the classification problem. Discussion of various models in the field of topic modeling (E-M and Gibbs sampling algorithms). Discussion of the problem of topics number. Assess the similarities and differences between the topic solutions. Review of software tools in the field of topic modeling. The implementation of the pipeline for topic modeling in Jupiter notebook and ‘TopicMiner’ software. Evaluation of the effect of preprocessing on the results of topic modeling. Application of topic modeling for the analysis of socio - political data.
- Decision treesDiscussion of the tree construction algorithm. The implementation of the ensemble of trees. Discussion of the following quality metrics: 1. Index Gini. 2. Gibbs –Shannon entropy. 3. Gini impurity. 4. Misclassification error. Implement the decision tree in python. Using Decision trees in data classification examples.
- Presentation of projectAn project is a written self-study on a topic offered by the teacher or by the student him/herself approved by teacher. The topic for project includes development of skills for critical thinking and written argumentation of ideas. An project should include clear statement of a research problem; include an analysis of the problem by using concepts and analytical tools within the subject that generalize the point of view of the author
- Bell, J. (2015). Machine Learning : Hands-On for Developers and Technical Professionals. Indianapolis, Ind: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=872454
- Murphy, K. P. (2012). Machine Learning : A Probabilistic Perspective. Cambridge, Mass: The MIT Press. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=480968
- A Tutorial on Machine Learning and Data Science Tools with Python. (2017). Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsbas&AN=edsbas.E5F82B62
- Črt Gorup, Mitar Milutinovič, Matija Polajnar, Marko Toplak, & Lan Umek. (n.d.). Orange: Data Mining Toolbox in Python. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsbas&AN=edsbas.59267479
- Horton, N. J., & Kleinman, K. (2015). Using R and RStudio for Data Management, Statistical Analysis, and Graphics (Vol. Second edition). Boca Raton, FL: Chapman and Hall/CRC. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=957543