Machine Learning and Data Mining
- Learn algorithms and their main advantages and limitations for social science goals
- Obtain skills to work with machine learning software / codes
- Be able to work with different types of data, such as textual or relational data
- Analyze textual and numerical data
- Do textual preprocessing (lemmatization and tokenization)
- Analyze data with machine learning tools
- Visualize results of the analysis
- Topic 1. Introduction to machine learning.Introduction to machine learning and software review. Overview of the application of machine learning methods in various industries, including social science. A discussion of how modern methods of machine learning and artificial intelligence change approaches in many scientific fields, and why knowledge of these methods becomes part of the researcher’s general scientific culture, regardless of the specific subject area. Discussions of data types, quality metrics, methodology for conducting experiments on data of various types.
- Topic 4. Regression (overview models).The application of linear, logistic and non-linear regressions for text and tabular data. Impact of using categorical variables on regression results. Basic quality measures for regressions. Discussion of the results of applying regressions on different datasets.
- Topic 5. Feature selection.An overview of models for extracting useful features in a dataset (Univariate Selection, Recursive Feature Elimination, Principal Component Analysis, Random Forest). Implementation of the specified models in python (using the sklearn library).
- Topic 6. Cluster analysis (Kmeans, Cmeans, Hierarchical clustering).Implement cluster analysis in Jupiter notebook. Development of a cluster analysis pipeline based on K means. Ways to initialize the algorithm. Visualization of cluster analysis results. Development of a pipeline for hierarchical cluster analysis. Visualization of the results of hierarchical data clustering. Discussion of clusterisation problems. Discussion of the results of clustering on different datasets.
- Topic 7. Linear models of classification and regressions.Introduction to the classification procedure. Discussion of the difference between classification and regression. Mathematical model of logistic regression for classification purposes. Optimization of classification results based on regularization procedures. Discussion of the quality metrics of classifiers (Precision, Recall, F measure, ROC, confusion matrix). Realization of logistic regression in Jupiter notebook. Example of classification on real data.
- Topic 8. KNN and SVM classification.Discussion of the KNN algorithm. Analysis of the advantages and disadvantages of KNN. The problem of choosing the number of neighbors. Evaluation of the method of selecting the number of neighbors. Discussion of the SVM (Support Vector Machines) algorithm. Analysis of the advantages and disadvantages of this algorithm. Discussion of parameters in linear and polynomial SVM models. Implementation of the KNN model in the Jupiter notebook. Quality assessment model KNN. Implementation of the SVM model in Jupiter notebook. Evaluation of the quality of KNN and SVM. Comparison of KNN and SVM results with each other on real data (text and non-text data). Discussion of the results.
- Topic 9. Naïve Bayes classifier.Introduction probability theory. The classic and Bayesian version of calculating the event probability. Discussion of Bayes Rule. A priori and a posteriori judgments. The use of a naive Bayes algorithm for classification purposes. Discussion of the advantages and disadvantages of the Bayes classifier. Implementation of the classification pipeline based on the naive Bayes algorithm in Jupiter notebook. Comparison of the work of the Bayesian classifier with the KNN and SVM classifiers on real datasets (textual and non-textual data).
- Topic 10. Topic modeling.Introduction to topic modeling. Probabilistic formulation of the classification problem. Discussion of various models in the field of topic modeling (E-M and Gibbs sampling algorithms). Discussion of the problem of topics number. Assess the similarities and differences between the topic solutions. Review of software tools in the field of topic modeling. The implementation of the pipeline for topic modeling in the Jupiter notebook and ‘TopicMiner’ software. Evaluation of the effect of preprocessing on the results of topic modeling. Application of topic modeling for the analysis of socio - political data. Review of hierarchical topic models and models with word embedings and it realization in python (genism, tomotopy).
- Topic 2. Overview of mathematical formalism necessary for understanding of machine learning.Overview of the mathematical model of machine learning. Overview of basic concepts from the field of linear algebra. Overview of elements of mathematical analysis. Overview of Jupiter notebook, the general principles of working with notebook.
- Topic 3. Data preprocessing.Vector model of text collections. Discussion of the data cleansing process, text lematization, stop words removal. Implementing the preprocessing process Python (Jupiter notebook) for Russian and English languages. Discussion of the lematization procedures: Mystem. 2. Pymorphy. 3. NLTK.
- Presentation of projectAn project is a written self-study on a topic offered by the teacher or by the student him/herself approved by teacher. The topic for project includes development of skills for critical thinking and written argumentation of ideas. An project should include clear statement of a research problem; include an analysis of the problem by using concepts and analytical tools within the subject that generalize the point of view of the author
- Murphy, K. P. (2012). Machine Learning : A Probabilistic Perspective. Cambridge, Mass: The MIT Press. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=480968
- A Tutorial on Machine Learning and Data Science Tools with Python. (2017). Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsbas&AN=edsbas.E5F82B62