• A
  • A
  • A
  • АБВ
  • АБВ
  • АБВ
  • А
  • А
  • А
  • А
  • А
Обычная версия сайта

Machine Learning and Data Mining

Учебный год
Обучение ведется на английском языке
Курс по выбору
Когда читается:
1-й курс, 4 модуль


Course Syllabus


Rapid developments of social networking sites, online media and other internet-generated data are making machine learning an essential analytical tool of social scientists and industrial analysts of social data. Nowadays, social researchers should not only be able to work with different types of data, such as textual or relational data, but should also have skills to interpret results obtained with complex mathematical algorithms. In this course students will, first, get to know basic machine learning algorithms and their main advantages and limitations for social science goals. Second, they will obtain skills to work with machine learning software / codes. Third, by the end of the course all students will produce small-scale research project that may be used in their Master theses. Depending on the level of a student group, the course may be based on one of the following software tools: 1. Orange 2. R 3. Python
Learning Objectives

Learning Objectives

  • Learn algorithms and their main advantages and limitations for social science goals
  • Obtain skills to work with machine learning software / codes
  • Be able to work with different types of data, such as textual or relational data
Expected Learning Outcomes

Expected Learning Outcomes

  • Collect data from the Internet for social science research
  • Analyze them with machine learning tools
  • Visualize results of the analysis
  • Present the resulting project
Course Contents

Course Contents

  • Introduction to machine learning
    Introduction to machine learning and software review. Overview of the application of machine learning methods in various industries, including social science. A discussion of how modern methods of machine learning and artificial intelligence change approaches in many scientific fields, and why knowledge of these methods becomes part of the researcher’s general scientific culture, regardless of the specific subject area. Discussions of data types, quality metrics, methodology for conducting experiments on data of various types.
  • Overview of mathematical formalism necessary for understanding of machine learning
    Overview of the mathematical model of machine learning. Overview of basic concepts from the field of linear algebra. Overview of elements of mathematical analysis. Introduction to the ‘Orange’ package, the general principles of the общие Orange ’package. Data visualization. Introduction to R Studio, general principles of work. Overview of Jupiter notebook, the general principles of working with notebook.
  • Data collection and existing databases for machine learning
    Overview of various free data warehouses. Overview of data collection from VKontakte (API requests).
  • Data preprocessing
    Vector model of text collections. Discussion of the data cleansing process, text lematization, stop words removal. Implementing the preprocessing process in Python (Jupiter notebook).
  • Cluster analysis (Kmeans, Cmeans, Hierarchical clustering)
    Implement cluster analysis in python. Development of a cluster analysis pipeline based on K means. Ways to initialize the algorithm. Visualization of cluster analysis results. Development of a pipeline for hierarchical cluster analysis. Visualization of the results of hierarchical data clustering. Discussion of the results. Implement cluster analysis in Jupiter notebook. Discussion of the results of clustering on different datasets.
  • Linear models of classification and regressions
    Introduction to the classification procedure. Discussion of the difference between classification and regression. Mathematical model of logistic regression for classification purposes. Optimization of classification results based on regularization procedures. Discussion of the quality metrics of classifiers (Precision, Recall, F measure, ROC, confusion matrix). Realization of logistic regression in python (Jupiter notebook). Example of classification on real data.
  • KNN and SVM classification
    Discussion of the KNN algorithm. Analysis of the advantages and disadvantages of KNN. The problem of choosing the number of neighbors. Evaluation of the method of selecting the number of neighbors. Discussion of the SVM (Support Vector Machines) algorithm. Analysis of the advantages and disadvantages of this algorithm. Discussion of parameters in linear and polynomial SVM models. Implementation of the KNN model in the Jupiter notebook. Quality assessment model KNN. Evaluation of the quality of KNN and SVM. Comparison of KNN and SVM results with each other on real data (text and non-text data). Discussion of the results.
  • Naïve Bayes classifier
    Introduction probability theory. The classic and Bayesian version of calculating the event probability. Discussion of Bayes Rule. A priori and a posteriori judgments. The use of a naive Bayes algorithm for classification purposes. Discussion of the advantages and disadvantages of the Bayes classifier. Implementation of the classification pipeline based on the naive Bayes algorithm Jupiter notebook. Comparison of the work of the Bayesian classifier with the KNN and SVM classifiers on real datasets (textual and non-textual data).
  • Topic Modeling
    Introduction to topic modeling. Probabilistic formulation of the classification problem. Discussion of various models in the field of topic modeling (E-M and Gibbs sampling algorithms). Discussion of the problem of topics number. Assess the similarities and differences between the topic solutions. Review of software tools in the field of topic modeling. The implementation of the pipeline for topic modeling in Jupiter notebook and ‘TopicMiner’ software. Evaluation of the effect of preprocessing on the results of topic modeling. Application of topic modeling for the analysis of socio - political data.
  • Decision trees
    Discussion of the tree construction algorithm. The implementation of the ensemble of trees. Discussion of the following quality metrics: 1. Index Gini. 2. Gibbs –Shannon entropy. 3. Gini impurity. 4. Misclassification error. Implement the decision tree in python. Using Decision trees in data classification examples.
Assessment Elements

Assessment Elements

  • non-blocking Presentation of project
    An project is a written self-study on a topic offered by the teacher or by the student him/herself approved by teacher. The topic for project includes development of skills for critical thinking and written argumentation of ideas. An project should include clear statement of a research problem; include an analysis of the problem by using concepts and analytical tools within the subject that generalize the point of view of the author
  • non-blocking Homework
Interim Assessment

Interim Assessment

  • Interim assessment (4 module)
    0.7 * Homework + 0.3 * Presentation of project


Recommended Core Bibliography

  • Bell, J. (2015). Machine Learning : Hands-On for Developers and Technical Professionals. Indianapolis, Ind: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=872454
  • Murphy, K. P. (2012). Machine Learning : A Probabilistic Perspective. Cambridge, Mass: The MIT Press. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=480968

Recommended Additional Bibliography

  • A Tutorial on Machine Learning and Data Science Tools with Python. (2017). Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsbas&AN=edsbas.E5F82B62
  • Črt Gorup, Mitar Milutinovič, Matija Polajnar, Marko Toplak, & Lan Umek. (n.d.). Orange: Data Mining Toolbox in Python. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsbas&AN=edsbas.59267479
  • Horton, N. J., & Kleinman, K. (2015). Using R and RStudio for Data Management, Statistical Analysis, and Graphics (Vol. Second edition). Boca Raton, FL: Chapman and Hall/CRC. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=957543