• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Text mining: Advanced Level

2023/2024
Academic Year
ENG
Instruction in English
3
ECTS credits
Course type:
Compulsory course
When:
2 year, 1 module

Instructor

Course Syllabus

Abstract

This course covers a wide range of machine learning algorithms for textual data analysis. The first part of the course deals with preprocessing procedures for text data, which include lematizers for different languages, procedures for vectorization of text data. This course also deals with the work of classifiers for textual analysis (and measures the quality of classifiers).The second part of the course focuses on the work of flat and hierarchical topic models (measures of quality: coherence, perplexity, loglikellyhood, stability, Renyi entropy). In addition, this course explores the concept of 'word embedings' for textual analysis (topic modeling). In the third part of the course, the work with neural networks for textual data analysis based on the TensorFlow framework with the Keras add-in is considered. All the models discussed are provided with python scripts. At the end of the course students have to present their work on data analysis in the form of a presentation and scripts.
Learning Objectives

Learning Objectives

  • Learn algorithms and their main advantages and limitations in terms of text data analysis
  • Obtain skills to work with machine learning software / cod
  • Be able to work with text data.
Expected Learning Outcomes

Expected Learning Outcomes

  • Have skills to analyze textual data
  • Analyze data with machine learning tools
  • Do textual preprocessing (lemmatization and tokenization)
  • Present the resulting project in terms of machine learning
  • Visualize results of the analysis
Course Contents

Course Contents

  • Objectives of text analysis - preprocessing, lematization-vectorization.
  • Overview of classical classifiers such as KNN, Random Forrest, SVM
  • Bayesian classification for sentiment analysis or topic definition.
  • Topic modeling (plane), quality metrics (Coherence, Perplexity, Loglokellyhood, stability, Renyi entropy), review of some libraries.
  • Topic modeling (hierarchical models, discussion of problems).
  • Embedings (gensim), what are word embeddings, how to work with words embedings.
  • Topic models with embedings (ETM, GLDAW).
  • Introduction to neuron networks (Tensorflow, keras) - the basics of working with Keras, an overview of some neural networks.
  • Preprocessing of text data for neural networking.
  • Working with recurrent neural networks for textual analysis.
  • Working with LSTM neural networks for textual analysis.
  • Model with multiple outputs (heads).
  • Presentation of student work.
Assessment Elements

Assessment Elements

  • non-blocking Homework
  • non-blocking Exam
    The exam is a competition (hakaton) to develop the best model of sentiment analysis for the Russian-language text. The essence of the competition is as follows. At the end of the first part of the course a Russian-language dataset with sentiment scores will be given. Students must train their classification models on this dataset. A week before the exam, students will receive the second part of the dataset, which they must use to test the models they have learned. On the exam, students give a presentation on their models. The grade for the presentation depends, first, on the level of presentation. Second, the grade depends on the results obtained (level of model learning and number of models).
Interim Assessment

Interim Assessment

  • 2023/2024 1st module
    Exam: Практическая работа * 0.4 + Homework : Домашнее задание * 0.6
Bibliography

Bibliography

Recommended Core Bibliography

  • 9781789958294 - Raschka, Sebastian; Mirjalili, Vahid - Python Machine Learning : Machine Learning and Deep Learning with Python, Scikit-learn, and TensorFlow 2, 3rd Edition - 2019 - Packt Publishing - http://search.ebscohost.com/login.aspx?direct=true&db=nlebk&AN=2329991 - nlebk - 2329991

Recommended Additional Bibliography

  • Miroslav Kubat. (2017). An Introduction to Machine Learning (Vol. 2nd ed. 2017). Springer.

Authors

  • KOLTSOV Sergei NIKOLAEVICH
  • TSVETKOVA EKATERINA ANDREEVNA
  • Ильина Мария Ивановна