We use cookies in order to improve the quality and usability of the HSE website. More information about the use of cookies is available here, and the regulations on processing personal data can be found here. By continuing to use the site, you hereby confirm that you have been informed of the use of cookies by the HSE website and agree with our rules for processing personal data. You may disable cookies in your browser settings.
Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from statistical physics, where an inferred topical structure of a collection can be considered an information statistical system residing in a non-equilibrium state. By testing our approach on four models—Probabilistic Latent Semantic Analysis (pLSA), Additive Regularization of Topic Models (BigARTM), Latent Dirichlet Allocation (LDA) with Gibbs sampling, LDA with variational inference (VLDA)—we, first of all, show that the minimum of Renyi entropy coincides with the “true” number of topics, as determined in two labelled collections. Simultaneously, we find that Hierarchical Dirichlet Process (HDP) model as a well-known approach for topic number optimization fails to detect such optimum. Next, we demonstrate that large values of the regularization coefficient in BigARTM significantly shift the minimum of entropy from the topic number optimum, which effect is not observed for hyper-parameters in LDA with Gibbs sampling. We conclude that regularization may introduce unpredictable distortions into topic models that need further research.
Motivation
Imaging mass spectrometry (imaging MS) is a prominent technique for capturing distributions of molecules in tissue sections. Various computational methods for imaging MS rely on quantifying spatial correlations between ion images, referred to as co-localization. However, no comprehensive evaluation of co-localization measures has ever been performed; this leads to arbitrary choices and hinders method development.
Results
We present ColocML, a machine learning approach addressing this gap. With the help of 42 imaging MS experts from nine laboratories, we created a gold standard of 2210 pairs of ion images ranked by their co-localization. We evaluated existing co-localization measures and developed novel measures using term frequency–inverse document frequency and deep neural networks. The semi-supervised deep learning Pi model and the cosine score applied after median thresholding performed the best (Spearman 0.797 and 0.794 with expert rankings, respectively). We illustrate these measures by inferring co-localization properties of 10 273 molecules from 3685 public METASPACE datasets.
Functional classes on a curve in a plane (a partial case of a spatial curve) can be described by the approximation speed by functions that are harmonic in three-dimensional neighbourhoods of the curve. No constructive description of functional classes on rather general surfaces in R3 and R4 has been presented in literature so far. The main result of the paper is Theorem 1.
In practice, the critical step in building machine learning models of big data (BD) is costly in terms of time and the computing resources procedure of parameter tuning with a grid search. Due to the size, BD are comparable to mesoscopic physical systems. Hence, methods of statistical physics could be applied to BD. The paper shows that topic modeling demonstrates self-similar behavior under the condition of a varying number of clusters. Such behavior allows using a renormalization technique. The combination of a renormalization procedure with the Rényi entropy approach allows for fast searching of the optimal number of clusters. In this paper, the renormalization procedure is developed for the Latent Dirichlet Allocation (LDA) model with a variational Expectation-Maximization algorithm. The experiments were conducted on two document collections with a known number of clusters in two languages. The paper presents results for three versions of the renormalization procedure: (1) a renormalization with the random merging of clusters, (2) a renormalization based on minimal values of Kullback–Leibler divergence and (3) a renormalization with merging clusters with minimal values of Rényi entropy. The paper shows that the renormalization procedure allows finding the optimal number of topics 26 times faster than grid search without significant loss of quality.
In this paper, we consider the following problem - what affects the Nash equilibrium amount of investment in knowledge when one of the complete graph enters another full one. The solution of this problem will allow us to understand exactly how game agents will behave when deciding whether to enter the other net, what conditions and externalities affect it and how the level of future equilibrium amount of investments in knowledge can be predicted.
Performing exact computations can require significant resources. Approximate computing allows to alleviate resource constraints, sacrificing the accuracy of results. In this work, we consider a generalization of the classical packet classification problem. Our major contribution is to introduce various representations for approximate packet classifiers with controlled accuracy and optimization techniques to reduce classifier sizes exploiting this new level of flexibility. We validate our theoretical results with a comprehensive evaluation study.
The chapter considers multistage multicriteria game in extensive form. We empoy the so-called A-subgame concept to examine the dynamical properties of come non-cooperative and cooperative solutions.
We deal with multistage multicriteria games in extensive form and employ so-called “A-subgame” concept to examine dynamical properties of some non-cooperative and cooperative solutions. It is proved that if we take into account only the active players at each A-subgame the set of all strong Pareto equilibria is time consistent but does not satisfy dynamical compatibility.
We construct an optimal cooperative trajectory and vector-valued characteristic function using the refined leximin algorithm. To ensure the sustainability of a cooperative agreement we design the A-incremental imputation distribution procedure for the Shapley value which provides a better incentive for cooperation than classical incremental allocation procedure. This specific payment schedule corresponds to the A-subgame concept satisfies time consistency and efficiency condition and implies non-zero current payment to the active player immediately after her move.
We suggest a new method on coloring generalized Kneser graphs based on hypergraphs with high discrepancy and small number of edges. The main result is providing a proper coloring of K(n, n/2−t, s) in (4 + o(1))(s + t) 2 colors, which is produced by Hadamard matrices. Also, we show that for colorings by independent set of a natural type, this result is the best possible up to a multiplicative constant. Our method extends to Kneser hypergraphs as well
Functional classes on a curve in a plane (a partial case of a spatial curve) can be described by the approximation speed by functions that are harmonic in three-dimensional neighbourhoods of the curve. No constructive description of functional classes on rather general surfaces in R 3 and R 4 has been presented in literature so far. The main result of the paper is Theorem 1.
In our previous papers (2002, 2017), we derived conditions for the existence of a strong Nash equilibrium in multistage nonzero-sum games under additional constraints on the possible deviations of coalitions from their agreed-upon strategies. These constraints allowed only one-time simultaneous deviations of all the players in a coalition. However, it is clear that in real-world problems the deviations of different members of a coalition may occur at different times (at different stages of the game), which makes the punishment strategy approach proposed by the authors earlier inapplicable in the general case. The fundamental difficulty is that in the general case the players who must punish the deviating coalition know neither the members of this coalition nor the times when each player performs the deviation. In this paper, we propose a new punishment strategy, which does not require the full information about the deviating coalition but uses only the fact of deviation of at least one player of the coalition. Of course, this punishment strategy can be realized only under some additional constraints on stage games. Under these additional constraints, it was proved that the punishment of the deviating coalition can be effectively realized. As a result, the existence of a strong Nash equilibrium in the game was established.
This article is dedicated to an alternative method of solving of the Chinese Remainder Theorem for polynomials. To construct the solution, a system of linear equations is constructed (using the method of undetermined coefficients) and then solved. The complexity of the proposed method is also calculated.
An effective calculation of the Reed-Solomon code syndrome is proposed. The method is based on the use of the partial normalized cyclic convolutions in the partial inverse cyclotomic discrete Fourier transform. The method is the best of the known algorithms, in terms of multiplicative complexity.
In the paper, a two-level infinitely repeated hierarchical game with one player (center) C0 on the first level and S1...Sn subordinate players on the second is considered. On each stage of the game player C0 selects vector x=(x1....xn) from a given set X, in which each component represents a vector of resources delivered by C0 to one of the subordinate players, i.e. (formula presented). At the second level, Si i=1,2..,n, choose the controls (formula presented), where Yi(xi) depends upon the choice of player C0. In this game, a set of different Nash equilibrium also based on threat and punishment strategies is obtained. In one case, the center enforces special behavior of subordinate firms (vector of manufactured goods), threatening to deprive them of resources on the next steps if the subordinate firms refuse to implement the prescribed behavior. In another case, the subordinate firms can force the center to use a certain resource allocation threatening to stop production. Using different combinations of such behaviors on different stages of the game, we obtain a wide class of Nash equilibrium in the game under consideration. The cooperative version of the game is also considered. The conditions are derived under which the cooperative behavior can be supported by Nash Equilibrium or Strong Nash Equilibrium (Nash Equilibrium stable against deviations of coalitions).
In each node of a network, economy is described by the simple two-period Romer’s model of endogenous growth with production and knowledge externalities. The sum of knowledge levels in the neighbor nodes causes an externality in the production of each node of the network. The game equilibrium in the network is investigated. The agents’ solutions depending on the size of externality are obtained. The uniqueness of inner equilibrium is proved. The role of passive agents in network formation is studied; in particular, the possibilities of adding a passive agent to a regular network, and also of joining of regular networks through nodes with passive agents. It is shown that the sum of knowledge levels in all the nodes decreases under adding of a new link.
Topic modeling is a popular approach for clustering text documents. However, current tools have a number of unsolved problems such as instability and a lack of criteria for selecting the values of model parameters. In this work, we propose a method to solve partially the problems of optimizing model parameters, simultaneously accounting for semantic stability. Our method is inspired by the concepts from statistical physics and is based on Sharma–Mittal entropy. We test our approach on two models: probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA) with Gibbs sampling, and on two datasets in different languages. We compare our approach against a number of standard metrics, each of which is able to account for just one of the parameters of our interest. We demonstrate that Sharma–Mittal entropy is a convenient tool for selecting both the number of topics and the values of hyper-parameters, simultaneously controlling for semantic stability, which none of the existing metrics can do. Furthermore, we show that concepts from statistical physics can be used to contribute to theory construction for machine learning, a rapidly-developing sphere that currently lacks a consistent theoretical ground.
In the framework of this paper we apply multifractal formalism to the analysis of statistical behaviour of topic models under variation of the number of topics. Fractal analysis of topic models allows to show that self-similar fractal clusters exist in large textual collections. We provide numerical results for 3 topic models (PLSA, ARTM, LDA Gibbs sampling) on 2 datasets, namely, on an English-language dataset and on a Russian-language dataset. We demonstrate that forming of clusters occurs precisely in the transition regions. Linear regions do not lead to changes in fractals, therefore, it is sufficient to find transition regions for the study of textual collections. Accordingly, the problem of the analysing the evolution of topic models can be reduced to the problem of searching transition regions in topic models.
The logistic family of distributions belongs to the class of important families in the theory of probability and mathematical statistics. However, the goodness-of-fit tests for the composite hypothesis of belonging to the logistic family with unknown location parameter against the general alternatives have not been sufficiently explored. We propose two new goodness-of-fit tests: the integral and the Kolmogorov-type, based on the recent characterization of the logistic family by Hua and Lin. Here we discuss asymptotic properties of new tests and calculate their Bahadur efficiency for common alternatives.
This paper is devoted to a new class of differential games with continuous updating. It is assumed that at each time instant, players have or use information about the game defined on a closed time interval. However, as the time evolves, information about the game updates, namely, there is a continuous shift of time interval, which determines the information available to players. Information about the game is the information about motion equations and payoff functions of players. For this class of games, the direct application of classical approaches to the determination of optimality principles such as Nash equilibrium is not possible. The subject of the current paper is the construction of solution concept similar to Nash equilibrium for this class of differential games and corresponding optimality conditions, in particular, modernized Hamilton-Jacobi-Bellman equations.
We describe a novel game-theoretic formulation of the optimal mobile agents’ placement problem which arises in the context of Mobile Ad-hoc Networks (MANETs). This problem is modelled as a sequential multistage game. The definitions of both the Nash equilibrium and cooperative solution are given. A modification was proposed to ensure the existence of a Nash equilibrium. A modelling environment for the analysis of different strategies of the players was developed in MATLAB. The programme generates various game situations and determines each player move by solving respective optimisation problems. Using the developed environment, two specific game scenarios were considered in detail. The proposed novel algorithm was implemented and tested using Network Simulator 3 (NS-3). The results show that the proposed novel algorithm increases network performance by using game theory principles and techniques.