We use cookies in order to improve the quality and usability of the HSE website. More information about the use of cookies is available here, and the regulations on processing personal data can be found here. By continuing to use the site, you hereby confirm that you have been informed of the use of cookies by the HSE website and agree with our rules for processing personal data. You may disable cookies in your browser settings.
Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from statistical physics, where an inferred topical structure of a collection can be considered an information statistical system residing in a non-equilibrium state. By testing our approach on four models—Probabilistic Latent Semantic Analysis (pLSA), Additive Regularization of Topic Models (BigARTM), Latent Dirichlet Allocation (LDA) with Gibbs sampling, LDA with variational inference (VLDA)—we, first of all, show that the minimum of Renyi entropy coincides with the “true” number of topics, as determined in two labelled collections. Simultaneously, we find that Hierarchical Dirichlet Process (HDP) model as a well-known approach for topic number optimization fails to detect such optimum. Next, we demonstrate that large values of the regularization coefficient in BigARTM significantly shift the minimum of entropy from the topic number optimum, which effect is not observed for hyper-parameters in LDA with Gibbs sampling. We conclude that regularization may introduce unpredictable distortions into topic models that need further research.
Motivation
Imaging mass spectrometry (imaging MS) is a prominent technique for capturing distributions of molecules in tissue sections. Various computational methods for imaging MS rely on quantifying spatial correlations between ion images, referred to as co-localization. However, no comprehensive evaluation of co-localization measures has ever been performed; this leads to arbitrary choices and hinders method development.
Results
We present ColocML, a machine learning approach addressing this gap. With the help of 42 imaging MS experts from nine laboratories, we created a gold standard of 2210 pairs of ion images ranked by their co-localization. We evaluated existing co-localization measures and developed novel measures using term frequency–inverse document frequency and deep neural networks. The semi-supervised deep learning Pi model and the cosine score applied after median thresholding performed the best (Spearman 0.797 and 0.794 with expert rankings, respectively). We illustrate these measures by inferring co-localization properties of 10 273 molecules from 3685 public METASPACE datasets.
Functional classes on a curve in a plane (a partial case of a spatial curve) can be described by the approximation speed by functions that are harmonic in three-dimensional neighbourhoods of the curve. No constructive description of functional classes on rather general surfaces in R3 and R4 has been presented in literature so far. The main result of the paper is Theorem 1.
In practice, the critical step in building machine learning models of big data (BD) is costly in terms of time and the computing resources procedure of parameter tuning with a grid search. Due to the size, BD are comparable to mesoscopic physical systems. Hence, methods of statistical physics could be applied to BD. The paper shows that topic modeling demonstrates self-similar behavior under the condition of a varying number of clusters. Such behavior allows using a renormalization technique. The combination of a renormalization procedure with the Rényi entropy approach allows for fast searching of the optimal number of clusters. In this paper, the renormalization procedure is developed for the Latent Dirichlet Allocation (LDA) model with a variational Expectation-Maximization algorithm. The experiments were conducted on two document collections with a known number of clusters in two languages. The paper presents results for three versions of the renormalization procedure: (1) a renormalization with the random merging of clusters, (2) a renormalization based on minimal values of Kullback–Leibler divergence and (3) a renormalization with merging clusters with minimal values of Rényi entropy. The paper shows that the renormalization procedure allows finding the optimal number of topics 26 times faster than grid search without significant loss of quality.
In this paper, we consider the following problem - what affects the Nash equilibrium amount of investment in knowledge when one of the complete graph enters another full one. The solution of this problem will allow us to understand exactly how game agents will behave when deciding whether to enter the other net, what conditions and externalities affect it and how the level of future equilibrium amount of investments in knowledge can be predicted.
Clustering large and heterogeneous data of user-profiles from social media is problematic as the problem of finding the optimal number of clusters becomes more critical than for clustering smaller and homo- geneous data. We propose a new approach based on the deformed R ́enyi entropy for determining the optimal number of clusters in hierarchical clustering of user-profile data. Our results show that this approach allows us to estimate R ́enyi entropy for each level of a hierarchical model and find the entropy minimum (information maximum). Our approach also shows that solutions with the lowest and the highest number of clusters correspond to the entropy maxima (minima of information).
We discuss the effect of self-heating on performance of injection microdisk lasers operating in continuous-wave (CW) regime at room and elevated temperature. A model is developed that allows one to obtain analytical expressions for the peak optical power limited by the thermal rollover effect, the corresponding injection current and excess temperature of the device. The model predicts, there exists the maximum temperature of microlaser operation in CW regime and the minimum mircrodisk diameter, at which CW lasing is possible. The model allows one to determine the dependence of the device characteristics on its diameter and the inherent parameters, such as thermal resistance, electrical resistance, non-radiative recombination and characteristic temperature of the threshold current. It is found that a rapid growth of the threshold current density with decreasing the diameter (which takes place even in the absence of the self-heating effect) is the main internal reason leading to the dependence of the temperature characteristics of the mirodisk laser on its size. In the calculations, we used a set of parameters extracted from experiments with InGaAs quantum dot microdisk lasers. The simulation results (in particular, the light-current curve and the dependence of the minimum microdisk diameter on ambient temperature) comply well with the measured dependences.
An InAs/InGaAs quantum dot laser with a heterostructure epitaxially grown on a silicon substrate was used to fabricate injection microdisk lasers of different diameters (15–31 µm). A post-growth process includes photolithography and deep dry etching. No surface protection/passivation is applied. The microlasers are capable of operating heatsink-free in a continuous-wave regime at room and elevated temperatures. A record-low threshold current density of 0.36 kA/cm2 was achieved in 31 µm diameter microdisks operating uncooled. In microlasers with a diameter of 15 µm, the minimum threshold current density was found to be 0.68 kA/cm2. Thermal resistance of microdisk lasers monolithically grown on silicon agrees well with that of microdisks on GaAs substrates. The ageing test performed for microdisk lasers on silicon during 1000 h at a constant current revealed that the output power dropped by only ~9%. A preliminary estimate of the lifetime for quantum-dot (QD) microlasers on silicon (defined by a double drop of the power) is 83,000 h. Quantum dot microdisk lasers made of a heterostructure grown on GaAs were transferred onto a silicon wafer using indium bonding. Microlasers have a joint electrical contact over a residual n+ GaAs substrate, whereas their individual addressing is achieved by placing them down on a p-contact to separate contact pads. These microdisks hybridly integrated to silicon laser at room temperature in a continuous-wave mode. No effect of non-native substrate on device characteristics was found.
We consider difference and differential Stackelberg game theoretic models with several followers of opinion control in marketing networks. It is assumed that in the stage of analysis of the network its opinion leaders have already been found and are the only objects of control. The leading player determines the marketing budgets of the followers by resource allocation. In the basic version of the models both the leader and the followers maximize the summary opinions of the network agents. In the second version the leader has a target value of the summary opinion. In all four models we have found the Stackelberg equilibrium and the respective payoffs of the players analytically. It is shown that the hierarchical control system is ideally compatible in all cases.
In practice, to build a machine learning model of big data, one needs to tune model parameters. The process of parameter tuning involves extremely time-consuming and computationally expensive grid search. However, the theory of statistical physics provides techniques allowing us to optimize this process. The paper shows that a function of the output of topic modeling demonstrates self-similar behavior under variation of the number of clusters. Such behavior allows using a renormalization technique. A combination of renormalization procedure with the Renyi entropy approach allows for quick searching of the optimal number of topics. In this paper, the renormalization procedure is developed for the probabilistic Latent Semantic Analysis (pLSA), and the Latent Dirichlet Allocation model with variational Expectation–Maximization algorithm (VLDA) and the Latent Dirichlet Allocation model with granulated Gibbs sampling procedure (GLDA). The experiments were conducted on two test datasets with a known number of topics in two different languages and on one unlabeled test dataset with an unknown number of topics. The paper shows that the renormalization procedure allows for finding an approximation of the optimal number of topics at least 30 times faster than the grid search without significant loss of quality.
Topic modeling is a widely used approach for clustering text documents, however, it possesses a set of parameters that must be determined by a user, for example, the number of topics. In this paper, we propose a novel approach for fast approximation of the optimal topic number that corresponds well to human judgment. Our method combines renormalization theory and Renyi entropy approach. The main advantage of this method is computational speed which is crucial when dealing with big data. We apply our method to Latent Dirichlet Allocation model with Gibbs sampling procedure and test our approach on two datasets in different languages. Numerical results and comparison of computational speed demonstrate significant gain in time with respect to standard grid search methods.
Performing exact computations can require significant resources. Approximate computing allows to alleviate resource constraints, sacrificing the accuracy of results. In this work, we consider a generalization of the classical packet classification problem. Our major contribution is to introduce various representations for approximate packet classifiers with controlled accuracy and optimization techniques to reduce classifier sizes exploiting this new level of flexibility. We validate our theoretical results with a comprehensive evaluation study.
We deal with multistage multicriteria games in extensive form and employ so-called “A-subgame” concept to examine dynamical properties of some non-cooperative and cooperative solutions. It is proved that if we take into account only the active players at each A-subgame the set of all strong Pareto equilibria is time consistent but does not satisfy dynamical compatibility.
We construct an optimal cooperative trajectory and vector-valued characteristic function using the refined leximin algorithm. To ensure the sustainability of a cooperative agreement we design the A-incremental imputation distribution procedure for the Shapley value which provides a better incentive for cooperation than classical incremental allocation procedure. This specific payment schedule corresponds to the A-subgame concept satisfies time consistency and efficiency condition and implies non-zero current payment to the active player immediately after her move.
We suggest a new method on coloring generalized Kneser graphs based on hypergraphs with high discrepancy and small number of edges. The main result is providing a proper coloring of K(n, n/2−t, s) in (4 + o(1))(s + t) 2 colors, which is produced by Hadamard matrices. Also, we show that for colorings by independent set of a natural type, this result is the best possible up to a multiplicative constant. Our method extends to Kneser hypergraphs as well
Functional classes on a curve in a plane (a partial case of a spatial curve) can be described by the approximation speed by functions that are harmonic in three-dimensional neighbourhoods of the curve. No constructive description of functional classes on rather general surfaces in R 3 and R 4 has been presented in literature so far. The main result of the paper is Theorem 1.
In our previous papers (2002, 2017), we derived conditions for the existence of a strong Nash equilibrium in multistage nonzero-sum games under additional constraints on the possible deviations of coalitions from their agreed-upon strategies. These constraints allowed only one-time simultaneous deviations of all the players in a coalition. However, it is clear that in real-world problems the deviations of different members of a coalition may occur at different times (at different stages of the game), which makes the punishment strategy approach proposed by the authors earlier inapplicable in the general case. The fundamental difficulty is that in the general case the players who must punish the deviating coalition know neither the members of this coalition nor the times when each player performs the deviation. In this paper, we propose a new punishment strategy, which does not require the full information about the deviating coalition but uses only the fact of deviation of at least one player of the coalition. Of course, this punishment strategy can be realized only under some additional constraints on stage games. Under these additional constraints, it was proved that the punishment of the deviating coalition can be effectively realized. As a result, the existence of a strong Nash equilibrium in the game was established.
This article is dedicated to an alternative method of solving of the Chinese Remainder Theorem for polynomials. To construct the solution, a system of linear equations is constructed (using the method of undetermined coefficients) and then solved. The complexity of the proposed method is also calculated.
An effective calculation of the Reed-Solomon code syndrome is proposed. The method is based on the use of the partial normalized cyclic convolutions in the partial inverse cyclotomic discrete Fourier transform. The method is the best of the known algorithms, in terms of multiplicative complexity.
In the paper, a two-level infinitely repeated hierarchical game with one player (center) C0 on the first level and S1...Sn subordinate players on the second is considered. On each stage of the game player C0 selects vector x=(x1....xn) from a given set X, in which each component represents a vector of resources delivered by C0 to one of the subordinate players, i.e. (formula presented). At the second level, Si i=1,2..,n, choose the controls (formula presented), where Yi(xi) depends upon the choice of player C0. In this game, a set of different Nash equilibrium also based on threat and punishment strategies is obtained. In one case, the center enforces special behavior of subordinate firms (vector of manufactured goods), threatening to deprive them of resources on the next steps if the subordinate firms refuse to implement the prescribed behavior. In another case, the subordinate firms can force the center to use a certain resource allocation threatening to stop production. Using different combinations of such behaviors on different stages of the game, we obtain a wide class of Nash equilibrium in the game under consideration. The cooperative version of the game is also considered. The conditions are derived under which the cooperative behavior can be supported by Nash Equilibrium or Strong Nash Equilibrium (Nash Equilibrium stable against deviations of coalitions).
In each node of a network, economy is described by the simple two-period Romer’s model of endogenous growth with production and knowledge externalities. The sum of knowledge levels in the neighbor nodes causes an externality in the production of each node of the network. The game equilibrium in the network is investigated. The agents’ solutions depending on the size of externality are obtained. The uniqueness of inner equilibrium is proved. The role of passive agents in network formation is studied; in particular, the possibilities of adding a passive agent to a regular network, and also of joining of regular networks through nodes with passive agents. It is shown that the sum of knowledge levels in all the nodes decreases under adding of a new link.
Topic modeling is a popular approach for clustering text documents. However, current tools have a number of unsolved problems such as instability and a lack of criteria for selecting the values of model parameters. In this work, we propose a method to solve partially the problems of optimizing model parameters, simultaneously accounting for semantic stability. Our method is inspired by the concepts from statistical physics and is based on Sharma–Mittal entropy. We test our approach on two models: probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA) with Gibbs sampling, and on two datasets in different languages. We compare our approach against a number of standard metrics, each of which is able to account for just one of the parameters of our interest. We demonstrate that Sharma–Mittal entropy is a convenient tool for selecting both the number of topics and the values of hyper-parameters, simultaneously controlling for semantic stability, which none of the existing metrics can do. Furthermore, we show that concepts from statistical physics can be used to contribute to theory construction for machine learning, a rapidly-developing sphere that currently lacks a consistent theoretical ground.