Publications
Efficient representation of data aggregations is a fundamental problem in modern big data applications, where network topologies and deployed routing and transport mechanisms play a fundamental role in optimizing desired objectives such as cost, latency, and others. In traditional networking, applications use TCP and UDP transports as a primary interface for implemented applications that hide the underlying network topology from end systems. On the flip side, to exploit network infrastructure in a better way, applications restore characteristics of the underlying network. In this work, we demonstrate that both specified extreme cases can be inefficient to optimize given objectives. We study the design principles of routing and transport infrastructure and identify extra information that can be used to improve implementations of compute-aggregate tasks. We build a taxonomy of compute-aggregate services unifying aggregation design principles, propose algorithms for each class, analyze them theoretically, and support our results with an extensive experimental study.
Buffering architectures and policies for their efficient management are core ingredients of a network architecture. However, despite strong incentives to experiment with and deploy new policies, opportunities for changing anything beyond minor elements are limited. We introduce a new specification language, OpenQueue, that allows to express virtual buffering architectures and management policies representing a wide variety of economic models. OpenQueue allows users to specify entire buffering architectures and policies conveniently through several comparators and simple functions. We show examples of buffer management policies in OpenQueue and empirically demonstrate its impact on performance in various settings.
Cloud applications bring new challenges to the design of network elements, in particular the burstiness of traffic workloads. A shared memory switch is a good candidate architecture to exploit buffer capacity; in this work, we analyze the performance of this architecture. Our goal is to explore the impact of additional traffic characteristics such as varying processing requirements and packet values on objective functions. The outcome of this work is a better understanding of the relevant parameters for buffer management to achieve better performance in dynamic environments of data centers. We consider a model that captures more of the properties of the target architecture than previous work and consider several scheduling and buffer management algorithms that are specifically designed to optimize its performance. In particular, we provide analytic guarantees for the throughput performance of our algorithms that are independent from specific distributions of packet arrivals. We furthermore report on a comprehensive simulation study which validates our analytic results.
Cosmetic items do not provide functional advantages in games, but, nevertheless, they play an important role in the overall player experience. Possessing predominantly socially-constructed dimensions of value, cosmetic items are chosen, discussed, assessed, and valuated in an ongoing iterative collaborative process by communities of players. In our study, we explore the case of Dota 2 and apply Topic Modeling to community-discussions data gathered from Reddit.com. We describe social experiences related to the valuation of cosmetic items in interaction and collision of various logics, including artificial scarcity, decomposition of visual effects, and connectedness to the game lore. Our findings connect the collective experience of players in the game and on online community platforms, suggesting that non-utility-based social value construction becomes an important part of game experience.
Packet processing increasingly involves heterogeneous requirements. We consider the well-known model of a shared memory switch with bounded-size buffer and generalize it in two directions. First, we consider unit-sized packets labeled with an output port and a processing requirement (i.e., packets with heterogeneous processing), maximizing the number of transmitted packets. We analyze the performance of buffer management policies under various characteristics via competitive analysis that provides uniform guarantees across traffic patterns (Borodin and ElYaniv 1998). We propose the Longest-Work-Drop policy and show that it is at most 2-competitive and at least -competitive. Second, we consider another generalization, posed as an open problem in Goldwasser (2010), where each unit-sized packet is labeled with an output port and intrinsic value, and the goal is to maximize the total value of transmitted packets. We show first results in this direction and define a scheduling policy that, as we conjecture, may achieve constant competitive ratio. We also present a comprehensive simulation study that validates our results.
Modern network processors increasingly deal with packets that require heterogeneous processing. We consider a bounded size input queue buffer where each packet requires several rounds of processing before transmission. Usually the transmission order of packets is induced by processing order, but processing order can have significant impact on the performance of buffer management policies even if the transmission order is fixed. We introduce the class of Semi-FIFO policies that decouple processing order from transmission order, restricting the latter to First-In-First-Out (FIFO). We build a taxonomy of Semi-FIFO policies and provide worst case guarantees for different processing orders. We consider various special cases and properties of Semi-FIFO policies: greedy, work-conserving, lazy, and push-out policies, and show how these properties affect performance. We generalize our results to additional constraints related to copying cost and conduct a comprehensive simulation study that validates our results.
We consider the problem of managing a bounded size First-In-First-Out (FIFO) queue buffer, where each incoming unit-sized packet requires several rounds of processing before it can be transmitted out. Our objective is to maximize the total number of successfully transmitted packets. We consider both push-out (when a policy is permitted to drop already admitted packets) and non-push-out cases. We provide worst-case guarantees for the throughput performance of our algorithms, proving both lower and upper bounds on their competitive ratio against the optimal algorithm, and conduct a comprehensive simulation study that experimentally validates predicted theoretical behavior.
Helping behavior is a significant part of social learning process in online games. One type of such a behavior is answering questions in a chat. We provide a method to detect if the question asked in a chat was answered and by whom. Proposed method is based on topic modeling for chat messages and comparison of a detected topic of question with a topic of possible reply. We show its efficiency on chat messages from online games.
Background: The Russian human immunodeficiency virus (HIV) epidemic among people who inject drugs (PWID) originated in Kaliningrad, but research into risk behaviours among PWID has been lacking. The potential for heterosexual spread has not been analysed. Methods: A sample of PWID was accrued using two methods. A questionnaire was administered to assess HIV-related risk behaviours for parenteral and sexual transmission, sociodemographic factors, HIV knowledge and attitudes about sexual risks. Data were analysed focusing on the role of imprisonment, factors associated with awareness of being HIV infected and condom use. Results: More than a quarter of the sample reported having been diagnosed with HIV infection, with higher prevalence among women and those with a history of incarceration. More than half reported having been diagnosed with hepatitis C virus infection. Those reporting being HIV positive were less likely to distribute used syringes to other PWID and more likely to have used a condom the last time they had sex. A history of incarceration was associated with higher rates of receptive syringe sharing among those not having ever received an HIV-positive diagnosis and a lower likelihood of believing that condoms are needed when having sex with a casual partner. Conclusion: Although extensive HIV testing has alerted many PWID to their HIV-positive status, which is associated with less distributive syringe sharing and higher likelihood of condom use, substantial risk for parenteral and especially sexual HIV transmission remains. More active prevention programs will be required to control the heterosexual spread of HIV.
Packet processing increasingly involves heterogeneous requirements. We consider the well-known model of a shared memory switch with bounded-size buffer and generalize it in two directions. First, we consider unit-sized packets labeled with an output port and a processing requirement (i.e., packets with heterogeneous processing), maximizing the number of transmitted packets. We analyze the performance of buffer management policies under various characteristics via competitive analysis that provides uniform guarantees across traffic patterns (Borodin and El-Yaniv, 1998). We propose the Longest-Work-Drop policy and show that it is at most 2-competitive and at least sqrt 2}-competitive. Second, we consider another generalization, posed as an open problem in [10], where each unit-sized packet is labeled with an output port and intrinsic value, and the goal is to maximize the total value of transmitted packets. We show first results in this direction and define a scheduling policy that, as we conjecture, may achieve constant competitive ratio. We also present a comprehensive simulation study that validates our results.
Recent advances in single-cell genomics provide an alternative to largely gene-centric metagenomics studies, enabling whole-genome sequencing of uncultivated bacteria. However, single-cell assembly projects are challenging due to (i) the highly nonuniform read coverage and (ii) a greatly elevated number of chimeric reads and read pairs. While recently developed single-cell assemblers have addressed the former challenge, methods for assembling highly chimeric reads remain poorly explored. We present algorithms for identifying chimeric edges and resolving complex bulges in de Bruijn graphs, which significantly improve single-cell assemblies. We further describe applications of the single-cell assembler SPAdes to a new approach for capturing and sequencing "microbial dark matter" that forms small pools of randomly selected single cells (called a mini-metagenome) and further sequences all genomes from the mini-metagenome at once. On single-cell bacterial datasets, SPAdes improves on the recently developed E+V-SC and IDBA-UD assemblers specifically designed for single-cell sequencing. For standard (cultivated monostrain) datasets, SPAdes also improves on A5, ABySS, CLC, EULER-SR, Ray, SOAPdenovo, and Velvet. Thus, recently developed single-cell assemblers not only enable single-cell sequencing, but also improve on conventional assemblers on their own turf. SPAdes is available for free online download under a GPLv2 license
Modern network processors (NPs) increasingly deal with packets with heterogeneous processing requirements. In this work, we consider the fundamental problem of managing a bounded size buffer at the input queue of an NP. Incoming traffic consists of packets, each packet requiring several rounds of processing before it can be transmitted out of the queue. The objective is to maximize the total number of successfully transmitted packets. In such an environment, it is well known that Shortest-Remaining-Processing-Time (SRPT) first scheduling with push-out is optimal [1]. However, it is hard to implement both priority queueing (PQ) by remaining processing and the push-out mechanism simultaneously in an NP. We explore alternatives for this architecture, addressing the simplicity vs. performance system design tradeoffs. We design a simplified architecture and provide worst-case guarantees for its throughput performance in different settings. We also conduct a comprehensive simulation study that validates our results.
One of the key advances in genome assembly that has led to a significant improvement in contig lengths has been improved algorithms for utilization of paired reads (mate-pairs). While in most assemblers, mate-pair information is used in a post-processing step, the recently proposed Paired de Bruijn Graph (PDBG) approach incorporates the mate-pair information directly in the assembly graph structure. However, the PDBG approach faces difficulties when the variation in the insert sizes is high. To address this problem, we first transform mate-pairs into edge-pair histograms that allow one to better estimate the distance between edges in the assembly graph that represent regions linked by multiple mate-pairs. Further, we combine the ideas of mate-pair transformation and PDBGs to construct new data structures for genome assembly: pathsets and pathset graphs.
The paper describe key points in algebraic bayesian network knowledge pattern implementation on C++ programming language. Knowledge pattern implemented as class that handle and store estimation for knowledge pattern elements. It also provide a couple of methods for processing knowledge pattern such as consistency update and a posteriori inference.
The lion's share of bacteria in various environments cannot be cloned in the laboratory and thus cannot be sequenced using existing technologies. A major goal of single-cell genomics is to complement gene-centric metagenomic data with whole-genome assemblies of uncultivated organisms. Assembly of single-cell data is challenging because of highly non-uniform read coverage as well as elevated levels of sequencing errors and chimeric reads. We describe SPAdes, a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V−SC assembler (specialized for single-cell data) and on popular assemblers Velvet and SoapDeNovo (for multicell data). SPAdes generates single-cell assemblies, providing information about genomes of uncultivatable bacteria that vastly exceeds what may be obtained via traditional metagenomics studies. SPAdes is available online (http://bioinf.spbau.ru/spades). It is distributed as open source software.
The processing of probabilistically uncertain knowledge patterns in intellectual decision support systems falls into three kinds of probabilistic-logic inference, such as reconciliation, a priori and a posteriori inference. The paper presents formulae that allow for putting the process down in terms of matrix-vector language.