# Research Projects

Manan Sanghi

 Computational Biology/BioInformatics Flexible Word Design DNA Word Design NMR Data Interpretation for Protein Structure Determination Protein Pathway Discovery from PPI graphs Signal Discovery in Plasmodium Protein Sequences E-commerce Strategy Proof Pricing of Digital Goods Graph Theory Approximation Algorithm for Bottleneck TSP Graph Labeling Tree Interval Graph Recognition Network Security Automatic Polymorphic Worm Signature Generation Qualitative Reasoning/Artificial Intelligence Landmark Discovery Autonomic Computing Adaptation Algorithms for Virtual Execution Environments

Exhaustive (non-overlapping) List of Projects

(January 25, 2005)

### Automatic Polymorphic Worm Signature Generation

Worms are self-propagating computer programs that threaten to compromise all vulnerable machines over the internet in a matter of few hours or even minutes. Due to this fast propagation, manual generation of signatures is unviable for computer worms. To further complicate the problem, worms could be polymorphic. That is they could change their byte sequence at every successful infection to evade detection using signatures.

In collaboration with the network security group, we propose Hamsa, a network-based automated signature generation system for polymorphic worms which is fast, noise tolerant and provably attack resilient.

Hamsa is based on the assumption that a worm must exploit one or more server specific vulnerabilities which constrains the worm author to include some invariant tokens in the worm packets that are crucial for exploiting the vulnerability. We formally capture this idea by proposing an adversary model to analyze the invariant content of polymorphic worms which allows us to make analytical attack resilience guarantees for the signature generation algorithm.

Keywords: network security, model-based analysis, signature generation, polymorphic worms, pattern based

Future Directions
Hamsa is currently a pure pattern based signature generator. It is widely believed that polymorphic worms of the future may be immune to such pattern-based approaches and hence the signatures should exploit some semantic information. We are currently exploring different approaches to generate a hybrid of semantic and pattern based signatures.

Collaborators
Zhichun Li, Brian Chavez, Yan Chen and Ming-Yang Kao

### Approximation Algorithm for Bottleneck TSP

Consider a truck running along a road. It picks up a load $L_i$ at point $\beta_i$ and delivers it at $\alpha_i$, carrying at most one load at a time. The speed on the various parts of the road in one direction is given by $f(x)$ and that in the other direction is given by $g(x)$. Minimizing the total time spent to deliver loads $L_1,\ldots,L_n$ is equivalent to solving the Traveling Salesman Problem (TSP) where the cities correspond to the loads $L_i$ with coordinates $(\alpha_i, \beta_i)$ and the distance from $L_i$ to $L_j$ is given by $\int^{\beta_j}_{\alpha_i} f(x)dx$ if $\beta_j \geq \alpha_i$ and by $\int^{\alpha_i}_{\beta_j} g(x)dx$ if $\beta_j < \alpha_i$.

Our motivation for this problem was to interpret nuclear magnetic resonance (NMR) spectroscopy data for protein structure determination. The problem was first proposed by Gilmore and Gomory in 1964 and is one of the most celebrated special cases of the TSP. Besides being one of the first non-trivial polynomially solvable cases, it is also the first with significant real-world applications. However, the bottleneck version of the problem (BTSP) was left open. Recently, Vairaktarakis~\cite{Vairaktarakis:2003:OGG} showed that BTSP with this distance metric is NP-complete.

We provide an approximation algorithm for this BTSP by exploiting the underlying geometry in a novel fashion in \cite{BTSP}. Our techniques also allow for an alternate analysis of Gilmore and Gomory's polynomial time algorithm for the TSP. We achieve an approximation ratio of $(2+\gamma)$ where $\gamma \geq \frac{f(x)}{g(x)} \geq \frac{1}{\gamma} \; \forall x$. Note that when $f(x)=g(x)$, the approximation ratio is 3.

Keywords: approximation algorithm, traveling salesman problem, graph theory, protein structure determination

Future Directions
For the case when $f(x)=g(x)$, I believe a tighter analysis could achieve an approximation ratio of $2.25+\epsilon$ for any $\epsilon>0$. A similar analysis could then also improve the result for the general case.

Collaborators
Ming-Yang Kao

(references)

### Flexible Word Design and Graph Labeling

Given an input graph G, we consider the problem of labeling its vertices with equal length binary strings of minimal length such that the Hamming distance is small between words of adjacent nodes and large between words of non-adjacent nodes. The problem is motivated by growing applications in DNA computing and DNA self-assembly.

There is growing need for the efficient design of DNA codes such that the hybridization between any two words in the code is determined by a graph specifying which strands should bond and which should not. For general graphs we have designed algorithms that bound the word length with respect to either the maximum degree of any vertex or the number of edges in either the input graph or its complement. We further provide multiple types of recursive, deterministic algorithms for trees and forests, and an improvement for forests that makes use of randomization. We consider generalizations of this problem to weighted graphs and graphs with optional edges and also explore the extension from simple adjacency queries to more general distance queries. We show how to obtain distance labelings for rings and paths by applying properties of hypercube traversal.

Keywords: DNA word design, bioinformatics, graph theory

Future Directions
For special classes of graphs it may be possible to achieve a much shorter length of labeling than that for general graphs. We achieve labelings of logarithmic size for trees and suspect that similar results may be possible for other classes also. Furthermore, there is a gap between the information theoretic lower bound on the length of labelings and the lengths achieved for the different cases we have considered.

Collaborators
Robert Schweller and Ming-Yang Kao

(references)

### Algorithms, Biology and Music

Typically search for music is done using meta-information like the name of the song, the artist, the album and so on. Query by Humming systems are content based music information retrieval systems that allows the user to find a song by humming part of the tune. This can be achieved by representing songs in the database and the hummed query as strings over an alphabet of notes. Matching the query to a song then is very similar to the string alignment problem biologists face when searching for homologues of a newly discovered gene or protein sequence in an existing database.

However, there are a few differences. While in biology both the query and the target are single sequences, in the case of music the songs in the database could be polyphonic with different tracks playing in parallel which are best represented as a set of multiple sequences rather than a single sequence. The hummed query could jump from one track to another and hence need to be aligned with all the tracks simultaneously. Multiple sequence alignment is known to be a very hard problem in computational biology. However, we soon realized that the multiple sequences corresponding to a musical
score are in a sense already aligned and therefore allow for a polynomial time dynamic programming solution.

Keywords: musical information retrieval, string alignment

Future Directions
Interestingly, the algorithms we developed for musical information retrieval may inspire novel solutions to applications back in computational biology! Given a set of aligned sequences belonging to a particular family of DNA or protein sequences and a newly discovered sequence, it is often of interest to determine whether the new sequence belongs to that family. Typically this is done by comparing the new sequence with the
consensus of the aligned sequences. However, it is conceivable that in some cases a member of the same family could `jump' from one sequence to another in the family (like a hummed query jumps from one track to another) instead of matching some common consensus. Obviously there are important differences but this work inspires
exploring a computational biology problem in a new light.

Collaborators
Bryan Pardo

(references)

### Strategy Proof Pricing of Digital Goods

Having written and compiled an e-book, the author needs to decide how much to charge for it. If the seller knows how much his different buyers value his goods, choosing a price that maximizes revenue is straightforward. The goal of this project is to design revenue maximizing pricing mechanisms for information goods when the distributions of valuations is not known to the seller. The problem is of increasing importance as the Internet provides the infrastructure for a major marketplace for electronic information.

We allow the buyers to be strategic, i.e. they may hide their true valuations to get a better deal. Strategy-proof pricing mechanisms proposed till date usually make the assumption that a buyer gets only one shot at purchasing the product, an assumption that doesn’t hold true in many practical situations. We are exploring pricing mechanisms that do not restrict the buyer in this manner.

Keywords: e-commerce, game theory, pricing mechanisms

Future Directions

Collaborators
Ming-Yang Kao

(references)

### Cubic Time Rooted Tree Interval Graph Recognition

PPI motivated problem for

Keywords: bioinformatics, intersection graph theory, PPI netowrks

Future Directions
Improve running time

Collaborators
Patrick Ming-Yang Kao

(references)

### DNA Word Design

Keywords:

Future Directions

Collaborators

(references)

### Adaptation Algorithms for Virtual Execution Environments

Keywords:

Future Directions

Collaborators

(references)

### Landmark Discovery

Keywords:

Future Directions

Collaborators

(references)

### Signal Discovery in Plasmodium vectors

Keywords:

Future Directions

Collaborators

(references)

### Keyword based search in structured P2P networks

Keywords:

Future Directions

Collaborators