Computational Biology/BioInformatics
Flexible Word Design DNA Word Design NMR Data Interpretation for Protein Structure Determination Protein Pathway Discovery from PPI graphs Signal Discovery in Plasmodium Protein Sequences Ecommerce
Search related

Graph Theory
Approximation Algorithm for Bottleneck TSP Graph Labeling Tree Interval Graph Recognition Network Security
Qualitative Reasoning/Artificial Intelligence
Autonomic Computing

Worms are selfpropagating computer programs that threaten to compromise
all vulnerable machines over the internet in a matter of few hours or even
minutes. Due to this fast propagation, manual generation of signatures
is unviable for computer worms. To further complicate the problem, worms
could be polymorphic. That is they could change their byte sequence at
every successful infection to evade detection using signatures.
In collaboration with the network security group, we propose Hamsa, a networkbased automated signature generation system for polymorphic worms which is fast, noise tolerant and provably attack resilient.
Hamsa is based on the assumption that a worm must exploit one or more server specific vulnerabilities which constrains the worm author to include some invariant tokens in the worm packets that are crucial for exploiting the vulnerability. We formally capture this idea by proposing an adversary model to analyze the invariant content of polymorphic worms which allows us to make analytical attack resilience guarantees for the signature generation algorithm.
Keywords: network security, modelbased analysis, signature generation, polymorphic worms, pattern based
Future Directions
Hamsa is currently a pure pattern based signature generator. It is
widely believed that polymorphic worms of the future may be immune to such
patternbased approaches and hence the signatures should exploit some semantic
information. We are currently exploring different approaches to generate
a hybrid of semantic and pattern based signatures.
Collaborators
Zhichun Li, Brian Chavez, Yan Chen and MingYang Kao
(link to paper)
Consider a truck running along a road. It picks up a load $L_i$
at point $\beta_i$ and delivers it at $\alpha_i$, carrying at most one
load at a time. The speed on the various parts of the road in one direction
is given by $f(x)$ and that in the other direction is given by $g(x)$.
Minimizing the total time spent to deliver loads $L_1,\ldots,L_n$ is equivalent
to solving the Traveling Salesman Problem (TSP) where the cities correspond
to the loads $L_i$ with coordinates $(\alpha_i, \beta_i)$ and the distance
from $L_i$ to $L_j$ is given by $\int^{\beta_j}_{\alpha_i} f(x)dx$ if $\beta_j
\geq \alpha_i$ and by $\int^{\alpha_i}_{\beta_j} g(x)dx$ if $\beta_j <
\alpha_i$.
Our motivation for this problem was to interpret nuclear magnetic resonance (NMR) spectroscopy data for protein structure determination. The problem was first proposed by Gilmore and Gomory in 1964 and is one of the most celebrated special cases of the TSP. Besides being one of the first nontrivial polynomially solvable cases, it is also the first with significant realworld applications. However, the bottleneck version of the problem (BTSP) was left open. Recently, Vairaktarakis~\cite{Vairaktarakis:2003:OGG} showed that BTSP with this distance metric is NPcomplete.
We provide an approximation algorithm for this BTSP by exploiting the underlying geometry in a novel fashion in \cite{BTSP}. Our techniques also allow for an alternate analysis of Gilmore and Gomory's polynomial time algorithm for the TSP. We achieve an approximation ratio of $(2+\gamma)$ where $\gamma \geq \frac{f(x)}{g(x)} \geq \frac{1}{\gamma} \; \forall x$. Note that when $f(x)=g(x)$, the approximation ratio is 3.
Keywords: approximation algorithm, traveling salesman problem, graph theory, protein structure determination
Future Directions
For the case when $f(x)=g(x)$, I believe a tighter analysis could achieve
an approximation ratio of $2.25+\epsilon$ for any $\epsilon>0$. A similar
analysis could then also improve the result for the general case.
Collaborators
MingYang Kao
(link to paper)
(references)
Given an input graph G, we consider the problem of labeling its
vertices with equal length binary strings of minimal length such that the
Hamming distance is small between words of adjacent nodes and large between
words of nonadjacent nodes. The problem is motivated by growing applications
in DNA computing and DNA selfassembly.
There is growing need for the efficient design of DNA codes such that the hybridization between any two words in the code is determined by a graph specifying which strands should bond and which should not. For general graphs we have designed algorithms that bound the word length with respect to either the maximum degree of any vertex or the number of edges in either the input graph or its complement. We further provide multiple types of recursive, deterministic algorithms for trees and forests, and an improvement for forests that makes use of randomization. We consider generalizations of this problem to weighted graphs and graphs with optional edges and also explore the extension from simple adjacency queries to more general distance queries. We show how to obtain distance labelings for rings and paths by applying properties of hypercube traversal.
Keywords: DNA word design, bioinformatics, graph theory
Future Directions
For special classes of graphs it may be possible to achieve a much
shorter length of labeling than that for general graphs. We achieve labelings
of logarithmic size for trees and suspect that similar results may be possible
for other classes also. Furthermore, there is a gap between the information
theoretic lower bound on the length of labelings and the lengths achieved
for the different cases we have considered.
Collaborators
Robert Schweller and MingYang Kao
(link to paper)
(references)
Typically search for music is done using metainformation like the
name of the song, the artist, the album and so on. Query by Humming systems
are content based music information retrieval systems that allows the user
to find a song by humming part of the tune. This can be achieved by representing
songs in the database and the hummed query as strings over an alphabet
of notes. Matching the query to a song then is very similar to the string
alignment problem biologists face when searching for homologues of a newly
discovered gene or protein sequence in an existing database.
However, there are a few differences. While in biology both the query
and the target are single sequences, in the case of music the songs in
the database could be polyphonic with different tracks playing in parallel
which are best represented as a set of multiple sequences rather than a
single sequence. The hummed query could jump from one track to another
and hence need to be aligned with all the tracks simultaneously. Multiple
sequence alignment is known to be a very hard problem in computational
biology. However, we soon realized that the multiple sequences corresponding
to a musical
score are in a sense already aligned and therefore allow for a polynomial
time dynamic programming solution.
Keywords: musical information retrieval, string alignment
Future Directions
Interestingly, the algorithms we developed for musical information
retrieval may inspire novel solutions to applications back in computational
biology! Given a set of aligned sequences belonging to a particular family
of DNA or protein sequences and a newly discovered sequence, it is often
of interest to determine whether the new sequence belongs to that family.
Typically this is done by comparing the new sequence with the
consensus of the aligned sequences. However, it is conceivable that
in some cases a member of the same family could `jump' from one sequence
to another in the family (like a hummed query jumps from one track to another)
instead of matching some common consensus. Obviously there are important
differences but this work inspires
exploring a computational biology problem in a new light.
Collaborators
Bryan Pardo
(link to paper)
(references)
Having written and compiled an ebook, the author needs to decide
how much to charge for it. If the seller knows how much his different buyers
value his goods, choosing a price that maximizes revenue is straightforward.
The goal of this project is to design revenue maximizing pricing mechanisms
for information goods when the distributions of valuations is not known
to the seller. The problem is of increasing importance as the Internet
provides the infrastructure for a major marketplace for electronic information.
We allow the buyers to be strategic, i.e. they may hide their true valuations to get a better deal. Strategyproof pricing mechanisms proposed till date usually make the assumption that a buyer gets only one shot at purchasing the product, an assumption that doesn’t hold true in many practical situations. We are exploring pricing mechanisms that do not restrict the buyer in this manner.
Keywords: ecommerce, game theory, pricing mechanisms
Future Directions
Collaborators
MingYang Kao
(link to paper)
(references)
PPI motivated problem for
Keywords: bioinformatics, intersection graph theory, PPI netowrks
Future Directions
Improve running time
Collaborators
Patrick MingYang Kao
(link to paper)
(references)
Keywords:
Future Directions
Collaborators
(link to paper)
(references)
Keywords:
Future Directions
Collaborators
(link to paper)
(references)
Keywords:
Future Directions
Collaborators
(link to paper)
(references)
Keywords:
Future Directions
Collaborators
(link to paper)
(references)
Keywords:
Future Directions
Collaborators
(link to paper)
(references)