Research

 

Overview


My graduate research has focused on relational data clustering techniques and applications.  Given a set of objects with descriptive attributes and a set of relational links between them, relational data clustering will return a partitioning of the objects into logical groups, or clusters.  Clustering can be used to support the activities of a human data analyst, or to compress, organize, simplify or make predictions about new, incomplete observations. 


My work focused on how to integrate multiple sources of information in a complex, multi-relational data set.  These sources include descriptive attributes for each object in a set and one or more relations over the set of objects. In prior research, two aspects of relational clustering were not adequately addressed: the tendency for related objects to have similar attribute values, and the possibility that not all relations are equally useful for clustering. 


Dissertation Research


My dissertation research focused on stochastic relational clustering models and relation engineering.  Stochastic relational clustering models use parametric probability distributions to describe attribute and relational information and then find a clustering that has the highest probability.  Relation engineering is closely related to stochastic relational clustering.  Relation engineering is the analog to feature engineering; it is the process of selecting and/or extracting the most useful relational information from a collection of different types of relations.  My work is the first to acknowledge that a relation over a set of objects is a special type of descriptive attribute that has inherent qualities that affect its utility for clustering.  This view of relational information leads to formally grounded methods for relation selection and extraction. 


My dissertation proposed the Probabilistic Relational Clustering Framework (PRCF), which is a generalization of several previously proposed clustering models and unifies the probabilistic assumptions made by each model.  The model consisted of two components: a cluster membership distribution (i.e. a prior distribution on cluster membership) and a data likelihood term (i.e. the probability of observing a set of relations and attributes, given a clustering).  I showed that small changes to either of these two components can result in significant changes in the resulting clustering.  The main contribution of PRCF is that it allows researchers to systematically target specific aspects of relational data under a common framework and then to easily compare their work to other algorithms within the framework. 


A survey of prior work that is subsumed by the PRCF revealed that little work has been done in the stochastic modeling of relations between objects.  Most approaches used the same parameterized Bernoulli distribution for modeling individual relational links.  By making the key observation that a Bernoulli distribution is in effect a trivial supervised classifier, I developed several unique clustering models that used supervised classifiers in place of the Bernoulli probability.  I showed that not only could unique clustering algorithms be developed by selecting different classifiers, but that the combination of clustering and supervised learning resulted in a significant boost in classifier performance for the task of link prediction. 


Relational clustering is a relatively young field of research that has followed many of the same paths as more mature areas of machine learning have followed.  One of the key assumptions that researchers have made in relational clustering is that all sets of relations available are highly informative and valuable to the clustering task.  Just as this is not true of object attributes for classification and clustering---as evidenced by the many feature engineering techniques that have been developed---neither is it true of relations. 


The final contribution of my dissertation research was the development of the first method for unsupervised relation selection.  My approach was to apply a fast method for assessing the quality of the structure in an arbitrary graph. The quality is measured by a score that I developed called block modularity, which, given a reference clustering, measures the difference between the distribution of edges in an input graph, compared to what would be expected for a random graph.  By averaging over a sequence of random, weak graph clusterings, I showed that the average block modularity score is indicative of the quality of the underlying structure in a relational set.  As an ongoing research project, I am also investigating techniques for relation extraction, or the derivation of new relation sets from the combination of two distinct relation sets.  This process may be particularly useful in situations where one relation set is much larger than another, and would dominate the clustering process (in the same way that an unnormalized feature with a large range would).


Supervised Undergraduate Research


In addition to my dissertation research, I have also had the opportunity to supervise several undergraduate researchers in independent projects.  I supervised one student on a multi-agent team formation project.  Together we developed a simulation in which simple networked agents had to form teams to complete abstract tasks.  The simulator was developed for a classroom environment in which students could implement their own agents that competed for resources in the simulation.  Another student is investigating the use of block modularity as part of a standalone clustering algorithm.  Thus far, we have shown that the algorithm is simple to implement, scales well to larger graphs, and in some situations performs at least as well as many algorithms in the class of PRCF's.  Finally, a third student project is in constrained relational clustering.  Constrained clustering is a relatively new area of clustering research in which a data analyst provides a set of external constraints that inform a clustering algorithm as to which clusterings are preferred.  Nobody has yet applied constraint techniques to the PRCF class of algorithms and we are currently investigating several techniques for doing so.  The team formation environment work was recently published in the Colloquium on AI Education at AAAI 2008, and I expect the students working on the latter two projects to submit their research to the International Joint Conference on Artificial Intelligence or the International Conference on Machine Learning this winter. 


Research Plan


I believe that data clustering can serve the needs of society through applications to the public and private data stores that have increased in size and complexity in recent years.  In my dissertation, I focused on relational data, because of the recent surge in the creation and collection of massive relational data sources, such as document archives, biological data, business transactions, and social interactions.  Given the recent popularity and importance of these relational data sources, I plan on extending my graduate research to perform the task of unsupervised link classification, a problem that has not been addressed in prior work and has many practical applications.  However, another important problem with clustering in general is the lack of accessibility and understanding of the clustering process by users who would benefit the most from applications of advanced clustering algorithms.  Therefore, I additionally propose an inter-disciplinary research project that first aims to quantify the clustering hypotheses of scientists and analysts.  Then, given a quantitative description of an analyst's needs, I will develop an automatic configuration and selection system that applies the most effective clustering algorithm to the analyst's data source.   



Advanced Relational Clustering Many researchers, including myself, have recently proposed approaches for clustering the objects in a relational data set.  While this is a useful task, it is frequently the case that the individual links in a relation set are just as interesting as the objects they connect.  For example, if one had access to a phone call graph, one might want to distinguish an important business call from a telemarketing call.  Or on the Internet, one might benefit from distinguishing legitimate advertising links from malicious ones.  The process of partitioning the links in a relation set into smaller, disjoint relation sets is referred to as unsupervised link classification. 


To solve this problem, I will extend the PRCF to incorporate a link partitioning variable, and will first develop an inference method that uses a fixed reference clustering for the objects.  The optimal partitioning of the relations in this case is the one that best reinforces the reference clustering.   I will then develop an inference technique that does not require a reference clustering to allow full unsupervised clustering of both objects and relations.  In addition to unsupervised link classification, there are several open problems in relational clustering that I am interested in, including co-association of heterogeneous objects, relation extraction, and applications of clustering in natural language processing and bioinformatics. 


Analyst Modeling and Intelligent Clustering Algorithm Portfolios  Little work has been done in quantifying and standardizing the needs of a human data analyst.  Analysts typically have an intuition about what they hope to discover by applying a clustering algorithm, but typically this intuition cannot be easily used to select the best algorithm for the task.  I intend to work closely with faculty in other departments to develop a systematic approach to encoding their preferences for particular clustering outcomes.  These preferences can be used in two ways.  The first is to develop a constrained clustering algorithm that can take into consideration the preferences, and the second is to use the preferences as an evaluation metric for comparing the performance of different clustering algorithms on the same data set. 

The preferences may also be used in conjunction with other information to compile an intelligent clustering algorithm portfolio. An algorithm portfolio is a collection of algorithms that solve the same task in different ways.  Machine learning techniques can be applied to develop an empirical model that predicts which algorithm will work best for a new data set, depending on the characteristics of the input and output requirements.  Such a tool would be invaluable to researchers who have large data stores, but lack the expertise to hand-pick the best clustering algorithm for the data.