Hi. I’m Behrooz from The Ohio State University … and today I’ll present our paper on Multi-Objective Group Discovery on the Social Web. This is a joint work with with Sihem, Pierre-Francois and Denis from University of Grenoble and CNRS in France. Nowadays, huge amount of user data is ubiquitous. In particular, collaborative rating websites have become essential data resources to make decisions … about mundane tasks such as purchasing a book, renting a movie or going to a restaurant. Examples of such data is MovieLens, a movie-rating site, BookCrossing, a book rating site, and others. This availability of such datasets motivates the data management community to … design approaches that help analysts make better decisions on tasks such as … crowd data sourcing (which users to ask ratings from), advertisers in determining which items to recommend to which users, etc. MovieLens as an instance of user data is shown here. User data is the conjunction of user profiles (gender, age and occupation) and user activities (movie ratings, eating habits, voting, etc.) To analyze user data, a new opportunity, is to focus on labeled user groups … which enables new findings and addresses issues raised by peculiarities of user data … such as noise and sparsity. An example of a group of users in MovieLens is “Young students who have rated Bernardo Bertolucci movies” with 320 members . Here “young and student” are user demographics and “rating Bertolucci movies” is user activity. Given a dataset, e.g., ratings of the movie Last Tango in Paris (by Bernardo Bertolucci), our problem is to discover high quality user groups. Now the question is “What is quality?” Quality is formulated as the optimization of following dimensions Coverage ensures that most input records (as seen in the table) belong to at least one group in the output. It can be computed as … the percentage of covered ratings. On the other hand, diversity ensures that found groups are as different as possible from each other, e.g., … males and females or young and old, and unveils different aspects of ratings. Diversity can be captured by penalizing overlapped ratings. Finally, Homogeneity or heterogeneity of ratings also allow to obtain groups with a consensus on a score or not. Variance can be captured with different functions. One example is diameter, i.e., the difference between the highest and lowest rating score. Now the question is “why do we define these dimensions as quality?” Let’s see that in an example to see how these dimensions benefit an analysis process. We consider an example in the domain of social science. Elena is a social scientist and wants to validate a hypothesis that … “romantic movies are mostly watched by females.” She goes to the IMDB website and collects all ratings for romance genre movies. She wants to validate her hypothesis by exploring diverse user groups that cover most ratings for romance genre movies. Such a group-centric examination would provide the following 3 user groups: female reviewers from DC (District of Columbia) young female reviewers male teenager reviewers with average ratings of 4.6, 3.7 and 3.1 (out of 5) respectively. By observing those groups, Elena finds that the hypothesis holds only for a sub-population of female reviewers, middle-age or residents of DC. Also the results show another group of romance genre lovers, male teenagers, which contradicts the hypothesis. Elena is confident in her observation as the results has high coverage. She can also notice different aspects of her hypothesis as results are diversified. Elena then looks at the variance of ratings in those groups and finds that male teenager reviewers has a higher variance comparing to two other groups. This potentially shows that not all male teenagers like romantic movies. Homogeneous groups have more interest in Elena’s research. So she can either choose the second or third group … or ask the system to find other groups specifically for males or teenagers. We consider each dimension as an objective and define our problem context as a multi-objective optimization problem. Given set of ratings R, our problem is to identify all group-sets where each group-set G satisfies … maximization of coverage(G,R), maximization of diversity(G,R) and minimization of diameter(G,R). Also we want to return at most k groups to prevent overwhelming analyst with too many options. Finally, the last constraint ensures that we don’t end up with small groups that hurts coverage. Note that our problem focuses on group-sets in opposition to individual groups, which is a clear distinction from the literature. In the paper, we prove that our problem is NP-Complete by a reduction form EC3 problem. The main challenge in designing an algorithm for user group discovery, is the Multi-Objective nature of the problem. In an optimization space which has more than one objective, we don’t have one unique optimal solution anymore … but a set called Pareto plans. These solutions dominate others which lie in their dominance area. An exhaustive solution starts by calculating Pareto plans for single groups and iteratively for group-sets of size 2 up to size k. At each iteration, dominated plans are discarded. The algorithm exploits optimality principle and combines sub-plans to obtain new plans. An improvement to the exhaustive algorithm is to approximate prunings. Consider an approximiation ratio alpha to widen the dominance area. In this case more prunings occur. Generating fewer plans makes a Multi-Objective optimization algorithm run faster. This is because the execution time heavily depends on the number of generated plans. Hence a pruning strategy is at the core of our approximation algorithm. The algorithm receives a set of records R, an integer k and alpha as input and returns the alpha-approximated pareto set. In the special case of α=1, the algorithm operates exhaustively. If α>1, the algorithm prunes more and hence is faster. In this case, a new plan is only inserted into the buffer if no other plan approximately dominates it. This means that the approximation algorithm tends to insert fewer plans than the exhaustive algorithm. However, in the worst case, approximation algorithm operates brute-force. What if the analyst wants to have a taste of few results without waiting too much? For that, we propose a heuristic algorithm alongside our approximation algorithm. It starts from n random group-sets … and navigates a lattice of user groups to find group-sets with local maximized coverage … using a single-objective algorithm in the literature like Shotgun Hill Climbing … by exploiting the monotonicity property of coverage. These group-sets which have local maximized coverage will be placed in fixed-width intervals of diversity. In the second step, we remove all dominated group-sets in each diversity interval. Finally, in each interval, we report the group-set with the optimal value of diameter. Of course the heuritsitc prevents any exhaustive search and delivers a subset of interesting group-sets. Now the question is how much the quality is sacrificed comparing to the approximation. Our experiments on different sets of users in MovieLens dataset shows that 62% of solutions of the heuristic algorithm are dominated by solutions of the approximation algorithm. This is because the approximation algorithm generates the complete set of alpha-approximated Pareto plans, while the heuristic algorithm produces a subset. For instance, for the movie American Beauty, the approximation algorithm produces 16 times more solutions than the heuristic algorithm. Concerning the huge difference in the size of solution sets, … potentially a fairer comparison is to consider objective values to neutralize the influence of size. We observe that the heuristic algorithm can achieve a supremacy of around 50% which potentially shows that it can return good representatives. The take-away is both algorithms are useful for analysts for group discovery in different scenarios. The approximation algorithm can be used in an offline context … to produce an exhaustive set of user groups … with a precision defined by α for further analysis. For instance, a movie rating website (like IMDB) … can index user groups generated offline … and execute various user queries like … “what are interesting groups of female teenagers who have rated romantic movies”. On the other hand, in an online or streaming context, … the heuristic algorithm is beneficial, because it can immediately produce a representative subset of results. For instance, in the movie rating website … an analyst can quickly observe interesting user groups of comedy and romantic movies. Concerning the performance of algorithms, … we show results on MovieLens dataset … for users who have rated the movie “American Beauty”. In this figure, we illustrate the execution time in milliseconds and logarithmic scale for different versions of our heuristic and approximation algorithms. For approximation, 3 different values of alpha (2, 1.5 and 1.15) and for the heuristic, 3 different number of diversity intervals (5, 10 and 40) are considered. We observe in general, the heuristic algorithm is two orders of magnitude faster than the approximation algorithm. We also observe that decreasing alpha, increases execution time. This is because more precision leads generation of more results. The same phenomena is observed while increasing number of intervals. In conclusion, we discussed the problem of discovering high quality groups. We formalized the problem as a constraint multi-objective problem where objectives are diversity, coverage and variance. We proposed two algorithms to discover high quality user groups: approximation and heuristic. Our work can be improved in many ways. In particular, we plan to perform an extensive user study to be able to evaluate the quality of returned group-sets. We plan to consider domain-specific online polls as a ground-truth to compare with obtained results of our algorithms. Thanks!