To see a map of the room click on the icon To see the abstract of a paper click on the icon To see the pdf of a paper click on the icon

 

Workshops


pdftxt

7th International Workshop on Real-Time Business Intelligence

In today's competitive and highly dynamic environment, analyzing data to understand how the business is performing and to predict outcomes and trends has become critical. The traditional approach to reporting is no longer adequate. Instead users now demand easy-to-use intelligent platforms and applications capable of analyzing real-time data to provide insight and actionable information at the right time. The end goal is to support better and timelier decision making, enabled by the availability of up-to-date, high quality information. Although there has been progress in this direction and many companies are introducing products towards meeting this goal, there is still a long way to go. In particular, the whole lifecycle of business intelligence requires innovative techniques and methodologies capable of dealing with the requirements imposed by these new generation of BI applications. From the capture of real-time business data to the transformation and delivery of actionable information, all the stages of the Business Intelligence (BI) cycle call for new algorithms and paradigms to support value-added functionalities. These functionalities include dynamic integration of real-time data feeds from operational sources, optimization and evolution of ETL transformations and analytical models, and dynamic generation of adaptive real-time dashboards, just to name a few. The series of BIRTE workshops, starting in 2006, have always been held in conjunction with VLDB. The series aims to provide a forum to discuss topics related to this emerging field and set research directions towards making business intelligence more real-time. Following the success of previous BIRTE editions (2006, 2008-2012) submissions for research, industrial and position papers on relevant topics are encouraged.

pdftxt

4th International Workshop on Accelerating Data Management Systems using Modern Processor and Storage Architectures

The objective of the one-day workshop is to understand impact of modern hardware technologies on accelerating core components of data management workloads (which include traditional OLTP, data warehousing/OLAP, ETL, Streaming/Real-time workloads, Business Analytics workloads such as text analytics, data mining, machine learning, graph analytics, RDF Processing, and big data processing) using modern processors (e.g., commodity and specialized Multi-core, GPUs, and FPGAs), storage systems (e.g., Storage-class Memories like SSDs and Phase-change Memory), and hybrid programming models like CUDA, OpenCL, and OpenACC. Specifically, the workshop hopes to explore the interplay between overall system design, core algorithms, query optimization strategies, programming approaches, performance modelling and evaluation, etc., from the perspective of data management applications.

pdftxt

TPC Technology Conference on Performance Evaluation and Benchmarking

The TPC has played, and continues to play, a crucial role in providing the computer industry with relevant industry-standard benchmarks. Vendors and endusers rely on TPC benchmarks to provide real-world data that is backed by a stringent and independent review process. Vendors also use TPC benchmarks to demonstrate performance competitiveness for their existing products, and to improve/monitor the performance of products-under-development. Many buyers often use TPC benchmark results as points of comparison when purchasing new computing systems. The technology landscape is continually evolving and challenging industry experts and researchers to develop innovative techniques to evaluate and benchmark computing systems. The TPC remains committed to developing highly relevant benchmark standards and will continue to develop new benchmarks to keep pace. With this conference, the TPC encourages researchers and industry experts to submit novel ideas and methodologies in performance evaluation, measurement, and characterization. Selected papers may be considered for future TPC benchmark developments. Proceedings will be published by Springer-Verlag as Lecture Notes in Computer Science (LNCS).

pdftxt

11th International Workshop on Quality in DataBases

The problem of poor data quality in databases, data warehousing and information systems largely and indistinctly affects every application domain. Many data processing tasks (such as information integration, data sharing, information retrieval, and knowledge discovery from databases) require various forms of data preparation and consolidation with complex data processing techniques, because the data input to the algorithms is assumed to conform to nice data distributions, containing no missing, inconsistent or incorrect values. This leaves a large gap between the available "dirty" data and the available machinery to process the data for application purposes. Building on the established tradition of nine previous international workshops on the topic of Data and Information Quality, namely IQIS 2004-2006, CleanDB 2006 and QDB 2007-2012, the 2013 Quality in Databases (QDB) workshop, co-located with VLDB2013, is a qualified forum for presenting and discussing novel ideas and solutions related to the problems of assessing, monitoring, improving, and maintaining the quality of data.

pdftxt

1st VLDB Workshop on Databases and Crowdsourcing

Crowdsourcing systems, such as Amazon Mechanical Turk and CrowdFlower, utilize human power to perform difficult tasks, such as entity resolution, search, filtering, image matching, or clustering. The important issues of collecting and managing the large volume of data in these applications have attracted plenty of attention from the database community. The goal of DBCrowd 2013, the First VLDB Workshop on Databases and Crowdsourcing, is to provide an avenue for database researchers and practitioners to disseminate and explore new research directions and promising results at the confluence of crowdsourcing and database areas.

pdftxt

3rd International Workshop on Information Management in Mobile Applications

The increasing functionality and capacity of mobile devices have enabled new mobile applications which require new approaches for data management. Information management in mobile applications is a complex problem space which requires the consideration of constraints on energy, CPU power, storage, etc. In addition, mobile data can have various forms, such as sensor data, user profiles & user context, spatial data, and multimedia data. The International Workshop on Information Management for Mobile Applications aims at a broad range of mobile application fields and wants to provide a forum for the discussion about technologies and mechanisms, which support the management of mobile, complex, integrated, distributed, and heterogeneous data-focused applications.

pdftxt

2nd International Workshop on Cloud Intelligence

With the increasing success of cloud computing, cloud business intelligence "as a service" offerings have sparkled widely, both from cloud start-ups and major BI industry vendors. Beyond porting BI features into the cloud, which already implies numerous issues (e.g., BigData/NoSQL database modeling and storage, data localization, security and privacy, performance, cost and usage models...), this trend also poses new, broader challenges for making data analytics available to small and medium-size enterprises (SMEs), non-governmental organizations, web communities (e.g., supported by social networks), and even the average citizen; this vision presumably requiring a mixture of both private and open data. The aim of the Cloud-I workshop is to become an interdisciplinary, regular exchange forum for researchers, industry and practitioners, as well as all potential users of Cloud Intelligence. The submission of research, industrial, position, visionary, survey and student papers are encouraged to fuel up the discussion.

pdftxt

1st Workshop on In-memory Data Management and Analytics

Over the last 30 years, memory prices have been dropping by a factor of 10 every 5 years. Main memory is the “new disk” for data storage. The number of I/O operations per second (IOPS) in DRAM is far greater than other storage media such as hard disks and SSDs. DRAM is readily available in the market at better price point in comparison to DRAM-alternatives. These trends make DRAM a better storage media for latency-sensitive data management applications, large-scale web applications, and future applications such as wearable devices. The first international workshop on In-memory Memory Data Management and Analytics (IMDM 2013) aims to bring together researchers and practitioners interested in the proliferation of in-memory data management and analytics infrastructures. The workshop is a forum to present research challenges, novel ideas and methodologies that can improve in-memory (main memory) data management and analytics. The proceedings of IMDM 2013 are planned to be published by Springer-Verlag as Lecture Notes in Computer Science (LNCS).

pdftxt

7th International Workshop on Ranking in Databases

In recent years, there has been a great deal of interest in developing effective techniques for ad-hoc search and ranked retrieval in relational databases, XML, RDF and graph databases, text and multimedia databases, scientific information systems, social networks, and many more. In particular, a large number of emerging applications require an explorative form of querying on such general-purpose or domain-specific databases; examples include users wishing to search bibliographic databases or catalogues of products, such as homes, cars, cameras, restaurants, photographs, etc. Current database query languages, such as SQL, XQuery or SPARQL, are designed for expert users and follow a Boolean retrieval model, which is inadequate for exploratory users who cannot articulate their perfect query needs. Top-k queries and ranking query results are gaining increasing importance to address the needs of exploratory users. In fact, in many of these applications, ranking is an integral part of the semantics, e.g., for keyword search, similarity search in multimedia as well as document collections. The increasing importance of ranking is directly derived from the explosion in the volume of data that is handled by current applications. Without ranking, users would frequently be overwhelmed by too many results. Furthermore, the sheer amount of data makes it almost impossible to process queries in a traditional compute-then-sort approach. Hence, ranking comes as a great tool for soliciting user preferences and data exploration. Ranking imposes several challenges for almost all data-centric systems. DBRank 2013 serves as a platform for the discussion of challenges, research, and applications in the context of ranking for relational, XML, RDF, text, multimedia, multidimensional, and social data.

pdftxt

VLDB PhD International Workshop

The VLDB PhD workshop is a forum for PhD students working in the broad areas addressed by the VLDB conference itself. This forum aims to facilitate interactions among PhD students and to stimulate feedback from more experienced researchers. We welcome submissions from PhD students at any stage of their PhD work. PhD workshop papers are meant to present on-going thesis work in 6 pages. Those starting their PhDs should describe the problem they focus on, explain why it is important, detail why the existing solutions are not sufficient, and give an outline of the new solutions that are pursued. Those in the middle or close to completion should be more concrete in describing their contribution, still in the context of their doctoral work. Note that specific portions of the thesis work might have been published or submitted to publication. Again this year, the accepted papers will be published in the VLDB proceedings. This workshop should be an opportunity for the upcoming generation of database researchers to share their work and get to know each other, and an opportunity for the rest of the database community to meet future colleagues and find out about the new ideas emerging in a variety of universities.

pdftxt

1st International Workshop on Big Dynamic Distributed Data

As the amount of streaming data produced by large-scale systems such as environmental monitoring, scientific experiments and communication networks grows rapidly, new approaches are needed to effectively process and analyze such data. There are several promising directions in the area of large-scale distributed computation, that is, where multiple computing entities work together over partitions of the massive, streaming data to perform complex computations. Two important paradigms in this realm are continuous distributed monitoring (i.e., continually maintaining an accurate estimate of a complex query), and distributed and cluster-based systems that allow the processing of big, streaming data (e.g., IBM System S, Amazon S4, and Twitter Storm). The aim of the BD3 workshop is to bring together computer scientists with interests in this field to present recent innovations, find topics of common interest and to stimulate further development of new approaches to deal with massive dynamic and distributed data.

pdftxt

14th Biennial Symposium on Data Base Programming Languages

For over 25 years, DBPL has established itself as the principal venue for publishing and discussing new ideas at the intersection of databases and programming languages. Many key contributions in query languages for object-oriented data, persistent databases, nested relational data, and semistructured data, as well as fundamental ideas in types for query languages, were first announced at DBPL. Today, the emergence of new data management applications such as cloud computing and “big data,” social network analysis, bidirectional programming, and data privacy has lead to a new flurry of creative research in this area, as well as a tremendous amount of activity in industry. DBPL is an established destination for such new ideas.

pdftxt

The 7th Workshop on Personalized Access, Profile Management, and Context Awareness in Databases

The past decade produced a rich ecosystem of web sites that provide personalized access to (semi-)structured data: financial asset tracking and management sites, personalized news delivery services, and even customized web search engines are all but a few examples. A second wave of innovation has been fueled by the explosive growth of web platforms that enable rich online social interactions, such as online social networks, web communities, wikis, and mashups. These new applications go beyond personalized information access and dissemination. Users can now transcend their role of passive content consumers and engage in content creation, sharing, and various forms of online collaboration as well. This online collaboration has recently moved to the next level through crowdsourcing: applications that enable users to help other users in completing their tasks. All aforementioned applications rely critically on user-centric data — such as profile data, preferences, activity logs, location, group memberships, and social connections — to provide a personalized experience, including personalized search results, personalized ads, product recommendations, coupons and so forth. Additionally, online social applications provide an unprecedented amount of user-contributed social and context data. The interconnected nature of personalized, social, and contextual data management problems as well as the fertile research ground these represent motivate a discussion on these problems within the database community. We need to obtain a common understanding of new challenges and to collaborate on the design of new models, algorithms, and systems for emerging applications. The PersDB 2013 workshop aims at providing the appropriate venue for discussion and debate of the relevant issues and at nurturing related future research and applications.

pdftxt

International Symposium on Data-Driven Process Discovery and Analysis

With the increasing automation of business processes, growing amounts of process data become available. This opens new research opportunities for business process data analysis, mining and modeling. The aim of the IFIP 2.6 - 2.12 International Symposium on Data-Driven Process Discovery and Analysis is to offer a forum where researchers from different communities and the industry can share their insight in this hot new field. The IFIP International Symposium on Data-Driven Process Discovery and Analysis (SIMPDA 2013) offers a unique opportunity to present new approaches and research results to researchers and practitioners working in business process data modeling, representation and privacy-aware analysis.

pdftxt

Secure data Management

SDM2013 brings together people from the security research community and data management research community in order to exchange ideas on the secure management of data. The workshop will provide forum for discussing practical experiences and theoretical research efforts that can help in solving the critical problems in secure data management. This 10th anniversary year, we will put special emphasis on some high profile position papers as well as on one or two special sessions on topics related to security, trust, and privacy in data driven networked services and / or future internet service architectures.

pdftxt

3rd International Workshop on Semantic Search over the Web

We are witnessing a smooth evolution of the Web from a worldwide information space of linked documents to a global knowledge base, composed of semantically interconnected resources. The continuous publishing and the integration of the plethora of semantic datasets from companies, government and public sector projects is leading to the creation of the so-called Web of Knowledge. As a matter of facts, researchers are now looking with growing interest to semantic issues in this huge amount of correlated data available on the Web. Many progresses have been made in the field of semantic technologies, from formal models to repositories and reasoning engines. The third edition of the International Workshop on Semantic Search over the Web (SSW) will discuss about data management issues related to the search over the web and the relationships with semantic web technologies, proposing new models, languages and applications. The SSW Workshop invites researchers, engineers, service developers to present their research and works in the field of data management for semantic search. Papers may deal with methods, models, case studies, practical experiences and technologies.



Monday Aug 26th 08:30-10:00

BIRTE / Cloud-I Keynote: (Actual timing 9:00-10:10)

Location: Room 1000Bmap


AsterixDB: A New Platform for Real-Time Big Data BI

Mike Carey (UC Irvine, USA)


ADMS Keynote 1: (Actual timing 8:40-10.15)

Location: Room 300map


Opening and welcome

txt

Hadoop: Past, Present, and (possibly) Future

Milind Bhandarkar (Machine Learning Platforms, Pivotal Inc.)

Apache Hadoop has rapidly become the de facto data processing platform, and is often mentioned synonymously with "Big Data". Hadoop started as a project within Apache Lucene and Nutch to scale the content backend for web search engine. However, it is currently being used in majority of Fortune 500 companies, in many other application domains, such as fraud detection at credit card companies, healthcare analytics, churn detection and prevention at Telecom companies etc. In this talk, we will reminisce about the early days of Hadoop at Yahoo, and lessons learned in scaling this platform from a 20-node prototype to a datacenter-wide production deployment. We will give an overview of the current state of Hadoop ecosystem, and present some prominent patterns and use cases of this platform. We will also discuss how Hadoop is evolving, and its future as a platform for "Big Data" processing.


TPCTC Opening & Research 1

Location: Room Belvederemap


Opening Remarks and Welcome

Raghunath Nambiar (TPC)

TPC State of the Council 2013

Raghunath Nambiar (Cisco), Meikel Poess (Oracle), Andrew Masland (NEC), H. Reza Taheri (VMware), Andrew Bond (RedHat), Forrest Carman (Owen Media), Michael Majdalany (L&M Mgmt Group)

TPC-BiH: A Benchmark for Bi-Temporal Databases

Martin Kaufmann (SAP AG, ETH Zurich), Peter Fischer (Albert-Ludwigs-Universit at), Norman May (SAP AG), Andreas Tonder (SAP AG), Donald Kossmann (ETH Zurich)

Towards Comprehensive Measurement of Consistency Guarantees for Cloud-Hosted Data Storage Services

David Bermbach (KIT)), Liang Zhao (NICTA and University of New South Wales) (Sherif Sakr (NICTA and University of New South Wales))


DBCrowd Keynote 1

Location: Room 100Amap


txt

Mining the Crowd

Tova Milo (Tel Aviv University)

Harnessing a crowd of Web users for data collection has recently become a widespread phenomenon. A key challenge is that the human knowledge forms an open world and it is thus difficult to know what kind of information we should be looking for. Classic databases have addressed this problem by data mining techniques that identify interesting data patterns. These techniques, however, are not suitable for the crowd. This is mainly due to properties of the human memory, such as the tendency to remember simple trends and summaries rather than exact details. Following these observations, we develop here a novel model for crowd mining. We will consider in the talk the logical, algorithmic, and methodological foundations needed for such a mining process, as well as the applications that can benefit from the knowledge mined from crowd.


IMMoA Research 1

Location: Room 100Bmap


pdf

Towards a Framework for Semantic Exploration of Frequent Patterns

Behrooz Omidvar Tehrani (LIG, France), Sihem Amer-Yahia (CNRS, LIG, France), Alexandre Termier (LIG, France), Aurélie Bertaux (INRIA, France), Eric Gaussier (LIG, France), Marie-Christine Rousset (LIG, France)

pdf

A Method for Activity Recognition Partially Resilient on Mobile Device Orientation

Nikola Jajac, Bratislav Predic, Dragan Stojanovic (University of Nis, Serbia)


Cloud-I Keynote: (Together with BIRTE Keynote in BIRTE's Room - Actual timing 9:00-10:10)

Location: Room Meetingmap


IMDM Keynote 1

Location: Room Presidenzamap


Composing Scalability for Transactions on Multicore Platforms

Anastasia Ailamaki (EPFL)



Monday Aug 26th 10:30-12:00

BIRTE Research

Location: Room 1000Bmap


LinkViews: An Integration Framework for Relational and Stream Systems

Yannis Sotiropoulos, Damianos Chatziantoniou

OLAP for Multidimensional Semantic Web Databases

Adriana Matei, Kuo-Ming Chao, Nick Godwin

A Multiple Query Optimization Scheme for Change Point Detection on Stream Processing System

Masahiro Ohke, Hideyuki Kawashima


ADMS Research 1: Compute Optimizations

Location: Room 300map


pdf

Vectorizing Database Column Scans with Complex Predicates

Thomas Willhalm (Intel), Ismail Oukid (Intel and SAP AG), Ingo Muller (Karlsruhe Institute of Technology and SAP AG) and Franz Faerber (SAP AG)

pdf

High-Performance XML Twing Filtering using GPUs

Ildar Absalyamov (UC Riverside), Roger Moussalli (IBM T. J. Watson Research Center), Vassilis Tsotras and Walid Najjar (UC Riverside)

pdf

Skew Handling in Aggregate Streaming Queries on GPUs

Georgios Koutsoumpakis, Iakovos Koutsoumpakis (Uppsala University) and Anastasios Gounaris (Aristotle University of Thessaloniki)


TPCTC Invited Talk & Research 2

Location: Room Belvederemap


(Invited Talk) TPC Express – A New Path for TPC Benchmarks

Karl Huppler

TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark

Peter Boncz (CWI), Thomas Neumann (Technical University Munich), Orri Erling (Openlink Software)


DBCrowd Research 1

Location: Room 100Amap


pdf

Crowdsourcing Feedback for Pay­As­You­Go Data Integration.

Fernando Osorno­Gutierrez, Norman Paton and Alvaro A. A. Fernandes

pdf

The Palm­tree Index: Indexing with the crowd.

Ahmed Mahmood, Walid Aref, Eduard Dragut and Saleh Basalamah

pdf

Crowdsourcing to Mobile Users: A Study of the Role of Platforms and Tasks.

Vincenzo Della Mea, Eddy Maddalena and Stefano Mizzaro


IMMoA Invited Talk & Research 2

Location: Room 100Bmap


txt

(Invited Talk) Moving objects beyond raw and semantic trajectories

Maria Luisa Damiani (University of Milan, Italy)

Semantic trajectory is a relatively recent concept developed to flexibly represent the history of locations of an entity moving continuously or nearly continuously in a reference space (i.e.trajectory). The key idea is to supplement the geometric representation of a trajectory – usually called raw trajectory - with thematic information describing application-dependent, time-varying features of the entity‘s movement. For example semantic trajectories can be used to describe the sequence of points of interest visited by tourists in a city, or the sequence of transportation means used by inviduals traveling in an urban setting for work or leisure. In semantic trajectories, single positions or sequences of positions inside a trajectory can be semantically annotated. This makes possible a fine-grained description of the behaviour of the moving object. The notion of semantic trajectory, however, opens up a number of issues. For example semantic trajectories magnify the risk to privacy because behavior information on individuals is explicitly extracted and represented in a machine-readable form, and therefore can be used within information processing applications and easily revealed to third parties. For this reason, semantic trajectories and privacy may clash. Moreover, the notion of semantic trajectory, while extensively investigated at conceptual level, still lacks an operational and rigorous definition. In particular it is still open to the problem of specifying how to handle and query large amounts of semantic trajectories so as to make this concept usable in real, data intensive mobile applications. In this presentation, I will discuss on-going research on these issues and emphasize the evolution of the notion of moving object.

pdf

Extending Augmented Reality Mobile Application with Structured Knowledge from the LOD Cloud

Betül Aydin (Grenoble Informatics Lab, France), Jerome Gensel (Grenoble Informatics Lab, France), Philippe Genoud (Grenoble Informatics Lab, France), Sylvie Calabretto (INSA de Lyon, France), Bruno Tellez (Claude Bernard Uni. Lyon 1, France)


Cloud-I 1: MapReduce

Location: Room Meetingmap


Opening and welcome

Jerome Darmont and Torben Bach Pedersen

pdf

Cache Conscious Star-Join in MapReduce Environments

Guoliang Zhou, Yongli Zhu and Guilan Wang

pdf

Toward Intersection Filter-Based Optimization for Joins in MapReduce

Thuong-Cang Phan, Laurent d'Orazio and Philippe Rigaux

pdf

i2MapReduce: Incremental Iterative MapReduce

Yanfeng Zhang and Shimin Chen


IMDM Research 1

Location: Room Presidenzamap


pdf

Massively Parallel NUMA-aware Hash Joins

Harald Lang, Viktor Leis, Martina-Cezara Albutiu, Thomas Neumann, Alfons Kemper (Technische Universität München)

pdf

Fast Column Scans: Paged Indices for In-Memory Column Stores

Martin Faust, David Schwalb, Jens Krueger (Hasso Plattner Institut)

pdf

Compiled Plans for In-Memory Path-Counting Queries

Brandon Myers, Jeremy Hyrkas, Daniel Halperin, Bill Howe (University of Washington - Seattle)



Monday Aug 26th 13:30-15:30

BIRTE Keynote 2 & Demo

Location: Room 1000Bmap


(Keynote) Query Adaptation and Privacy For Real-time Business Intelligence

Prof. Dr. Christoph Freytag (HUB, Germany)

(Demo) Big Scale Text Analytics and Smart Content Navigation

Karsten Schmidt, Philipp Scholl, Sebastian Bächle, Georg Nold

(Demo) Dynamic Generation of Adaptive Real-time Dashboards for Continuous Data Stream Processing

Timo Michelsen, Marco Grawunder, Dennis Geesen, H.-Jürgen Appelrath


ADMS Research 2: Memory/Storage Optimizations (Actual timing 13.30-15.00)

Location: Room 300map


pdf

Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads

Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May (SAP AG) and Anastasia Ailamaki (EPFL)

pdf

Modularizing B+-trees: Three-Level B+-trees Work Fine

Shigero Sasaki and Takuya Araki (NEC Corporation)

pdf

FBARC: I/O Asymmetry Aware Buffer Replacement Strategy

Paul Dubs (TU Darmstadt), Ilia Petrov (Reutlingen University), Robert Gottstein and Alejandro Buchmann (TU Darmstadt)


TPCTC Research 3: (Actual timing 14:00-15:30)

Location: Room Belvederemap


Benchmarking Challenges in the New World of Big Data and Cloud Services

Raghu Ramakrishnan (Microsoft)

Architecture and Performance characteristics of a PostgreSQL implementation of TPC-E and TPC-V workloads

H. Reza Taheri (VMware), Andrew Bond (RedHat), Doug Johnson (InfoSizing), Greg Kopczynski (VMware)


DBCrowd Keynote 2 & Research 2

Location: Room 100Amap


(Keynote) Multi-Platform, Reactive Crowdsourcing

Stefano Ceri (Politecnico di Milano)

pdf

Wrapper Generation Supervised by a Noisy Crowd.

Valter Crescenzi, Paolo Merialdo and Disheng Qiu

pdf

Condition­Task­Store: A Declarative Abstraction for Microtask­based Complex Crowdsourcing.

Kenji Gonnokami, Atsuyuki Morishima and Hiroyuki Kitagawa


IMMoA Research 3

Location: Room 100Bmap


pdf

Vanet-X: A Videogame to Evaluate Information Management in Vehicular Networks

Sergio Ilarri, Eduardo Mena, Víctor Rújula (University of Zaragoza, Spain)

pdf

Mobile objects and sensors within a video surveillance system: Spatio-temporal model and queries

Dana Codreanu, Ana-Maria Manzat, Florence Sedes (Université de Toulouse, France)

pdf

MappingSets for Spatial Observation Data Warehouses

José R.R. Viqueira, David Martínez, Sebastián Villarroya, José A. Taboada (Universidade de Santiago de Compostela, Spain)

pdf

To trust, or not to trust: Highlighting the need for data provenance in mobile apps for smart cities

Mikel Emaldi (DeustoTech, Spain), Oscar Peña (DeustoTech, Spain), Jon Lázaro (DeustoTech, Spain), Diego López-de-Ipiña (DeustoTech, Spain), Sacha Vanhecke (Ghent University, Belgium), Erik Mannens (Ghent University, Belgium)


Cloud-I 2: Emerging Topics

Location: Room Meetingmap


pdf

Bloofi: A Hierarchical Bloom Filter Index with Applications to Distributed Data Provenance

Adina Crainiceanu

pdf

Cloud Intelligence - Challenges for Research and Industry (Roundtable/Panel discussion)

Jerome Darmont and Torben Bach Pedersen


IMDM Keynote 2 & Research 2

Location: Room Presidenzamap


(Keynote) Evolving the architecture of SQL Server for modern hardware.

Paul Larson (Microsoft)

pdf

Bringing Linear Algebra Objects to Life in a Column-Oriented In-Memory Database

David Kernert (SAP), Frank Köhler (SAP), Wolfgang Lehner (TU Dresden)

pdf

Dynamic Query Prioritization for In-Memory Databases

Johannes Wust (Hasso Plattner Institut), Martin Grund (EXascale Infolab), Hasso Plattner (Hasso Plattner Institut)



Monday Aug 26th 16:00-18:00

BIRTE Industrial Paper & Panel: (Actual timing 16:00-17:30)

Location: Room 1000Bmap


(Invited industrial talk) The Inverted Data Warehouse based on TARGIT Xbone - How the biggest of data can be mined by "the little guy"

Dr. Morten Middelfart (TARGIT, USA/Denmark)

(Panel)

Moderator: Meichun Hsu (HP Labs)


ADMS Keynote 2: (Actual timing 15:45-17.00)

Location: Room 300map


txt

Active Storage: Exploring a Scalable, Compute-In-Storage model by extending the Blue Gene/Q architecture with Integrated Non-volatile Memory

Blake Fitch (IBM T. J. Watson Research Center)

Emerging storage class memories offer a set of challenges and opportunities in system architecture, programming models, and application design. We are exploring the close integration of emerging solid-state storage technologies in conjunction with high performance networks and integrated processing capability. Specifically, we consider the extension of the Blue Gene/Q architecture by integrating Flash into the node to enable a scalable, data-centric computing platform. We are using BG/Q as a rapid prototyping platform allowing us to build a research system based on an infrastructure with proven scalability to thousands of nodes. Our work also involves enabling a Linux environment with standard network interfaces on the BG/Q hardware. We plan to explore applications of this system architecture including existing file systems and middleware as well as more aggressive compute-in-storage approaches. Compute-in-storage is intended to enable the use of high performance (HPC) programming techniques (MPI) to implement data-centric algorithms (e.g. sort, join, graph) that execute on processing elements embedded within a storage system. This presentation will review the architectural extension to BG/Q, present a progress report on the project, and describe some early results.


TPCTC Research 4 & Closing

Location: Room Belvederemap


A Practice of TPC-DS Multidimensional Implementation in NoSQL Database Systems

Hongwei Zhao (Tsinghua University), Xiaojun YE (Tsinghua University)

PRIMEBALL: Parallel Processing Framework Benchmark, for BigData Applications in the Cloud

Jaume Ferrarons (Universite de Lyon), Mulu Adhana (Universite de Lyon), Carlos Colmenares, (Universite de Lyon) Sandra Pietrowska, Fadila Bentayeb (Universite de Lyon), Jerome Darmont (Universite de Lyon)

CEPBen: A Benchmark for Complex Event Processing Systems

Chunhui Li (Aston University), Robert Berry (Aston University)

Closing Remarks and TPCTC 2013

Meikel Poess (Oracle)


DBCrowd Vision & Discussions

Location: Room 100Amap


pdf

Data In Context: Aiding News Consumers while Taming Dataspaces.

Eugene Wu, Adam Marcus , Sam Madden

pdf

Cost and Quality Trade­Offs in Crowdsourcing.

Anja Gruenheid, Donald Kossmann

pdf

Crowds, not Drones: Modeling Human Factors in Interactive Crowdsourcing

Senjuti Basu Roy, Ioanna Lykourentzou, Saravanan Thirumuruganathan, Sihem Amer­Yahia and Gautam Das


IMMoA Research 4

Location: Room 100Bmap


pdf

HealthNet: A System for Mobile and Wearable Health Information Management

Christoph Quix, Johannes Barnickel, Sandra Geisler, Marwan Hassani, Saim Kim, Xiang Li, Andreas Lorenz, Till Quadflieg, Thomas Gries, Matthias Jarke, Steffen Leonhardt, Ulrike Meyer, Thomas Seidl (RWTH Aachen University, Germany)

pdf

A clinical quality feedback loop supported by mobile point of care (POC) data collection

Christopher A. Bain (Alfred Health, Australia), Tracey Bucknall (Deakin University, Australia), Janet Weir-Phyland (Alfred Health, Australia)


IMDM Research 3

Location: Room Presidenzamap


pdf

Aggregates Caching in Columnar In-Memory Databases

Stephan Müller, Hasso Plattner (Hasso Plattner Institut)

pdf

An Evaluation of Strict Timestamp Ordering Concurrency Control for Main-Memory Database Systems

Henrik Mühe, Stephan Wolf, Alfons Kemper, Thomas Neumann (Technische Universitat Munchen)



Monday Aug 26th 18:15-23:00

Welcome Reception

Location: Palameeting Gardensmap


A night next to the lake

DJ Paso



Tuesday Aug 27th 09:00-10:30

Welcome and Keynote 1

Location: Room 1000Amap

Chair: Themis Palpanas (University of Trento)


txt

Data Infrastructure at Web Scale

Jay Parikh, VP of Infrastructure Engineering (Facebook)

Nearly every team at Facebook depends on the company's custom-built data infrastructure for warehousing and analytics, with roughly 1,000 people across the company -technical and non-technical- using these technologies every day. Given Facebook's unique scalability challenges (their data warehouse is more than 250 PB in size, they add 600 TB of new data every day) and processing needs (they crunch more than 10 PB of data a day), the company's data infrastructure team has to ensure that its systems are prepared to handle not just today's challenges, but tomorrow's as well. In this session, Facebook's Jay Parikh will provide an overview of the company's data infrastructure, focusing on the custom-built technologies they've developed -including Corona, Presto, Morse, and Giraph- to meet the scale challenges they face.

Bio: Jay Parikh is the VP of infrastructure engineering at Facebook. In that role, he leads the engineering and operations teams responsible for building and maintaining an infrastructure that serves more than a billion users, developers, and partners worldwide. Prior to Facebook, Jay was senior vice president of engineering and operations at Ning, where he oversaw the scaling of the company’s social networking platform from 50,000 social networks to more than 1.5 million social networks. Before Ning, Jay was the vice president of engineering at Akamai Technologies, where he helped build the world’s largest and most globally distributed computing platform.



Tuesday Aug 27th 11:00-12:30

Research 1: Emerging Hardware

Location: Room 1000Amap

Chair: Rajesh Bordawekar (IBM T.J. Watson Research Center)


pdftxt

Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture

Jiong HE (Nanyang Technological University), Mian Lu (A*STAR Institute of High Performance Computing), Bingsheng He (NTU Singapore)

Query co-processing on graphics processors (GPUs) has become an effective means to improve the performance of main memory databases. However, the relatively low bandwidth and high latency of the PCI-e bus is usually a bottleneck issue for such co-processing. Recently, coupled CPU-GPU architectures have received a lot of attentions, e.g., AMD APUs with the CPU and the GPU integrated into a single chip. That opens up new opportunities for optimizing query co-processing. In this paper, we experimentally revisit hash joins, one of the most important join algorithms for main memory databases, on such coupled CPU-GPU architectures. Particularly, we study the fine-grained co-processing mechanisms on hash joins with and without partitioning. The co-processing outlines an interesting design space. We extend existing cost models to automatically guide decisions on the design space. Our experimental results on a recent AMD APU show that (1) the coupled architecture enables fine-grained co-processing and cache reuses, which are inefficient on discrete CPU-GPU architectures; (2) carefully designs and tuning are important for the performance of co-processing on the coupled architecture, and the cost model can automatically guide the tuning knobs in the design space; (3) fine-grained co-processing achieves up to 53.35%, 35.82%, 28.74% performance improvement compared with CPU-only, GPU-only and conventional CPU-GPU co-processing.

pdftxt

Hardware-Oblivious Parallelism for In-Memory Column-Stores

Max Heimel (Technische Universität Berlin), Michael Saecker (ParStream GmbH), Holger Pirk (CWI), Stefan Manegold (CWI), Volker Markl (Technische Universität Berlin)

The multi-core architectures of today's computer systems make parallelism a necessity for performance critical applications. Writing such applications in a generic, hardware-oblivious manner is a challenging problem: Current database systems thus rely on labor-intensive and error-prone manual tuning to exploit the full potential of modern parallel hardware architectures like multi-core CPUs and GPUs. We propose an alternative design for a parallel database engine, based around a single set of hardware-oblivious operators, which are compiled down to the actual hardware at runtime. This approach reduces the development overhead for parallel database systems, while achieving competitive performance to hand-tuned systems. We provide a proof-of-concept for this design by integrating operators written using the parallel programming framework OpenCL into the open-source database MonetDB. Thereby, we achieve efficient, yet highly portable parallel code without the need for optimization by hand. We evaluate our implementation against MonetDB using TPC-H derived queries and observed a performance that rivals that of the parallelized MonetDB query execution on the CPU, and surpassed it on the GPU. In addition, we show that the same set of operators runs nearly unchanged on a GPU, demonstrating the feasability of our approach.

pdftxt

Improving Flash Write Performance by Using Update Frequency

Radu Stoica (EPFL), Anastasia Ailamaki (EPFL)

Solid-state drives (SSDs) are quickly becoming the default storage medium for database systems as the cost of NAND flash memory continues to drop. However, flash memory introduces new challenges due to its asymmetry between reading and writing. A software Flash Translation Layer (FTL) is used to overcome the technology's limitations and to give applications the illusion of a traditional block device by storing data in a log-structured fashion. Despite a large number of existing FTL algorithms, performance, predictability, and device lifetime remain an issue, especially for write-intensive workloads specific to database applications. In this paper, we show that more efficient FTLs and improved SSD endurance are both possible by using the I/O write skew to guide data placement on flash memory. We model the relationship between data placement and write performance for basic I/O write patterns and detail the most important concepts of writing to flash memory: i) the trade-off between the extra capacity available and write overhead, ii) the benefit of adapting data placement to write skew, iii) the impact of the space reclamation policy, and iv) how to estimate the best achievable write performance for a given I/O workload. Based on the findings of the theoretical model, we propose a new principled data placement algorithm that can be incorporated into any existing FTL proposal. We show the benefits of our data placement algorithm when running micro-benchmarks and real database I/O traces: our data placement algorithm reduces write overhead by 20% - 75% when compared to state-of-art techniques.

pdftxt

Hybrid Storage Management for Database Systems

Xin Liu (University of Waterloo), Kenneth Salem (University of Waterloo)

Flash-based solid state drives (SSD) are becoming a part of the storage system. Adding SSD to a storage system not only raises the question of how to manage the SSD, but also raises the question of whether current buffer pool algorithms will still work effectively. We are interested in the use of hybrid storage systems, consisting of SSDs and hard disk drives (HDD), for database management. We present cost-aware replacement algorithms for both the DBMS buffer pool and the SSD. These algorithms are aware of the different I/O performance of HDD and SDD. In such a hybrid storage system, the physical access pattern to the SSD depends on the management of the DBMS buffer pool. We studied the impact of the buffer pool caching policies on the access patterns of the SSD. Based on these studies, we designed a cost-adjusted caching policy to effectively manage the SSD. We implemented these algorithms in MySQL's InnoDB storage engine and used the TPC-C workload to demonstrate that these cost-aware algorithms outperform previous algorithms.

pdftxt

The Yin and Yang of Processing Data Warehousing Queries on GPU Devices

Yuan Yuan (The Ohio State University), Rubao Lee (The Ohio State University), Xiaodong Zhang (The Ohio State University)

Database community has made significant research efforts to optimize query processing on GPUs in the past few years. However, we can hardly find that GPUs have been truly adopted in major warehousing production systems. Preparing to merge GPUs to the warehousing systems, we have identified and addressed several critical issues in a three-dimensional study of warehousing queries on GPUs by varying query characteristics, software techniques, and GPU hardware configurations. We also propose an analytical model to understand and predict the query performance on GPUs. Based on our study, we present our performance insights for warehousing query execution on GPUs.


Tutorial 1

Location: Room 1000Bmap

Chair: Gianni Mecca (Universita della Basilicata)


pdftxt

Big Data Integration

Xin Luna Dong (Google Inc.) and Divesh Srivastava (AT&T Labs-Research)

The Big Data era is upon us: data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Since the value of data explodes when it can be linked and fused with other data, addressing the big data integration (BDI) challenge is critical to realizing the promise of Big Data. DI differs from traditional data integration in many dimensions: (i) the number of data sources, even for a single domain, has grown to be in the tens of thousands, (ii) many of the data sources are very dynamic, as a huge amount of newly collected data are continuously made available, (iii) the data sources are extremely heterogeneous in their structure, with considerable variety even for substantially similar entities, and (iv) the data sources are of widely differing qualities, with significant differences in the coverage, accuracy and timeliness of data provided. This tutorial explores the progress that has been made by the data integration community on the topics of schema mapping, record linkage and data fusion in addressing these novel challenges faced by big data integration, and identifies a range of open problems for the community.

Bio: Xin Luna Dong is a senior research scientist at Google Inc. Prior to joining Google, she worked for AT&T Labs-Research. She received her Ph.D. from University of Washington in 2007, received a Master's Degree from Peking University in China and a Bachelor's Degree from Nankai University in China. Her research interests include databases, information retrieval and machine learning, with an emphasis on data integration, data cleaning, knowledge bases, and personal information management. She has led the Solomon project, whose goal is to detect copying between structured sources and to leverage the results in various aspects of data integration, and the Semex personal information management system, which got the Best Demo award (one of top-3) in Sigmod 2005. She co-chaired CIKM Demo track 2013, Sigmod/PODS PhD Symposium 2012-2013, QDB 2012, WebDB 2010, and served as an area chair or senior PC member in ICDE'13 and CIKM'11.

Bio: Divesh Srivastava is the head of the Database Research Department at AT&T Labs-Research. He received his Ph.D. from the University of Wisconsin, Madison, and his B.Tech. from the Indian Institute of Technology, Bombay. He is an ACM fellow, on the board of trustees of the VLDB Endowment and an associate editor of the ACM Transactions on Database Systems. He has served as the program committee co-chair of many conferences, including VLDB 2007. His research interests and publications span a variety of topics in data management.


Research 2: Indexing

Location: Room 300map

Chair: Stefan Manegold (CWI)


pdftxt

A Performance Study of Three Disk-based Structures for Indexing and Querying Frequent Itemsets

Guimei Liu (National University of Singapore), Andre Suchitra (National University of Singapore), Limsoon Wong (National University of Singapore)

Frequent itemset mining is an important problem in the data mining area. Extensive efforts have been devoted to developing efficient algorithms for mining frequent itemsets. However, not much attention is paid on managing the large collection of frequent itemsets produced by these algorithms for subsequent analysis and for user exploration. In this paper, we study three structures for indexing and querying frequent itemsets: inverted files, signature files and CFP-tree. The first two structures have been widely used for indexing general set-valued data. We make some modifications to make them more suitable for indexing frequent itemsets. The CFP-tree structure is specially designed for storing frequent itemsets. We add a pruning technique based on length-2 frequent itemsets to make it more efficient for processing superset queries. We study the performance of the three structures in supporting five types of containment queries: exact match, subset/superset search and immediate subset/superset search. Our results show that no structure can outperform other structures for all the five types of queries on all the datasets. CFP-tree shows better overall performance than the other two structures.

pdftxt

Computing Immutable Regions for Subspace Top-k Queries

Kyriakos Mouratidis (Singapore Management University), HweeHwa Pang (Singapore Management University)

Given a high-dimensional dataset, a top-k query can be used to shortlist the k tuples that best match the user's preferences. Typically, these preferences regard a subset of the available dimensions (i.e., attributes) whose relative significance is expressed by user-specified weights. Along with the query result, we propose to compute for each involved dimension the maximal deviation to the corresponding weight for which the query result remains valid. The derived weight ranges, called immutable regions, are useful for performing sensitivity analysis, for finetuning the query weights, etc. In this paper, we focus on top-k queries with linear preference functions over the queried dimensions. We codify the conditions under which changes in a dimension's weight invalidate the query result, and develop algorithms to compute the immutable regions. In general, this entails the examination of numerous non-result tuples. To reduce processing time, we introduce a pruning technique and a thresholding mechanism that allow the immutable regions to be determined correctly after examining only a small number of non-result tuples. We demonstrate empirically that the two techniques combine well to form a robust and highly resource-efficient algorithm. We verify the generality of our findings using real high-dimensional data from different domains (documents, images, etc) and with different characteristics.

pdftxt

A Data-adaptive and Dynamic Segmentation Index for Whole Matching on Time Series

Yang Wang (Fudan University), Peng Wang (Fudan University), Jian Pei (Simon Fraser University), Wei Wang (Fudan University),Sheng Huang (IBM Research China)

Similarity search on time series is an essential operation in many applications. In the state-of-the-art methods, such as the R-tree based methods, SAX and iSAX, time series are by default divided into equi-length segments globally, that is, all time series are segmented in the same way. Those methods then focus on how to approximate or symbolize the segments and construct indexes. In this paper, we make an important observation: global segmentation of all time series may incur unnecessary cost in space and time for indexing time series. We develop DSTree, a data adaptive and dynamic segmentation index on time series. In addition to savings in space and time, our new index can provide tight upper and lower bounds on distances between time series. An extensive empirical study shows that our new index DSTree supports time series similarity search effectively and efficiently.

pdftxt

Efficient Indexing for Diverse Query Results

Lu Li (National University of Singapore), Chee-Yong Chan (National University of Singapore)

This paper examines the problem of computing diverse query results which is useful for browsing search results in online shopping applications. The search results are diversified wrt a sequence of output attributes (termed d-order) where an attribute that appears earlier in the d-order has higher priority for diversification. We present a new indexing technique, D-Index, to efficiently compute diverse query results for queries with static or dynamic d-orders. Our performance evaluation demonstrates that our D-Index outperforms the state-of-the-art technique developed for queries with static or dynamic d-orders.

pdftxt

LLAMA: A Cache/Storage Subsystem for Modern Hardware

Justin Levandoski (Microsoft Research), David Lomet (Microsoft Research), Sudipta Sengupta (Microsoft Research)

LLAMA (Latch-free, Log-structured Access Method Aware) is a caching and storage subsystem for new hardware environments (e.g., flash, multi-core). LLAMA supports an API for page-oriented access methods that provides both cache and storage management, optimizing processor caches and secondary storage. Caching (CL) and storage (SL) layers use a common mapping table that separates a page’s logical and physical location. CL supports data updates and management updates (e.g., for index re-organization) via latch-free compare-and-swap atomic state changes on its mapping table. SL uses the same mapping table to cope with the page location changes produced by log structuring on every page flush. To demonstrate LLAMA’s suitability, we tailored our latch-free Bw-tree implementation to use LLAMA. The Bw-tree, is an ordered B-tree style index. Executing on LLAMA, it exhibits much higher performance and multi-core scalability using real workloads compared with BerkeleyDB’s B-tree, which is known for good performance.


Industry 1: Systems

Location: Room 120map

Chair: Shel Finkelstein (SAP)


pdftxt

MillWheel: Fault-Tolerant Stream Processing at Internet Scale

Tyler Akidau (Google), Alex Balikov (Google), Kaya Bekiroglu (Google), Slava Chernyak (Google), Josh Haberman (Google), Reuven Lax (Google), Sam McVeety (Google), Daniel Mills (Google), Paul Nordstrom (Google), Sam Whittle (Google)

MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework's fault-tolerance guarantees. This paper describes MillWheel's programming model as well as its implementation. The case study of a continuous anomaly detector in use at Google serves to motivate how many of MillWheel's features are used. MillWheel's programming model provides a notion of logical time, making it simple to write time-based aggregations. MillWheel was designed from the outset with fault tolerance and scalability in mind. In practice, we find that MillWheel's unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google.

pdftxt

F1: A Distributed SQL Database That Scales

Jeff Shute (Google), Radek Vingralek (Google), Bart Samwel (Google), Ben Handy (Google), Chad Whipkey (Google), Eric Rollins (Google), Mircea Oancea (Google), Kyle Littlefield (Google), David Menestrina (Google), Stephan Ellner (Google), John Cieslewicz (Google), Ian Rae (UW Madison), Traian Stancescu (Google), Himani Apte (Google)

F1 is a distributed relational database system built at Google to support the AdWords business. F1 is a hybrid database that combines high availability, the scalability of NoSQL systems like Bigtable, and the consistency and usability of traditional SQL databases. F1 is built on Spanner, which provides synchronous cross-datacenter replication and strong consistency. Synchronous replication implies higher commit latency, but we mitigate that latency by using a hierarchical schema model with structured data types and through smart application design. F1 also includes a fully functional distributed SQL query engine and automatic change tracking and publishing.

pdftxt

DB2 with BLU Acceleration: So Much More than Just a Column Store

Vijayshankar Raman (IBM Research), Gopi Attaluri (IBM SWG), Ronald Barber (IBM Research), Naresh Chainani (IBM SWG), David Kalmuk (IBM SWG), Vincent Kulandai Samy (IBM SWG), Jens Leenstra (IBM STG) , Sam Lightstone (IBM SWG), Shaorong Liu (IBM SWG), Guy M. Lohman (IBM Research), Tim Malkemus (IBM Research), Rene Mueller (IBM Research), Ippokratis Pandis (IBM Research), Berni Schiefer (IBM SWG), David Sharpe (IBM SWG), Richard Sidle (IBM Research), Adam Storm (IBM SWG), Liping Zhang (IBM SWG)

DB2 BLU deeply integrates within IBM's DB2 for Linux, UNIX, and Windows new techniques for defining and processing column-organized tables that speed read-mostly Business Intelligence queries by 10 to 50 times and improve compression by 3 to 10 times, compared to traditional row-organized tables, without the complexity of defining indexes or materialized views on those tables. But DB2 BLU is much more than just a column store. Exploiting frequency-based dictionary compression and main-memory query processing technology from the Blink project at IBM Research - Almaden, DB2 BLU performs most SQL operations -- predicate application (even range predicates and IN-lists), joins, and grouping -- on the compressed values, which can be packed bit-aligned so densely that multiple values fit in a register and can be processed simultaneously via SIMD (single-instruction, multiple-data) instructions. Designed and built from the ground up to exploit modern multi-core processors, DB2 BLU's hardware-conscious algorithms are carefully engineered to maximize parallelism by using novel data structures that need little latching, and to minimize data-cache and instruction-cache misses. Though DB2 BLU is optimized for in-memory processing, table size is not limited by the size of main memory. Fine-grained synopses, late materialization, and aggressive prefetching minimize disk I/Os. Full integration with DB2 ensures that DB2 BLU benefits from the full functionality and robust utilities of a mature product, while still enjoying order-of-magnitude performance gains without even having to change the SQL, and can mix column-organized and row-organized tables in the same tablespace and even within the same query.

pdftxt

The Quantcast File System

Michael Ovsiannikov (Quantcast), Silvius Rus (Quantcast), Damian Reeves (Google), Paul Sutter (Quantcast), Sriram Rao (Microsoft), Jim Kelly (Quantcast), Chris Zimmerman (Quantcast), Dan Adkins (Google), Thilee Subramaniam (Quantcast), Jeremy Fishman (Quantcast)

The Quantcast File System (QFS) is an efficient alternative to the Hadoop Distributed File System (HDFS). QFS is written in C++, is plugin-compatible with Hadoop MapReduce and offers several efficiency improvements relative to HDFS: 50% disk space savings through erasure coding instead of replication, corresponding 2x higher write throughput, faster namenode, support for faster sorting and logging through a concurrent append feature, a native command line client much faster than hadoop fs, and global feedback-directed I/O device management. As QFS works out of the box with Hadoop, migrating data from HDFS to QFS involves simply executing hadoop distcp. QFS is being developed fully open-source and is available under an Apache license from https://github.com/quantcast/qfs. Multi-petabyte QFS instances have been in heavy production use since 2011.

pdftxt

Overview of Turn Data Management Platform for Digital Advertising

Hazem Elmeleegy (Turn Inc.), Yinan Li (Turn Inc), Yan Qi (Turn Inc), Peter Wilmot (Turn Inc), Mingxi Wu (Turn Inc), Santanu Kolay (Turn Inc), Ali Dasdan (Turn Inc), Songting Chen (Facebook Inc)

This paper gives an overview of Turn Data Management Platform (DMP). We explain the purpose of this type of platforms, and show how it is positioned in the current digital advertising ecosystem. We also provide a detailed description of the key components in Turn DMP. These components cover the functions of (1) data ingestion and integration, (2) data warehousing and analytics, and (3) real-time data activation. For all components, we discuss the main technical and research challenges, as well as the alternative design choices. One of the main goals of this paper is highlight the central role that data management is playing in shaping this growing multi-billion dollars industry.


Demo A: New Platforms

Location: Room Stampamap


pdf

A Demonstration of SpatialHadoop: An Efficient MapReduce Framework for Spatial Data

Ahmed Eldawy (University of Minnesota), Mohamed Mokbel (University of Minnesota)

pdf

Aggregate Profile Clustering for Telco Analytics

Mehmet Ali Abbasoğlu (İhsan Doğramacı Bilkent Üniversitesi), Buğra Gedik (Bilkent University), Hakan Ferhatosmanoglu (Bilkent University)

pdf

Parallel Graph Processing on Graphics Processors Made Easy

Jianlong Zhong (Nanyang Technological University), Bingsheng He (Nanyang Technological University)

pdf

Mosquito: Another One Bites the Data Upload STream

Stefan Richter (Saarland University), Jens Dittrich (Saarland University)

pdf

NoFTL: Database Systems on FTL-less Flash Storage

Sergey Hardock (TU-Darmstadt), Ilia Petrov (Reutlingen University), Robert Gottstein (TU-Darmstadt), Alejandro Buchmann (TU-Darmstadt)

pdf

EagleTree: Exploring the Design Space of SSD-Based Algorithms

Niv Dayan (IT University of Copenhagen), Martin Kjær Svendsen (IT University of Copenhagen), Matias Bjørling (IT University of Copenhagen), Philippe Bonnet (IT University of Copenhagen), Luc Bouganim (INRIA Rocquencourt and University of Versailles)

pdf

Flexible Query Processor on FPGAs

mohammadreza Najafi (Technical University Munich), Mohammad Sadoghi (IBM T. J. Watson Research Center), Hans-Arno Jacobsen (University of Toronto)

pdf

A Demonstration of Iterative Parallel Array Processing in Support of Telescope Image Analysis

Matthew Moyers (University of Washington), Emad Soroush (University of Washington), Spencer Wallace (University of Arizona), Simon Krughoff (University of Washington), Jake Vanderplas (University of Washington), Magdalena Balazinska (University of Washington), Andrew Connolly (University of Washington)

pdf

Hone: "Scaling Down" Hadoop on Shared-Memory Systems

K.Ashwin Kumar (UMD), Jonathan Gluck (University of Maryland, College Park), Amol Deshpande (University of Maryland), Jimmy Lin (University of Maryland, College Park)

pdf

REEF: Retainable Evaluator Execution Framework

Byung-Gon Chun (Microsoft), Tyson Condie (Microsoft), Carlo Curino (Microsoft), Raghu Ramakrishnan (Microsoft), Russell Sears (Microsoft), Markus Weimer (Microsoft)

pdf

OmniDB: Towards Portable and Efficient Query Processing on Parallel CPU/GPU Architectures

Shuhao Zhang (Nanyang Technological University), Jiong HE (Nanyang Technological University), Bingsheng He (NTU Singapore), Mian Lu (A*STAR Institute of High Performance Computing)

pdf

DiAl: Distributed Streaming Analytics Anywhere, Anytime

Ivo Santos (Microsoft Research ATL Europe), Marcel Tilly (Microsoft Research ATL Europe), Badrish Chandramouli (Microsoft Research), Jonathan Goldstein (Microsoft Research)



Tuesday Aug 27th 14:00-15:30

Research 3: Cloud Databases

Location: Room 1000Amap

Chair: Ippokratis Pandis (IBM Almaden Research Center)


pdftxt

DAX: A Widely Distributed Multi-tenant Storage Service for DBMS Hosting

Rui Liu (University of Waterloo), Ashraf Aboulnaga (University of Waterloo), Kenneth Salem (University of Waterloo)

Many applications hosted on the cloud have sophisticated data management needs that are best served by a SQL-based relational DBMS. It is not difficult to run a DBMS in the cloud, and in many cases one DBMS instance is enough to support an application's workload. However, a DBMS running in the cloud (or even on a local server) still needs a way to persistently store its data and protect it against failures. One way to achieve this is to provide a scalable and reliable storage service that the DBMS can access over a network. This paper describes such a service, which we call DAX. DAX relies on multi-master replication and Dynamo-style flexible consistency, which enables it to run in multiple data centers and hence be disaster tolerant. Flexible consistency allows DAX to control the consistency level of each read or write operation, choosing between strong consistency at the cost of high latency or weak consistency with low latency, by applying protocols that we designed based on the characteristics of how a DBMS uses its storage tier. With these protocols, DAX provides a storage service that can host multiple DBMS tenants, scaling in the number of tenants and the required storage capacity and bandwidth. DAX also provides high availability and disaster tolerance for the DBMS storage tier. Experiments using the TPC-C benchmark show that DAX provides up to a factor of 4 performance improvement over baseline solutions that do not exploit flexible consistency.

pdftxt

Low-latency multi-datacenter databases using replicated commit

Hatem Mahmoud (University of California, Santa Barbara), Faisal Nawab (University of California, Santa Barbara), Alexander Pucher (University of California, Santa Barbara), Divyakant Agrawal (University of California, Santa Barbara), Amr El Abbadi (University of California, Santa Barbara)

Web service providers have been using NoSQL datastores to provide scalability and availability for globally distributed data at the cost of sacrificing transactional guarantees. Recently, major web service providers like Google have moved towards building storage systems that provide ACID transactional guarantees for globally distributed data. For example, the newly published system, Spanner, uses Two-Phase Commit and Two-Phase Locking to provide atomicity and isolation for globally distributed data, running on top of Paxos to provide fault-tolerant log replication. We show in this paper that it is possible to provide the same ACID transactional guarantees for multi-datacenter databases with fewer cross-datacenter communication trips, compared to replicated logging, by using a more efficient architecture. Instead of replicating the transactional log, we replicate the commit operation itself, by running Two-Phase Commit multiple times in different datacenters, and we use Paxos to reach consensus among datacenters as to whether the transaction should commit. Doing so not only replaces several inter-datacenter communication trips with intra-datacenter communication trips, but also allows us to integrate atomic commitment and isolation protocols with consistent replication protocols so as to further reduce the number of cross-datacenter communication trips needed for consistent replication; for example, by eliminating the need for an election phase in Paxos. We analyze our approach in terms of communication trips to compare it against the replication log approach, then we conduct an extensive experimental study to compare the performance and scalability of both approaches under various multi-datacenter setups.

pdftxt

RACE: A Scalable and Elastic Parallel System for Discovering Repeats in Very Long Sequences

Essam Mansour ( King Abdullah University of Science and Technology), Ahmed El-Roby (University of Waterloo), Panos Kalnis (King Abdullah University of Science and Technology), Aron Ahmadia (Columbia University), Ashraf Aboulnaga (Qatar Computing Research Institute)

A wide range of applications including bioinformatics, time series, and log analysis, depend on the identification of repetitions in very long sequences. Maximal pairs represent a superset of the most important types of repetitions. Existing maximal pair computation methods require both the input sequence and its index structure (which is at least an order of magnitude larger than the input) to fit in memory. Moreover, they are serial algorithms with prohibitively long execution time. Therefore, they are limited to small datasets, despite the fact that modern applications demand orders of magnitude longer sequences. In this paper we present RACE, a parallel system for finding maximal repeats in very long sequences. RACE supports parallel execution on stand-alone multi-core systems, but can scale to thousands of nodes on clusters or supercomputers. RACE does not require the input or the index to fit in memory; therefore, it supports very long sequences with limited memory. Moreover, it uses a novel array representation that traverses the tree in a cache-efficient manner. RACE is particularly suitable for the cloud (e.g., Amazon EC2) since, based on availability, it can scale elastically to more or fewer machines during its execution. Since scaling out introduces overheads, mainly due to load imbalance, we propose a cost model to estimate the expected speedup for a specific problem size and computing infrastructure, based on statistics gathered through sampling. The model allows the user to select the appropriate combination of cloud resources based on the provider's prices and the required deadline. We contacted extensive experimental evaluation with large real datasets and large computing infrastructures. In contrast to existing methods, RACE can handle the entire human genome on a single machine with 16GB RAM. Moreover, for a problem that takes 10 hours of serial execution, RACE finishes in 28 seconds using 2,048 nodes on an IBM BlueGene/P supercomputer.

pdftxt

XORing Elephants: Novel Erasure Codes for Big Data

Maheswaran Sathiamoorthy (University of Southern California), Megasthenis Asteris (University of Southern California), Dimitris Papailiopoulos (University of Texas at Austin), Alexandros Dimakis (University of Texas at Austin), Ramkumar Vadali,Dropbox), Scott Chen (Facebook), Dhruba Borthakur (Facebook)

Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. Reed-Solomon codes are the standard design choice and their high repair cost is often considered an unavoidable price to pay for high storage efficiency and high reliability. This paper shows how to overcome this limitation. We present a novel family of erasure codes that are efficiently repairable and offer higher reliability compared to Reed-Solomon codes. We show analytically that our codes are optimal on a recently identified tradeoff between locality and minimum distance. We implement our new codes in Hadoop HDFS and compare to a currently deployed HDFS module that uses Reed-Solomon codes. Our modified HDFS implementation shows a reduction of approximately 2x on the repair disk I/O and repair network traffic. The disadvantage of the new coding scheme is that it requires 14% more storage compared to Reed-Solomon codes, an overhead shown to be information theoretically optimal to obtain locality. Because the new codes repair failures faster, this provides higher reliability, which is orders of magnitude higher compared to replication.

pdftxt

Distribution-Based Query Scheduling

Yun Chi (NEC Laboratories America), Hakan Hacigumus (NEC Laboratories America), Wang-Pin Hsiung (NEC Laboratories America), Jeffrey Naughton (University of Wisconsin-Madison)

Query scheduling, a fundamental problem in database management systems, has recently received renewed attention, perhaps in part due to the rise of the “database as a service” (DaaS) model for database deployment. While there has been a great deal of work investigating different scheduling algorithms, there has been comparatively little work investigating what the scheduling algorithms can or should know about the queries to be scheduled. In this work, we investigate the efficacy of using histograms describing the distribution of likely query execution times as input to the query scheduler. We propose a novel distribution-based scheduling algorithm, Shepherd, and show that Shepherd substantially outperforms state-of-the-art point-based methods through extensive experimentation with both synthetic and TPC workloads.


Tutorial 1

Location: Room 1000Bmap

Chair: Gianni Mecca (Universita della Basilicata)


pdftxt

Big Data Integration

Xin Luna Dong (Google Inc.) and Divesh Srivastava (AT&T Labs-Research)

The Big Data era is upon us: data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Since the value of data explodes when it can be linked and fused with other data, addressing the big data integration (BDI) challenge is critical to realizing the promise of Big Data. DI differs from traditional data integration in many dimensions: (i) the number of data sources, even for a single domain, has grown to be in the tens of thousands, (ii) many of the data sources are very dynamic, as a huge amount of newly collected data are continuously made available, (iii) the data sources are extremely heterogeneous in their structure, with considerable variety even for substantially similar entities, and (iv) the data sources are of widely differing qualities, with significant differences in the coverage, accuracy and timeliness of data provided. This tutorial explores the progress that has been made by the data integration community on the topics of schema mapping, record linkage and data fusion in addressing these novel challenges faced by big data integration, and identifies a range of open problems for the community.

Bio: Xin Luna Dong is a senior research scientist at Google Inc. Prior to joining Google, she worked for AT&T Labs-Research. She received her Ph.D. from University of Washington in 2007, received a Master's Degree from Peking University in China and a Bachelor's Degree from Nankai University in China. Her research interests include databases, information retrieval and machine learning, with an emphasis on data integration, data cleaning, knowledge bases, and personal information management. She has led the Solomon project, whose goal is to detect copying between structured sources and to leverage the results in various aspects of data integration, and the Semex personal information management system, which got the Best Demo award (one of top-3) in Sigmod 2005. She co-chaired CIKM Demo track 2013, Sigmod/PODS PhD Symposium 2012-2013, QDB 2012, WebDB 2010, and served as an area chair or senior PC member in ICDE'13 and CIKM'11.

Bio: Divesh Srivastava is the head of the Database Research Department at AT&T Labs-Research. He received his Ph.D. from the University of Wisconsin, Madison, and his B.Tech. from the Indian Institute of Technology, Bombay. He is an ACM fellow, on the board of trustees of the VLDB Endowment and an associate editor of the ACM Transactions on Database Systems. He has served as the program committee co-chair of many conferences, including VLDB 2007. His research interests and publications span a variety of topics in data management.


Research 4: Graph Data 1

Location: Room 300map

Chair: Michael Böhlen (University of Zurich)


pdftxt

IS-LABEL: an Independent-Set based Labeling Scheme for Point-to-Point Distance Querying

Ada Wai-Chee Fu (Chinese University of Hong Kong), Huanhuan Wu (CUHK), James Cheng (CUHK), Raymond Chi-Wing Wong (Hong Kong University of Science and Technology)

We study the problem of computing shortest path or distance between two query vertices in a graph, which has numerous important applications. Quite a number of indexes have been proposed to answer such distance queries. However, all of these indexes can only process graphs of size barely up to 1 million vertices, which is rather small in view of many of the fast-growing real-world graphs today such as social networks and Web graphs. We propose an efficient index, which is a novel labeling scheme based on the independent set of a graph. We show that our method can handle graphs of size orders of magnitude larger than existing indexes.

pdftxt

Mining and Indexing Graphs For Supergraph Search

Dayu Yuan (Penn State University), Prasenjit Mitra (Penn State University), C. Lee Giles (Penn State University)

We study supergraph search (SPS), that is, given a query graph q and a graph database G that contains a collection of graphs , return graphs that have q as a supergraph from G. SPS has broad appli- cations in bioinformatics, cheminformatics and other scientific and commercial fields. Determining whether a graph is a subgraph (or supergraph) of another is an NP-complete problem. Hence, it is in- tractable to compute SPS for large graph databases. Two separate indexing methods, a “filter + verify”-based method and a “prefix- sharing”-based method, have been studied to efficiently compute SPS. To implement the above two methods, subgraph patterns are mined from the graph database to build an index. Those subgraphs are mined to optimize either the filtering gain or the prefix-sharing gain. However, no single subgraph-mining algorithm considers both gains. This work is the first one to mine subgraphs to optimize both the filtering gain and the prefix-sharing gain while processing SPS queries. First, we show that the subgraph-mining problem is NP- hard. Then, we propose two polynomial-time algorithms to solve the problem with an approximation ratio of 1−1/e and 1/4 respec- tively. In addition, we construct a lattice-like index, LW-index, to organize the selected subgraph patterns for fast index-lookup. Our experiments show that our approach improves the query processing time for SPS queries by a factor of 3 to 10.

pdftxt

NeMa: Fast Graph Search with Label Similarity

Arijit Khan (University of California, Santa Barbara), Yinghui Wu (University of California, Santa Barbara), Charu Aggarwal (IBM T. J. Watson Research Center), Xifeng Yan (University of California, Santa Barbara)

It is increasingly common to find real-life data represented as networks of labeled, heterogeneous entities. To query these networks, one often needs to identify the matches of a given query graph in a (typically large) network modeled as a target graph. Due to noise and the lack of fixed schema in the target graph, the query graph can substantially differ from its matches in the target graph in both structure and node labels, thus bringing challenges to the graph querying tasks. In this paper, we propose NeMa (Network Match), a neighborhood-based subgraph matching technique for querying real-life networks. (1) To measure the quality of the match, we propose a novel subgraph matching cost metric that aggregates the costs of matching individual nodes, and unifies both structure and node label similarities. (2) Based on the metric, we formulate the minimum cost subgraph matching problem. Given a query graph and a target graph, the problem is to identify the (top-k) matches of the query graph with minimum costs in the target graph. We show that the problem is NP-hard, and also hard to approximate. (3) We propose a heuristic algorithm for solving the problem based on an inference model. In addition, we propose optimization techniques to improve the efficiency of our method. (4) We empirically verify that NeMa is both effective and efficient compared to the keyword search and various state-of the-art graph querying techniques.

pdftxt

A Distributed Graph Engine for Web Scale RDF Data

Kai Zeng (UCLA), Jiacheng Yang (Columbia University), Haixun Wang (Microsoft Research), Bin Shao (Microsoft Research Asia), Zhongyuan Wang (Microsoft Research Asia)

Much work has been devoted to supporting RDF data. But state-of-the-art systems and methods still cannot handle web scale RDF data effectively. Furthermore, many useful and general purpose graph-based operations (e.g., random walk, reachability, community discovery) on RDF data are not supported, as most existing systems store and index data in particular ways (e.g., as relational tables or as a bitmap matrix) to maximize one particular operation on RDF data: SPARQL query processing. In this paper, we introduce Trinity.RDF, a distributed, memory-based graph engine for web scale RDF data. Instead of managing the RDF data in triple stores or as bitmap matrices, we store RDF data in its native graph form. It achieves much better (sometimes orders of magnitude better) performance for SPARQL queries than the state-of-the-art approaches. Furthermore, since the data is stored in its native graph form, the system can support other operations (e.g., random walks, reachability) on RDF graphs as well. We conduct comprehensive experimental studies on real life, web scale RDF data to demonstrate the effectiveness of our approach.

pdftxt

Top-K Nearest Keyword Search on Large Graphs

Miao Qiao (CUHK), Lu Qin, Hong Cheng (The Chinese University of Hong Kong), Jeffrey Yu (Chinese University of Hong Kong)

It is quite common for networks emerging nowadays to have labels or textual contents on the nodes. On such networks, we study the problem of top-k nearest keyword (k-NK) search. In a network G modeled as an undirected graph, each node is attached with zero or more keywords, and each edge is assigned with a weight measuring its length. Given a query node q in G and a keyword lambda, a k-NK query seeks k nodes which contain lambda and are nearest to q. k-NK is not only useful as a stand-alone query but also as a building block for tackling complex graph pattern matching problems. The key to an accurate k-NK result is a precise shortest distance estimation in a graph. Based on the latest distance oracle technique, we build a shortest path tree for a distance oracle and use the tree distance as a more accurate estimation. With such representation, the original k-NK query on a graph can be reduced to answering the query on a set of trees and then assembling the results obtained from the trees. We propose two efficient algorithms to report the exact k-NK result on a tree. One is query time optimized for a scenario when a small number of result nodes are of interest to users. The other handles k-NK queries for an arbitrarily large k efficiently. In obtaining a k-NK result on a graph from that on trees, a global storage technique is proposed to further reduce the index size and the query time. Extensive experimental results conform with our theoretical findings, and demonstrate the effectiveness and efficiency of our k-NK algorithms on large real graphs.


Industry 2: Knowledge

Location: Room 120map

Chair: Eric Simon (SAP)


pdftxt

Online, Asynchronous Schema Change in F1

Ian Rae (University of Wisconsin-Madison), Eric Rollins (Google), Jeff Shute (Google), Sukhdeep Sodhi (Google), Radek Vingralek (Google)

Large-scale, distributed database systems often store critical business data and have stringent availability requirements. We introduce a protocol for schema evolution in these systems that is asynchronous --it allows different servers in the database system to transition to a new schema at different times --and online --all servers can access and update all data during a schema change. We provide a formal model for determining the correctness of schema changes under these conditions, and we demonstrate that many common schema changes can cause anomalies and database corruption. We avoid these problems by replacing corruption-causing schema changes with a sequence of schema changes that is guaranteed to avoid corrupting the database so long as all servers are no more than one schema version behind at any time. Finally, we discuss a practical implementation of our protocol in F1, the database system which stores data for Google AdWords.

pdftxt

WOO: A Scalable and Multi-tenant Platform for Continuous Knowledge Base Synthesis

Kedar Bellare (Facebook), Carlo Curino (Microsoft), Ashwin Machanavajjhala (Duke University), Peter Mika (Yahoo! Labs Barcelona), Mandar Rahurkar (Yahoo Labs!), Aamod Sane (Yahoo!)

Search, exploration and social experience on the Web has re- cently undergone tremendous changes with search engines, web portals and social networks offering a different perspec- tive on information discovery and consumption. This new perspective is aimed at capturing user intents, and providing richer and highly connected experiences. The new battle- ground revolves around technologies for the ingestion, dis- ambiguation and enrichment of entities from a variety of structured and unstructured data sources – we refer to this process as knowledge base synthesis. This paper presents the design, implementation and production deployment of the Web Of Objects (WOO) system, a Hadoop-based plat- form tackling such challenges. WOO has been designed and implemented to enable various products in Yahoo! to syn- thesize knowledge bases (KBs) of entities relevant to their domains. Currently, the implementation of WOO we de- scribe is used by various Yahoo! properties such as Intonow, Yahoo! Local, Yahoo! Events and Yahoo! Search. This pa- per highlights: (i) challenges that arise in designing, build- ing and operating a platform that handles multi-domain, multi-version, and multi-tenant disambiguation of web-scale knowledge bases (hundreds of millions of entities), (ii) the architecture and technical solutions we devised, and (iii) an evaluation on real-world production datasets.

pdftxt

Entity Extraction, Linking, Classification, and Tagging for Social Media: a Wikipedia-Based Approach

Abhishek Gattani (WalmartLabs), Digvijay S. Lamba (WalmartLabs), Nikesh Garera (WalmartLabs), Mitul Tiwari (LinkedIn), Xiaoyong Chai (WalmartLabs), Sanjib Das (University of Wisconsin-Madison), Sri Subramaniam (WalmartLabs), Anand Rajaraman (Cambrian Ventures), Venky Harinarayan (Cambrian Ventures), AnHai Doan (University of Wisconsin-Madison, WalmartLabs)

Many applications that process social data, such as tweets, must extract entities from tweets (e.g., “Obama” and “Hawaii” in “Obama went to Hawaii”), link them to entities in a knowledge base (e.g., Wikipedia), classify tweets into a set of predefined topics, and assign descriptive tags to tweets. Few solutions exist today to solve these problems for social data, and they are limited in important ways. Further, even though several industrial systems such as OpenCalais have been deployed to solve these problems for text data, little if any has been published about them, and it is unclear if any of the systems has been tailored for social media. In this paper we describe in depth an end-to-end industrial system that solves these problems for social data. The system has been developed and used heavily in the past three years, first at Kosmix, a startup, and later at WalmartLabs. We show how our system uses a Wikipedia-based global “real-time” knowledge base that is well suited for social data, how we interleave the tasks in a synergistic fashion, how we generate and use contexts and social signals to improve task accuracy, and how we scale the system to the entire Twitter firehose. We describe experiments that show that our system outperforms current approaches. Finally we describe applications of the system at Kosmix and WalmartLabs, and lessons learned.

pdftxt

Unicorn: A System for Searching the Social Graph

Michael Curtiss (Facebook), Iain Becker (Facebook), Tudor Bosman (Facebook), Sergey Doroshenko (Facebook), Lucian Grijincu (Facebook), Tom Jackson (Facebook), Sandhya Kunnatur (Facebook), Soren Lassen (Facebook), Philip Pronin (Facebook), Sriram Sankar (Facebook), Guanghao Shen (Facebook), Gintaras Woss (Facebook), Chao Yang (Facebook), Ning Zhang (Facebook)

Unicorn is an online, in-memory social graph-aware indexing system designed to search trillions of edges between tens of billions of users and entities on thousands of commodity servers. Unicorn is based on standard concepts in information retrieval, but it includes features to promote results with good social proximity. It also supports queries that require multiple round-trips to leaves in order to retrieve objects that are more than one edge away from source nodes. Unicorn is designed to answer billions of queries per day at latencies in the hundreds of milliseconds, and it serves as an infrastructural building block for Facebook's Graph Search product. In this paper, we describe the data model and query language supported by Unicorn. We also describe its evolution as it became the primary backend for Facebook's search offerings.


Demo B: Personal, Social, and Web Data

Location: Room Stampamap


pdf

DesTeller: A System for Destination Prediction Based on Trajectories with Privacy Protection

Andy Yuan Xue (University of Melbourne), Rui Zhang (University of Melbourne), Yu Zheng (Microsoft Research Asia), Xing Xie (Microsoft Research Asia, China), Jianhui Yu (South China Normal University), Yong Tang (South China Normal University)

pdf

GroupFinder: A New Approach to Top-K Point-of-Interest Group Retrieval

Kenneth Bøgh (Aarhus University), Anders Skovsgaard (Aarhus University), Christian S. Jensen (Arhus University)

pdf

CrowdMiner: Mining association rules from the crowd

Yael Amsterdamer (Tel Aviv University), Yael Grossman (Tel Aviv University), Tova Milo (Tel Aviv University), Pierre Senellart (Télécom ParisTech)

pdf

TeRec: A Temporal Recommender System Over Tweet Stream

Chen Chen (Peking University), Hongzhi Yin (Peking University), Junjie Yao (Peking University), Bin Cui (Peking University)

pdf

iRoad: A Framework For Scalable Predictive Query Processing On Road Networks

Abdeltawab Hendawi (University of Minnesota), Jie Bao (University of Minnesota), Mohamed Mokbel (University of Minnesota)

pdf

SmartMonitor: Using Smart Devices to Perform Structural Health Monitoring

Dimitrios Kotsakos (University of Athens), Panos Sakkos (University of Athens), Vana Kalogeraki (Athens University of Economics and Business), Dimitrios Gunopulos (University of Athens)

pdf

EnviroMeter: A Platform for Querying Community-Sensed Data

Saket Sathe (EPFL), Arthur Oviedo (EPFL), Dipanjan Chakraborty (IBM Research - India), Karl Aberer (EPFL)

pdf

EvenTweet: Online Localized Event Detection from Twitter

Hamed Abdelhaq (Heidelberg University), Christian Sengstock (Heidelberg University), Michael Gertz (Heidelberg University)

pdf

PhotoStand: A Map Query Interface for a Database of News Photos

Hanan Samet (University of Maryland), Marco D. Adelfio (University of Maryland), Brendan C. Fruin (University of Maryland), Michael D. Lieberman (University of Maryland), Jagan Sankaranarayanan (University of Maryland)

pdf

Ringtail: A Generalized Nowcasting System

Dolan Antenucci (University of Michigan), Erdong Li (University of Michigan), Shaobo Liu (University of Michigan), Bochun Zhang (University of Michigan), Mike Cafarella (University of Michigan), Christopher Re (University of Wisconsin-Madison)

pdf

IPS: An Interactive Package Configuration System for Trip Planning

Min Xie (University of British Columbia), Laks V. S. Lakshmanan (University of British Columbia), Peter Wood (Birkbeck, University of London)

pdf

R2-D2: a System to Support Probabilistic Path Prediction in Dynamic Environments

Jingbo Zhou (National University of Singapore), Anthony K.H. Tung (National University of Singapore), Wei Wu (I2R), Wee Siong Ng (I2R)



Tuesday Aug 27th 16:00-17:30

Research 5: Social and Crowd

Location: Room 1000Amap

Chair: Ken Salem (University of Waterloo)


pdftxt

Piggybacking on social networks

Aristides Gionis (Aalto University), Flavio Junqueira (Microsoft Research Cambridge (UK), Vincent Leroy (University of Grenoble - CNRS (France), Marco Serafini (QCRI), Ingmar Weber (QCRI)

The popularity of social-networking sites has increased rapidly over the last decade. One of the most fundamental functionalities of social-networking sites is to present users with streams of events shared by their friends. At a systems level, materialized per-user views are a common way to assemble and deliver such event streams on-line and with low latency. Access to the data stores, which keep the user views, is a major bottleneck of social-networking systems. We propose improving the throughput of a system by using social piggybacking: process the requests of two friends by querying and updating the view of a third common friend. By using one such hub view, the system can serve requests of the first friend without querying or updating the view of the second. We show that, given a social graph, social piggybacking can minimize the overall number of requests, but computing the optimal set of hubs is an NP-hard problem. We propose an O(log(n)) approximation algorithm and a heuristic to solve the problem, and evaluate them using the full Twitter and Flickr social graphs, which have up to billions of edges. Compared to existing approaches, using social piggybacking results in similar throughput in systems with few servers, but enables substantial throughput improvements as the size of the system grows, reaching up to a 2-factor increase. We also evaluate our algorithms on a real social-networking system prototype and we show that the actual increase in throughput corresponds nicely to the gain anticipated by our cost function.

pdftxt

Answering Planning Queries with the Crowd

Haim Kaplan (Tel Aviv University), Ilia Lotosh (Tel Aviv University), Tova Milo (Tel Aviv University), Slava Novgorodov (Tel Aviv University)

Recent research has shown that crowd sourcing can be used effectively to solve problems that are difficult for computers, e.g., optical character recognition and identification of the structural configuration of natural proteins. In this paper we propose to use the power of the crowd to address yet another difficult problem that frequently occurs in a daily life - answering planning queries whose output is a sequence of objects/actions, when the goal, i.e, the notion of "best output", is hard to formalize. For example, planning the sequence of places/attractions to visit in the course of a vacation, where the goal is to enjoy the resulting vacation the most, or planning the sequence of courses to take in an academic schedule planning, where the goal is to obtain solid knowledge of a given subject domain. Such goals may be easily understandable by humans, but hard or even impossible to formalize for a computer. We present a novel algorithm for efficiently harnessing the crowd to assist in answering such planning queries. The algorithm builds the desired plans incrementally, choosing at each step the 'best' questions so that the overall number of questions that need to be asked is minimized. We prove the algorithm to be optimal within its class and demonstrate experimentally its effectiveness and efficiency.

pdftxt

Query Optimization over Crowdsourced Data

Hyunjung Park (Stanford University), Jennifer Widom (Stanford University)

Deco is a comprehensive system for answering declarative queries posed over stored relational data together with data obtained on-demand from the crowd. In this paper we describe Deco's cost-based query optimizer, building on Deco's data model, query language, and query execution engine presented earlier. Deco's objective in query optimization is to find the best query plan to answer a query, in terms of estimated monetary cost. Deco's query semantics and plan execution strategies require several fundamental changes to traditional query optimization. Novel techniques incorporated into Deco's query optimizer include a cost model distinguishing between "free" existing data versus paid new data, a cardinality estimation algorithm coping with changes to the database state during query execution, and a plan enumeration algorithm maximizing reuse of common subplans in a setting that makes reuse challenging. We experimentally evaluate Deco's query optimizer, focusing on the accuracy of cost estimation and the efficiency of plan enumeration.

pdftxt

Counting with the Crowd

Adam Marcus (Locu/MIT CSAIL), David Karger (MIT CSAIL), Sam Madden (MIT CSAIL), Robert Miller (MIT CSAIL), Sewoong Oh (University of Illinois at Urbana-Champaign)

In this paper, we address the problem of selectivity estimation in a crowdsourced database. Specifically, we develop several techniques for using workers on a crowdsourcing platform like Amazon's Mechanical Turk to estimate the fraction of items in a dataset (e.g., a collection of photos) that satisfy some property or predicate (e.g., photos of trees). We do this without explicitly iterating through every item in the dataset. This is important in crowdsourced query optimization to support predicate ordering and in query evaluation, when performing a GROUP BY operation with a COUNT or AVG aggregate. We compare sampling item labels, a traditional approach, to showing workers a collection of items and asking them to estimate how many satisfy some predicate. Additionally, we develop techniques to eliminate spammers and colluding attackers trying to skew selectivity estimates when using this count estimation approach. We find that for images, counting can be much more effective than sampled labeling, reducing the amount of work necessary to arrive at an estimate that is within 1% of the true fraction by up to an order of magnitude, with lower worker latency. We also find that sampled labeling outperforms count estimation on a text processing task, presumably because people are better at quickly processing large batches of images than they are at reading strings of text. Our spammer detection technique, which is applicable to both the label- and count-based approaches, can improve accuracy by up to two orders of magnitude.

pdftxt

Question Selection for Crowd Entity Resolution

Steven Whang (Google Research), Peter Lofgren (Stanford University), Hector Garcia-Molina (Stanford University)

We study the problem of enhancing Entity Resolution (ER) with the help of crowdsourcing. ER is the problem of clustering records that refer to the same real-world entity and can be an extremely difficult process for computer algorithms alone. For example, figuring out which images refer to the same person can be a hard task for computers, but an easy one for humans. We study the problem of resolving records with crowdsourcing where we ask questions to humans in order to guide ER into producing accurate results. Since human work is costly, our goal is to ask as few questions as possible. We propose a probabilistic framework for ER that can be used to estimate how much ER accuracy we obtain by asking each question and select the best question with the highest expected accuracy. Computing the expected accuracy is NP-hard, so we propose approximation techniques for efficient computation. We evaluate our best question algorithms on real and synthetic datasets and demonstrate how we can obtain high ER accuracy while significantly reducing the number of questions asked to humans.


Tutorial 2

Location: Room 1000Bmap

Chair: Divesh Srivastava (AT&T Labs Research)


pdftxt

Towards Database Virtualization for Database as a Service

Aaron J. Elmore (University of California Santa Barbara), Carlo Curino (Microsoft CISL), Divyakant Agrawal (University of California Santa Barbara), Amr El Abbadi (University of California Santa Barbara)

Advances in operating system and storage-level virtualization technologies have enabled the effective consolidation of heterogeneous applications in a shared cloud infrastructure. Novel research challenges arising from this new shared environment, include load balancing, workload estimation, resource isolation, machine replication, live migration, and an emergent need of automation to handle large scale operations with minimal manual intervention. Given that databases are at the core of most applications that are deployed in the cloud, database management systems (DBMSs) represent a very important technology component that needs to be virtualized in order to realize the benefits of virtualization from autonomic management of data-intensive applications in large scale data-centers. This tutorial is organized in three parts. In part one, we provide a general background of the current state-of-the-art of virtualization technologies on modern cloud environment in particular, covering in depth some crucial advances. In part two, we explore consolidated shared storage systems and virtualization techniques used to ensure fair access between applications. For part three, we focus on some of the shortcomings of general purpose operating systems and storage-level virtualization technologies when applied to DBMSs and discuss recent research and development in this space. We will touch on several open problems, providing pointers to areas of research we expect to be growing and for which simple naive solutions are not likely to be very effective. The goal of this tutorial is to survey the techniques used in providing elasticity in virtual machine systems, shared storage systems, and survey database research on multitenant architectures and elasticity primitives. This foundation of core Database as a Service advances, together with a primer of important related topics in OS and storage-level virtualization, are central for anyone that wants to operate in this area of research. At the end of the tutorial we expect attendees to be well oriented in this exciting research area and ready to participate in it.

Bio: Aaron J. Elmore is currently a PhD candidate at the University of California, Santa Barbara. He has a MS in computer science from the University of Chicago. His research interests involve cloud computing, multitenant databases, and ecoinformatics.

Bio: Carlo A. Curino received a PhD from Politecnico di Milano, and spent two years as Post Doc Associate at CSAIL MIT leading the relational cloud project. He worked at Yahoo! Research as Research Scientist focusing on mobile/cloud platforms and entity deduplication at scale. Carlo is currently a Senior Scientist at Microsoft in the recently formed Cloud and Information Services Lab (CISL) where he is working on big-data platforms and cloud computing.

Bio: Divyakant Agrawal is a Professor of Computer Science and the Director of Engineering Computing Infrastructure at the University of California at Santa Barbara. His research expertise is in the areas of database systems, distributed computing, data warehousing, and large-scale information systems. He currently serves as the Editor-in-Chief of Distributed and Parallel Databases and is on the editorial boards of the ACM Transactions on Database Systems and IEEE Transactions of Knowledge and Data Engineering. He serves on the Board of Trustees of the VLDB Endowment and on the Executive Committee of ACM Special Interest Group SIGSPATIAL. Dr. Agrawal is a Fellow of ACM and a Fellow of IEEE.

Bio: Amr El Abbadi is a Professor of Computer Science at the University of California, Santa Barbara. He received his B. Eng. from Alexandria University, Egypt, and his Ph.D. from Cornell University. Prof. El Abbadi is an ACM Fellow, an AAAS Fellow and was Chair of the Computer Science Department at UCSB from 2007 to 2011. He has served as a journal editor for several database journals, including, currently, The VLDB Journal. He has been Program Chair for multiple database and distributed systems conferences, most recently SIGSPATIAL GIS 2010 and ACM Symposium on Cloud Computing (SoCC) 2011, COMAD India 2012 and ACM COSN (Conference On Social Networks) 2013.


Research 6: Web Data and Information Dissemination

Location: Room 300map

Chair: Ivan Bedini (Trento RISE)


pdftxt

DisC Diversity: Result Diversification based on Dissimilarity and Coverage

Marina Drosou (University of Ioannina), Evaggelia Pitoura (University of Ioannina)

Recently, result diversification has attracted a lot of attention as a means to improve the quality of results retrieved by user queries. In this paper, we propose a new, intuitive definition of diversity called DisC diversity. A DisC diverse subset of a query result contains objects such that each object in the result is represented by a similar object in the diverse subset and the objects in the diverse subset are dissimilar to each other. We show that locating a minimum DisC diverse subset is an NP-hard problem and provide heuristics for its approximation. We also propose adapting DisC diverse subsets to a different degree of diversification. We call this operation zooming. We present efficient implementations of our algorithms based on the M-tree, a spatial index structure, and experimentally evaluate their performance.

pdftxt

Ratio Threshold Queries over Distributed Data Sources

Rajeev Gupta (IBM Research - India), Krithi Ramamritham (Indian Institute of Technology Bombay), Mukesh Mohania (IBM Research - India)

Continuous aggregation queries over dynamic data are used for real time decision making and timely business intelligence. In this paper we consider queries where a client wants to be notified if the ratio of two aggregates over distributed data crosses a specified threshold. Consider these scenarios: a mechanism designed to defend against distributed denial of service attacks may be triggered when the fraction of packets arriving to a subnet is more than 5% of the total packets; or a distributed store chain withdrws its discount on luxury goods when sales of luxury goods constitute more than 20% of the overall sales. The challenge in executing such ratio threshold queries (RTQs) lies in incurring the minimal amount of communication necessary for propagation of updates from data sources to the aggregator node where the client query is executed.We address this challenge by proposing schemes for converting the client ratio threshold condition into conditions on individual distributed data sources. Whenever the condition associated with a source is violated, the source pushes its data values to the aggregator, which in turn pulls data values from other sources to determine whether the client threshold condition is indeed violated. We present algorithms to minimize the number of source condition violations (i.e., the number of pushes) while ensuring that no violation of the client threshold condition is missed. Further, in case of a source condition violation, we propose efficient selective pulling algorithms for intelligently choosing additional sources whose data should be pulled by the aggregator. Using performance evaluation on synthetic and real traces of data updates we show that our algorithms result in up to an order of magnitude less number of messages compared to existing approaches in the literature.

pdftxt

Distributed Time-aware Provenance

Wenchao Zhou (Georgetown University), Suyog Mapara (University of Pennsylvania), Yiqing Ren (University of Pennsylvania), Yang Li (University of Pennsylvania), Andreas Haeberlen (University of Pennsylvania), Zachary Ives (University of Pennsylvania), Boon Thau Loo (University of Pennsylvania), Micah Sherr (Georgetown University)

The ability to reason about changes in a distributed system’s state enables network administrators to better diagnose protocol misconfigurations, detect intrusions, and pinpoint performance bottlenecks. We propose a novel provenance model called Distributed Time-aware Provenance (DTaP) that aids distributed system forensics and debugging by explicitly representing time, distributed state, and state changes. Using a distributed Datalog abstraction for modeling distributed protocols, we prove that the DTaP model provides a sound and complete representation that correctly captures dependencies among events in a distributed system. We additionally introduce DistTape, an implementation of the DTaP model that uses novel distributed storage structures, query processing, and cost-based optimization techniques to efficiently query time-aware provenance in a distributed setting. Using two example systems (declarative network routing and Hadoop MapReduce), we demonstrate that DistTape can efficiently maintain and query time-aware provenance at low communication and computation cost.

pdftxt

TripleBit: a Fast and Compact System for Large Scale RDF Data

Pingpeng Yuan (HUST), Pu Liu (Huazhong Univ. of Sci. & Tech.), Buwen Wu (Huazhong University of Science), Ling Liu (Georgia Institute of Technology), Hai Jin (HUST), Wenya Zhang (Huazhong Univ. of Sci. & Tech.)

The volume of RDF data continues to grow over the past decade and many known RDF datasets have billions of triples. A grant challenge of managing this huge RDF data is how to access this big RDF data efficiently. A popular approach to addressing the problem is to build a full set of permutations of (S, P, O) indexes. Although this approach has shown to accelerate joins by orders of magnitude, the large space overhead limits the scalability of this approach and makes it heavyweight. In this paper, we present TripleBit, a fast and compact system for storing and accessing RDF data. The design of TripleBit has three salient features. First, the compact design of TripleBit reduces both the size of stored RDF data and the size of its indexes. Second, TripleBit introduces two auxiliary index structures, ID-Chunk bit matrix and ID-Predicate bit matrix, to minimize the cost of index selection during query evaluation. Third, its query processor dynamically generates an optimal execution ordering for join queries, leading to fast query execution and effective reduction on the size of intermediate results. Our experiments show that TripleBit outperforms RDF-3X, MonetDB, BitMat on LUBM, UniProt and BTC 2012 benchmark queries and it offers orders of mangnitude performance improvement for some complex join queries.

pdftxt

Extraction and Integration of Partially Overlapping Web Sources

Mirko Bronzi (Università Roma Tre), Valter Crescenzi (Università Roma Tre), Paolo Merialdo (Università Roma Tre), Paolo Papotti (QCRI)

We present an unsupervised approach for harvesting the data exposed by a set of structured and partially overlapping data-intensive web sources. Our proposal comes within a formal framework tackling two problems: the data extraction problem, to generate extraction rules based on the input websites, and the data integration problem, to integrate the extracted data in a unified schema. We introduce an original algorithm, Weir, to solve the stated problems and formally prove its correctness. Weir leverages the overlapping data among sources to make better decisions both in the data extraction (by pruning rules that do not lead to redundant information) and in the data integration (by reflecting local properties of a source over the mediated schema). Along the way, we characterize the amount of redundancy needed by our algorithm to produce a solution, and present experimental results to show the benefits of our approach with respect to existing solutions.


Industry 3: Scalability

Location: Room 120map

Chair: Arnab Nandi (Ohio State)


pdftxt

Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce

Ablimit Aji (Emory University), Fusheng Wang (Emory University), Hoang Vo (Emory University), Rubao Lee (The Ohio State University), Qiaoling Liu (Emory University), Xiaodong Zhang (The Ohio State University), Joel Saltz (Emory University)

Support of high performance queries on large volumes of spatial data becomes increasingly important in many application domains, including geospatial problems in numerous fields, location based services, and emerging scientific applications that are increasingly data- and compute-intensive. The emergence of massive scale spatial data is due to the proliferation of cost effective and ubiquitous positioning technologies, development of high resolution imaging technologies, or contribution from a large number of community users. There are two major challenges for managing and querying massive spatial data to support spatial queries: the explosion of spatial data, and the high computational complexity of spatial queries. In this paper, we present Hadoop-GIS -- a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop. Hadoop-GIS supports multiple types of spatial queries on MapReduce through space partitioning, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects on MapReduce. Hadoop-GIS takes advantage of global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. Hadoop-GIS is integrated into Hive to support declarative spatial queries with an integrated architecture. Our experiments have demonstrated the high efficiency of Hadoop-GIS on query response and high scalability to run on commodity clusters. Comparative experiments have shown that Hadoop-GIS has close performance to parallel SDBMS and outperforms SDBMS for compute-intensive queries, and comes with much flexibility on query optimization and is highly cost effective. The system has been deployed or tested for multiple real world applications, and is available as a set of library for processing spatial queries, and as an integrated software package with Hive.

pdftxt

Statistics Collection in Oracle Spatial and Graph: Fast Histogram Construction for Complex Geometry Objects

Bhuvan Bamba (Oracle America Inc.), Siva Ravada (Oracle America Inc.), Ying Hu (Oracle America Inc.), Richard Anderson (Oracle America Inc.)

Oracle Spatial and Graph is a geographic information system (GIS) which provides users the ability to store spatial data alongside conventional data in Oracle. As a result of the coexistence of spatial and other data, we observe a trend towards users performing increasingly complex queries which involve spatial as well as non-spatial predicates. Accurate selectivity values, especially for queries with multiple predicates requiring joins among numerous tables, are essential for the database optimizer to determine a good execution plan. For queries involving spatial predicates, this requires that reasonably accurate statistics collection has been performed on the spatial data. For extensible data cartridges such as Oracle Spatial and Graph, the optimizer expects to receive accurate predicate selectivity and cost values from functions implemented within the data cartridge. Although statistics collection for spatial data has been researched in academia for a few years; to the best of our knowledge, this is the first work to present spatial statistics collection implementation details for a commercial GIS database. In this paper, we describe our experiences with implementation of statistics collection methods for complex geometry objects within Oracle Spatial and Graph. Firstly, we exemplify issues with previous partitioning-based algorithms in presence of complex geometry objects and suggest enhancements which resolve the issues. Secondly, we propose a main memory implementation which not only speeds up the disk-based partitioning algorithms but also utilizes existing R-tree indexes to provide surprisingly accurate selectivity estimates. Last but not the least, we provide extensive experimental results and an example study which displays the efficacy of our approach on Oracle query performance.

pdftxt

Scuba: Diving into Data at Facebook

Lior Abraham (Interana), John Allen (Addepar), Oleksandr Barykin (Facebook), Vinayak Borkar (UC Irvine), Bhuwan Chopra (Facebook), Ciprian Gerea (Facebook), Daniel Merl (Facebook), Josh Metzler (Facebook), David Reiss (Facebook), Subbu Subramanian (Facebook), Janet L. Wiener (Facebook), Okay Zed (Rdio)

Facebook takes performance monitoring seriously. Any issues can impact over a billion users so we track thousands of servers, hundreds of PB of daily network traffic, hundreds of daily code changes, and many other metrics. We require latencies of about a minute from events occuring (a client request on a phone, a bug report filed, a code change checked in) to graphs showing those events on developers’ monitors. Scuba is the data management system Facebook uses for most real-time analysis. Scuba is a fast, scalable, distributed, in-memory database built at Facebook. It currently ingests millions of rows (events) per second and expires data at the same rate. Scuba stores data completely in memory on hundreds of servers each with 144 GB RAM. To process each query, Scuba aggregates data from all servers. Scuba processes almost a million queries per day. Scuba is used extensively for interactive, ad hoc, analysis queries that run in under a second over live data. In addition, Scuba is the workhorse behind Facebook’s code regression analysis, bug report monitoring, ads revenue monitoring, and performance debugging.

pdftxt

Adaptive and Big Data Scale Parallel Execution in Oracle

Srikanth Bellamkonda (Oracle USA), Huagang Li (Oracle USA), Unmesh Jagtap (Oracle USA), Yali Zhu (Oracle USA), Thierry Cruanes (Oracle USA), Vince Liang (MeLLmo Inc.)

This paper showcases some of the newly introduced parallel execution methods in Oracle RDBMS. These methods provide highly scalable and adaptive parallel evaluation for the most commonly used SQL operations – joins, group-by, rollup/cube, grouping sets, and analytic window functions. The novelty of these techniques is their use of multi-stage parallelization models, accommodation of optimizer mistakes, and the runtime parallelization and data distribution decisions. These parallel plans adapt based on the statistics gathered on the real data at query execution time. We realized enormous performance gains from these novel parallelization techniques. The paper also discusses our approach to parallelize queries with operations that are inherently serial. We believe all these techniques will make their way into big data analytics and other massively parallel database systems.


Demo C: From Data Collection to Analysis

Location: Room Stampamap


pdf

NADEEF: A Generalized Data Cleaning System

Amr Ebaid (Purdue University), Ahmed Elmagarmid (QCRI), Ihab Ilyas (QCRI), Mourad Ouzzani (QCRI), Jorge-Arnulfo Quiane-Ruiz (QCRI), Nan Tang (QCRi), Si Yin (QCRI)

pdf

RecDB in Action: Recommendation Made Easy in Relational Databases

Mohamed Sarwat (University of Minnesota), James Avery (University of Minnesota), Mohamed Mokbel (University of Minnesota)

pdf

Graph Queries in a Next-Generation Datalog System

Alexander Shkapsky (UCLA), Kai Zeng (UCLA), Carlo Zaniolo (UCLA)

pdf

Lazy ETL in Action: ETL Technology Dates Scientific Data

Yağız Kargın (CWI), Milena Ivanova (Netherlands eScience Center), Stefan Manegold (CWI), Martin Kersten (CWI), Ying Zhang (CWI)

pdf

Scolopax: Exploratory Analysis of Scientific Data

Alper Okcan (Northeastern University), Mirek Riedewald, Biswanath Panda, Daniel Fink

pdf

PROPOLIS: Provisioned Analysis of Data-Centric Processes

Daniel Deutch (Ben Gurion university), Yuval Moskovitch (Ben Gurion University), Val Tannen (University of Pennsylvania)

pdf

Feature Selection in Enterprise Analytics: A Demonstration using an R-based Data Analytics System

Pradap Konda (University of Wisconsin-Madison), Arun Kumar (University of Wisconsin-Madison), Christopher Re (University of Wisconsin-Madison), Vaishnavi Sashikanth (Oracle)

pdf

PLASMA-HD: Probing the LAttice Structure and MAkeup of High-dimensional Data

David Fuhry (The Ohio State University), Yang Zhang (The Ohio State University), Venu Satuluri (Twitter), Arnab Nandi (The Ohio State University), Srinivasan Parthasarathy (The Ohio State University)

pdf

IBminer: A Text Mining Tool for Constructing and Populating InfoBox Databases and Knowledge Bases

Hamid Mousavi (UCLA), Shi Gao (UCLA), Carlo Zaniolo (UCLA)

pdf

Mining and Linking Patterns across Live Data Streams and Stream Archives

Di Yang (WPI), Kaiyu Zhao (WPI), Maryam Hasan (WPI), Hanyuan Lu (WPI), Elke Rundensteiner (WPI), Matthew Ward (WPI)

pdf

User Analytics with UbeOne: Insights into Web Printing

Georgia Koutrika (HP Labs), Qian Lin (HP Labs), Jerry Liu (HP Labs)



Wednesday Aug 28th 08:45-10:00

VLDB 2014 and Keynote 2

Location: Room 1000Amap

Chair: Christoph Koch (EPFL)


txt

The DataHub: A Collaborative Data Analytics and Visualization Platform

Samuel Madden, Professor of Electrical Engineering and Computer Science (MIT Computer Science and Artificial Intelligence Laboratory)

In this talk, I will describe a new system we are building at MIT, called DataHub. DataHub is a hosted interactive data processing, sharing, and visualization system for large-scale data analytics. Key features of DataHub include: (i) Flexible ingest and data cleaning tools to help massage data into a form that users can write programs that operate on it. This includes both removing irregularity as well as exposing structure from unstructured data such as text files and images. (ii) A scalable, parallel, SQL-based analytic data processing engine optimized for extremely low-latency operation on large data sets, by exploiting massive parallelism available in modern GPUs and upcoming “manycore” CPUs. (iii) An interactive visualization system that is tightly coupled to the data processing and lineage engine. Specifically, DataHub provides a workflow-based visualization engine where users can choose from a library of pre-built visualizations, or define their own visualizations via a simple API. Analysis and visualization steps may run on either CPUs or manycore/GPU devices. (iv) Finally, Datahub is a hosted data platform, designed to eliminate the need for users to manage their own database. It includes features that allow users to selectively share their data with other users, using complex context-sensitive predicates (e.g., that data about particular times or location should be visible to particular users).

Bio: Samuel Madden is a Professor of Electrical Engineering and Computer Science in MIT's Computer Science and Artificial Intelligence Laboratory. His research interests include databases, distributed computing, and networking. Research projects include the C-Store column-oriented database system, the CarTel mobile sensor network system, and the Relational Cloud "database-as-a-service". Madden is a leader in the emerging field of "Big Data", heading the Intel Science and Technology Center (ISTC) for Big Data, a multi-university collaboration on developing new tools for processing massive quantities of data. He also leads BigData@CSAIL, an industry-backed initiative to unite researchers at MIT and leaders from industry to investigate the issues related to systems and algorithms for data that is high rate, massive, or very complex. Madden received his Ph.D. from the University of California at Berkeley in 2003 where he worked on the TinyDB system for data collection from sensor networks. Madden was named one of Technology Review's Top 35 Under 35 in 2005, and is the recipient of several awards, including an NSF CAREER Award in 2004, a Sloan Foundation Fellowship in 2007, best paper awards in VLDB 2004 and 2007, MobiCom 2006, CIDR 2013, EuroSys 2013, and a SIGMOD Test of Time Award for his 2003 paper "The Design of an Acquisitional Query Processor for Sensor Networks."



Wednesday Aug 28th 10:30-12:00

Panel 1

Location: Room 1000Amap

Chair: Rada Chirkova (NC State University) as Moderator


pdftxt

Big and Useful: What's in the Data for Me?

Rada Chirkova (NC State University), Minos Garofalakis (TU Crete), Joseph M. Hellerstein (UC Berkeley), Yannis Ioannidis (ATHENA Research and Innovation Center and University of Athens), Zachary Ives (University of Pennsylvania), H. V. Jagadish (University of Michigan), Jun Yang (Duke University)

In the context of extracting value to users from the available data, the database community has historically been focusing on efficient processing of primarily structured queries posed by expert users mostly on structured, pre-organized data. The rather recent “Big-Data” phenomenon has been shaping a world where extreme quantities of data are collected in an ad-hoc, almost accidental, way. Further, the user targets for Big Data are broader than those traditionally considered by the database-research community. The main goals of this panel are to identify the pain points when end users attempt to extract real value from ad hoc collections of Big Data, and to provide alternative viewpoints on what the database community should work on, if it is to play a bigger role in bringing the benefits of Big Data to the masses.

Bio: Rada Chirkova is an associate professor in the Department of Computer Science at NC State University. She received her Ph.D. from Stanford University in 2002. She has been a recipient of the US National Science Foundation Career Award and of multiple IBM Faculty Awards, and is a senior member of the ACM. Her current research interests include reformulation of data and queries for a variety of purposes and application domains.

Bio: Minos Garofalakis is a Professor and the Director of the SoftNet Lab at the Department of Electronic & Computer Engineering of the Technical University of Crete. He received his PhD in Computer Sciences from the University of Wisconsin in 1998, and held senior researcher positions at Bell Labs, Intel Research Berkeley, and Yahoo! Research, as well as an Adjunct Associate Professor appointment with UC Berkeley EECS (2006-2008). His current research focuses on centralized and distributed data streams, data synopses and approximate query processing, uncertain databases, and big-data analytics and data mining. He serves as a PI for a number of European research projects in these areas. Minos is an ACM Distinguished Scientist (2011), and a recipient of the Bell Labs President's Gold Award (2004) and the IEEE ICDE Best Paper Award (2009).

Bio: Joseph M. Hellerstein is a Chancellor's Professor of Computer Science at the University of California, Berkeley, whose work focuses on data-centric systems and the way they drive computing. He is an ACM Fellow, an Alfred P. Sloan Research Fellow and the recipient of two ACM-SIGMOD “Test of Time” awards for his research. In 2010, Fortune Magazine included him in their list of 50 smartest people in technology, and MIT's Technology Review magazine included his Bloom language for cloud computing on their TR10 list of the 10 technologies “most likely to change our world.” He serves on the technical advisory boards of EMC, SurveyMonkey, Platfora and Captricity. Hellerstein is co-founder and CEO of Trifacta, which develops productivity software for data analysts.

Bio: Yannis Ioannidis (UC Berkeley PhD, 1986) is the President and General Director of the ATHENA Research and Innovation Center and a Professor at the Department of Informatics and Telecommunications of the University of Athens. His research work focuses on data and information management systems (especially query processing and optimization as well as complex dataflow processing), user modeling and attitude management systems, scientific data infrastructures, digital libraries and repositories, and computer-human interaction. He is coordinating several European and national projects on the above topics. Yannis Ioannidis is an ACM and IEEE Fellow and a member of Academia Europaea, he has received the 2003 VLDB "10-Year Best Paper" Award as well as several other research and teaching awards, and he currently serves as the ACM SIGMOD Chair.

Bio: Zachary Ives is an Associate Professor and the Markowitz Faculty Fellow at the University of Pennsylvania. His research interests include data integration and sharing, "big data," sensor networks, and data provenance and authoritativeness. He is a recipient of the NSF CAREER award, and an alumnus of the DARPA Computer Science Study Panel and Information Science and Technology advisory panel. He has also been awarded the Christian R. and Mary F. Lindback Foundation Award for Distinguished Teaching. He serves as the undergraduate curriculum chair for Penn's Singh Program in Networked and Social Systems Engineering, which focuses on the convergence of algorithms, game theory, sociology, and network dynamics in the context of the Internet. He is a co-author of the textbook Principles of Data Integration. His recent projects include a highly scalable cluster compute engine called REX, the ieeg.org portal for cloud-based sharing and analysis of network neuroscience data, and the Q system for integrating hundreds or thousands of disparate data sources.

Bio: H.V. Jagadish is a Bernard A Galler Collegiate Professor of Electrical Engineering and Computer Science at the University of Michigan. Dr. Jagadish obtained his Ph.D. from Stanford in 1985, and worked many years for AT&T, where he eventually headed the database department. After a brief detour through the University of Illinois, he joined the University of Michigan in the fall of 1999. He is the founding Editor-in-Chief of the Proceedings of the Very Large Database Endowment (PVLDB). His SIGMOD 2007 keynote on database usability gave the database community a significant impetus to think about usability issues. He recently coordinated a community white paper on Big Data.

Bio: Jun Yang is an Associate Professor of Computer Science at Duke University in USA. He is broadly interested in research on databases and data-intensive systems. He received his B.A. from University of California at Berkeley in 1995, and his Ph.D. from Stanford University in 2001. He has been a recipient of the US National Science Foundation CAREER Award, IBM Faculty Award, and HP Labs Innovation Research Award. His recent interests include continuous querying systems, scalable statistical computing, and computational journalism.


Tutorial 2

Location: Room 1000Bmap

Chair: Alkis Simitsis (HP Labs)


pdftxt

Towards Database Virtualization for Database as a Service

Aaron J. Elmore (University of California Santa Barbara), Carlo Curino (Microsoft CISL), Divyakant Agrawal (University of California Santa Barbara), Amr El Abbadi (University of California Santa Barbara)

Advances in operating system and storage-level virtualization technologies have enabled the effective consolidation of heterogeneous applications in a shared cloud infrastructure. Novel research challenges arising from this new shared environment, include load balancing, workload estimation, resource isolation, machine replication, live migration, and an emergent need of automation to handle large scale operations with minimal manual intervention. Given that databases are at the core of most applications that are deployed in the cloud, database management systems (DBMSs) represent a very important technology component that needs to be virtualized in order to realize the benefits of virtualization from autonomic management of data-intensive applications in large scale data-centers. This tutorial is organized in three parts. In part one, we provide a general background of the current state-of-the-art of virtualization technologies on modern cloud environment in particular, covering in depth some crucial advances. In part two, we explore consolidated shared storage systems and virtualization techniques used to ensure fair access between applications. For part three, we focus on some of the shortcomings of general purpose operating systems and storage-level virtualization technologies when applied to DBMSs and discuss recent research and development in this space. We will touch on several open problems, providing pointers to areas of research we expect to be growing and for which simple naive solutions are not likely to be very effective. The goal of this tutorial is to survey the techniques used in providing elasticity in virtual machine systems, shared storage systems, and survey database research on multitenant architectures and elasticity primitives. This foundation of core Database as a Service advances, together with a primer of important related topics in OS and storage-level virtualization, are central for anyone that wants to operate in this area of research. At the end of the tutorial we expect attendees to be well oriented in this exciting research area and ready to participate in it.

Bio: Aaron J. Elmore is currently a PhD candidate at the University of California, Santa Barbara. He has a MS in computer science from the University of Chicago. His research interests involve cloud computing, multitenant databases, and ecoinformatics.

Bio: Carlo A. Curino received a PhD from Politecnico di Milano, and spent two years as Post Doc Associate at CSAIL MIT leading the relational cloud project. He worked at Yahoo! Research as Research Scientist focusing on mobile/cloud platforms and entity deduplication at scale. Carlo is currently a Senior Scientist at Microsoft in the recently formed Cloud and Information Services Lab (CISL) where he is working on big-data platforms and cloud computing.

Bio: Divyakant Agrawal is a Professor of Computer Science and the Director of Engineering Computing Infrastructure at the University of California at Santa Barbara. His research expertise is in the areas of database systems, distributed computing, data warehousing, and large-scale information systems. He currently serves as the Editor-in-Chief of Distributed and Parallel Databases and is on the editorial boards of the ACM Transactions on Database Systems and IEEE Transactions of Knowledge and Data Engineering. He serves on the Board of Trustees of the VLDB Endowment and on the Executive Committee of ACM Special Interest Group SIGSPATIAL. Dr. Agrawal is a Fellow of ACM and a Fellow of IEEE.

Bio: Amr El Abbadi is a Professor of Computer Science at the University of California, Santa Barbara. He received his B. Eng. from Alexandria University, Egypt, and his Ph.D. from Cornell University. Prof. El Abbadi is an ACM Fellow, an AAAS Fellow and was Chair of the Computer Science Department at UCSB from 2007 to 2011. He has served as a journal editor for several database journals, including, currently, The VLDB Journal. He has been Program Chair for multiple database and distributed systems conferences, most recently SIGSPATIAL GIS 2010 and ACM Symposium on Cloud Computing (SoCC) 2011, COMAD India 2012 and ACM COSN (Conference On Social Networks) 2013.


Research 7: Data Quality and Cleaning

Location: Room 300map

Chair: Pierre Senellart (Telecom ParisTech)


pdftxt

Efficient Querying of Inconsistent Databases with Binary Integer Programming

Phokion Kolaitis (UC Santa Cruz & IBM Research - Almaden), Enela Pema (UC Santa Cruz), Wang-Chiew Tan (UC Santa Cruz)

An inconsistent database is a database that violates one or more integrity constraints. A typical approach for answering a query over an inconsistent database is to first clean the inconsistent database by transforming it to a consistent one and then apply the query to the consistent database. An alternative and more principled approach, known as consistent query answering, derives the answers to a query over an inconsistent database without changing the database, but by taking into account all possible repairs of the database. In this paper, we study the problem of consistent query answering over inconsistent databases for the class for conjunctive queries under primary key constraints. We develop a system, called EQUIP, that represents a fundamental departure from existing approaches for computing the consistent answers to queries in this class. At the heart of EQUIP is a technique, based on Binary Integer Programming (BIP), that repeatedly searches for repairs to eliminate candidate consistent answers until no further such candidates can be eliminated. We establish rigorously the correctness of the algorithms behind EQUIP and carry out an extensive experimental investigation that validates the effectiveness of our approach. Specifically, EQUIP exhibits good and stable performance on conjunctive queries under primary key constraints, it significantly outperforms existing systems for computing the consistent answers of such queries in the case in which the consistent answers are not first-order rewritable, and it scales well.

pdftxt

Streaming Quotient Filter: A Near Optimal Approximate Duplicate Detection Approach for Data Streams

Sourav Dutta (Max Planck Institute for Informatics), Ankur Narang (IBM Research - India), Suman K. Bera (IBM Research - India)

The unparalleled growth and popularity of the Internet coupled with the advent of diverse modern applications such as search engines, on-line transactions, climate warning systems, etc., has catered to an unprecedented expanse in the volume of data stored world-wide. Efficient storage, management, and processing of such massively exponential amount of data has emerged as a central theme of research in this direction. Detection and removal of redundancies and duplicates in real-time from such multi-trillion record-set to bolster resource and compute efficiency constitutes a challenging area of study. The infeasibility of storing the entire data from potentially unbounded data streams, with the need for precise elimination of duplicates calls for intelligent approximate duplicate detection algorithms. The literature hosts numerous works based on the well-known probabilistic bitmap structure, Bloom Filter and its variants. In this paper we propose a novel data structure, Streaming Quotient Filter, (SQF) for efficient detection and removal of duplicates in data streams. SQF intelligently stores the signatures of elements arriving on a data stream, and along with an eviction policy provides near zero false positive and false negative rates. We show that the near optimal performance of SQF is achieved with a very low memory requirement, making it ideal for real-time memory-efficient de-duplication applications having an extremely low false positive and false negative tolerance rates. We present detailed theoretical analysis of the working of SQF, providing a guarantee on its performance. Empirically, we compare SQF to alternate methods and show that the proposed method is superior in terms of memory and accuracy compared to the existing solutions. We also discuss Dynamic SQF for evolving streams and the parallel implementation of SQF.

pdftxt

On Repairing Structural Problems In Semi-structured Data

Flip Korn (AT&T Labs-Research), Barna Saha (AT&T Labs-Research), Divesh Srivastava (AT&T Labs-Research), Shanshan Ying (National University of Singapore)

Semi-structured data such as XML are popular for data interchange and storage. However, many XML documents have improper nesting where open- and close-tags are unmatched. Since some semi-structured data (e.g., Latex) have a flexible grammar and since many XML documents lack an accompanying DTD or XSD, we focus on computing a syntactic repair via the edit distance. To solve this problem, we propose a dynamic programming algorithm which takes cubic time. While this algorithm is not scalable, well-formed substrings of the data can be pruned to enable faster computation. Unfortunately, there are still cases where the dynamic program could be very expensive; hence, we give branch-and-bound algorithms based on various combinations of two heuristics, called MinCost and MaxBenefit, that trade off between accuracy and efficiency. Finally, we experimentally demonstrate the performance of these algorithms on real data.

pdftxt

The Llunatic Data Cleaning Framework

Floris Geerts (University of Antwerp), Giansalvatore Mecca (Università della Basilicata), Paolo Papotti (QCRI), Donatello Santoro (Università della Basilicata)

Data-cleaning (or data-repairing) is considered a crucial problem in many database-related tasks. It consists in making a database consistent with respect to a set of given constraints. In recent years, repairing methods have been proposed for several classes of constraints. However, these methods rely on ad hoc decisions and tend to hard-code the strategy to repair conflicting values. As a consequence, there is currently no general algorithm to solve database repairing problems that involve different kinds of constraints and different strategies to select preferred values. In this paper we develop a uniform framework to solve this problem. We propose a new semantics for repairs, and a chase-based algorithm to compute minimal solutions. We implemented the framework in a DBMS-based prototype, and we report experimental results that confirm its good scalability and superior quality in computing repairs.

pdftxt

Truth Finding on the Deep Web: Is the Problem Solved?

Xian Li (SUNY at Binghamton), Xin Luna Dong (Google), Kenneth Lyons (AT&T Labs-Research), Weiyi Meng (Binghamton University), Divesh Srivastava (AT&T Labs-Research)

The amount of useful information available on theWeb has been growing at a dramatic pace in recent years and people rely more and more on theWeb to fulfill their information needs. In this paper, we study truthfulness of Deep Web data in two domains where we believed data are fairly clean and data quality is important to people’s lives: Stock and Flight. To our surprise, we observed a large amount of inconsistency on data from different sources and also some sources with quite low accuracy. We further applied on these two data sets state-of-the-art data fusion methods that aim at resolving conflicts and finding the truths, analyzed their promise and limitations, and suggested possible improvements. We wish our study can increase awareness of the seriousness of conflicting data on theWeb and in turn inspire more research in our community to tackle this problem.


Research 8: Privacy and Security

Location: Room 120map

Chair: Panos Kalnis (KAUST)


pdftxt

Practical Differential Privacy via Grouping and Smoothing

Georgios Kellaris (HKUST), Stavros Papadopoulos (HKUST)

We address one-time publishing of non-overlapping counts with e-differential privacy. These statistics are useful in a wide and important range of applications, including transactional, traffic and medical data analysis. Prior work on the topic publishes such statistics with prohibitively low utility in several practical scenarios. Towards this end, we present GS, a method that pre-processes the counts by elaborately grouping and smoothing them via averaging. This step acts as a form of preliminary perturbation that diminishes sensitivity, and enables GS to achieve e-differential privacy through low Laplace noise injection. The grouping strategy is dictated by a sampling mechanism, which minimizes the smoothing perturbation. We demonstrate the superiority of GS over its competitors, and confirm its practicality, via extensive experiments on real datasets.

pdftxt

On Differentially Private Frequent Itemset Mining

Chen Zeng (University of Wisconsin-Madison), Jeffrey Naughton (University of Wisconsin-Madison), Jin-yi Cai (University of Wisconsin-Madison)

We consider differentially private frequent itemset mining. We begin by exploring the theoretical difficulty of simultaneously providing good utility and good privacy in this task. While our analysis proves that in general this is very difficult, it leaves a glimmer of hope in that our proof of difficulty relies on the existence of long transactions (that is, transactions containing many items). Accordingly, we investigate an approach that begins by truncating long transactions, trading off errors introduced by the truncation with those introduced by the noise added to guarantee privacy. Experimental results over standard benchmark databases show that truncating is indeed effective. Our algorithm solves the ``classical'' frequent itemset mining problem, in which the goal is to find all itemsets whose support exceeds a threshold. Related work has proposed differentially private algorithms for the top-k itemset mining problem (``find the k most frequent itemsets''.) An experimental comparison with those algorithms show that our algorithm achieves better F-score unless k is small.

pdftxt

Processing Analytical Queries over Encrypted Data

Stephen Tu (MIT), Frans Kaashoek (MIT), Sam Madden (MIT), Nickolai Zeldovich (MIT)

Monomi is a system for securely executing analytical workloads over sensitive data on an untrusted database server. Monomi works by encrypting the entire database and running queries over the encrypted data. Monomi introduces split client/server query execution, which can execute arbitrarily complex queries over encrypted data, as well as several techniques that improve performance for such workloads, including per-row precomputation, space-efficient encryption, grouped homomorphic addition, and pre-filtering. Since these optimizations are good for some queries but not others, Monomi introduces a designer for choosing an efficient physical design at the server for a given workload, and a planner to choose an efficient execution plan for a given query at runtime. A prototype of Monomi running on top of Postgres can execute most of the queries from the TPC-H benchmark with a median overhead of only 1.24x (ranging from 1.03x to 2.33x) compared to an un-encrypted Postgres database where a compromised server would reveal all data.

pdftxt

CorrectDB: SQL Engine with Practical Query Authentication

Sumeet Bajaj (Stony Brook University), Radu Sion (Stony Brook University)

Clients of outsourced databases need Query Authentication (QA) guaranteeing the integrity (correctness and completeness), and authenticity of the query results returned by potentially compromised providers. Existing results provide QA assurances for a limited class of queries by deploying several software cryptographic constructs. Here, we show that, to achieve QA, however, it is significantly cheaper and more practical to deploy server-hosted, tamper-proof co-processors, despite their higher acquisition costs. Further, this provides the ability to handle arbitrary queries. To reach this insight, we extensively survey existing QA work and identify interdependencies and efficiency relationships. We then introduce CorrectDB, a new DBMS with full QA assurances, leveraging server-hosted, tamper-proof, trusted hardware in close proximity to the outsourced data.

pdftxt

Lightweight Privacy-Preserving Peer-to-Peer Data Integration

Ye Zhang (The Pennsylvania State University), Wai-Kit Wong (Hang Seng Management College), Siu Ming Yiu (The University of Hong Kong), Nikos Mamoulis (The University of Hong Kong), David W. Cheung (The University of Hong Kong)

Peer Data Management Systems (PDMS) are an attractive solution for managing distributed heterogeneous information. When a peer (client) requests data from another peer (server) with a different schema, translations of the query and its answer are done by a sequence of intermediate peers (translators). There are two privacy issues in this P2P data integration process: (i) answer privacy: no unauthorized parties (including the translators) should learn the query result; (ii) mapping privacy: the schema and the value mappings used by the translators to perform the translation should not be revealed to other peers. Elmeleegy and Ouzzani proposed the PPP protocol that is the first to support privacy-preserving querying in PDMS. However, PPP suffers from several shortcomings. First, PPP does not satisfy the requirement of answer privacy, because it is based on commutative encryption; we show that this issue can be fixed by adopting another cryptographic technique called oblivious transfer. Second, PPP adopts a weaker notion for mapping privacy, which allows the client peer to observe certain mappings done by translators. In this paper, we develop a lightweight protocol, which satisfies mapping privacy and extend it to a more complex one that facilitates parallel translation by peers. Furthermore, we consider a stronger adversary model where there may be collusions among peers and propose an efficient protocol that guards against collusions. We conduct an experimental study on the performance of the proposed protocols using both real and synthetic data. The results show that the proposed protocols not only achieve a better privacy guarantee than PPP, but they are also more efficient.


Demo D: Queries and Interfaces

Location: Room Stampamap


pdf

Senbazuru: A Prototype Spreadsheet Database Management System

Shirley Zhe Chen (University of Michigan), Mike Cafarella (University of Michigan), Jun Chen (University of Michigan), Daniel Prevo, Junfeng Zhuang (University of Michigan)

pdf

ReqFlex: Fuzzy Queries for Everyone

Grégory SMITS (IRISA-University of Rennes 1), Olivier PIVERT (IRISA-University of Rennes 1), Thomas GIRAULT (Freelance Engineer)

pdf

Comprehensive and Interactive Temporal Query Processing with SAP HANA

Martin Kaufmann (ETH Zürich), Panagiotis Vagenas (ETH Zurich), Peter Fischer (Albert-Ludwigs-Universität Freiburg, Germany), Donald Kossmann (ETH Zurich), Franz Färber (SAP AG)

pdf

Functions Are Data Too (Defunctionalization for PL/SQL)

Torsten Grust (Universität Tübingen, Germany), Nils Schweinsberg (Universität Tübingen, Germany), Alexander Ulrich (Universität Tübingen, Germany)

pdf

QUEST: A Keyword Search System for Relational Data based on Semantic and Machine Learning Techniques

Sonia Bergamaschi, Francesco Guerra, and Matteo Interlandi (Università di Modena e Reggio Emilia), Raquel Trillo-Lado (Universidad de Zaragoza), Yannis Velegrakis (Università di Trento)

pdf

ROSeAnn: Reconciling Opinions of Semantic Annotators

Luying Chen (Oxford), Stefano Ortona (Oxford), Giorgio Orsi (Oxford), Michael Benedikt (Oxford)

pdf

SkySuite: A Framework of Skyline-Join Operators for Static and Stream Environments

Mithila Nagendra (Arizona State University), K. Selcuk Candan (Arizona State University)

pdf

MASTRO STUDIO: Managing Ontology-Based Data Access applications

Cristina Civili (Sapienza University of Rome), Marco Console (Sapienza University of Rome), Giuseppe De Giacomo (Sapienza Università di Roma), Domenico Lembo (Sapienza University of Rome), Maurizio Lenzerini (Sapienza Università di Roma), Lorenzo Lepore (Sapienza University of Rome), Riccardo Mancini (Sapienza University of Rome), Antonella Poggi (Sapienza University of Rome), Riccardo Rosati (Sapienza University of Rome), Marco Ruzzi (Sapienza University of Rome), Valerio Santarelli (Sapienza University of Rome), Domenico Fabio Savo (Sapienza University of Rome)

pdf

PAQO: A Preference-Aware Query Optimizer for PostgreSQL

Nicholas L. Farnan (University of Pittsburgh), Adam J. Lee (University of Pittsburgh), Panos K. Chrysanthis (University of Pittsburgh), Ting Yu (North Carolina State University & Qatar Computing Research Institute)

pdf

eSkyline: Processing Skyline Queries over Encrypted Data

Suvarna Bothe (Rutgers University), Panagiotis Karras (Rutgers University), Akrivi Vlachou (NTNU)

pdf

GestureQuery: A Multitouch Database Query Interface

Lilong Jiang (The Ohio State University), Michael Mandel (The Ohio State University), Arnab Nandi (The Ohio State University)

pdf

Complete Approximations of Incomplete Queries

Ognjen Savkovic (Free University of Bozen-Bolzano), Paramita Mirza (Fondazione Bruno Kessler), Alex Tomasi (Free University of Bozen-Bolzano), Werner Nutt (Free University of Bozen-Bolzano)

pdf

POIKILO: A Tool for Evaluating the Results of Diversification Models and Algorithms

Marina Drosou (University of Ioannina), Evaggelia Pitoura (University of Ioannina)



Wednesday Aug 28th 13:20-15:00

Business Meeting: Announcements & Awards

Location: Room 1000Amap

Chair: Yannis Velegrakis (University of Trento)


VLDB 2013

VLDB Journal Report

Kian-Lee Tan (NUS)

(Best Paper Award) DisC Diversity: Result Diversification based on Dissimilarity and Coverage

Marina Drosou (University of Ioannina), Evaggelia Pitoura (University of Ioannina)

(Borg Early Career Award)

Yanlei Diao (U Mass Amherst)

txt

(Early Career Research Contribution Award & Presentation) From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Daniel Abadi (Yale University)

Four years ago at VLDB 2009, a paper was published about a research prototype, called HadoopDB, that attempted to transform Hadoop --- a batch-oriented scalable system designed for processing unstructured data --- into a full-fledged parallel database system that can achieve real-time (interactive) query responses across both structured and unstructured data. In 2010 it was commercialized by Hadapt, a start-up that was formed to accelerate the engineering of the HadoopDB ideas, and to harden the codebase for deployment in real-world, mission-critical applications. In this talk I will give an overview of HadoopDB, and how it combines ideas from the Hadoop and database system communities. I will then describe how the project transitioned from a research prototype written by PhD students at Yale University into enterprise-ready software written by a team of experienced engineers. We will examine particular technical features that are required in enterprise Hadoop deployments, and technical challenges that we ran into while making HadoopDB robust enough to be deployed in the real world. The talk will conclude with an analysis of how starting a company impacts the tenure process, and some thoughts for graduate students and junior faculty considering a similar path.

Bio: Daniel Abadi is an Associate Professor of computer science at Yale University and directs the DR@Y research lab that performs research on database system architecture and implementation, scalable and distributed systems, and cloud computing. Before joining Yale, Prof. Abadi received his Ph.D from MIT. He is best known for his research in column-store database systems (the C-Store project), high performance transactional systems (the H-Store and Calvin projects), and Hadoop (the HadoopDB project). Abadi has been a recipient of a Churchill Scholarship, an NSF CAREER Award, a Sloan Research Fellowship, the 2008 SIGMOD Jim Gray Doctoral Dissertation Award, the 2007 VLDB best paper award, and the 2013 VLDB Early Career Researcher Award. He blogs at http://dbmsmusings.blogspot.com and tweets at @daniel_abadi.



Wednesday Aug 28th 15:30-17:30

Research 9: Graph Data 2

Location: Room 1000Amap

Chair: James Cheng (Chinese University of Hong Kong)


pdftxt

Efficient SimRank-based Similarity Join Over Large Graphs

Weiguo Zheng (Peking University), Lei Zou (Peking University), Yansong Feng, Lei Chen (Honk Kong University of Science and Technology), Dongyan Zhao (Peking University)

Graphs have been widely used to model complex data in many real-world applications. Answering vertex join queries over large graphs is meaningful and interesting, which can benefit friend recommendation in social networks and link prediction, etc. In this paper, we adopt "SimRank" to evaluate the similarity of two vertices in a large graph because of its generality. Note that "SimRank" is purely structure dependent and it does not rely on the domain knowledge. Specifically, we define a SimRank-based join (SRJ) query to find all the vertex pairs satisfying the threshold in a data graph G. In order to reduce the search space, we propose an estimated shortest-path distance based upper bound for SimRank scores to prune unpromising vertex pairs. In the verification, we propose a novel index, called h-go cover, to efficiently compute the SimRank score of a single vertex pair. Given a graph G, we only materialize the SimRank scores of a small proportion of vertex pairs (called h-go covers), based on which, the SimRank score of any vertex pair can be computed easily. In order to handle large graphs, we extend our technique to the partition-based framework. Thorough theoretical analysis and extensive experiments over both real and synthetic datasets confirm the efficiency and effectiveness of our solution.

pdftxt

Incremental and Accuracy-Aware Personalized PageRank through Scheduled Approximation

Fanwei Zhu (Zhejiang Univ City College), Yuan Fang, Kevin Chang (UIUC), Jing Ying (Zhejiang Univ City College)

As Personalized PageRank has been widely leveraged for ranking on a graph, the efficient computation of Personalized PageRank Vector (PPV) becomes a prominent issue. In this paper, we propose FastPPV, an approximate PPV computation algorithm that is incremental and accuracy-aware. Our approach hinges on a novel paradigm of scheduled approximation: the computation is partitioned and scheduled for processing in an ''organized'' way, such that we can gradually improve our PPV estimation in an incremental manner, and quantify the accuracy of our approximation at query time. Guided by this principle, we develop an efficient hub based realization, where we adopt the metric of hub-length to partition and schedule random walk tours so that the approximation error reduces exponentially over iterations. Furthermore, as tours are segmented by hubs, the shared substructures between different tours (around the same hub) can be reused to speed up query processing both within and across iterations. Finally, we evaluate FastPPV over two real-world graphs, and show that it not only significantly outperforms two state-of-the-art baselines in both online and offline phrases, but also scale well on larger graphs. In particular, we are able to achieve near-constant time online query processing irrespective of the graph size.

pdftxt

Memory Efficient Minimum Substring Partitioning

Yang Li (University of California, Santa Barbara), Pegah Kamousi (University of California, Santa Barbara), Fangqiu Han (University of California, Santa Barbara), Shengqi Yang (University of California, Santa Barbara), Xifeng Yan (University of California, Santa Barbara), Subhash Suri (University of California, Santa Barbara)

Massively parallel DNA sequencing technologies are revolutionizing genomics research. Billions of short reads generated at low costs can be assembled for reconstructing the whole genomes. Unfortunately, the large memory footprint of the existing de novo assembly algorithms makes it challenging to get the assembly done for higher eukaryotes like mammals. In this work, we investigate the memory issue of constructing de Bruijn graph, a core task in leading assembly algorithms, which often consumes several hundreds of gigabytes memory for large genomes. We propose a disk-based partition method, called Minimum Substring Partitioning (MSP), to complete the task using less than 10 gigabytes memory, without runtime slowdown. MSP breaks the short reads into multiple small disjoint partitions so that each partition can be loaded into memory, processed individually and later merged with others to form a de Bruijn graph. By leveraging the overlaps among the k-mers (substring of length k), MSP achieves astonishing compression ratio: The total size of partitions is reduced from Θ(kn) to Θ(n), where n is the size of the short read database, and k is the length of a k-mer. Experimental results show that our method can build de Bruijn graphs using a commodity computer for any large-volume sequence dataset.

pdftxt

An In-depth Comparison of Subgraph Isomorphism Algorithms in Graph Databases

Jinsoo Lee (Kyungpook National University), Wook-Shin Han (Kyungpook National University), Romans Kasperovics (Kyungpook National University), Jeong-Hoon Lee (Kyungpook National University)

Finding subgraph isomorphisms is an important problem in many applications which deal with data modeled as graphs. While this problem is NP-hard, in recent years, many algorithms have been proposed to solve it in a reasonable time for real datasets using different join orders, pruning rules, and auxiliary neighborhood information. However, since they have not been empirically compared one another in most research work, it is not clear whether the later work outperforms the earlier work. Another problem is that reported comparisons were often done using the original authors’binaries which were written in different programming environments. In this paper, we address these serious problems by re-implementing five state-of-the-art subgraph isomorphism algorithms in a common code base and by comparing them using many real-world datasets and their query loads. Through our in-depth analysis of experimental results, we report surprising empirical findings.

pdftxt

Large Scale Cohesive Subgraphs Discovery for Social Network Visual Analysis

Feng Zhao (NUS), Anthony Tung (NUS)

Graphs are widely used in large scale social network analysis nowadays. Not only analysts need to focus on cohesive subgraphs to study patterns among social actors, but also normal users are interested in discovering what happening in their neighborhood. However, effectively storing large scale social network and efficiently identifying cohesive subgraphs is challenging. In this work we introduce a novel subgraph concept to capture the cohesion in social interactions, and propose an I/O efficient approach to discover cohesive subgraphs. Besides, we propose an analytic system which allows users to perform intuitive, visual browsing on large scale social networks. Our system stores the network as a social graph in the graph database, retrieves a local cohesive subgraph based on the input keywords, and then hierarchically visualizes the subgraph out on orbital layout, in which more important social actors are located in the center. By summarizing textual interactions between social actors as tag cloud, we provide a way to quickly locate active social communities and their interactions in a unified view.


Tutorial 3

Location: Room 1000Bmap

Chair: C. Mohan (IBM Almden Research Center)


pdftxt

Toward Scalable Transaction Processing - Evolution of Shore - MT

Anastasia Ailamaki (EPFL), Ryan Johnson (University of Toronto), Ippokratis Pandis (IBM), Pinar Tözün (EPFL)

Designing scalable transaction processing systems on modern multicore hardware has been a challenge for almost a decade. The typical characteristics of transaction processing workloads lead to a high degree of unbounded communication on multicores for conventional system designs. In this tutorial, we initially present a systematic way of eliminating scalability bottlenecks of a transaction processing system, which is based on minimizing unbounded communication. Then, we show several techniques that apply the presented methodology to minimize logging, locking, latching etc. related bottlenecks of transaction processing systems. In parallel, we demonstrate the internals of the Shore-MT storage manager and how they have evolved over the years in terms of scalability on multicore hardware through such techniques. We also teach how to use Shore-MT with the various design options it offers through its sophisticated application layer Shore-Kits and simple Metadata Frontend.

Bio: Anastasia Ailamaki is a Professor of Computer Sciences at EPFL in Switzerland. Her research interests are in database systems and applications, and in particular (a) in strengthening the interaction between the database software and emerging hardware and I/O devices, and (b) in automating database management to support computationally-demanding and demanding data-intensive scientific applications. She has received a Finmeccanica endowed chair from the Computer Science Department at Carnegie Mellon (2007), a European Young Investigator Award from the European Science Foundation (2007), an Alfred P. Sloan Research Fellowship (2005), eight best-paper awards at top conferences (2001-2012), and an NSF CAREER award (2002). She earned her Ph.D. in Computer Science from the University of Wisconsin-Madison in 2000. She is a senior member of the IEEE and a member of the ACM, and has also been a CRA-W mentor.

Bio: Ryan Johnson is an Assistant Professor at the University of Toronto specializing in systems aspects of database engines, particularly in the context of modern hardware. He contributed heavily to the initial development and performance tuning of Shore-MT. He graduated with M.S. and PhD degrees in Computer Engineering from Carnegie Mellon University in 2010, after completing a B.S. in Computer Engineering at Brigham Young University in 2004. In addition to his work with database systems, Johnson has interests in computer architecture, operating systems, compilers, and hardware design.

Bio: Ippokratis Pandis is a Research Staff Member (RSM) at IBM Research – Almaden. His research focuses on efficient, scalable data management and he is actively involved in IBM’s DB2 BLU project. Prior joining IBM, Ippokratis graduated with a PhD in Electrical and Computer Engineering from Carnegie Mellon University where he worked on scalable transaction processing on multisocket and multicore hardware, contributing to the development of Shore-MT and Shore-Kits. His PhD thesis was on the data-oriented transaction processing architecture (DORA).

Bio: Pinar Tozun is a fourth year PhD student at EPFL working under supervision of Prof. Anastasia Ailamaki in Data-Intensive Applications and Systems (DIAS) Laboratory. Her research focuses on scalability and efficiency of transaction processing systems on modern hardware and she actively contributes to the development and maintenance of Shore-MT and Shore-Kits. Before starting her PhD, she received her BSc degree in Computer Engineering department of Koc University in 2009 as the top student.


Research 10: Data Mining

Location: Room 300map

Chair: Xin Luna Dong (Google)


pdftxt

Scaling Factorization Machines to Relational Data

Steffen Rendle, University of Konstanz

The most common approach in predictive modeling is to describe cases with feature vectors (aka design matrix). Many machine learning methods such as linear regression or support vector machines rely on this representation. However, when the underlying data has strong relational patterns, especially relations with high cardinality, the design matrix can get very large which can make learning and prediction slow or even infeasible. This work solves this issue by making use of repeating patterns in the design matrix which stem from the underlying relational structure of the data. It is shown how coordinate descent learning and Bayesian Markov Chain Monte Carlo inference can be scaled for linear regression and factorization machine models. Empirically, it is shown on two large scale and very competitive datasets (Netflix prize, KDDCup 2012), that (1) standard learning algorithms based on the design matrix representation cannot scale to relational predictor variables, (2) the proposed new algorithms scale and (3) the predictive quality of the proposed generic feature-based approach is as good as the best specialized models that have been tailored to the respective tasks.

pdftxt

Scorpion: Explaining Away Outliers in Aggregate Queries

Eugene Wu (MIT), Sam Madden (MIT)

A common way that users of database systems explore large data sets is to run aggregate queries that project the data down to a smaller number of points and dimensions that can then be visualized. Often, such visualizations will reveal outliers that correspond to errors or surprising features of the input data set. Unfortunately, databases and visualization systems do not provide a way to work backwards from an outlier point to the common properties of the (possibly many) unaggregated input tuples that correspond to that outlier. In this paper, we propose Scorpion, a system that, given a set of user-specified outlier points in an aggregate query result, identifies predicates that explain the outliers in terms of properties of the input tuples that are used to compute the selected outlier results. Specifically, we do this explanation by identifying predicates that, when applied to the input data, cause the outliers to disappear from the output. To find such predicates, we develop a notion of influence of a predicate on a given output, and design several algorithms that efficiently search for maximum influence predicates over the input data. We show that these algorithms can quickly find outliers in two real data sets (from a sensor deployment and a campaign finance data set), and run orders of magnitude faster than a naive search algorithm while providing comparable quality on a synthetic data set.

pdftxt

PARAS: A Parameter Space Framework for Online Association Mining

Xika Lin (WPI), Abhishek Mukherji (WPI), Elke Rundensteiner (WPI), Carolina Ruiz (WPI), Matthew Ward (WPI)

Association rule mining is known to be computationally intensive, yet real-time decision-making applications are increasingly intolerant to delays. To enable online association rule mining, existing techniques prestore intermediate results, namely, itemsets in an itemset-based index. However, given particular input parameter values such as minsupport and minconfidence, the actual rule generation must still be performed at query-time. The response time can be unacceptably long for interactive mining, especially when rule redundancy resolution is required as part of rule generation. To tackle this shortcoming, we introduce the parameter space model, called PARAS. PARAS enables fast rule mining by compactly maintaining the final rulesets instead of just storing intermediate itemsets. The PARAS model is based on the notion of stable region abstractions that form the coarse granularity ruleset space. Based on new insights on the redundancy relationships among rules, PARAS establishes a surprisingly compact representation of complex redundancy relationships while enabling efficient redundancy resolution at query-time. Besides the classical rule mining requests, this model supports three novel classes of exploratory queries. Using the proposed PSpace index, these exploratory query classes can all be answered with near real-time responsiveness. Our experimental evaluation using several benchmark datasets demonstrates that PARAS achieves 2-7 orders of magnitude improvement over commonly used techniques in online association rule mining.

pdftxt

Schema Extraction for Tabular Data on the Web

Marco D. Adelfio (University of Maryland), Hanan Samet (University of Maryland)

Tabular data is an abundant source of information on the Web, but remains mostly isolated from the latter’s interconnections since tables lack links and computer-accessible descriptions of their structure. In other words, the schemas of these tables -- attribute names, values, data types, etc. -- are not explicitly stored as table metadata. Consequently, the structure that these tables contain is not accessible to the crawlers that power search engines and thus not accessible to user search queries. We address this lack of structure with a new method for leveraging the principles of table construction in order to extract table schemas. Discovering the schema by which a table is constructed is achieved by harnessing the similarities and differences of nearby table rows through the use of a novel set of features and a feature processing scheme. The schemas of these data tables are determined using a classification technique based on conditional random fields in combination with a novel feature encoding method called logarithmic binning, which is specifically designed for the data table extraction task. Our method provides considerable improvement over the well-known WebTables schema extraction method. In contrast with previous work that focuses on extracting individual relations, our method excels at correctly interpreting full tables, thereby being capable of handling general tables such as those found in spreadsheets, instead of being restricted to HTML tables as is the case with the WebTables method. We also extract additional schema characteristics, such as row groupings, which are important for supporting information retrieval tasks on tabular data.

pdftxt

Partitioning and Ranking Tagged Data Sources

Milad Eftekhar (University of Toronto), Nick Koudas (University of Toronto)

Online types of expression in the form of social networks, microblogging, blogs and rich content sharing platforms have proliferated in the last few years. Such proliferation contributed to the vast explosion in online data sharing we are experiencing today. One unique aspect of online data sharing is tags manually inserted by content generators to facilitate content description and discovery (e.g., hashtags in tweets). In this paper we focus on these tags and we study and propose algorithms that make use of tags in order to automatically organize and categorize this vast collection of socially contributed and tagged information. In particular, we take a holistic approach in organizing such tags and we propose algorithms to partition as well as rank this information collection. Our partitioning algorithms aim to segment the entire collection of tags (and the associated content) into a specified number of partitions for specific problem constraints. In contrast our ranking algorithms aim to identify few partitions fast, for suitably defined ranking functions. We present a detailed experimental study utilizing the full twitter firehose (set of all tweets in the Twitter service) that attests to the practical utility and effectiveness of our overall approach. We also present a detailed qualitative study of our results.


Industry Vision

Location: Room 120map

Chair: Min Wang (Google Research), Cong Yu (Google Research)


pdf

Next Generation Data Analytics at IBM Research

Oktie Hassanzadeh (IBM Research), Anastasips Kementsietsidis (IBM Research), Benny Kimelfeld (IBM Research), Rajasekar Krishnamurthy (IBM Research), Fatma Özcan (IBM Research), Ippokratis Pandis (IBM Research)

pdf

Learning and Intelligent Optimization: one ring to rule them all

Mauro Brunato (Lionsolver Inc. and University of Trento), Roberto Battiti (Lionsolver Inc. and University of Trento)

pdf

SAP HANA: The Evolution from a Modern Main-Memory Data Platform to an Enterprise Application Platform

Vishal Sikka (SAP), Franz Färber (SAP), Anil Goel (SAP), Wolfgang Lehner (SAP)

Platform-as-a-Service for Data-enabled Applications

Milind Bhandarkar (Pivotal), George Tuma (Pivotal)

pdf

Athilab presentation

Sergio Ramazzina (Athilab), Chiara L. Ballari (Athilab)

pdf

Context-Aware Computing: Opportunities and Open Issues

Edward Y. Chang (HTC Corp)

pdf

How to maximize the value of big data with the open source SpagoBI suite through a comprehensive approach

Monica Franceschini (SpagoBI)

Facebook Data Analytics

Sambavi Muthukrishnan (Facebook)

pdf

Exploiting the Diversity, Mass and Speed of Territorial Data by TELCO Operators for Better User Services

Fabrizio Antonelli (Telecom Italia), Antonino Casella (Telecom Italia), Cristiana Chitic (Telecom Italia), Roberto Larcher (Telecom Italia), Giovanni Torrisi (Telecom Italia)

pdf

Odyssey: A MultiStore System for Evolutionary Analytics

Hakan Hacıgumus (NEC Labs), Jagan Sankaranarayanan (NEC Labs), Junichi Tatemura (NEC Labs), Jeff LeFevre, Neoklis Polyzotis (UCSC)

pdf

TPC: A Look Back and a Look Ahead

Raghunath Nambiar (TPC), Meikel Poess (TPC)

pdf

The Trento Big Data Platform for Public Administration and Large Companies: Use cases and Opportunities

Ivan Bedini (Trento Rise), Benedikt Elser (Trento RISE), Yannis Velegrakis (University of Trento and Trento RISE)

pdf

Designing Query Optimizers for Big Data problems of the future

Nga Tran (Vertica)

pdf

Microsoft SQL Server’s Integrated Database Approach for Modern Applications and Hardware

David Lomet (Microsoft Research)

pdf

A global Entity Name System (ENS) for data ecosystems.

Paolo Bouquet (OKKAM srl), Andrea Molinari (OKKAM srl)

Google Data 2020 - The next challenges in big data

Stephan Ellner (Google Inc)


Demo A: New Platforms

Location: Room Stampamap


pdf

A Demonstration of SpatialHadoop: An Efficient MapReduce Framework for Spatial Data

Ahmed Eldawy (University of Minnesota), Mohamed Mokbel (University of Minnesota)

pdf

Aggregate Profile Clustering for Telco Analytics

Mehmet Ali Abbasoğlu (İhsan Doğramacı Bilkent Üniversitesi), Buğra Gedik (Bilkent University), Hakan Ferhatosmanoglu (Bilkent University)

pdf

Parallel Graph Processing on Graphics Processors Made Easy

Jianlong Zhong (Nanyang Technological University), Bingsheng He (Nanyang Technological University)

pdf

Mosquito: Another One Bites the Data Upload STream

Stefan Richter (Saarland University), Jens Dittrich (Saarland University)

pdf

NoFTL: Database Systems on FTL-less Flash Storage

Sergey Hardock (TU-Darmstadt), Ilia Petrov (Reutlingen University), Robert Gottstein (TU-Darmstadt), Alejandro Buchmann (TU-Darmstadt)

pdf

EagleTree: Exploring the Design Space of SSD-Based Algorithms

Niv Dayan (IT University of Copenhagen), Martin Kjær Svendsen (IT University of Copenhagen), Matias Bjørling (IT University of Copenhagen), Philippe Bonnet (IT University of Copenhagen), Luc Bouganim (INRIA Rocquencourt and University of Versailles)

pdf

Flexible Query Processor on FPGAs

mohammadreza Najafi (Technical University Munich), Mohammad Sadoghi (IBM T. J. Watson Research Center), Hans-Arno Jacobsen (University of Toronto)

pdf

A Demonstration of Iterative Parallel Array Processing in Support of Telescope Image Analysis

Matthew Moyers (University of Washington), Emad Soroush (University of Washington), Spencer Wallace (University of Arizona), Simon Krughoff (University of Washington), Jake Vanderplas (University of Washington), Magdalena Balazinska (University of Washington), Andrew Connolly (University of Washington)

pdf

Hone: "Scaling Down" Hadoop on Shared-Memory Systems

K.Ashwin Kumar (UMD), Jonathan Gluck (University of Maryland, College Park), Amol Deshpande (University of Maryland), Jimmy Lin (University of Maryland, College Park)

pdf

REEF: Retainable Evaluator Execution Framework

Byung-Gon Chun (Microsoft), Tyson Condie (Microsoft), Carlo Curino (Microsoft), Raghu Ramakrishnan (Microsoft), Russell Sears (Microsoft), Markus Weimer (Microsoft)

pdf

OmniDB: Towards Portable and Efficient Query Processing on Parallel CPU/GPU Architectures

Shuhao Zhang (Nanyang Technological University), Jiong HE (Nanyang Technological University), Bingsheng He (NTU Singapore), Mian Lu (A*STAR Institute of High Performance Computing)

pdf

DiAl: Distributed Streaming Analytics Anywhere, Anytime

Ivo Santos (Microsoft Research ATL Europe), Marcel Tilly (Microsoft Research ATL Europe), Badrish Chandramouli (Microsoft Research), Jonathan Goldstein (Microsoft Research)



Wednesday Aug 28th 18:15-24:00

Banquet Dinner

Location: Café Casino della Citta di Arcomap


pdf

Dinner and Music

Punto Gezz



Thursday Aug 29th 08:45-10:00

Prizes and Keynote 3

Location: Room 1000Amap

Chair: Reneé J. Miller (University of Toronto)


txt

Privacy-Preserving Data Analysis: From Fallacious to Felicitous ... and to Fruition

Cynthia Dwork, Distinguished Scientist (Microsoft Research)

Privacy-preserving data analysis, also known as statistical disclosure control, has a large literature that spans several disciplines. Many early attempts have proved problematic either in practice or on paper. A new approach, based on the definitional concept of "differential privacy," has provided a theoretically sound and powerful framework that has given rise to an explosion of research. This talk motivates and explains the definition of differential privacy, describes some basic techniques for achieving it, and discusses some of the technical and cultural obstacles to bringing this approach to fruition.

Bio: Cynthia Dwork, Distinguished Scientist at Microsoft Research, is renowned for placing privacy-preserving data analysis on a mathematically rigorous foundation. A cornerstone of this work is differential privacy, a strong privacy guarantee frequently permitting highly accurate data analysis. Dr. Dwork has also made seminal contributions in cryptography and distributed computing, and is a recipient of the Edsger W. Dijkstra Prize, recognizing some of her earliest work establishing the pillars on which every fault-tolerant system has been built for decades. She is a member of the US National Academy of Engineering and a Fellow of the American Academy of Arts and Sciences.



Thursday Aug 29th 10:30-12:00

Research 11: Big Data

Location: Room 1000Amap

Chair: Carlo Curino (Microsoft Research)


pdftxt

ClouDiA: A Deployment Advisor for Public Clouds

Tao Zou (Cornell University), Ronan Le Bras (Cornell University), Marcos Vaz Salles (DIKU), Alan Demers (Cornell University), Johannes Gehrke (Cornell University)

An increasing number of distributed data-driven applications are moving into shared public clouds. By sharing resources and operating at scale, public clouds promise higher utilization and lower costs than private clusters. To achieve high utilization, however, cloud providers must inevitably allocate virtual machine instances non-contiguously, i.e., instances of a given application may end up in physically distant machines in the cloud. This allocation strategy can lead to large differences in average latency between instances. For a large class of applications, this difference can result in significant performance degradation, unless care is taken in how application components are mapped to instances. In this paper, we propose ClouDiA, a general deployment advisor that selects application node deployments minimizing either (i) the largest latency between application nodes, or (ii) the longest critical path among all application nodes. ClouDiA employs mixed-integer programming and constraint programming techniques to efficiently search the space of possible mappings of application nodes to instances. Through experiments with synthetic and real applications in Amazon EC2, we show that our techniques yield a 15% to 55% reduction in time-to-solution or service response time, without any need for modifying application code.

pdftxt

Upper and Lower Bounds on the Cost of a Map-Reduce Computation

Foto Afrati (NTUA), Anish Das Sarma (Google Research), Semih Salihoglu (Stanford University), Jeffrey Ullman (Stanford University)

In this paper we study the tradeoff between parallelism and communication cost in a map-reduce computation. For any problem that is not ``embarrassingly parallel,'' the finer we partition the work of the reducers so that more parallelism can be extracted, the greater will be the total communication between mappers and reducers. We introduce a model of problems that can be solved in a single round of map-reduce computation. This model enables a generic recipe for discovering lower bounds on communication cost as a function of the maximum number of inputs that can be assigned to one reducer. We use the model to analyze the tradeoff for three problems: finding pairs of strings at Hamming distance d, finding triangles and other patterns in a larger graph, and matrix multiplication. For finding strings of Hamming distance 1, we have upper and lower bounds that match exactly. For triangles and many other graphs, we have upper and lower bounds that are the same to within a constant factor. For the problem of matrix multiplication, we have matching upper and lower bounds for one-round map-reduce algorithms. We are also able to explore two-round map-reduce algorithms for matrix multiplication and show that these never have more communication, for a given reducer size, than the best one-round algorithm, and often have significantly less.

pdftxt

A Distributed Algorithm for Large-Scale Generalized Matching

Faraz Makari Manshadi (Max Planck Institute for Informatics), Baruch Awerbuch (Johns Hopkins University), Rainer Gemulla (Max Planck Institute for Informatics), Rohit Khandekar (Knight Capital Group), Julián Mestre (School of IT, The University of Sydney), Mauro Sozio (Institut Mines-Telecom, Telecom ParisTech, CNRS)

Generalized matching problems arise in a number of applications, including computational advertising, recommender systems, and trade markets. Consider, for example, the problem of recommending multimedia items (e.g., DVDs) to users such that (1) users are recommended items that they are likely to be interested in, (2) every user gets neither too few nor too many recommendations, and (3) only items available in stock are recommended to users. State-of-the-art matching algorithms fail at coping with large real-world instances, which may involve millions of users and items. We propose the first distributed algorithm for computing near-optimal solutions to large-scale generalized matching problems like the one above. Our algorithm is designed to run on a small cluster of commodity nodes (or in a MapReduce environment), has strong approximation guarantees, and requires only a poly-logarithmic number of passes over the input. In particular, we propose a novel distributed algorithm to approximately solve so-called mixed packing-covering linear programs, which include but are not limited to generalized matching problems. Experiments on real-world and synthetic data suggest that our algorithm scales to very large problem sizes and can be orders of magnitude faster than alternative approaches.

pdftxt

Making Queries Tractable on Big Data with Preprocessing

Wenfei Fan (University of Edinburgh), Floris Geerts (University of Antwerp), Frank Neven (Hasselt University and transnational University of Limburg)

A query class is traditionally considered tractable if there exists a polynomial-time (PTIME) algorithm to answer its queries. When it comes to big data, however, PTIME algorithms often become infeasible in practice. A traditional and effective approach to coping with this is to preprocess data off-line, so that queries in the class can be subsequently evaluated on the data efficiently. This paper aims to provide a formal foundation for this approach in terms of computational complexity. (1) We propose a set of Pi-tractable queries to characterize classes of queries that can be answered in parallel poly-logarithmic time (NC) after PTIME preprocessing. (2) We show that several natural query classes are Pi-tractable and are feasible on big data. (3) We also study a set of query classes that can be effectively converted to Pi-tractable queries by re-factorizing its data and queries for preprocessing. We introduce a form of NC reductions to characterize such conversions. (4) We show that a natural query class is complete for the set of query classes that can be made Pi-tractable. (5) We also show that unless P =NC, the set of all Pi-tractable queries is properly contained in the set P of all PTIME queries. Nonetheless, all PTIME query classes can be made P-tractable via proper re-factorizations. This work is a step towards understanding the tractability of queries in the context of big data.

pdftxt

Hadoop's Adolescence: An analysis of Hadoop usage in scientific workloads

Kai Ren (Carnegie Mellon University), YongChul Kwon (Microsoft), Magdalena Balazinska (University of Washington), Bill Howe (University of Washington)

We analyze Hadoop workloads from three different research clusters from a user-centric perspective. The goal is to better understand data scientists' use of the system and the mismatches between the system design and use. Our analysis suggests that Hadoop usage is still in its adolescence. We see underuse of Hadoop features, extensions, and tools. We see significant diversity in resource usage and application styles, including some ``interactive'' and ``iterative'' workloads, motivating new tools in the ecosystem. We also observe significant opportunities for optimizations of these workloads. We find that job customization and configuration are used in a narrow scope, suggesting the future pursuit of automatic tuning systems. Overall, we present the first user-centered measurement study of Hadoop and find significant opportunities for improving its efficient use for data scientists.


Tutorial 4

Location: Room 1000Bmap

Chair: Stratis Viglas (University of Edinburgh)


txt

Modern Database Systems

C. Mohan (IBM Almaden Research Center)

This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.

Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan


Research 12: Spatial and Text

Location: Room 300map

Chair: Panos Kalnis (KAUST)


pdftxt

A General Framework for Geo-Social Query Processing

Nikos Armenatzoglou (HKUST), Stavros Papadopoulos (HKUST), Dimitris Papadias (HKUST)

The proliferation of GPS-enabled mobile devises and the popularity of social networking have recently led to the rapid growth of Geo-Social Networks (GeoSNs). GeoSNs have created a fertile ground for novel location-based social interactions and advertising. These can be facilitated by GeoSN queries, which extract useful information combining both the social relationships and the current location of the users. This paper constitutes the first systematic work on GeoSN query processing. We propose a general framework that offers flexible data management and algorithmic design. Our architecture segregates the social, geographical and query processing modules. Each GeoSN query is processed via a transparent combination of primitive queries issued to the social and geographical modules. We demonstrate the power of our framework by introducing several "basic" and "advanced" query types, and devising various solutions for each type. Finally, we perform an exhaustive experimental evaluation with real and synthetic datasets, based on realistic implementations with both commercial software (such as MongoDB) and state-of-the-art research methods. Our results confirm the viability of our framework in typical large-scale GeoSNs.

pdftxt

Spatio-Textual Similarity Joins

Panagiotis Bouros (HKU), Shen Ge (HKU), Nikos Mamoulis (University of Hong Kong)

Given a collection of objects that carry both spatial and textual information, a spatio-textual similarity join retrieves the pairs of objects that are spatially close and textually similar. As an example, consider a social network with spatially and textually tagged persons (i.e., their locations and profiles). A useful task (for friendship recommendation) would be to find pairs of persons that are spatially close and their profiles have a large overlap (i.e., they have common interests). Another application is data de-duplication (e.g., finding photographs which are spatially close to each other and high overlap in their descriptive tags). Despite the importance of this operation, there is very little previous work that studies its efficient evaluation and in fact under a different definition; only the best match for each object is identified. In this paper, we combine ideas from state-of-the-art spatial distance join and set similarity join methods and propose efficient algorithms that take into account both spatial and textual constraints. Besides, we propose a batch processing technique which boosts the performance of our approaches. An experimental evaluation using real and synthetic datasets shows that our optimized techniques are orders of magnitude faster than baseline solutions.

pdftxt

Direction-Preserving Trajectory Simplification

Cheng Long (HKUST), Raymond Chi-Wing Wong (Hong Kong University of Science and Technology), H. V. Jagadish (University of Michigan)

Trajectories of moving objects are collected in many applications. Raw trajectory data is typically very large, and has to be simplified before use. In this paper, we introduce the notion of direction preserving trajectory simplification, and shows that it can support a broader range of applications than traditional position preserving trajectory simplification. We present a polynomial-time algorithm for optimal direction preserving simplification, and another approximate algorithm with a quality guarantee. Extensive experimental evaluation with real trajectory data shows the benefit of the new techniques.

pdftxt

Efficient Error-tolerant Query Autocompletion

Chuan Xiao (Nagoya University), Jianbin Qin (The University of New South Wales), Wei Wang (The University of New South Wales), Yoshiharu Ishikawa (Nagoya University), Koji Tsuda (AIST), Kunihiko Sadakane (National Institute of Informatics)

Query autocompletion is an important feature saving users many keystrokes from typing the entire query. In this paper we study the problem of query autocompletion that tolerates errors in users’ input using edit distance constraints. Previous approaches index data strings in a trie, and continuously maintain all the prefixes of data strings whose edit distance with the query are within the threshold. The major inherent problem is that the number of such prefixes is huge for the first few characters of the query and is exponential in the alphabet size. This results in slow query time even if the entire query approximately matches only few prefixes. In this paper, we propose a novel neighborhood generation-based algorithm, IncNGTrie, which can achieve up to three orders of magnitude speedup over existing methods for the error-tolerant query autocompletion problem. Our proposed algorithm only maintains a small set of active nodes, thus saving both space and time to process the query. We also study efficient duplicate removal methods which is a core problem in fetching query answers. In addition, we propose optimization techniques to reduce our index size, as well as discussions on several extensions to our method. The efficiency of our method is demonstrated against existing methods through extensive experiments on real datasets.

pdftxt

Spatial Keyword Query Processing: An Experimental Evaluation

Lisi Chen (NTU), Gao Cong (Nayang Technological University), Christian S. Jensen (Arhus University), Dingming Wu (Hong Kong Baptist University)

Geo-textual indices play an important role in spatial keyword querying. The existing geo-textual indices have not been compared systematically under the same experimental framework. This makes it difficult to determine which indexing technique best supports specific functionality. We provide an all-around survey of 12 state-of-the-art geo-textual indices. We propose a benchmark that enables the comparison of the spatial keyword query performance. We also report on the findings obtained when applying the benchmark to the indices, thus uncovering new insights that may guide index selection as well as further research.


Industry 4: Optimization

Location: Room 120map

Chair: Paul Larson (Microsoft Research)


pdftxt

Continuous Cloud-Scale Query Optimization and Processing

Nico Bruno (Microsoft), Sapna Jain (IIT Bombay), Jingren Zhou (Microsoft Corp.)

Massive data analysis in cloud-scale data centers plays a crucial role in making critical business decisions. High-level scripting languages free developers from understanding various system trade-offs, but introduce new challenges for query optimization. One key optimization challenge is missing accurate data statistics, typically due to massive data volumes and their distributed nature, complex computation logic, and frequent usage of user-defined functions. In this paper we propose novel techniques to adapt query processing in the SCOPE system, the cloud-scale computation environment in Microsoft Online Services. We continuously monitor query execution, collect actual runtime statistics, and adapt parallel execution plans as the query executes. We discuss similarities and differences between our approach and alternatives proposed in the context of traditional centralized systems. Experiments on large-scale SCOPE production clusters show that the proposed techniques systematically solve the challenge of missing/inaccurate data statistics, detect and resolve partition skew and plan structure, and improve query latency by a few folds for real workloads. Although we focus on optimizing high-level languages, the same ideas are also applicable for MapReduce systems.

pdftxt

Optimization Strategies for A/B Testing on HADOOP

Andrii Cherniak (University of Pittsburgh), Huma Zaidi (eBay Inc), Vladimir Zadorozhny (University of Pittsburgh)

In this work, we present a set of techniques that considerably improve the performance of executing concurrent MapReduce jobs. Our proposed solution relies on proper resource allocation for concurrent Hive jobs based on data dependency, inter-query optimization and modeling of Hadoop cluster load. To the best of our knowledge, this is the first work towards Hive/MapReduce job optimization which takes Hadoop cluster load into consideration. We perform an experimental study that demonstrates 233% reduction in execution time for concurrent vs sequential execution schema. We report up to 40% extra reduction in execution time for concurrent job execution after resource usage optimization. The results reported in this paper were obtained in a pilot project to assess the feasibility of migrating A/B testing from Teradata + SAS analytics infrastructure to Hadoop. This work was performed on eBay production Hadoop cluster.

pdftxt

Piranha: Optimizing Short Jobs in Hadoop

Khaled Elmeleegy, Turn Inc

Cluster computing has emerged as a key parallel processing platform for large scale data. All major internet companies use it as their major central processing platform. One of cluster computing's most popular examples is MapReduce and its open source implementation Hadoop. These systems were originally designed for batch and massive-scale computations. Interestingly, over time their production workloads have evolved into a mix of a small fraction of large and long-running jobs and a much bigger fraction of short jobs. This came about because these systems end up being used as data warehouses, which store most of the data sets and attract ad-hoc, short, data-mining queries. Moreover, the availability of higher level query languages that operate on top of these cluster systems proliferated these ad-hoc queries. Since existing systems were not designed for short, latency-sensistive jobs, short interactive jobs suffer from poor response times. In this paper, we present Piranha---a system for optimizing short jobs on Hadoop without affecting the larger jobs. It runs on existing unmodified Hadoop clusters facilitating its adoption. Piranha exploits characteristics of short jobs learned from production workloads at Yahoo! clusters to reduce the latency of such jobs. To demonstrate Piranha's effectiveness, we evaluated its performance using three realistic short queries. Piranha was able to reduce the queries' response times by up to 71%.

pdftxt

Making Updates Disk-I/O Friendly Using SSDs

Mohammad Sadoghi (IBM T. J. Watson Research Center), Kenneth Ross (Columbia University), Mustafa Canim (IBM T.J. Watson), Bishwaranjan Bhattacharjee (IBM T.J. Watson)

Multiversion databases store both current and historical data. Rows are typically annotated with timestamps representing the period when the row is/was valid. We develop novel techniques for reducing index maintenance in multiversion databases, so that indexes can be used effectively for analytical queries over current data without being a heavy burden on transaction throughput. To achieve this end, we re-design persistent index data structures in the storage hierarchy to employ an extra level of indirection. The indirection level is stored on solid state disks that can support very fast random I/Os, so that traversing the extra level of indirection incurs a relatively small overhead. The extra level of indirection dramatically reduces the number of magnetic disk I/Os that are needed for index updates, and localizes maintenance to indexes on updated attributes. Further, we batch insertions within the indirection layer in order to reduce physical disk I/Os for indexing new records. By reducing the index maintenance overhead on transactions, we enable operational data stores to create more indexes to support queries. We have developed a prototype of our indirection proposal by extending the widely used Generalized Search Tree (GiST) open-source project, which is also employed in Postgres. Our working implementation demonstrates that we can significantly reduce index maintenance and/or query processing cost, by a factor of 3. For insertions of new records, our novel batching technique can save up to 90% of the insertion time.


Demo B: Personal, Social, and Web Data

Location: Room Stampamap


pdf

DesTeller: A System for Destination Prediction Based on Trajectories with Privacy Protection

Andy Yuan Xue (University of Melbourne), Rui Zhang (University of Melbourne), Yu Zheng (Microsoft Research Asia), Xing Xie (Microsoft Research Asia, China), Jianhui Yu (South China Normal University), Yong Tang (South China Normal University)

pdf

GroupFinder: A New Approach to Top-K Point-of-Interest Group Retrieval

Kenneth Bøgh (Aarhus University), Anders Skovsgaard (Aarhus University), Christian S. Jensen (Arhus University)

pdf

CrowdMiner: Mining association rules from the crowd

Yael Amsterdamer (Tel Aviv University), Yael Grossman (Tel Aviv University), Tova Milo (Tel Aviv University), Pierre Senellart (Télécom ParisTech)

pdf

TeRec: A Temporal Recommender System Over Tweet Stream

Chen Chen (Peking University), Hongzhi Yin (Peking University), Junjie Yao (Peking University), Bin Cui (Peking University)

pdf

iRoad: A Framework For Scalable Predictive Query Processing On Road Networks

Abdeltawab Hendawi (University of Minnesota), Jie Bao (University of Minnesota), Mohamed Mokbel (University of Minnesota)

pdf

SmartMonitor: Using Smart Devices to Perform Structural Health Monitoring

Dimitrios Kotsakos (University of Athens), Panos Sakkos (University of Athens), Vana Kalogeraki (Athens University of Economics and Business), Dimitrios Gunopulos (University of Athens)

pdf

EnviroMeter: A Platform for Querying Community-Sensed Data

Saket Sathe (EPFL), Arthur Oviedo (EPFL), Dipanjan Chakraborty (IBM Research - India), Karl Aberer (EPFL)

pdf

EvenTweet: Online Localized Event Detection from Twitter

Hamed Abdelhaq (Heidelberg University), Christian Sengstock (Heidelberg University), Michael Gertz (Heidelberg University)

pdf

PhotoStand: A Map Query Interface for a Database of News Photos

Hanan Samet (University of Maryland), Marco D. Adelfio (University of Maryland), Brendan C. Fruin (University of Maryland), Michael D. Lieberman (University of Maryland), Jagan Sankaranarayanan (University of Maryland)

pdf

Ringtail: A Generalized Nowcasting System

Dolan Antenucci (University of Michigan), Erdong Li (University of Michigan), Shaobo Liu (University of Michigan), Bochun Zhang (University of Michigan), Mike Cafarella (University of Michigan), Christopher Re (University of Wisconsin-Madison)

pdf

IPS: An Interactive Package Configuration System for Trip Planning

Min Xie (University of British Columbia), Laks V. S. Lakshmanan (University of British Columbia), Peter Wood (Birkbeck, University of London)

pdf

R2-D2: a System to Support Probabilistic Path Prediction in Dynamic Environments

Jingbo Zhou (National University of Singapore), Anthony K.H. Tung (National University of Singapore), Wei Wu (I2R), Wee Siong Ng (I2R)



Thursday Aug 29th 13:30-15:00

Panel 2

Location: Room 1000Amap

Chair: Nick Koudas (University of Toronto) as Moderator


txt

To startup or not to startup? Academics/Entrepreneurs share their experiences.

Daniel Abadi (Yale University), Nick Koudas (University of Toronto), Yannis Papakonstantinou (University of California - San Diego), Jignesh Patel (University of Wisconsin), Radu Sion (Stony Brook University)

Our community is associated with numerous stories of commercialization of research projects. However starting a company, creating and maintaining a product and everything that this effort involves requires a set of skills that we do not typically acquire via our training. This panel brings together a few academics that went through the commercialization process to share their experience.

Bio: Prof. Abadi's research interests are in database system architecture and implementation, cloud computing, and the Semantic Web. Before joining the Yale computer science faculty, he spent four years at the Massachusetts Institute of Technology where he received his Ph.D. Abadi has been a recipient of a Churchill Scholarship, an NSF CAREER Award, a Sloan Research Fellowship, the 2008 SIGMOD Jim Gray Doctoral Dissertation Award, and the 2007 VLDB best paper award. His research on HadoopDB (see below) is currently being commercialized by Hadapt, where Abadi also serves as chief scientist. He blogs at DBMS Musings and tweets at @daniel_abadi.

Bio: Prof. Kouda's research interests are in database systems, web, social analytics and big data. Before joining the University of Toronto, he was a principal scientist at AT&T research and an adjunct professor at Columbia University. Prof Koudas was named inventor of the year in 2011 by the University of Toronto. He is a co-founder of Sysomos a social media analytics company. Prof. Koudas serves as an advisor to several startups commercializing data analytics related technologies. You can follow him at @koudas.

Bio: Yannis Papakonstantinou is a Professor of Computer Science and Engineering at the University of California, San Diego. His research is in the intersection of data management technologies and the web, where he has published over eighty research articles. He has given multiple tutorials and invited talks, has served on journal editorial boards and has chaired and participated in program committees for many international conferences and workshops. Yannis enjoys to commercialize his research and to inform his research accordingly. He was the CEO and Chief Scientist of Enosys Software, which built and commercialized an early XML-based Enterprise Information Integration platform. Enosys Software was acquired in 2003 by BEA Systems. His lab's FORWARD platform (for the rapid development of data-driven Ajax applications) is now in use by many commercial applications. He is involved in data analytics in the pharmaceutical industry and is in the technical advisory board of Brightscope Inc. He is the inventor of seven patents. Yannis holds a Diploma of Electrical Engineering from the National Technical University of Athens, MS and Ph.D. in Computer Science from Stanford University (1997) and an NSF CAREER award for his work on data integration.

Bio: Professor Doktor Ingenieur Radu Sion, PhD, MSc, BSc, and highschool diploma, is a faculty in Computer Science at Stony Brook University (on leave) and currently the CEO of Private Machines Inc. He remembers when gophers were digging through the Internets and bits were running at slower paces of 512 per second. He is also interested in efficient computing with a touch of cyber-security paranoia, raising rabbits on space ships and sailing catamarans of the Hobie variety.


Tutorial 5

Location: Room 1000Bmap

Chair: Haixun Wang (Microsoft Research Asia)


pdftxt

Just-in-time compilation for SQL query processing

Stratis D. Viglas (University of Edinburgh)

Just-in-time compilation of SQL queries into native code has recently emerged as a viable technique for query processing and an alternative to the dominant interpretation-based approach. We present the salient results of research in this fresh area, addressing all as- pects of the query processing stack: from traditional query compilation techniques, to compilation in man- aged environments, to state-of-the-art approaches on intermediate and native code emission. Throughout the discussion we refer and draw analogies to the gen- eral code generation techniques used in contemporary compiler technology. At the same time we describe the open research problems of the area.

Bio: Stratis D. Viglas is a Reader in the School of Infor- matics at the University of Edinburgh. He received a PhD in Computer Science from the University of Wisconsin—Madison in 2003, and BSc and MSc degrees from the Department of Informatics at the University of Athens, Greece, in 1996 and 1999.


Research 14: Temporal, Stream and Event processing

Location: Room 300map

Chair: Yanlei Diao (University of Massachusetts Amherst)


pdftxt

Streaming Algorithms for k-core Decomposition

Erdem Sarıyüce (OSU), Buğra Gedik (Bilkent University), Gabriela Jacques-Silva (IBM T.J. Watson Research Center), Kun-Lung Wu (IBM T.J. Watson Research Center), Ümit Çatalyürek (OSU)

A k-core of a graph is a maximal connected subgraph in which every vertex is connected to at least k vertices in the subgraph. k-core decomposition is often used in large-scale network analysis, such as community detection, protein function prediction, visualization, and solving NP-Hard problems on real networks efficiently, like maximal clique finding. In many real-world applications, networks change over time. As a result, it is essential to develop efficient incremental algorithms for streaming graph data. In this paper, we propose the first incremental k-core decomposition algorithms for streaming graph data. These algorithms locate a small subgraph that is guaranteed to contain the list of vertices whose maximum k-core values have to be updated, and efficiently process this subgraph to update the k-core decomposition. Our results show a significant reduction in run-time compared to non-incremental alternatives. We show the efficiency of our algorithms on different types of real and synthetic graphs, at different scales. For a graph of 16 million vertices, we observe speedups reaching a million times, relative to the non-incremental algorithms.

pdftxt

Travel Cost Inference from Sparse, Spatio- Temporally Correlated Time Series Using Markov Models

Bin Yang (Aarhus University), Chenjuan Guo (Aarhus University), Christian S. Jensen (Arhus University)

The monitoring of a system yields a set of measurements that can be modeled as a collection of time series. These time series are often sparse due to missing measurements and spatio-temporally correlated, meaning that near-by time series exhibit temporal correlations. The analysis of such time series offers insight into the underlying systems and enables prediction of system behavior. While the techniques presented in the paper are applicable more generally, we consider the case of transportation systems and aim to predict travel cost from GPS tracking data obtained from probe vehicles. Specifically, each road segment has an associated travel cost time series. We use spatio-temporal hidden Markov models (STHMM) to model correlations among different traffic time series. We provide a set of algorithms that are able to learn the parameters of an STHMM, while contending with the sparsity, spatio-temporal correlation, and heterogeneity of the time series. Using the resulting STHMM, near future travel costs in the transportation network, e.g., travel time or greenhouse gas emissions, can be inferred, enabling a variety of routing services, e.g., eco-routing. Empirical studies with a substantial GPS data set offer insight into the design properties of the proposed framework and algorithms, demonstrating the effectiveness and efficiency of travel cost inferencing.

pdftxt

Top-k Publish-Subscribe for Social Annotation of News

Alexander Shraer (Google), Maxim Gurevich (Google), Marcus Fontoura (Google), Vanja Josifovski (Google)

Social content, such as Twitter updates, often have the quickest first-hand reports of news events, as well as numerous commentaries that are indicative of public view of such events. As such, social updates provide a good complement to professionally written news articles. In this paper we consider the problem of automatically annotating news stories with social updates (tweets), at a news website serving high volume of pageviews. The high rate of both the pageviews (millions to billions a day) and of the incoming tweets (more than 100 millions a day) make real-time indexing of tweets ineffective, as this requires an index that is both queried and updated extremely frequently. The rate of tweet updates makes caching techniques almost unusable since the cache would become stale very quickly. We propose a novel architecture where each story is treated as a subscription for tweets relevant to the story's content, and new algorithms that efficiently match tweets to stories, proactively maintaining the top-k tweets for each story. Such top-k pub-sub consumes only a small fraction of the resource cost of alternative solutions, and can be applicable to other large scale content-based publish-subscribe problems. We evaluate and show the effectiveness of our approach on real-world data: a corpus of news stories from Yahoo! News and a log of Twitter updates.

pdftxt

Sketch-based Geometric Monitoring of Distributed Stream Queries

Minos Garofalakis (Technical University of Crete (Greece), Daniel Keren (Haifa University), Vasilis Samoladas (Technical University of Crete)

Emerging large-scale monitoring applications rely on continuous tracking of complex data-analysis queries over collections of massive, physically-distributed data streams. Thus, in addition to the space- and time-efficiency requirements of conventional stream processing (at each remote monitor site), effective solutions also need to guarantee communication efficiency (over the underlying communication network). The complexity of the monitored query adds to the difficulty of the problem --- this is especially true for non-linear queries (e.g., joins), where no obvious solutions exist for distributing the monitor condition across sites. The recently proposed geometric method offers a generic methodology for splitting an arbitrary (non-linear) global threshold-monitoring task into a collection of local site constraints; still, the approach relies on maintaining the complete stream(s) at each site, thus raising serious efficiency concerns for massive data streams. In this paper, we propose novel algorithms for efficiently tracking a broad class of complex aggregate queries in such distributed-streams settings. Our tracking schemes rely on a novel combination of the geometric method with compact sketch summaries of local data streams, and maintain approximate answers with provable error guarantees, while optimizing space and processing costs at each remote site and communication cost across the network. One of our key technical insights for the effective use of the geometric method lies in exploiting a much lower-dimensional space for monitoring the sketch-based estimation query. Due to the complex, highly non-linear nature of these estimates, efficiently monitoring the local geometric constraints poses challenging algorithmic issues for which we propose novel solutions. Experimental results on real-life data streams verify the effectiveness of our approach.

pdftxt

Efficient Recovery of Missing Events

Jianmin Wang (Tsinghua University), Shaoxu Song (Tsinghua University), Xiaochen Zhu (Tsinghua University), Xuemin Lin (University of New South Wales)

For various entering and transmission issues raised by human or system, missing events often occur in event data, which record the execution logs of business processes. Without recovering these missing events, applications such as provenance analysis or complex event processing built upon event data are not reliable. Following the minimum change discipline in improving data quality, it is also rational to find a recovery that minimally differs from the original data. Existing recovery approaches fall short of efficiency owing to enumerating and searching over all the possible sequences of events. In this paper, we study the efficient techniques for recovering missing events. According to our theoretical results, the recovery problem is proved to be NP-hard. Nevertheless, we are able to concisely represent the space of event sequences in a branching framework. Advanced indexing and pruning techniques are developed to further improve the recovery efficiency. Our proposed efficient techniques make it possible to find top-k recoveries. The experimental results demonstrate that our minimum recovery approach achieves high accuracy, and significantly outperforms stateof-the-art techniques for up to 5 orders of magnitudes improvement in time performance.


Research 13: Analytical Processing

Location: Room 120map

Chair: Rainer Gemulla (Max-Planck-Institut)


pdftxt

Skyline Operator on Anti-correlated Distributions

Haichuan Shang (University of Tokyo), Masaru Kitsuregawa (University of Tokyo)

Finding the skyline in a multi-dimensional space is relevant to a wide range of applications. The skyline operator over a set of d-dimensional points selects the points that are not dominated by any other point on all dimensions. Therefore, it provides a minimal set of candidates for the users to make their personal trade-off among all optimal solutions. The existing algorithms establish both the worst case complexity by discarding distributions and the average case complexity by assuming dimensional independence. However, the data in the real world is more likely to be anti-correlated. The cardinality and complexity analysis on dimensionally independent data is meaningless when dealing with anti-correlated data. Furthermore, the performance of the existing algorithms becomes impractical on anti-correlated data. In this paper, we establish a cardinality model for anti-correlated distributions. We propose an accurate polynomial estimation for the expected value of the skyline cardinality. Because the high skyline cardinality downgrades the performance of most existing algorithms on anti-correlated data, we further develop a determination and elimination framework which extends the well-adopted elimination strategy. It achieves remarkable effectiveness and efficiency. The comprehensive experiments on both real datasets and benchmark synthetic datasets demonstrate that our approach significantly outperforms the state-of-the-art algorithms under a wide range of settings.

pdftxt

Permuting Data on Random-Access Block Storage

Risi Thonangi (Duke University), Jun Yang (Duke University)

Permutation is a fundamental operator for array data, with appli- cations in, for example, changing matrix layouts and reorganizing data cubes. We consider the problem of permuting large quantities of data stored on secondary storage supporting fast random block accesses, such as solid state drives and distributed key-value stores. Faster random accesses open up interesting new opportunities for permutation. While external merge sort has often been used for permutation, it is an overkill that fails to exploit the property of permutation fully and carries unnecessary overhead in storing and comparing keys. We propose faster algorithms with lower mem- ory requirements for a large, useful class of permutations. We also tackle practical challenges that traditional permutation algorithms have not dealt with, such as exploiting random block accesses more aggressively, considering the cost asymmetry between reads and writes, and handling arbitrary data dimension sizes (as opposed to perfect powers often assumed by previous work). As a result, our algorithms are faster and more broadly applicable.

pdftxt

A Comparison of Knives for Bread Slicing

Alekh Jindal (MIT), Endre Palatinus (Saarland University), Vladimir Pavlov (Saarland University), Jens Dittrich (Saarland University)

Vertical partitioning is a crucial step in physical database design in row-oriented databases. A number of vertical partitioning algorithms have been proposed over the last three decades for a variety of niche scenarios. In principle, the underlying problem remains the same: decompose a table into one or more vertical partitions. However, it is not clear how good different vertical partitioning algorithms are in comparison to each other. In fact, it is not even clear how to experimentally compare different vertical partitioning algorithms. In this paper, we present an exhaustive experimental study of several vertical partitioning algorithms. We categorize vertical partitioning algorithms along three dimensions. We survey six vertical partitioning algorithms and discuss their pros and cons. We identify the major differences in the use-case settings for different algorithms and describe how to make an apples-to-apples comparison of different vertical partitioning algorithms under the same setting. We propose four metrics to compare vertical partitioning algorithms. We show experimental results from the TPC-H and SSB benchmark and present four key lessons learned: (1) we can do four orders of magnitude less computation and still find the optimal layouts, (2) the benefits of vertical partitioning depend strongly on the database buffer size, (3) HillClimb is the best vertical partitioning algorithm, and (4) vertical partitioning for TPC-H-like benchmarks can improve over column layout by only up to 5%.

pdftxt

Sharing Data and Work Across Concurrent Analytical Queries

Iraklis Psaroudakis (EPFL), Manos Athanassoulis (EPFL), Anastasia Ailamaki (EPFL)

Today's data deluge enables organizations to collect massive data, and analyze it with an ever-increasing number of concurrent queries. Traditional data warehouses (DW) face a challenging problem in executing this task, due to their query-centric model: each query is optimized and executed independently. This model results in high contention for resources. Thus, modern DW depart from the query-centric model to execution models involving sharing of common data and work. Our goal is to show when and how a DW should employ sharing. We evaluate experimentally two sharing methodologies, based on their original prototype systems, that exploit work sharing opportunities among concurrent queries at run-time: Simultaneous Pipelining (SP), which shares intermediate results of common sub-plans, and Global Query Plans (GQP), which build and evaluate a single query plan with shared operators. First, after a short review of sharing methodologies, we show that SP and GQP are orthogonal techniques. SP can be applied to shared operators of a GQP, reducing response times by 20%-48% in workloads with numerous common sub-plans. Second, we corroborate previous results on the negative impact of SP on performance for cases of low concurrency. We attribute this behaviour to a bottleneck caused by the push-based communication model of SP. We show that pull-based communication for SP eliminates the overhead of sharing altogether for low concurrency, and scales better on multi-core machines than push-based SP, further reducing response times by 82%-86% for high concurrency. Third, we perform an experimental analysis of SP, GQP and their combination, and show when each one is beneficial. We identify a trade-off between low and high concurrency. In the former case, traditional query-centric operators with SP perform better, while in the latter case, GQP with shared operators enhanced by SP give the best results.

pdftxt

Efficient Implementation of Generalized Quantification in Relational Query Languages

Antonio Badia (University of Louisville), Bin Cao (Teradata Inc.)

We present research aimed at improving our understanding of the use and implementation of quantification in relational query languages in general and SQL in particular. In order to make our results as general as possible, we use the framework of Generalized Quantification. Generalized Quantifiers (GQs) are high-level, declarative logical operators that in the past have been studied from a theoretical perspective. In this paper we focus on their practical use, showing how to incorporate a dynamic set of GQs in relational query languages, how to implement them efficiently and use them in the context of SQL. We present experimental evidence of the performance of the approach, showing that it improves over traditional (relational) approaches.


Demo C: From Data Collection to Analysis

Location: Room Stampamap


pdf

NADEEF: A Generalized Data Cleaning System

Amr Ebaid (Purdue University), Ahmed Elmagarmid (QCRI), Ihab Ilyas (QCRI), Mourad Ouzzani (QCRI), Jorge-Arnulfo Quiane-Ruiz (QCRI), Nan Tang (QCRi), Si Yin (QCRI)

pdf

RecDB in Action: Recommendation Made Easy in Relational Databases

Mohamed Sarwat (University of Minnesota), James Avery (University of Minnesota), Mohamed Mokbel (University of Minnesota)

pdf

Graph Queries in a Next-Generation Datalog System

Alexander Shkapsky (UCLA), Kai Zeng (UCLA), Carlo Zaniolo (UCLA)

pdf

Lazy ETL in Action: ETL Technology Dates Scientific Data

Yağız Kargın (CWI), Milena Ivanova (Netherlands eScience Center), Stefan Manegold (CWI), Martin Kersten (CWI), Ying Zhang (CWI)

pdf

Scolopax: Exploratory Analysis of Scientific Data

Alper Okcan (Northeastern University), Mirek Riedewald, Biswanath Panda, Daniel Fink

pdf

PROPOLIS: Provisioned Analysis of Data-Centric Processes

Daniel Deutch (Ben Gurion university), Yuval Moskovitch (Ben Gurion University), Val Tannen (University of Pennsylvania)

pdf

Feature Selection in Enterprise Analytics: A Demonstration using an R-based Data Analytics System

Pradap Konda (University of Wisconsin-Madison), Arun Kumar (University of Wisconsin-Madison), Christopher Re (University of Wisconsin-Madison), Vaishnavi Sashikanth (Oracle)

pdf

PLASMA-HD: Probing the LAttice Structure and MAkeup of High-dimensional Data

David Fuhry (The Ohio State University), Yang Zhang (The Ohio State University), Venu Satuluri (Twitter), Arnab Nandi (The Ohio State University), Srinivasan Parthasarathy (The Ohio State University)

pdf

IBminer: A Text Mining Tool for Constructing and Populating InfoBox Databases and Knowledge Bases

Hamid Mousavi (UCLA), Shi Gao (UCLA), Carlo Zaniolo (UCLA)

pdf

Mining and Linking Patterns across Live Data Streams and Stream Archives

Di Yang (WPI), Kaiyu Zhao (WPI), Maryam Hasan (WPI), Hanyuan Lu (WPI), Elke Rundensteiner (WPI), Matthew Ward (WPI)

pdf

User Analytics with UbeOne: Insights into Web Printing

Georgia Koutrika (HP Labs), Qian Lin (HP Labs), Jerry Liu (HP Labs)



Thursday Aug 29th 15:30-17:00

Research 15: Data Integration

Location: Room 1000Amap

Chair: Amelie Marian (Rutgers University)


pdftxt

Actively Soliciting Feedback for Query Answers in Keyword Search-Based Data Integration

Zhepeng Yan (University of Pennsylvania), Nan Zheng (University of Pennsylvania), Zachary Ives (University of Pennsylvania), Partha Talukdar (Carnegie Mellon University), Cong Yu (Google Research)

The problem of scaling up data integration, such that new sources can be quickly utilized as they are discovered, remains elusive: global schemas for integrated data are difficult to develop and expand, and schema and record matching techniques are limited by the fact that data and metadata are often under-specified and must be disambiguated by data experts. One promising approach is to avoid using a global schema, and instead to develop keyword search-based data integration --- where the system lazily discovers associations enabling it to join together matches to keywords, and return ranked results. The user is expected to understand the data domain and provide feedback about answers' quality. The system generalizes such feedback to learn how to correctly integrate data. A major open challenge is that under this model, the user only sees and offers feedback on a few top-k results: this result set must be carefully selected to include answers of high relevance and answers that are highly informative when feedback is given on them. Existing systems merely focus on predicting relevance, by composing the scores of various schema and record matching algorithms. In this paper we show how to predict the uncertainty associated with a query result's score, as well as how informative feedback is on a given result. We build upon these foundations to develop an active learning approach to keyword search-based data integration, and we validate the effectiveness of our solution over real data from several very different domains.

pdftxt

Query Processing under GLAV Mappings for Relational and Graph Databases

Diego Calvanese (Free Univ. of Bozen-Bolzano), Giuseppe De Giacomo (Sapienza Università di Roma), Maurizio Lenzerini (Sapienza Università di Roma), Moshe Vardi (Rice University)

Schema mappings establish a correspondence between data stored in two databases, called source and target respectively. Query processing under schema mappings has been investigated extensively in the two cases where each target atom is mapped to a query over the source (called GAV, global-as-view), and where each source atom is mapped to a query over the target (called LAV, local-as-view). The general case, called GLAV, in which queries over the source are mapped to queries over the target, has attracted a lot of attention recently, especially for data exchange. However, query processing for GLAV mappings has been considered only for the basic service of query answering, and mainly in the context of conjunctive queries (CQs) in relational databases. In this paper we study query processing for GLAV mappings in a wider sense, considering not only query answering, but also query rewriting, perfectness (the property of a rewriting to compute exactly the certain answers), and query containment relative to a mapping. We deal both with the relational case, and with graph databases, where the basic querying mechanism is that of regular path queries. Query answering in GLAV can be smoothly reduced to a combination of the LAV and GAV cases, and for CQs this reduction can be exploited also for the remaining query processing tasks. In contrast, as we show, GLAV query processing for graph databases is non-trivial and requires new insights and techniques. We obtain upper bounds for answering, rewriting, and perfectness, and show decidability of relative containment.

pdftxt

Discovering Linkage Points over Web Data

Oktie Hassanzadeh (IBM Research), Ken Pu (UOIT), Soheil Hassas Yeganeh (University of Toronto), Renee Miller (University of Toronto), Lucian Popa (IBM Research), Mauricio Hernandez (IBM Research), Howard Ho (IBM Research)

Large-scale integration of heterogeneous data sources is a challenging problem, and has been the topic of extensive research for many years. A basic step in integration is the identification of linkage points, i.e., finding the attributes that are shared (or related) between the data sources, and that can be used to match records or entities across the sources. This is usually performed using a match operator, that associates the schema elements of one database to another. However, the massive growth in the amount of unstructured and semi-structured data in data warehouses and on the Web has created new challenges for this task. Such data sources often do not have a fixed pre-defined schema and contain large numbers of diverse attributes. Furthermore, the end goal is not schema alignment as these schemas may be too heterogeneous (and dynamic) to meaningfully align. Rather, the goal is to align any overlapping data shared by these sources. We will show that even attributes with different meanings (that would not qualify as schema matches) can sometimes be useful in aligning data. The solution we propose in this paper replaces the basic schema-matching step with a more complex instance-based schema analysis and linkage discovery. We present a framework consisting of a library of efficient lexical analyzers and similarity functions, and a set of search algorithms for effective and efficient identification of linkage points over Web data. We experimentally evaluate the effectiveness of our proposed algorithms in real-world integration scenarios in several domains.

pdftxt

Reducing Uncertainty of Schema Matching via Crowdsourcing

Chen Zhang (HKUST), Lei Chen (Honk Kong University of Science and Technology), Hosagrahar Jagadish (University of Michigan), Chen Cao (HKUST)

Schema matching is a central challenge for data integration systems. Automated tools are often uncertain about schema matchings they suggest, and this uncertainty is inherent since it arises from the inability of the schema to fully capture the semantics of the represented data. Human common sense can often help. Inspired by the popularity and the success of easily accessible crowdsourcing platforms, we explore the use of crowdsourcing to reduce the uncertainty of schema matching. Since it is typical to ask simple questions on crowdsourcing platforms, we assume that each question, namely Correspondence Correctness Question (CCQ), is to ask the crowd to decide whether a given correspondence should exist in the correct matching. We propose frameworks and efficient algorithms to dynamically manage the CCQs, in order to maximize the uncertainty reduction within a limited budget of questions. We develop two novel approaches, namely “Single CCQ” and “Multiple CCQ”, which adaptively select, publish and manage the questions. We verified the value of our solutions with simulation and real implementation.

pdftxt

Less is More: Selecting Sources Wisely for Integration

Xin Luna Dong (Google Inc.), Barna Saha (AT&T Labs-Research), Divesh Srivastava (AT&T Labs-Research)

We are often thrilled by the abundance of information surrounding us and wish to integrate data from as many sources as possible. However, understanding, analyzing, and using these data are often hard. Too much data can introduce a huge integration cost, such as expenses for purchasing data and resources for integration and cleaning. Furthermore, including low-quality data can even deteriorate the quality of integration results instead of bringing the desired quality gain. Thus, “the more the better” does not always hold for data integration and often “less is more”. In this paper, we study how to select a subset of sources before integration such that we can balance the quality of integrated data and integration cost. Inspired by the Marginalism principle in economic theory, we wish to integrate a new source only if its marginal gain, often a function of improved integration quality, is higher than the marginal cost, associated with data-purchase expense and integration resources. As a first step towards this goal, we focus on data fusion tasks, where the goal is to resolve conflicts from different sources. We propose a randomized solution for selecting sources for fusion and show empirically its effectiveness and scalability on both real-world data and synthetic data.


Tutorial 6

Location: Room 1000Bmap

Chair: Serge Abiteboul (INRIA)


pdftxt

Mobility and Social Networking: A Data Management Perspective

Mohamed F. Mokbel (University of Minnesota), Mohamed Sarwat (University of Minnesota)

Online social networks, such as Facebook and Twitter have become very popular in the past decade. Users register to online social networks in order to keep in touch with their friends and family, learn about their news, get recommendations from them, and engage in online social events. As mobile devices (e.g., smart phones, GPS devices) became ubiquitous, location-based social networking services (e.g., Foursquare and Facebook Places) are getting more and more popular. For instance, as of September 2012, Foursquare claims to have over 25 million people worldwide, and over billions of check-ins with millions more every day. Users, in a location-based social network, are associated with a geo-location, and might alert friends when visiting a venue (e.g., restaurant, bar) by checking-in on their mobile phones (e.g., iPhone, Android). The rise of location-based social networking applications has led to the emergence of both social networking and mobility side by side, which led to the rise of new research challenges and opportunities. This tutorial presents the state-of-the-art research that lies at the intersection of both: Social Networking and Mobility. Data management research in social networking is mainly concerned with managing users social interactions and collaboration, storing / retrieving social media (e.g., Microblogs, News Feed), and analyzing users behavior. Data management research in mobility focuses on handling user GeoSpatial location and contextual information. This tutorial takes an overarching approach by surveying the research that combines both social networking and mobility from four different perspectives: (1) Microblog search and social news feed queries, (2) Recommendation Services, (3) Crowdsourcing, and (4) Social Media Visualization. We finally highlight the risks and threats (e.g., privacy) that result from combining mobility and social networking, and we conclude the tutorial by summarizing and presenting some open research directions.

Bio: Mohamed F. Mokbel (Ph.D., Purdue University, MS, B.Sc., Alexandria University) is an associate professor in the Department of Computer Science and Engineering, University of Minnesota. His current research interests focus on providing database and platform support for spatio-temporal data, location based services 2.0, personalization, and recommender systems. His research work has been recognized by four best paper awards at IEEE MASS 2008, IEEE MDM 2009, SSTD 2011, and ACM MobiGIS Workshop 2012, and by the NSF CAREER award 2010. Mohamed is/was general co-chair of SSTD 2011, program co-chair of ACM SIGSPAITAL GIS 2008-2010, and MDM 2014, 2011. He has served in the editorial board of IEEE Data Engineering Bulletin, Distributed and Parallel Databases Journal, and Journal of Spatial Information Science. Mohamed is an ACM and IEEE member and a founding member of ACM SIGSPATIAL. For more information, please visit: www.cs.umn.edu/~mokbel

Bio: Mohamed Sarwat is a doctoral candidate at the department of Computer Science and Engineering, University of Minnesota. He obtained his Bachelor's degree in computer engineering from Cairo University in 2007 and his Master's degree in computer science from University of Minnesota in 2011. His research interest lies in the broad area of data management systems. More specifically, his interests include database support for recommender systems, personalized databases, location-based services, and social networking applications as well as distributed graph databases and large scale data management. Mohamed has been awarded the University of Minnesota Doctoral Dissertation Fellowship in 2012. His ICDE 2012 paper has been selected for TKDE special issue on Best Papers of ICDE 2012. His research work has been recognized by the Best Paper Award in the International Symposium on Spatial and Temporal Databases SSTD 2011. For more details, please visit: http://www-users.cs.umn.edu/~sarwat/


Research 16: Concurrency and Query Processing

Location: Room 300map

Chair: Wolfgang Lehner (Dresden University of Technology)


pdftxt

Towards Predicting Query Execution Time for Concurrent and Dynamic Database Workloads

Wentao Wu (University of Wisconsin-Madison), Yun Chi (NEC Laboratories America), Hakan Hacigumus (NEC Laboratories America), Jeffrey Naughton (University of Wisconsin-Madison)

Predicting query execution time is crucial for many database management tasks including admission control, query scheduling, and progress monitoring. While a number of recent papers have explored this problem, the bulk of the existing work either considers prediction for a single query, or prediction for a static workload of concurrent queries, where by "static" we mean that the queries to be run are fixed and known. In this paper, we consider the more general problem of dynamic concurrent workloads. Unlike most previous work on query execution time prediction, our proposed framework is based on analytic modeling rather than machine learning. We first use the optimizer’s cost model to estimate the I/O and CPU requirements for each pipeline of each query in isolation, and then use a combination queueing model and buffer pool model that merges the I/O and CPU requests from concurrent queries to predict running times. We compare the proposed approach with a machine-learning based approach that is a variant of previous work. Our experiments show that our analytic-model based approach can lead to competitive and often better prediction accuracy than its machine-learning based counterpart.

pdftxt

Lightweight Locking For Main Memory Database Systems

Kun Ren (Yale University & Northwestern Polytechnical University, China), Alexander Thomson (Yale University), Daniel Abadi (Yale University)

Locking is widely used as a concurrency control mechanism in database systems. As more OLTP databases are stored mostly or entirely in memory, transactional throughput is less and less limited by disk IO, and lock managers increas- ingly become performance bottlenecks. In this paper, we introduce very lightweight locking (VLL), an alternative approach to pessimistic concurrency control for main-memory database systems that avoids almost all overhead associated with traditional lock manager opera- tions. We also propose a protocol called selective contention analysis (SCA), which enables systems implementing VLL to achieve high transactional throughput under high con- tention workloads. We implement these protocols both in a traditional single-machine multi-core database server set- ting and in a distributed database where data is partitioned across many commodity machines in a shared-nothing clus- ter. Our experiments show that VLL dramatically reduces locking overhead and thereby increases transactional through- put in both settings.

pdftxt

Supporting User-Defined Functions on Uncertain Data

Thanh Tran (UMass (Amherst), Yanlei Diao (University of Massachusetts Amherst), Charles Sutton (University of Edinburgh), Anna Liu (UMass (Amherst)

Uncertain data management has become crucial in many sensing and scientific applications. As user-defined functions (UDFs) become widely used in these applications, an important task is to capture result uncertainty for queries that evaluate UDFs on uncertain data. In this work, we provide a general framework for supporting UDFs on uncertain data. Specifically, we proposed a learning approach based on Gaussian processes (GPs) to compute approximate output distributions of a UDF when evaluated on uncertain input, with guaranteed error bounds. We also devise an online algorithm to compute such output distributions, which employs a suite of optimizations to improve accuracy and performance. Our evaluation using both real-world and synthetic functions shows that our proposed GP approach can outperform the state-of-the-art sampling approach with up to two orders of magnitude improvement for a variety of UDFs.

pdftxt

On Scaling Up Sensitive Data Auditing

Yupeng Fu (University of California (San Diego), Raghav Kaushik (Microsoft Corporation), Ravi Ramamurthy (Microsoft Corporation)

This paper studies the following problem: given (1)~a query and (2)~a set of sensitive records, find the subset of records ``accessed'' by the query. The notion of a query accessing a single record is adopted from prior work. There are several scenarios where the number of sensitive records is large (in the millions.) The novel challenge addressed in this work is to develop a general-purpose solution for complex SQL that scales in the number of sensitive records. We propose efficient techniques that improves upon straightforward alternatives by orders of magnitude. Our empirical evaluation over the TPC-H benchmark data illustrates the benefits of our techniques.

pdftxt

On the Complexity of Query Result Diversification

Ting Deng (Beihang University), Wenfei Fan (University of Edinburgh)

Query result diversification is a bi-criteria optimization problem for ranking query results. Given a database D, a query Q and a positive integer k, it is to find a set of k tuples from Q(D) such that the tuples are as relevant as possible to the query, and at the same time, as diverse as possible to each other. Subsets of Q(D) are ranked by an objective function defined in terms of relevance and diversity. Query result diversification has found a variety of applications in databases, information retrieval and operations research. This paper studies the complexity of result diversification for relational queries. We identify three problems in connection with query result diversification, to determine whether there exists a set of k tuples that is ranked above a bound with respect to relevance and diversity, to assess the rank of a given k-element set, and to count how many k-element sets are ranked above a given bound. We study these problems for a variety of query languages and for three objective functions. We establish the upper and lower bounds of these problems, all matching, for both combined complexity and data complexity. We also investigate several special settings of these problems, identifying tractable cases.


Demo D: Queries and Interfaces

Location: Room Stampamap


pdf

Senbazuru: A Prototype Spreadsheet Database Management System

Shirley Zhe Chen (University of Michigan), Mike Cafarella (University of Michigan), Jun Chen (University of Michigan), Daniel Prevo, Junfeng Zhuang (University of Michigan)

pdf

ReqFlex: Fuzzy Queries for Everyone

Grégory SMITS (IRISA-University of Rennes 1), Olivier PIVERT (IRISA-University of Rennes 1), Thomas GIRAULT (Freelance Engineer)

pdf

Comprehensive and Interactive Temporal Query Processing with SAP HANA

Martin Kaufmann (ETH Zürich), Panagiotis Vagenas (ETH Zurich), Peter Fischer (Albert-Ludwigs-Universität Freiburg, Germany), Donald Kossmann (ETH Zurich), Franz Färber (SAP AG)

pdf

Functions Are Data Too (Defunctionalization for PL/SQL)

Torsten Grust (Universität Tübingen, Germany), Nils Schweinsberg (Universität Tübingen, Germany), Alexander Ulrich (Universität Tübingen, Germany)

pdf

QUEST: A Keyword Search System for Relational Data based on Semantic and Machine Learning Techniques

Sonia Bergamaschi, Francesco Guerra, and Matteo Interlandi (Università di Modena e Reggio Emilia), Raquel Trillo-Lado (Universidad de Zaragoza), Yannis Velegrakis (Università di Trento)

pdf

ROSeAnn: Reconciling Opinions of Semantic Annotators

Luying Chen (Oxford), Stefano Ortona (Oxford), Giorgio Orsi (Oxford), Michael Benedikt (Oxford)

pdf

SkySuite: A Framework of Skyline-Join Operators for Static and Stream Environments

Mithila Nagendra (Arizona State University), K. Selcuk Candan (Arizona State University)

pdf

MASTRO STUDIO: Managing Ontology-Based Data Access applications

Cristina Civili (Sapienza University of Rome), Marco Console (Sapienza University of Rome), Giuseppe De Giacomo (Sapienza Università di Roma), Domenico Lembo (Sapienza University of Rome), Maurizio Lenzerini (Sapienza Università di Roma), Lorenzo Lepore (Sapienza University of Rome), Riccardo Mancini (Sapienza University of Rome), Antonella Poggi (Sapienza University of Rome), Riccardo Rosati (Sapienza University of Rome), Marco Ruzzi (Sapienza University of Rome), Valerio Santarelli (Sapienza University of Rome), Domenico Fabio Savo (Sapienza University of Rome)

pdf

PAQO: A Preference-Aware Query Optimizer for PostgreSQL

Nicholas L. Farnan (University of Pittsburgh), Adam J. Lee (University of Pittsburgh), Panos K. Chrysanthis (University of Pittsburgh), Ting Yu (North Carolina State University & Qatar Computing Research Institute)

pdf

eSkyline: Processing Skyline Queries over Encrypted Data

Suvarna Bothe (Rutgers University), Panagiotis Karras (Rutgers University), Akrivi Vlachou (NTNU)

pdf

GestureQuery: A Multitouch Database Query Interface

Lilong Jiang (The Ohio State University), Michael Mandel (The Ohio State University), Arnab Nandi (The Ohio State University)

pdf

Complete Approximations of Incomplete Queries

Ognjen Savkovic (Free University of Bozen-Bolzano), Paramita Mirza (Fondazione Bruno Kessler), Alex Tomasi (Free University of Bozen-Bolzano), Werner Nutt (Free University of Bozen-Bolzano)

pdf

POIKILO: A Tool for Evaluating the Results of Diversification Models and Algorithms

Marina Drosou (University of Ioannina), Evaggelia Pitoura (University of Ioannina)



Friday Aug 30th 08:30-10:00

DBRank Keynote 1

Location: Room 1000Bmap


Spatial Keyword Querying of Geo-Tagged Web Content

Christian S. Jensen (Aarhus University)


Phd Workshop 1: The Database Execution Engine

Location: Room 300map


pdf

Why it is time for a HyPE: A Hybrid Query Processing Engine for Efficient GPU Coprocessing in DBMS

Sebastian Breß, University of Magdeburg

pdf

Storing and Processing Temporal Data in a Main Memory Column Store

Martin Kaufmann, ETH Zurich

pdf

Database Support for Unstructured Meshes

Alireza Rezaei Mahdiraji, Jacobs University


BD3 Invited Talk & Keynote

Location: Room Stampamap


Streaming Balanced Partitioning of Massive Scale Graphs

Milan Vojnovic (MSR Cambridge)

Theory for Large-Scale Storage

Alex Dimakis (UT Austin)


DBPL Invited Talk 1 & Research 1

Location: Room Belvederemap


pdf

(Invited Talk) Introducing Access Control in Webdamlog

Serge Abiteboul (Inria)

pdf

First-Class Functions for First-Order Database Engines

Torsten Grust (Universität Tübingen) and Alexander Ulrich (Universität Tübingen)


PersDB Keynote 1

Location: Room 100Amap


txt

Serious games meets Adaptive Hypermedia: Integrating games into web-based e-learning systems

Maurice Hendrix (Serious Games Institute, Coventry University)

The use of games in education is becoming increasingly popular, as it can, in some cases, significantly improve learning outcomes over traditional methods (Knight et al., 2010). At the same time, a blended learning approach (Garrison and Vaughan, 2008), in which the use of games is combined with other technologies and classroom-based education is becoming increasingly popular. Many institutions have adopted a Learning Management System, an on-line system for managing digital learning material, where the course contents get uploaded and where students can log in, explore the course contents, take tests or engage in learning activities. While LMSs are useful for distance education only courses, they are increasingly regularly used in conjunction with traditional classroom-based courses in a blended approach. A significant volume of research has been conducted into e-learning and a number of standards have emerged. However, games differ from traditional digital media such as texts, in that they often bundle multiple learning objectives into a package coupled with game play mechanics and dynamics and as a result the integration of games into e-learning standards and into systems such as LMSs mostly limited to linking, leaving the job of blending them into the curriculum up to the teacher. Learning analytics (Ferguson, 2012) is a new emerging trend. For educational games this means that the difficulty, environment, amount of guidance via non player characters etc can be adapted to the learner’s knowledge and learning style. It also means that data mining can be used to establish, what learners really do in these games, and the course in general and whether there is a pattern that successful students have in common. This could then be used to improve the game, and the course as a whole. Advances we have made recently include re-use and re-purposing tools for educational games (Protopsaltis et al., 2011), integration of educational game authoring tools and adaptive hypermedia authoring tools (Hendrix et al., 2013) and direct integrations between games and LMSs via a JavaScript library and XML messages (Dunwell et al., 2011). However a standard is clearly required. Some educational games take a different approach altogether. For example the European-funded Mobile Assistance for Social Inclusion and Empowerment of Immigrants with Persuasive Learning Technologies and Social Network Services (MASELTOV) project seeks to provide both practical tools and learning services via mobile devices. The learning services include an educational game. However unlike many other educational games it is not part of a formal curriculum. Therefore instead of integrating it into an LMS, it is distributed as a few to play game via the Google Play store (https://play.google.com/‎). A lose integration with other services, for example allowing users to gain points that can be spent on upgrades in game, is achieved via web-services. Deployment in this way has the potential to reach large and gather data about audiences, such as play times, locations and which areas of the game are proving problematic as well as which other (MASELTOV) sources the user uses. The play store allows updating of games, so it is possible to respond to results of data analysis with improved versions of the game.


SIMPDA Research 1

Location: Room 100Bmap


Sequential Approaches for Predicting Business Process Outcome and Process Failure Warning

Mai Le, Detlef Nauck and Bogdan Gabrys

Graph-Based Business Process Model Refactoring

María Fernández-Ropero, Ricardo Pérez-Castillo and Mario Pattini

Studies on the Discovery of Declarative Control Flows from Error-prone Data

Claudio Di Ciccio and Massimo Mecella


SDM Keynote: (Actual timing 9:00-10:00)

Location: Room Meetingmap


txt

To Cloud or not to? Musings at the intersection of Clouds, Security and Big Data

Radu Sion (Stony Brook University)

In this talk we explore the economics of cloud computing. We identify cost trade-offs and postulate the key principles of cloud outsourcing that define when cloud deployment is appropriate and why. The results may surprise and are especially interesting in understanding cyber- security aspects that impact the appeal of clouds. We outline and investigate some of the main research challenges on optimizing for these trade-offs. If you come to this talk you are also very likely to find out exactly how many US dollars you need to spend to break your favorite cipher, or send one of your bits over the network


SSW Keynote & Teaser

Location: Room Presidenzamap


New tools for query answering on Semantic Data

Andrea Cali'

Quick 3 minutes teaser of each of the accepted papers



Friday Aug 30th 10:30-12:00

DBRank Research & Vision

Location: Room 1000Bmap


pdf

A Thin Monitoring Layer for Top-k Aggregation Queries over a Database

Foteini Alvanaki & Sebastian Michel

pdf

Progressive Ranking Based on a Dominance List

Sadoun Isma, Yann Loyer & Karine Zeitouni

pdf

Wearable Queries: Adapting Common Retrieval Needs to Data and Users

Barbara Catania, Giovanna Guerrini, Alberto Belussi, Federica Mandreoli, Riccardo Martoglia & Wilma Penzo

pdf

Keyword Search and Evaluation over Relational Databases: an Outlook to the Future

Sonia Bergamaschi, Nicola Ferro, Francesco Guerra, Gianmaria Silvello


Phd Workshop 2 & Panel: NoSQL and Panel Discussion

Location: Room 300map


pdf

Scalable Transactions across Heterogeneous NoSQL Key-Value Data Stores

Akon Dey, University of Sydney

pdf

(Panel) How to Have a Successful Career as a Database Researcher?


BD3 Research 1: Distributed Monitoring

Location: Room Stampamap


pdf

Safe-Zones for Monitoring Distributed Streams

Daniel Keren (Haifa), Guy Sagy (Technion), Amir Abboud (Technion), Izchak Sharfman (Technion), Assaf Schuster (Technion), David Ben-David (Technion)

pdf

Communication-Efficient Distributed Online Prediction using Dynamic Model Synchronizations

Mario Boley (Fraunhofer IAIS), Izchak Sharfman (Technion), Daniel Keren (Haifa), Michael Kamp (Fraunhofer IAIS), Assaf Schuster (Technion)

pdf

Communication-efficient Outlier Detection for Scale-out Systems

Moshe Gabel (Technion), Daniel Keren (Haifa), Assaf Schuster (Technion)


DBPL Invited Talk 2 & Research 2

Location: Room Belvederemap


(Invited talk) Bestarium vocabulum of NoSQL languages

Jérôme Siméon (IBM)

pdf

Managing Schema Evolution in NoSQL Data Stores

Stefanie Scherzinger (Regensburg University of Applied Sciences), Meike Klettke (University of Rostock), and Uta Störl (Darmstadt University of Applied Sciences)


PersDB Research 1: (Actual timing 10:30-11.40)

Location: Room 100Amap


pdf

Peckalytics: Analyzing Experts and Interests on Twitter

Alex Cheng, Nilesh Bansal, Nick Koudas

pdf

Recommendation by Examples

Rubi Boim, Tova Milo


SIMPDA Research 2

Location: Room 100Bmap


Development of a knowledge base for enabling non-expert users to apply data mining algorithms

Roberto Espinosa, Diego García, Marta Zorrilla, Jose Zubcoff and Jose-Norberto Mazon

Using Semantic Lifting for improving Process Mining: a Data Loss Prevention System case study

Antonia Azzini, Ernesto Damiani, Francesco Zavatarelli and Chiara Braghin

Challenges of Applying Adaptive Processes to Enable Variability in Sustainability Data Collection

Gregor Grambow, Nicolas Mundbrod, Vivian Steller and Manfred Reichert


SDM Research 1

Location: Room Meetingmap


A Multi-Party Protocol for Privacy-Preserving Range Queries

Maryam Sepehri, Stelvio Cimato, and Ernesto Damiani

Secure Similar Document Detection with Simhash

Sahin Buyrukbilen (presenter) and Spiridon Bakiras

Query Log Attack on Encrypted Databases

Tahmineh Sanamrad and Donald Kossmann


SSW Research 1

Location: Room Presidenzamap


Discovering Attribute and Entity Synonyms for Knowledge Integration and Semantic Web Search

Hamid Mousavi, Shi Gao and Carlo Zaniolo

A Framework for Guided Search of Mashup Components

Michele Melchiori, Valeria De Antonellis and Devis Bianchini

Reasoning on the Web of Data

Andrea Cali', Stefano Capuzzi, Mirko Michele Dimartino, Riccardo Frosini



Friday Aug 30th 13:30-15:30

DBRank Keynote 2 & Invited Talk

Location: Room 1000Bmap


(Keynote) Entwining Structure into Web Search

Stelios Paparizos (Microsoft Research)

pdf

On the Modelling of Ranking Algorithms in Probabilistic Datalog

Thomas Roelleke & Marco Bonzanini

pdf

Ranking and New Database Architectures

Justin Levandoski


Phd Workshop 3: Mining and Similarity Queries

Location: Room 300map


pdf

Universal Indexing of Arbitrary Similarity Models

Tomas Bartos, Charles University in Prague

pdf

Realtime Analysis of Information Diffusion in Social Media

Io Taxidou, University of Freiburg

pdf

Mining Frequent Patterns with Differential Privacy

Luca Bonomi, Emory University

pdf

Efficiency and Security in Similarity Cloud Services

Stepan Kozak, Masaryk University


BD3 Research 2: CEP and Graphs

Location: Room Stampamap


pdf

Elastic Complex Event Processing under Varying Query Load

Thomas Heinze (SAP AG), Yuanzhen Ji (SAP AG), Yinying Pan (SAP AG), Franz Josef Grüneberger (SAP AG), Zbigniew Jerzak (SAP AG), Christof Fetzer (TU Dresden)

pdf

Adaptive Selective Replication for Complex Event Processing Systems

Franz Josef Grüneberger (SAP AG), Thomas Heinze (SAP AG), Pascal Felber (Universite de Neuchatel)

pdf

Dynamic Partitioning of Big Hierarchical Graphs

Vasilis Spyropoulos (Athens University of Economics and Business), Yannis Kotidis (Athens University of Economics and Business)

pdf

Scalable and Robust Management of Dynamic Graph Data

Alan Labouseur (SUNY, Albany), Paul Olsen (SUNY, Albany), Jeong-Hyon Hwang (SUNY, Albany)


DBPL Invited Talk 3 & Research 3: (Actual timing 14:00-15.30)

Location: Room Belvederemap


(Invited talk)

Soren Lassen (Facebook)

pdf

Declarative Ajax Web Applications through SQL++ on a Unified Application State

Yupeng Fu (UCSD), Kian Win Ong (UCSD), and Yannis Papakonstantinou (UCSD)


PersDB Keynote 2 & Research 2

Location: Room 100Amap


txt

Two of the Many Faces of Ranking: Diversity and Time

Evaggelia Pitoura (University of Ioannina, Greece)

What makes a query result interesting besides relevance? This talk focuses on two of the many ranking criteria: (a) diversity and (b) change through time. Diversity has recently attracted a lot of attention as a means of increasing user satisfaction in recommendation systems, information retrieval and database queries. Diversification takes many forms such as increasing novelty and covering different query aspects and user information needs. In our recent research, we have proposed efficient indexing techniques for supporting continuous diversification as well as a new definition of diversity based on graph theory that provides a novel perspective to the problem. Then, I will touch upon our new work on capturing and querying the history of evolving social graphs through time

pdf

Social Search Queries in Time

Georgia Koloniari, Kostas Stefanidis

pdf

Coping with the Persistent Coldstart Problem

Siarhei Bykau, Georgia Koutrika and Yannis Velegrakis


SIMPDA Research 3: (Actual timing 13:30-15.00)

Location: Room 100Bmap


Enhancing the Case Handling Paradigm to Support Object-aware Processes

Carolina Ming Chiao, Vera Kuenzle and Manfred Reichert

Knowledge and Business Intelligence Technologies in Cross-Enterprise Environments for Italian Advanced Mechanical Industry

Francesco Arigliano and Paolo Ceravolo

On Process Rewritting for Business Process Security

Rafael Accorsi


SDM Research 2 & Vision 1: (Actual timing 16:00-17:00)

Location: Room Meetingmap


Privacy Implications of Privacy Settings and Tagging in Facebook

Stan Damen and Nicola Zannone

Big Security for Big Data: Addressing Security Challenges for the Big Data Infrastructure

Yuri Demchenko, Canh Ngo, Cees de Laat, Peter Membrey, Daniil Gordijenko

(Vision) Future of security research


SSW Research 2

Location: Room Presidenzamap


Supporting Semantic Web Search and Structured Queries on Mobile Devices

Andrea Dessi, Andrea Maxia, Maurizio Atzori and Carlo Zaniolo

Towards a folksonomy of Web APIs

Devis Bianchini

Automatic Web Spreadsheet Data Extraction

Zhe Chen and Michael Cafarella

Effectively and Efficiently Supporting Crowd-Enabled Databases via NoSQL Paradigms

Alfredo Cuzzocrea, Marcello Di Stefano, Paolo Fosci and Giuseppe Psaila



Friday Aug 30th 16:00-18:00

DBRank Poster

Location: Room 1000Bmap


(Poster & Breakout)


Phd Workshop 4: Data Models

Location: Room 300map


pdf

Automatic ontology based User Profile Learning from heterogeneous Web Resources in a Big Data Context

Anett Hoppe, Universite de Bourgogne

pdf

Domain Specific Multistage Query Language for Medical Document Repositories

Aastha Madaan, University of Aizu

pdf

Getting Unique Solution in Data Exchange

Nhung Ngo, Free University of Bolzano

pdf

Fast Cartography for Data Explorers

Thibault Sellam, CWI


BD3 Research 3: Stream Processing

Location: Room Stampamap


pdf

Towards Elastic Stream Processing: Patterns and Infrastructure

Kai-Uwe Sattler (TU Ilmenau), Felix Beier (TU Ilmenau)

pdf

Task Graphs of Stream Mining Algorithms

Sayaka Akioka (Meiji University)

pdf

Large-scale Online Mobility Monitoring with Exponential Histograms

Christine Kopp (Fraunhofer IAIS), Michael Mock (Fraunhofer IAIS), Odysseas Papapetrou (Technical Univ. of Crete), Michael May (Fraunhofer IAIS)

pdf

Multi-Stage Malicious Click Detection on Large Scale Web Advertising Data

Leyi Song (East China Normal University), Xueqing Gong (East China Normal University), Xiaofeng He (East China Normal University), Rong Zhang (East China Normal University), Aoying Zhou (East China Normal University)


DBPL Research 4: (Actual timing 16:00-17.30)

Location: Room Belvederemap


pdf

Learning Schemas for Unordered XML

Radu Ciucanu (Inria) and Slawek Staworko (Inria)

pdf

Static Enforceability of XPath-Based Access Control Policies

James Cheney (University of Edinburgh)

pdf

XPath Satisfiability with Parent Axes or Qualifiers Is Tractable under Many of Real-World DTDs

Yasunori Ishihara (Osaka University), Nobutaka Suzuki (University of Tsukuba), Kenji Hashimoto (Nara Institute of Science and Technology), Shogo Shimizu (Gakushuin Women’s College), and Toru Fujiwara (Osaka University)


IFIP WG 2.6: Business Meeting

Location: Room 100Bmap


SDM Vision 2 & Closing

Location: Room Meetingmap


(Vision) Future of security research


SSW Panel & Closing

Location: Room Presidenzamap


(Panel Session & Closing Remarks)