News
About
I am a postdoc at DIMA, Technical University of Berlin.
I am working with Volker Markl on Stratosphere.
I received my Ph.D from Aalborg
University, Denmark in 2011. My advisor was Christian S. Jensen. I was a visiting student at
the Computer Science Department at
the University of Maryland, USA
and an intern at
the Database
Group, Microsoft
Research, Redmond, USA. In 2007, I graduated from
the Department of
Electrical and Computer
Engineering, National
Technical University of Athens, Greece.
I have worked on research with
Christian S. Jensen,
Amol Deshpande,
Man Lung Yiu,
David Lomet, and
Timos Sellis.
My CV (in pdf) contains more details.
Publications
Fabian Hueske, Mathias Peters, Matthias J. Sax, Astrid Rheinländer, Rico Bergmann, Aljoscha Krettek, Kostas Tzoumas
Opening the Black Boxes in Data Flow Optimization
PVLDB 2012
[abstract]
Many systems for big data analytics employ a data flow abstraction to define parallel data processing tasks. In this setting, custom operations expressed as user-defined functions are very common. We address the problem of performing data flow optimization at this level of abstraction, where the semantics of operators are not known. Traditionally, query optimization is applied to queries with known algebraic semantics. In this work, we find that a handful
of properties, rather than a full algebraic specification, suffice to establish reordering conditions for data processing operators. We show that these properties can be accurately estimated for black box operators by statically analyzing the general-purpose code of their user-defined functions.
We design and implement an optimizer for parallel data flows that does not assume knowledge of semantics or algebraic properties of operators. Our evaluation confirms that the optimizer can apply common rewritings such as selection reordering, bushy join order enumeration, and limited forms of aggregation push-down, hence yielding similar rewriting power as modern relational DBMS optimizers. Moreover, it can optimize the operator order of non-relational data flows, a unique feature among today's systems.
Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, Volker Markl
Spinning Fast Iterative Data Flows
PVLDB 2012
[abstract]
Parallel dataflow systems are nowadays a central part of most analytic pipelines for big data. The iterative nature of many analysis and machine learning algorithms, however, is still a challenge for today's systems. While certain types of iterative algorithms are supported by novel dataflow frameworks, these systems cannot exploit computational dependencies present in many algorithms, for example for graph analysis. As a result, these algorithms are inefficiently executed and have led to specialized systems based on paradigms other than data flows. We propose a form of incremental iteration for dataflows that is able to exploit the computational dependencies in many iterative problems. After showing how to integrate iterations into a dataflow system and its optimizer, we present an extension to the programming model that alleviates for the lack of mutable state and allows for exploiting the sparse computational dependencies inherent in many iterative algorithms. The evaluation of a prototypical implementation shows that those aspects lead to up to two orders of magnitude speedup in algorithm runtime, when exploited. In our experiments, the improved dataflow system is highly competitive with specialized systems while maintaining a transparent and unified dataflow abstraction.
Kostas Tzoumas, Amol Deshpande, Christian S. Jensen
Lightweight Graphical Models for Selectivity Estimation Without Independence Assumptions
PVLDB 2011
[pdf] [slides-pdf]
[abstract]
As a result of decades of research and industrial development, modern query optimizers are complex software artifacts. However, the quality of the query plan chosen by an optimizer is largely determined by the quality of the underlying statistical summaries. Small selectivity estimation errors, propagated exponentially, can lead to severely sub-optimal plans. Modern optimizers typically maintain one-dimensional statistical summaries and make the attribute value independence and join uniformity assumptions for efficiently estimating selectivities. Therefore, selectivity estimation errors in today's optimizers are frequently caused by missed correlations between attributes. We present a selectivity estimation approach that does not make the independence assumptions. By carefully using concepts from the field of graphical models, we are able to factor the joint probability distribution of all the attributes in the database into small, usually two dimensional distributions. We describe several optimizations that can make selectivity estimation highly efficient, and we present a complete implementation inside PostgreSQL's query optimizer. Experimental results indicate an order of magnitude better selectivity estimates, while keeping optimization time in the range of tens of milliseconds.
David Lomet, Kostas Tzoumas, Michael Zwilling
Implementing Performance Competitive Logical Recovery
PVLDB 2011
[pdf] [slides-pdf]
[abstract]
New hardware platforms, e.g. cloud, multi-core, etc., have led to a reconsideration of database system architecture. Our Deuteronomy project separates transactional functionality from data management functionality, enabling a flexible response to exploiting new platforms. This separation requires, however, that recovery is described logically. In this paper, we extend current recovery methods to work in this logical setting. While this is straightforward in principle, performance is an issue. We show how ARIES style recovery optimizations can work for logical recovery where page information is not captured on the log. In side-by-side performance experiments using a common log, we compare logical recovery with a state-of-the art ARIES style recovery implementation and show that logical redo performance can be competitive.
Kostas Tzoumas, Amol Deshpande, Christian S. Jensen
Sharing-Aware Horizontal Partitioning for Exploiting Correlations during Query Processing
PVLDB 2010 [pdf] [slides-pdf]
[abstract]
Optimization of join queries based on average selectivities is
sub-optimal in highly correlated databases. In such databases,
relations are naturally divided into partitions, each partition having
substantially different statistical characteristics. It is very
compelling to discover such data partitions during query optimization
and create multiple plans for a given query, one plan being optimal
for a particular combination of data partitions. This scenario calls
for the sharing of state among plans, so that common intermediate
results are not recomputed. We study this problem in a setting with a
routing-based query execution engine based on eddies. Eddies naturally
encapsulate horizontal partitioning and maximal state sharing across
multiple plans. We define the notion of a conditional join plan, a
novel representation of the search space that enables us to address
the problem in a principled way. We present a low-overhead greedy
algorithm that uses statistical summaries based on graphical
models. Experimental results suggest an one order of magnitude faster
execution time over traditional optimization for high correlations,
while maintaining the same performance for low correlations.
Man Lung Yiu, Leong Hou U, Simonas Saltenis, Kostas Tzoumas
Efficient Proximity Detection among Mobile Users via Self-Tuning Policies
PVLDB 2010
[pdf]
[abstract]
Given a distance threshold $\epsilon$, a set of users, and the friend
relationships among them, the {\em proximity-detection} problem is to
find each pair of friends such that the Euclidean distance between
them is smaller than $\epsilon$. This problem plays an essential role
in friend-locator applications and massively multiplayer online games.
To reduce the server load and the communication cost between the
server and the mobile clients, efficient proximity-detection
algorithms need to be developed. Existing proximity detection
solutions either incur substantial location update costs or their
performance does not scale well to a large number of users.
Motivated by this, we present a centralized proximity detection
solution that assigns each mobile client with a moving bounding box.
We then design a {\em self-tuning policy} to adjust the width of the
box automatically, in order to minimize communication cost. In
addition, we develop a cost model of our proposed solution and
present techniques for saving the computational resources of the
server. Extensive experiments suggest that our proposed solution is
efficient and robust with respect to various parameters.
Kostas Tzoumas, Man Lung Yiu, Christian S. Jensen
Workload-Aware Indexing of Continuously Moving Objects
PVLDB 2009
[pdf] [slides-pdf]
[abstract]
The increased deployment of sensors and data communication networks
yields data management workloads with update loads that are intense,
skewed, and highly bursty. Query loads resulting from location-based
services are expected to exhibit similar characteristics. In such
environments, index structures can easily become performance
bottlenecks. We address the need for indexing that is adaptive to the
workload characteristics, called workload-aware, in order to cover the
space in between maintaining an accurate index, and having no index at
all. Our proposal, QU-Trade, extends R-tree type indexing and achieves
workload-awareness by controlling the underlying index's filtering
quality. QU-Trade safely drops index updates, increasing the overlap
in the index when the workload is update-intensive, and it restores
the filtering capabilities of the index when the workload becomes
query-intensive. This is done in a non-uniform way in space so that
the quality of the index remains high in frequently queried regions,
while it deteriorates in frequently updated regions. The adaptation
occurs online, without the need for a learning phase. We apply
QU-Trade to the R-tree and the TPR-tree, and we offer analytical and
empirical studies. In the presence of substantial workload skew,
QU-Trade can achieve index update costs close to zero and can also
achieve virtually the same query cost as the underlying index.
Service
Organization and PC co-chair: DanaC 2012
PC member: SWEET 2012, PVLDB 2013
|