Marcus Leich

Research Associate at DIMA


Marcus Leich
TU Berlin
DIMA Group, Sekr. EN7
Einsteinufer 17
10587 Berlin

Email: marcus.leich∂
Twitter: @mleich


Research: Debugging for Big Data

I am a research associate at the Database Systems and Information Management (DIMA) group at Technische Universität Berlin. My primary research interests focus on debugging techniques for massively parallel data processing programs.

Developing and debugging software is already hard if it runs on small scale data. Doing this with several terabytes of data, scattered across hundreds or even thousands of nodes doesn’t make it easier. In the case of Big Data the conventional debug cycle becomes a huge undertaking.

  1. You execute your code. Depending on the number of machines available this will either take a significant amount of time or energy. In any case, it will burn a lot of CPU hours.
  2. Now you analyze the program’s result data and potentially any logging output. Depending on how sophisticated your logging is, this process itself may already be an error prone data analysis task. 
  3. Based on these findings, you now try to fix the code or add some more logging logic.
    If you want to play it safe, you change as few things as possible and run through this cycle again and again. If you want to be brave, change as much things as you can, minimize the number of debug cycle iterations, and use the chance to add some new bugs to the one you’re trying to fix.

It is my goal to find methods that help to reduce the overall cost of the debug cycle. The main opportunities here are to help developers with code modifications, especially the ones that introduce logging code, and to shorten the execution time by slicing away irrelevant parts of the code and the data.

If you’re interested in more details, please have a look at this paper and poster presentation from the PhD Workshop of VLDB 2014.


Thesis Topics

Below you can find outlines of topic areas in which I am currently offering Master theses or Bachelor theses projects. Unless explicetly marked as Master or Bachelor thesis, these topics are deliberately described very broadly. Chances are that all of these topics are actually too large, especially for a Bachelor thesis. Consequently, the first part of writing a thesis will be to narrow its scope and to agree with your supervisor (Prof. Markl) and your advisor (me) on what exactly you will and will not do in your thesis.


Master Thesis: Automated Provenance Tracking for Dataflow Programs

During analysis of dataflow programs, it is often of great interest to identify the input tuples that produced certain intermediate tuples or output tuples.In other words, it is of interest to track the computational origin (i.e. the provenance) of intermediate tuples or output tuples.

Adding provenance tracking logic manually to program code is a tedious and error-prone task, especially since provenance tracking is usually not a functional component, but more of a necessary complication that is only temporarily required for program analysis and debugging purposes. So, why not try and find a way to add provenance tracking logic automatically to the dataflow program?

It is the goal of this research topic to analyze at least two different strategies for automated augmentation of dataflow programs.

Potential candidates technologies are the following:

  1. API mirroring: wrap existing API into a new API that automatically adds provenance tracking.
  2. Aspect/Feature oriented programming: use advanced programming techniques such as aspect oriented programming, to implement provenance tracking as a cross-cutting concern.
  3. Code generation: parse the code of any dataflow program, and augmented it with provenance tracking features.

A thorough analysis of any of these approaches should address strengths and limitations such as:

What you should know, or should be willing to learn:


Master Thesis: Static Dependency Analysis of User Code Variables

Let’s assume you just hacked together a mildly sophisticated dataflow program. Let’s also assume you made a small copy/paste error that will cause one of your data sources to be ignored. So instead of getting a meaningful result, you’ll wait half an hour to look at the nonsensical output of an expensive self-join. Now, wouldn’t it be nice if your IDE could have shown you that before you hit execute?

Also, let’s assume you’re trying to fix a bug in your code and you’re executing it time and again to analyze your program’s behavior and test potential fixes. Wouldn’t it be nice if your execution engine could ignore all the expensive code that does not contribute to the portion you’re trying to fix?

Both the IDE hints, as well as the slicing of irrelevant code, require a static dependency analysis algorithm, which is the scope of this project. Contrary to conventional approaches, this variant needs to consider both the user-code that is passed to higher order processing primitives, as well as the relationship of these higher order primitives like map, reduce, and join.

The research goal of this thesis topic is the find and algorithm that extracts as precise as possible a dependency graph for for intermediate and final tuple fields in the dataflow. Additionally this graph is meant to contain dependency information between temporary variables in UDFs and tuple fields. Care should be taken, to keep the dependency graph as small as possible while still providing accurate dependency information.For evaluation purposes this graph is to be visualized either directly in the IDE, or in the job submission interface of a dataflow processing system, such as Apache Flink.

What you should know, or should be willing to learn:


Previous Thesis Topics

The following topics have already been handed to students and are currently ongoing investigation, or have lead to a finished thesis. These topics are listed here for reference purposes. Should you wish to elaborate on one of these topics, feel free to get in touch.

Glanceable Visualization of Massive Data Flows (Ongoing)

In-situ data processing is currently all the rage. You do no ingestion, no schema transformation, no cleansing, nothing that makes your live easier whatsoever, only to get your results fast. This approach leads to the following problem: you don’t get to know your data before you’re processing it. Instead, you get to know your data when it breaks your code.

Your default statistics like min, max, avg, median, histogram may help for simple numeric data. However, especially text data or (semi-) structured data call for different approaches. Aside from knowing what your raw data looks like at the input stage, in data flow programs it, is also crucial to understand how the different operations affect the data within the data flow.

It is the goal of this project to provide developers with a visualization tool that helps them to quickly understand what their data looks like (at the source and within the data flow) in order to rule out any false assumptions about the input and its behavior in the data flow, as soon as possible.

What you should know, or should be willing to learn:


A Step-Trough Debugger for Distributed Data Processing Programs (Ongoing)

A much more convenient approach for debugging than your quickly hacked printf statement is provided by most IDEs that feature break- and watchpoint based debugging. There you can have the execution stop at any code location (even based on certain boolean expressions), observe the program flow and inspect variables while stepping forward one statement at a time.

While it is possible to connect to remote processes and debug a handful of your worker tasks using conventional IDEs, it is quite apparent that these tools are not carved out for debugging hundreds or even thousands of tasks in parallel. Neither is the UI designed to efficiently communicate the state of so many tasks, neither is the network communication approach prepared to handle so many concurrent connections.

This project aims at bringing classical step-through debugging capabilities for massively parallel data processing programs to modern IDEs. The developer should be able to set breakpoint for very large amounts of remote tasks, inspect variables and step through successive code statements in a cohesive UI. All of that should be done without flooding the developer’s workstation with data packages from the debugged tasks.

What you should know, or should be willing to learn:



Runtime Analysis of Distributed Data Processing Programs
Marcus Leich
PhD Workshop at VLDB 2014
(Received the Best Paper Award of the PhD Workshop)
The Stratosphere platform for big data analytics
Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinländer, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, Daniel Warneke
VLDB Journal 2014
Applying Stratosphere for Big Data Analytics
Marcus Leich, Jochen Adamek, Moritz Schubotz, Arvid Heise, Astrid Rheinländer, Volker Markl
BTW 2013: 507-510
Meteor/Sopremo: An Extensible Query Language and Operator Model
Arvid Heise, Astrid Rheinländer, Marcus Leich, Ulf Leser, Felix Naumann
BigData Workshop (2012), affiliated with VLDB