Spatial Data Analytics
Agora-EO: A Unified Ecosystem for Earth Observation
Spatial Join and Aggregation
The spatial join is a fundamental data operation that plays a key role in spatial analytics. It aims to find pairs of spatial objects that satisfy a spatial predicate, most commonly spatial intersection. For example, connected mobility companies such as Uber join locations of passenger requests with a set of polygonal regions to display available cars and enable dynamic pricing. Existing spatial join algorithms typically first solve the join using approximations (e.g., bounding boxes) of the geometries. Then, for each pair of intersecting approximations, they perform exact geometric tests to eliminate false matches. For real-world objects with complex shapes, these tests are computationally expensive. To improve the efficiency of spatial join and enable real-time evaluation, we introduced Raster Join, an operator that follows a fundamentally different design. Specifically, we used the following key insight: A spatial join between two data sets can be considered as "drawing" the two data sets on the same canvas, and then examining their intersections. This insight allows framing spatial joins as renderings and using the rasterization operation, which is highly optimized for the GPU. Rasterization is an integral component of the graphics rendering pipeline that converts a geometric shape into a set of pixels. We further observed that many emerging applications (e.g., visualization tools) only need approximate results. This observation allows evaluating the join based solely on GPU rendering, thereby eliminating costly geometric tests.
We integrated Raster Join into Urbane, a 3D visual analytics framework that supports data-driven decision-making in urban planning. Thanks to this integration, domain experts can interactively explore, over space and time, several urban data sets at multiple resolutions. Our demonstration of the framework at the ACM SIGMOD 2018 conference won the Best Demonstration Award.
Furthermore, we developed a main-memory point-in-polygon spatial join algorithm. To reduce the compute-intensive geometric tests, we approximate polygons closely with a set of fine-grained raster cells. We index these cells using a novel query-efficient radix tree data structure that we call Adaptive Cell Trie (ACT). ACT allows us to efficiently evaluate the join with an index-nested loop: to perform the point-in-polygon join, we query ACT for every point. Overall, our approach trades memory consumption (for storing the ACT index) for reduced computation (number of performed geometric tests).
To provide further support for spatial aggregation over arbitrary polygons, we also introduced GeoBlocks, a novel pre-aggregating data structure. Instead of pre-computing aggregates over a hierarchy of rectangles as in prior work, GeoBlocks pre-compute aggregates over fine-grained raster cells, enabling error-bounded results. Furthermore, GeoBlocks cache aggregates of frequently queried regions, thereby dynamically adapting to the skew inherently present in query workloads and improving performance over time.
The above work reflects my overall vision of using distance-bounded spatial approximations to accelerate spatial queries. I presented this vision in CIDR 2021, where I showed that distance-bounded raster approximations can enable a broader set of optimization options and can form the basis of approximate spatial query processing techniques that take better advantage of modern hardware and improve performance. In doing so, my work sets the stage for a new generation of spatial systems that employ distance-bounded raster approximations at their core.
See also our edium Blogpost.
Spatial Data Approximation
The Minimum Bounding Box (MBB) is the most widely used spatial approximation. It is the smallest axis-aligned box that completely encloses a given (set of) spatial object(s). While being computation-and space-efficient, the MBB is often a poor approximation of today’s complex real-world data. It encloses a significant amount of dead space, i.e., fragments of the bounded area that do not contain any objects. In contrast, other spatial approximations proposed in prior work, e.g., the convex hull, approximate data tightly but are not computation- and space-efficient. To achieve a compact yet time-efficient approximation, we introduced the concept of Clipped Bounding Boxes (CBBs). CBBs subtract out (clip away) hyper-rectangles from the corners of the MBB using only a few auxiliary multi-dimensional points. We proposed strategies for generating these points based on Pareto optimality. Existing indexes only require a supplemental compact data structure and minor modifications to the construction, query, and update algorithms to work with CBBs. We showed that CBBs eliminate unnecessary recursions in dead space, which results in improved query performance.
Mobility-Aware Data Processing
I contributed to the design of NebulaStream, an end-to-end data management system for the IoT. I am particularly interested in the impact of the mobility of IoT devices on data processing. We introduced NebulaStream-MSS, a fault tolerance strategy for mobile devices to handle their transient disconnections. Our strategy uses a circular buffer to store the data generated during the disconnection. Furthermore, it ensures the correct forwarding of the buffered data upon re-connection, considering changes in the query execution plan caused by mobility.
The abundance of data sources containing complementary information about a spatial region creates the need to issue a spatial range query on multiple data sets simultaneously. Nevertheless, existing indexes do not provide efficient support for querying an arbitrarily chosen combination of data sets while returning an individual answer for each data set. To address this challenge, I introduced a novel data structure, STITCH. STITCH organizes spatial objects not only based on their spatial proximity but also based on application-specific categories. That way, it allows for scalable selection of spatial regions of interest from a subset of data sets that users arbitrarily choose from hundreds of candidate data sets. We further optimized data access using queries as hints to build (partial) spatial data structures incrementally, only for the bits of data needed. Our approach, Space Odyssey, minimizes the amount of data that needs to be organized and optimizes access to data sets frequently queried together, thereby decreasing data-to-query time.
Time Series Indexing
To facilitate the exploration of time series data, I studied the problem of time-bound threshold queries. A time-bound threshold query returns all the time series values that occurred within specified time bounds and are above a given threshold. Time series access methods use single-attribute indexes that index either the time or the value domain and thus cannot effectively prune the search space using constraints on both attributes. To provide efficient and scalable processing of time-bound threshold queries, I developed a new indexing scheme, RUBIK. RUBIK encodes time series values with bitmaps and compresses the bitmaps with a quadtree, effectively exploiting similarities within and across time series. More importantly, RUBIK allows executing time-bound threshold queries directly on the compressed representation, thereby achieving both space- and time-efficiency.