PhD ResearchThe goal of my PhD thesis was to bridge the gap between the requirements of spatial and temporal data exploration and the capabilities of existing data management systems. To that end, I focused on query processing techniques and access methods, on which the performance of data management systems depends, and revisited their design to eliminate bottlenecks that oppose the interactivity and scalability requirements of exploratory applications. Specifically, I developed spatial query operators that leverage GPU rasterization, lightweight geometric object approximations with improved precision, and workload-aware access methods.
Real-Time Spatial Aggregation using GPU Rasterization
Spatial aggregation queries partition and summarize point data over different regions. Enabling fast response times to such queries is challenging for several reasons. First, they comprise computationally-intensive point-in-polygon tests that require time linear with respect to the size of the polygons. Real-world polygonal regions have complex shapes, often consisting of hundreds of vertices. This problem is compounded by the fact that data sets can have hundreds of millions to several billion points. Finally, data cube-based structures which are used to maintain aggregate values, require costly pre-processing and can incur prohibitively high memory overhead, while, more importantly, they do not support queries over arbitrary polygonal regions.
To address these challenges, we introduced Raster Join, a rasterization-based spatial aggregation technique that leverages modern graphics hardware to support interactive response times to ad-hoc aggregation queries over arbitrary polygons. As part of the driver provided by the hardware vendors, rasterization is optimized to make use of the underlying architecture and maximize the occupancy of the GPU. Raster Join processes queries on-the-fly, which allows taking into account dynamic updates to query parameters without requiring any memory-intensive pre-computation
Approximating Spatial DataSpatial data processing techniques rely heavily on approximating groups of spatial objects by their minimum bounding box (MBB). However, MBBs often enclose significant “dead space”, fragments of bounded area that contain no actual objects. We augment MBBs with few auxiliary points that eliminate dead space around the corners of the MBB. We propose strategies for generating these points based on Pareto optimality, and demonstrate that our proposal can be plugged into any R-tree variant with a small auxiliary data structure and minor modifications to the construction, query, and update algorithms.
Spatial IndexingWith the abundance of data sources containing complementary information about a spatial region, there is the need to issue a spatial range query on multiple data sets simultaneously. Nevertheless, existing indexes do not provide adequate support for querying multiple data sets while returning a separate answer for each one of them. To address the challenge of efficiently querying a combination of data sets that are arbitrarily chosen from hundreds of candidate data sets, we developed Space Odyssey, an incremental indexing approach that continuously optimizes access to data sets frequently queried together, and STITCH, a novel index structure for the scalable execution of spatial range queries on multiple data sets.