GEM: An Efficient Entity Matching Framework for Geospatial Data (poster paper)
Published in the 29th ACM SIGSPATIAL, 2021
Abstract
Identifying various mentions of the same real-world locations is known as spatial entity matching. While entity matching (EM) received significant interest in the field of relational EM, the same cannot be said about spatial EM. In this work, we build an end-to-end Geospatial EM framework, GEM. Instead of confining ourselves to matching spatial entities of only “point” geometry type, we extend the boundaries of spatial EM to match the more generic “polygon” geometry entities as well. Blocking, feature vector creation, and classification are the core steps of our system. GEM comprises an efficient and lightweight blocking technique, GeoPrune, that uses the geohash encoding mechanism to prune away the obvious non-matching spatial entities. We leverage the Apache Sedona engine to create the feature vectors. In this step, we re-purpose the spatial proximality operators in Sedona to create spatial feature dimensions that capture the proximity between a geospatial entity pair. The classification step in GEM is a pluggable component, which consumes the feature vector for a spatial entity pair and determines whether the geolocations match or not. We conduct experiments with three classifiers upon multiple large-scale geospatial datasets consisting of both spatial and relational attributes. GEM achieves an F-measure of 1.0 for a “point x point” dataset with 176k total pairs, which is 42% higher than a state-of-the-art spatial EM baseline. It achieves F-measures of 0.966 and 0.993 for the “point x polygon” dataset with 302M total pairs, and the “polygon x polygon” dataset with 16M total pairs respectively.