Generating Concise Entity Matching Rules (demo paper)

Published in the ACM SIGMOD Conference on Management of Data, 2017

Download paper here

Abstract

Entity matching (EM) is a critical part of data integration and cleaning. In many applications, the users need to understand why two entities are considered a match, which reveals the need for interpretable and concise EM rules. We model EM rules in the form of General Boolean Formulas (GBFs) that allows arbitrary attribute matching combined by conjunctions ($\wedge$), disjunctions ($\vee$), and negations ($\neg$). GBFs can generate more concise rules than traditional EM rules represented in disjunctive normal forms (DNFs). We use program synthesis, a powerful tool to automatically generate rules (or programs) that provably satisfy a high-level specification, to automatically synthesize EM rules in GBF format, given only positive and negative matching examples. In this demo, attendees will experience the following features: (1) Interpretability { they can see and measure the conciseness of EM rules defined using GBFs; (2) Easy customization { they can provide custom experiment parameters for various datasets, and, easily modify a rich predefined (default) synthesis grammar, using a Web interface; and (3) High performance { they will be able to compare the generated concise rules, in terms of accuracy, with probabilistic models (e.g., machine learning methods), and hand-written EM rules provided by experts. Moreover, this system will serve as a general platform for evaluating different methods that discover EM rules, which will be released as an open-source tool on GitHub.