ALFA: Active Learning for Graph Neural Network-based Semantic Schema Alignment

Published in The VLDB Journal: Special Issue on Machine Learning and Databases, Volume 32, Issue 6, Article No.: 4, 2023

Download paper here

Abstract

Semantic schema alignment aims to match elements across a pair of schemas based on their semantic representation. It is a key primitive for data integration that facilitates the creation of a common data fabric across heterogeneous data sources. Deep learning approaches such as graph representation learning have shown promise for effective alignment of semantically rich schemas, often captured as ontologies. Most of these approaches are supervised and require large amounts of labeled training data, which is expensive in terms of cost and manual labor. Active learning (AL) techniques can alleviate this issue by intelligently choosing the data to be labeled utilizing a human-in-the-loop approach, while minimizing the amount of labeled training data required.

However, existing active learning techniques are limited in their ability to utilize the rich semantic information from underlying schemas. Therefore, they cannot drive effective and efficient sample selection for human labeling that is necessary to scale to larger datasets. In this paper, we propose ALFA, an active learning framework to overcome these limitations. ALFA exploits the schema element properties as well as the relationships between schema elements (structure) to drive a novel ontology-aware sample selection and label propagation algorithm for training highly accurate alignment models. We propose semantic blocking to scale to larger datasets without compromising model quality. Our experimental results across three real-world datasets show that (1) ALFA leads to a substantial reduction (27% to 82%) in the cost of human labeling, (2) semantic blocking reduces label skew up to 40% without adversely affecting model quality and scales AL to large datasets, and (3) sample selection achieves comparable schema matching quality (90% F1-score) to models trained on the entire set of available training data. We also show that ALFA outperforms the state-of-the-art ontology alignment system, BERTMap, in terms of (1)10x shorter time per AL iteration and (2) requiring half of the AL iterations to achieve the highest convergent F1-score.