Projects

Delivery of Query Optimizer in Watsonx.data 2.0
Initiation and GA of IBM Data Lakehouse (Watsonx.data)
Guided Data Analysis for Conversational Business Intelligence
Active Learning for Ontology Alignment
Human Intent Prediction for Data Exploration
Unified Active Learning for Entity Matching
Data Integration of Electric System Schemata
Rule Discovery in Knowledge Bases
Interpretable Entity Matching
Statistical Data Cleaning
Write-efficient sort for PCM
Sub-query Plan reuse-based Query Optimization

Delivery of Query Optimizer in Watsonx.data 2.0

IBM research initiated and delivered enterprise grade query optimization in Watsonx.data. We initiated the idea of using the Db2 query optimizer as a disaggregated optimizer for complex Presto SQL queries, prototyped the initial proof-of-concept, and collaborated with Data & AI Business Unit to deliver the technology in Watsonx.data 2.0. We internally delivered better price performance compared to Databrick’s Photon engine on a query benchmark derived from public 100TB TPC-DS. We accomplished equal query runtime at less than 60% of the cost using Watsonx.data 2.0 with query optimizer and Presto C++ v0.286 on IBM Fusion HCI.

Team: Berthold Reinwald, Hamid Pirahesh, Michael Kaufmann, Nasrullah Sheikh, Richard Sidle, Venkata Vamsikrishna Meduri, Zoltan Arnold Nagy, Ronald Barber, Pascal Spoerri, Gregory Kishi, Aditi Pandit, Ajay Gupta, Arin Mathew, Ashok Kumar, Austin Clifford, Calisto Zuzarte, Christian Zentgraf, Deepak Majeti, Ethan Zang, George Lapis, Jason Sizto, Sudheesh Kairali

Initiation and General Availability (GA) of IBM Data Lakehouse (Watsonx.data)

IBM Data Lakehouse became GA in July 2023. IBM Research (Almaden Research Center) initiated the effort for IBM to enter the growing data lakehouse market. Research closely worked with Data and AI Business Unit (Silicon Valley Lab, Toronto, India) in setting the strategy and delivering the product. IBM Data Lakehouse is built on open source PrestoDB enriched with IBM technologies to make it enterprise ready. IBM Data Lakehouse builds the foundation for Watsonx.data.

Team: Hamid Pirahesh, Berthold Reinwald, Larry Chiu, Ronald Barber, Richard Sidle, Scott Guthridge, Venkata Vamsikrishna Meduri, Nasrullah Sheikh, Frank Schmuck

Guided Data Analysis for Conversational Business Intelligence

We built a Business Intelligence (BI) query recommender system that guides analysts towards the interesting segments of the data during a conversational data analysis session.

BIREC

Technical Report: BI-REC
Authors: Venkata Vamsikrishna Meduri, Abdul Quamar, Chuan Lei, Vasilis Efthymiou, Fatma Özcan

Active Learning for Ontology Alignment

We built an active learning framework for ontology alignment using Graph neural Networks (GNNs).

VLDBJ_2024

Publications: VLDBJ 2024
Authors: Venkata Vamsikrishna Meduri, Abdul Quamar, Chuan Lei, Xiao Qin, Berthold Reinwald

Human Intent Prediction for Data Exploration

In this project, we aim at making the human-database interaction seamless during a data exploration session, by predicting the dynamically changing human intent.

TODS

Publications: ICDE 2018 (Lightning Talk Abstract), EDBT 2019 (Short Paper), TODS 2021 (Research Paper)
Authors: Venkata Vamsikrishna Meduri, Kanchan Chowdhury, Mohamed Sarwat

Unified Active Learning for Entity Matching

We build a unified active learning framework for entity matching to evaluate combinations of learners and example selectors w.r.t quality, latency, #labels and interpretability metrics. We also compare the active learning strategies against state-of-the-art supervised learning approaches.

unifiedAL

Publications: SIGMOD 2020 (Research Paper)
Authors: Venkata Vamsikrishna Meduri, Prithviraj Sen, Lucian Popa, Mohamed Sarwat

Data Integration of Electric System Schemata

We integrate real world schemata with a lot of inconsistencies and apply approximate entity matching and schema alignment techniques to reconcile electric system transmission, distribution and location data with diverse format. This project was a collaborative effort between the CASCADE team at ASU and Salt River Project (SRP) which is one of the primary electricity distributors in Arizona.

Collaborators: Stewart Nunn, Dragan Boscovic, Mohamed Sarwat

Rule Discovery in Knowledge Bases

We mine positive and negative rules that satisfy or negate pre-specified relationships between the subject and object entities in a Knowledge Graph. This is done by traversing the paths between several instances (RDF triples) of the relationship and generalizing them into rules.

rudikFig

Publications: ICDE 2018 (Research Paper), VLDB 2018 (Demo Paper), JDIQ 2019 (Research Paper)
Authors: Stefano Ortona, Venkata Vamsikrishna Meduri, Paolo Papotti, Naser Ahmadi, Viet-Phi Huynh

Interpretable Entity Matching

We use a powerful technique called program synthesis and a solver named Sketch to generate concise and interpretable boolean expressions (rules) satisfying matching and non-matching assertions on the training data to perform entity matching.

ERSynth

Publications: PVLDB 2017 (Research Paper) presented in VLDB 2018, SIGMOD 2017 (Demo Paper)
Authors: Rohit Singh, Venkata Vamsikrishna Meduri, Paolo Papotti, Nan Tang, Armando Solar-Lezama, Samuel Madden, Ahmed K. Elmagarmid, Jorge-Arnulfo Quiane-Ruiz

Statistical Data Cleaning

We design and develop a statistical data cleaning framework called BayesWipe which obviates the need for clean master data. Rather, it learns a model of the clean data from the dirty data itself in a probabilistically principled manner.

BayesWipe

Publications: JDIQ 2016 (Research Paper)
Authors: Sushovan De, Yuheng Hu, Venkata Vamsikrishna Meduri, Yi Chen, Subbarao Kambhampati

Write-efficient Sort for Phase Change Memory

We design a sort algorithm that minimizes the writes on Phase Change Memory under a hybrid main memory setting comprising a large PCM and a tiny DRAM without sacrificing latency. The purpose is to cater to the limited write endurance of PCM.

DEXA

Publications: DEXA 2012 (Research Paper)
Authors: Venkata Vamsikrishna Meduri, Zhan Su, Kian-Lee Tan

Sub-query Plan reuse-based Query Optimization

We detect near-isomorphic subquery graphs with similar selectivities and reuse the optimal plan generated upon one candidate subquery for another isomorphic subquery enumerated during the optimal plan detection for a complex query. This was implemented in the PostgreSQL engine to reduce the latency of the (Iterative) Dynamic Programming query optimizer.

COMAD

Publications: COMAD 2011 (Research Paper)
Authors: Venkata Vamsikrishna Meduri, Kian-Lee Tan