My face

Modern approaches to Entity Resolution


Instructor Donatella Firmani
Primary material Readings in the form of research papers and on-line resources referenced by the instructor.
Assignments The course will contain individual and team assignments, including programming assignment in Python on real-world data using open source frameworks.
Credits 3
When June 15th 2020 - June 24th 2020
Where Lectures will be held online as a Microsoft Teams meeting. Please register here by June 13th in order to receive the Teams meeting link via e-mail.
Audience The course is open to PhD students and young researchers from degree-granting institutions, including but not limited to Roma Tre University.

Abstract

Entity Resolution (ER) is a fundamental Data Integration task. ER seeks to identify and match different manifestations of the same real-world object over different data sources, i.e., duplicates. Many domains, such as government Open Data and the World Wide Web, can provide thousands of data sources about real-world entities, including profiles of people and institutions, or specifications of products and services. In these scenarios ER is key for solving complex tasks such as building a knowledge graph and represents a long-standing interest for many researchers and practitioners in the data integration community.

The sheer number of ways in which real-world entities can be represented and misrepresented, make ER a challenging task for automated strategies, but relatively easier for expert humans. For this reasons, many frameworks have been recently proposed to better leverage and represent human knowledge in the context of ER, ranging from crowdsourcing approaches where crowd workers answer questions of the form “do records u and v refer to the same entity?” to machine learning methods that can build on impressive results in the field of natural language processing (e.g. BERT).

The course aims at illustrating these methods and related notions, such as how to produce explainable interpretations of the output of such machine learning models for the ER task, and common variants and applications of the ER problem, including the construction of hierarchies and knowledge graphs. All the subjects addressed during the course are investigated under both practical and methodological perspectives. The course includes practical exercises with real-world data and the assignments of individual projects.


Program

Lectures will be held online as a Microsoft Teams meeting. Slides will be uploaded before each lecture.

Please register here by June 13th in order to receive the Teams meeting link and news updates.

ID Date and Time (CEST) Title Slides Main References
1. June 15th. 3PM-5PM Introduction to Entity Resolution and Data Integration Slides [10, 11]
2. June 17th. 11AM-01PM Modern Approaches for Recognition of Duplicates Slides [1,2,5,6]
3. June 19th. 3PM-5PM Modern Approaches for Clustering and Reducing the Duplicates Search Space Slides [8,9]
4. June 22nd. 3PM-5PM Explainable AI methods for Entity Resolution Slides [3,7]
5. June 24th. 3PM-5PM Beyond Entity Resolution: Knowledge Graphs Slides [4,12]


Assignments

Assignments consist of individual or team projects, such as a programming project in Python on real-world data using open source frameworks.

For course participants: please fill the form here by June 25th in order to start planning assignment activities or to let me know that you are not planing to do the assignment.


Main References

  1. Sainyam Galhotra, Donatella Firmani, Barna Saha, Divesh Srivastava: Robust Entity Resolution using Random Graphs. SIGMOD Conference 2018: 3-18
  2. Donatella Firmani, Barna Saha, Divesh Srivastava: Online Entity Resolution Using an Oracle. Proc. VLDB Endow. 9(5): 384-395 (2016)
  3. Vincenzo Di Cicco, Donatella Firmani, Nick Koudas, Paolo Merialdo, Divesh Srivastava: Interpreting deep learning models for entity resolution: an experience report using LIME. aiDM@SIGMOD 2019: 8:1-8:4
  4. Andrea Rossi, Donatella Firmani, Antonio Matinata, Paolo Merialdo, Denilson Barbosa: Knowledge Graph Embedding for Link Prediction: A Comparative Analysis. CoRR abs/2002.00819 (2020)
  5. Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, Wang-Chiew Tan: Deep Entity Matching with Pre-Trained Language Models. CoRR abs/2004.00584 (2020)
  6. Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra: Deep Learning for Entity Matching: A Design Space Exploration. SIGMOD Conference 2018: 19-34
  7. Amr Ebaid, Saravanan Thirumuruganathan, Walid G. Aref, Ahmed K. Elmagarmid, Mourad Ouzzani: EXPLAINER: Entity Resolution Explanations. ICDE 2019: 2000-2003
  8. Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq R. Joty, Mourad Ouzzani, Nan Tang: Distributed Representations of Tuples for Entity Resolution. Proc. VLDB Endow. 11(11): 1454-1467 (2018)
  9. Guilherme Dal Bianco, Marcos André Gonçalves, Denio Duarte: BLOSS: Effective meta-blocking with almost no effort. Inf. Syst. 75: 75-89 (2018)
  10. Doan, AnHai, Alon Halevy, and Zachary Ives. Principles of data integration. Elsevier, 2012.
  11. Dong, Xin Luna, and Divesh Srivastava. "Big data integration." Synthesis Lectures on Data Management 7.1 (2015): 1-198.
  12. Q. Wang, Z. Mao, B. Wang, and L. Guo. Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering (TKDE), 29(12):2724–2743, 2017.