Instructor | Donatella Firmani |
---|---|
Primary material | Readings in the form of research papers and on-line resources referenced by the instructor. |
Assignments | The course will contain individual and team assignments, including programming assignment in Python on real-world data using open source frameworks. |
Credits | 3 |
When | June 15th 2020 - June 24th 2020 |
Where | Lectures will be held online as a Microsoft Teams meeting. Please register here by June 13th in order to receive the Teams meeting link via e-mail. |
Audience | The course is open to PhD students and young researchers from degree-granting institutions, including but not limited to Roma Tre University. |
Entity Resolution (ER) is a fundamental Data Integration task. ER seeks to identify and match different manifestations of the same real-world object over different data sources, i.e., duplicates. Many domains, such as government Open Data and the World Wide Web, can provide thousands of data sources about real-world entities, including profiles of people and institutions, or specifications of products and services. In these scenarios ER is key for solving complex tasks such as building a knowledge graph and represents a long-standing interest for many researchers and practitioners in the data integration community.
The sheer number of ways in which real-world entities can be represented and misrepresented, make ER a challenging task for automated strategies, but relatively easier for expert humans. For this reasons, many frameworks have been recently proposed to better leverage and represent human knowledge in the context of ER, ranging from crowdsourcing approaches where crowd workers answer questions of the form “do records u and v refer to the same entity?” to machine learning methods that can build on impressive results in the field of natural language processing (e.g. BERT).
The course aims at illustrating these methods and related notions, such as how to produce explainable interpretations of the output of such machine learning models for the ER task, and common variants and applications of the ER problem, including the construction of hierarchies and knowledge graphs. All the subjects addressed during the course are investigated under both practical and methodological perspectives. The course includes practical exercises with real-world data and the assignments of individual projects.
Lectures will be held online as a Microsoft Teams meeting. Slides will be uploaded before each lecture.
Please register here by June 13th in order to receive the Teams meeting link and news updates.
ID | Date and Time (CEST) | Title | Slides | Main References |
---|---|---|---|---|
1. | June 15th. 3PM-5PM | Introduction to Entity Resolution and Data Integration | Slides | [10, 11] |
2. | June 17th. 11AM-01PM | Modern Approaches for Recognition of Duplicates | Slides | [1,2,5,6] |
3. | June 19th. 3PM-5PM | Modern Approaches for Clustering and Reducing the Duplicates Search Space | Slides | [8,9] |
4. | June 22nd. 3PM-5PM | Explainable AI methods for Entity Resolution | Slides | [3,7] |
5. | June 24th. 3PM-5PM | Beyond Entity Resolution: Knowledge Graphs | Slides | [4,12] |
Assignments consist of individual or team projects, such as a programming project in Python on real-world data using open source frameworks.
For course participants: please fill the form here by June 25th in order to start planning assignment activities or to let me know that you are not planing to do the assignment.