Approximately duplicate records detection based on complete sub-graph
- 软件学院－已发表论文 
Duplicate records detection is the process of identifying multiple records that refer to one unique real-world entity or object. However, duplicate records may do not share a common key and contain errors that make duplicate records detection a difficult task. By analyzing the MPN algorithm, it is clear that transitive closure in the merge step will cause higher false-positive rate. Our improved method treats a similar dataset as a complete sub-graph, and therefore the problem of duplicate records detection is converted to finding complete sub-graphs from an association graph where the vertexes represent data records and the edges reflect the similarity between records. In our proposed method, the first step is to build the association graph. Afterwards, it searches complete sub-graphs for each sliding window. It regards the corresponding vertex of the first record of current window as the first potential vertex of a complete sub-graph, and adds new vertexes into that sub-graph when new vertexes are adjacent to all the vertexes already in the sub-graph. These steps are repeated until all the records are checked. At the same time, our algorithm effectively avoids the redetection of some parts of an already detected sub-graph. Finally, the experimental results illustrate that the improved algorithm solves the problem of false cluster caused by transitive closure effectively.