A novel data-driven robust framework based on machine learning and knowledge graph for disease classification
Md. Saiful Islam
- 信息学院－已发表论文 
Abstract(#br)As Noncommunicable Diseases (NCDs) are affected or controlled by diverse factors such as age, regionalism, timeliness or seasonality, they are always challenging to be treated accurately, which has impacted on daily life and work of patients. Unfortunately, although a number of researchers have already made some achievements (including clinical or even computer-based) on certain diseases, current situation is eager to be improved via computer technologies such as data mining and Deep Learning. In addition, the progress of NCD research has been hampered by privacy of health and medical data. In this paper, a hierarchical idea has been proposed to study the effects of various factors on diseases, and a data-driven framework named d-DC with good extensibility is presented. d-DC is able to classify the disease according to the occupation on the premise where the disease is occurring in a certain region. During collecting data, we used a combination of personal or family medical records and traditional methods to build a data acquisition model. Not only can it realize automatic collection and replenishment of data, but it can also effectively tackle the cold start problem of the model with relatively few data effectively. The diversity of information gathering includes structured data and unstructured data (such as plain texts, images or videos), which contributes to improve the classification accuracy and new knowledge acquisition. Apart from adopting machine learning methods, d-DC has employed knowledge graph (KG) to classify diseases for the first time. The vectorization of medical texts by using knowledge embedding is a novel consideration in the classification of diseases. When results are singular, the medical expert system was proposed to address inconsistencies through knowledge bases or online experts. The results of d-DC are displayed by using a combination of KG and traditional methods, which intuitively provides a reasonable interpretation to the results (highly descriptive). Experiments show that d-DC achieved the improved accuracy than the other previous methods. Especially, a fusion method called RKRE based on both ResNet and the expert system attained an average correct proportion of 86.95%, which is a good feasibility study in the field of disease classification.