基于GraphLab的分布式近邻传播聚类算法
Distributed affinity propagation clustering algorithm based on GraphLab
Abstract
为有效实现海量数据的非线性聚类,提出基于grAPHlAb的分布式流式近邻传播算法——gSTrAP(grAPHlAb bASEd STrEAM AffInITy PrOPAgATIOn)。该算法将数据抽象为有向无环图模型,采用“gATHEr-APPly-SCATTEr“的模式完成数据同步和算法迭代。在人工合成流形数据3d CluSTErS、AggrEgATIOn、flAME和PATHbASEd数据集上分别采用不同数据规模以及与传统k-MEAnS的聚类性能做对比,实验表明:基于grAPHlAb的近邻传播算法对数据规模具有良好的拓展性,在保持算法聚类效果的同时,有效降低时间复杂度。 A distributed affinity propagation algorithm based on GraphLab was proposed,which was named GStrAP(Graphlab based stream affinity propagation).In GraphLab's DAG abstraction,the parallel computation was represented as a directed acyclic graph with data flowing along edges between vertices,and the "Gather-Apply-Scatter"paradigm was applied to complete data synchronization and algorithm's iteration.The experimental results on 3D Clusters,Aggregation,Flame and Pathbased datasets with different scale and the clustering performance were compared with Kmeans,which demonstrated that the proposed GStrAP could achieve high performance on both scalability and accuracy.