The growth of graph size has created new problems in graph visualization and graph analysis. To solve the problem, several graph sampling techniques have been proposed dedicated to obtaining a representative subgraph from a complex network. While prior research indicates that sampling on a large-scale graph is not an easy task, especially for topology-based sampling methods (e.g. breadth first sampling). Topology-based sampling methods can produce a more accurate subgraph than node sampling and edge sampling in preserving statistical graph properties. In this paper, we propose three types of distributed sampling algorithms and develop a sampling package on Spark. To evaluate the effectiveness of these distributed sampling techniques, we apply them to three graph datasets and compare them with traditional/non-distributed sampling approaches. The results show that (1) our distributed sampling approaches are as reliable as the non-distributed sampling techniques, and (2) they are a great improvement in sampling efficiency, especially for topology-based sampling. In addition, (3) the distributed architecture of these algorithms causes them to have horizontal scalability.
Fangyan Zhang, Song Zhang, Christopher Lightsey, "Implementation and Evaluation of Distributed Graph Sampling Methods with Spark" in Proc. IS&T Int’l. Symp. on Electronic Imaging: Visualization and Data Analysis, 2018, pp 379-1 - 379-9, https://doi.org/10.2352/ISSN.2470-1173.2018.01.VDA-379