01-intro.tex

\section{Introduction}
The tremendous amount of data that we generate through our daily applications such as social networking services, online shopping, and news recommendations, provides us with an opportunity to extract hidden but useful, even invaluable information. Realizing this opportunity, however, requires a significant amount of effort because traditional machine learning algorithms often become extremely inefficient with large amounts of data.

There have been two main approaches to resolving this issue; machine learning researchers have developed new scalable algorithms~\cite{bottou2010large, boyd2011distributed}, while systems and networking researchers have worked on developing new generic infrastructure systems which can be leveraged to construct machine learning solutions more efficiently~\cite{dean2008mapreduce, chang2008bigtable}.
However, the best possible performance is often achieved by carefully integrating both of these approaches in a single solution.

One such big data problem is analyzing large graphs such as social networks where it is not unusual to see a network consisting of billions of edges and tens of million of vertices~\cite{yang2015defining}. In particular, we are interested in the overlapping community detection problem~\cite{xie2013overlapping}, where the goal is to learn the probability distribution of each vertex to participate in each community, given a set of vertices, the links between them (which are usually very sparse), and the number of latent communities. A community can be seen as a densely connected group of vertices that are only sparsely connected to the rest of the network. This problem is significantly more complex than the related domain of detecting disjoint communities.

The problem of detecting overlapping communities is modeled by the mixed membership stochastic blockmodels (MMSB)~\cite{airoldi2009mixed} and in this paper we are particularly interested in a variant of MMSB, called assortative MMSB (a-MMSB\footnote{Although we work on a-MMSB for simplicity, it is also straightforward to apply the proposed method to the general MMSB model.}) \cite{gopalan2012scalable}.
The MMSB model is a probabilistic graphical model~\cite{koller2009probabilistic} that represents a convenient paradigm for modeling complex relationships between a potentially large number of random variables. Bayesian graphical models, where we define priors and infer posteriors over parameters, also allow us to quantify model uncertainty and facilitate model selection and averaging. However, an increasingly urgent question is whether these models and their inference procedures will be up to the challenge of handling very large graphs.

There have been two main recent advances in this direction of scalable Bayesian inference: methods based on stochastic variational Bayes (SVB)~\cite{gopalan2012scalable,hoffman2013stochastic,gopalan2013efficient} and stochastic gradient Markov chain Monte Carlo (SG-MCMC)~\cite{welling2011bayesian,patterson2013stochastic,ahn2014distributed,ahn2012bayesian}. Both methods have the important property that they only require a small subset of the data for every iteration. In other words, they can be applied to (infinite) streaming data.

In this paper, we focus on the SG-MCMC method applied to the a-MMSB model. Recent work~\cite{LiAW15} showed that this methodology is faster and more accurate than the SVB method. \cite{LiAW15} further proposes a heuristic to scale up the number of communities at the cost of less precision; the work in this paper scales the algorithm without that heuristic with a custom high-performance implementation. To this end, we propose a design of a parallel and distributed system specifically tailored to solve the a-MMSB problem. In particular, we use a mixture of OpenMP, MPI and RDMA in order to efficiently scale and accelerate the algorithm's computation.

Achieving this goal necessitated overcoming several challenges. First, the algorithm's state grows rapidly with larger graphs and number of latent communities. Since the full state information is too large to fit in a single machine's memory, it is partitioned and distributed across a cluster of machines. Second, to access the full state, each cluster node must read remote memory hosted by its peers. We leveraged RDMA to limit the high latency of such operations and increase the communication bandwidth. To reduce the latency further, we pipelined the algorithm's computations such that data can be fetched in advance over the network. Finally, the algorithm's computation is effectively distributed across the cluster nodes and parallelized further within each node by exploiting their multi-core CPUs.

The remainder of this paper is organized as follows. Section~2 provides an overview of the algorithm and its theoretical foundation. The design and implementation of the parallel and distributed solution is presented in Section~3. Section~4 evaluates the efficacy of the system and analyzes its performance. Finally, Section~5 provides concluding remarks.

% -------\\
% Community detection is the central problem in network analysis, with the goal of identifying the groups of related nodes that are densely connected within this group but sparsely connected to the rest of the network. Different from classical community detection problem where we assume each node belongs to one single community, our paper considers overlapping communities where each of the nodes might belong to multiple communities. In particular, we consider the model called a-MMSB(Assortive Mixed Membership Stochastic Blockmodel) in this paper, which was first introduced in~\cite{gopalan2012scalable}.\\


% a-MMSB, as a probabilistic graphical model, represents a convenient paradigm for modeling complex relationships between a potentially large number of random variables. It also uses priors and posteriors to quantify model uncertainty and facilitate model selection and averaging. While a-MMSB provides rich representation power, like other Bayesian models, the inference procedures of handling big data which is common in real world is still a very challenging problem. \\


% Consider the large networks such as a social network, it easily runs into billions of edges and tens of million of nodes. In addition to that, the number of communities might exceed few millions. 
% There were two types of scalable algorithm for a-MMSB have been introduced recently \cite{gopalan2012scalable}, stochastic variational Bayesian inference (SVB) and stochastic gradient Markov chain Monte Carlo(SG-MCMC), respectively. Both methods have the important property that each iteration only relies on a small subset of the data. Although both methods can work for some large networks with many hundreds of nodes, the performance is far less satisfactory when it applies to the so-called "big data" such as facebook network with millions of nodes and billions of edges. Given the facts that \cite{LiAW15}, SG-MCMC runs faster and converges to the better local minima, in this paper, we mainly consider the problem of scaling up SG-MCMC to the big data set.  \textit{As far as we know, this is the first paper that studies community detection on the frencter dataset with few billion's of edges.}