1

Network Topology and
Communication-Computation Tradeoffs in
Decentralized Optimization
Angelia Nedić, Alex Olshevsky, and Michael G. Rabbat

arXiv:1709.08765v2 [math.OC] 15 Jan 2018

Abstract
In decentralized optimization, nodes cooperate to minimize an overall objective function that is the sum (or
average) of per-node private objective functions. Algorithms interleave local computations with communication
among all or a subset of the nodes. Motivated by a variety of applications—decentralized estimation in sensor
networks, fitting models to massive data sets, and decentralized control of multi-robot systems, to name a few—
significant advances have been made towards the development of robust, practical algorithms with theoretical
performance guarantees. This paper presents an overview of recent work in this area. In general, rates of convergence
depend not only on the number of nodes involved and the desired level of accuracy, but also on the structure and
nature of the network over which nodes communicate (e.g., whether links are directed or undirected, static or
time-varying). We survey the state-of-the-art algorithms and their analyses tailored to these different scenarios,
highlighting the role of the network topology.

I. I NTRODUCTION
In multi-agent consensus optimization, n agents or nodes, as we will refer to them throughout this article,
cooperate to solve an optimization problem. A local objective function fi : Rd → R is associated with each
node i = P
1, . . . , n, and the goal is for all nodes to find and agree on a minimizer of the average objective
1
f (x) = n ni=1 fi (x) in a decentralized way. Each node maintains its own copy xi ∈ Rd of the optimization
variable, and node i only has direct access to information about its local objective fi ; for example, node i may be
able to calculate the gradient ∇fi (xi ) of fi evaluated at xi . Throughout this article we focus on the case where the
functions fi are convex (so f is also convex) and where f has a non-empty set of minimizers so that the problem
is well-defined.
Because each node only has access to local information, the nodes must communicate over a network to find
a minimizer of f (x). Multi-agent consensus optimization algorithms are iterative, where each iteration typically
involves some local computation followed by communication over the network.
A. Architectures for Distributed Optimization
Gradient descent is a simple, well-studied, and widely-used method for solving minimization problems, and it is
one of the first methods one typically studies in a course on numerical optimization [1]–[3]. Gradient descent is a
prototypical first-order method because it only makes use of gradients ∇f (x) ∈ Rd of a continuously differentiable
objective function f : Rd → R to find a minimizer x∗ , gradients being the first-order derivatives of f . It is useful
to discuss how one may implement gradient descent in a distributed manner in order to build intuition for the
multi-agent optimization methods on which we focus in this article.
Centralized gradient descent for minimizing the function f (x) starts with an initial value x0 and recursively
updates it for k = 1, 2, . . . , by setting
xk+1 = xk − αk ∇f (xk ),
(1)
A. Nedić is with the School of Electrical, Computer, and Energy Engineering, Arizona State University, Tempe, AZ, USA.
A. Olshevsky is with the Department of Electrical and Computer Engineering, Boston University, Boston, MA, USA.
M.G. Rabbat is with Facebook AI Research, Montréal, Canada, and the Department of Electrical and Computer Engineering, McGill
University, Montréal, Canada.
Email: angelia.nedich@asu.edu, alexols@bu.edu, michael.rabbat@mcgill.ca
The work of A.N. and A.O. was supported by the Office of Naval Research under grant number N000014-16-1-2245. The work of A.O.
was also supported by NSF under award CMMI-1463262 and AFOSR under award FA-95501510394. The work of M.R. was supported by
the Natural Sciences and Engineering Research Council of Canada under awards RGPIN-2012-341596 and RGPIN-2017-06266.

2

f1

f1

f2

fn

f1

f2

fn

f2

fn

Master

..

.

f3

(a) Master-worker

..

f3

.

(b) Fully-connected

..

.

f3

(c) General connected

Fig. 1. Three example architectures for distributed optimization. (a) In the master-worker architecture, each agent sends and receives messages
from the master node. (b) In a fully-connected architecture, each agent sends and receives messages to every other agent in the network. (c)
In a general multi-agent architecture, each agent only communicates with a subset of the other agents in the network. Note, in these figures
the positions of each agent aren’t meant to reflect geographic locations; rather, the aim is just to depict the communication topology.

where α1 , α2 , . . . , is a sequence of positive scalar step-sizes. When f is convex, it has a unique minimum, and it
is well-known that, for appropriate choices of the step-sizes αk , the sequence of values f (xk ) converges to this
minimum [1]–[3].
Now, recall the multi-agent setup, where
n

1X
fi (x)
f (x) =
n

(2)

i=1

and where the gradient ∇fi (x) can only be evaluated at agent i. There are a variety of distributed architectures one
may consider in this setting, and we discuss three here: 1) the master-worker architecture, 2) the fully-connected
architecture, and 3) a general, connected architecture. They are depicted in Fig. 1 and described next.
1) The master-worker architecture: When f (x) decomposes as in (2), then the gradient also decomposes, and

1
∇f (x) =
∇f1 (x) + ∇f2 (x) + · · · + ∇fn (x) ,
n
that is, the gradient of the overall objective is the average of the gradients of the local objectives.
In a master-worker architecture, one node acts as the master (sometimes also called the fusion center), maintaining
the authoritative copy of the optimization variable xk . At each iteration, it sends xk to every agent, and agent i
returns ∇fi (xk ) to the master. The master averages the gradients it receives from the agents, and once it has
received a gradient from every agent it can perform the gradient descent update (1), before proceeding to the next
iteration.
The master-worker architecture is useful in that it is relatively simple to implement. However in many applications
it may be unattractive or impractical for a variety of reasons. First, as the number of nodes n grows large, the master
node may become a communication bottleneck if it has limited communication resources (e.g., if its bandwidth
does not grow linearly with the size of the network), and at the same time, scaling the bandwidth of the master
with the size of the network may be expensive or impractical. Also, the master node may become a robustness
bottleneck, in the sense that if the master node fails then the entire network fails. In addition, in many scenarios it
may not be practical to have a single master node that communicates with all agents. For example, if agents are
low-power devices communicating via wireless radios, then two devices may only be able to communicate if they
are nearby and it may not be practical to have all nodes within the required proximity of the master.
2) The fully-connected architecture: A natural first step to address some of the issues of the master-worker
architecture is to eliminate the master node, leading to a fully-connected peer-to-peer architecture, where each node
communicates directly with all other nodes. In this case, each node i = 1, . . . , n maintains a local copy of the
optimization variable, xki ∈ Rd . To mimic centralized gradient descent in a similar way, suppose that the local
copies of the optimization variables are initialized to the same value, e.g., x0i = 0 for all i = 1, . . . , n. Then, each
node computes its local gradient ∇fi (x0i ) and sends it to every other node in the network. Once a node has received

3

gradients from all other nodes, it can average them, and since x0i was initialized to be the same at every node, we
have that
n
1X
∇fj (x0j ) = ∇f (x0i ),
for all i = 1, . . . , n.
n
j=1

Thus, using the average of the gradients received from its neighbors, node i can update


n
X
1
x1i = x0i − α0 
∇fj (x0j )
n

(3)

j=1

and x1i is exactly equivalent to having taken one step of centralized gradient descent. Furthermore, the values x1i
will be identical at all nodes, and so we can repeat this process recursively to essentially implement centralized
gradient descent exactly in a distributed manner.
For the fully-connected architecture just described1 , each node acts like a master in the master-worker architecture,
and so the fully-connected architecture suffers from the same issues as the master-worker architecture. Moreover,
the communication overhead of having all nodes communicate at every iteration is even worse than the masterworker architecture (it grows quadratically in the number of nodes n, whereas the communication overhead was
linear in n for the master-worker architecture). Nevertheless, the fully-connected architecture provides a conceptual
transition from the master-worker architecture to general connected (but not fully-connected) architectures.
3) General multi-agent architectures: Consider a peer-to-peer architecture where node i is only connected to
a subset of the other nodes, and not necessarily all of them. Let Ni ⊂ {1, . . . , n} denote the neighbors of node
i: the subset of nodes that sends messages to node i. Similar to the fully connected case, suppose that x0i ∈ Rd
is initialized to the same value at all nodes, and let xki denote the value at node i after k iterations. We can
approximately implement gradient descent in a decentralized manner by mimicking the update (3), but where agent
i only averages over the gradients it receives from its neighbors, so that


X
1
(4)
∇fj (xkj ) ,
= xki − αk 
xk+1
i
|Ni |
j∈Ni

where |Ni | is the size of node i’s neighborhood.
This approach given in (4) is prototypical of most multi-agent optimization algorithms, in that the update equation
can be implemented in the following steps, which are executed in parallel at every node, i = 1, . . . , n:
1) Node i locally computes ∇fi (xki ).
2) Node i transmits its gradient ∇fi (xki ) and receives gradients ∇fj (xkj ) from its neighbors j ∈ Ni .
, e.g., via equation (4).
3) Node i uses this new information to compute the new value xk+1
i
Different multi-agent optimization algorithms may differ in terms of what information gets exchanged in the second
step, and in the precise way they compute the update in the last step, not necessarily using (4), as well as in the
assumptions they make about the local objective functions fi or the communication topology, captured by the
neighborhoods Ni . For example: the communication topology may be static or it may vary from iteration to
iteration; communications may be undirected (node i receives messages from node j if and only if j also receives
messages from i) or directed.
The general multi-agent approach to implementing gradient descent, given in (4), also raises a set of issues
which did not come up in the other architectures. Since the master-worker and fully-connected architectures exactly
implement gradient descent, the well-established convergence theory for gradient descent directly applies to those
architectures. However, when nodes update using the rule (4), they no longer exactly implement centralized gradient
descent, because they use a search direction
1 X
∇fj (xkj ) 6= ∇f (xki )
|Ni |
j∈Ni

which is the average of a subset, rather than all, of the gradients at other nodes. Thus, after the first iteration, the
local values x1j at different nodes are no longer equivalent. Subsequently, at the next iteration, the local gradients
1

We refer to it as fully-connected because every node communicates with every other node at each iteration.

4

being averaged at node i will have been evaluated at different values x1j . Therefore, there are a few ways in which
the values produced by multi-agent gradient descent and deviates from centralized gradient descent. One may hope
that, under the right conditions, the values at different nodes will not be too different from each other and that the
local search directions will be sufficiently similar to the gradient search direction that the nodes still converge to
(and agree on!) a minimizer of f (x).
Indeed, we will see that we can identify a variety of conditions under which multi-agent optimization algorithms
are guaranteed to converge, and we can precisely quantify how the convergence rate differs from that of centralized
gradient descent. Most often, this difference depends directly on the communication topology. In many applications
of interest, either it is not possible or one does not allow each node to communicate with every other node. The
connectivity of the network (i.e., which pairs of nodes may communicate directly with each other) can be represented
as a graph with n vertices and with an edge from j to i if node j receives messages from node i. We will see
that the communication network topology plays a key role in the convergence theory of multi-agent optimization
methods in that it may limit the flow of information between distant nodes and thereby hinder convergence.
During the past decade, multi-agent consensus optimization has been the subject of intense interest, motivated
by a variety of applications which we discuss in next.
B. Motivating Applications
The general multi-agent optimization problem described above was originally introduced and studied in the
1980’s in the context of parallel and distributed numerical methods [4]–[6]. The surge of interest in multi-agent
convex optimization during the past decade has been fueled by a variety of applications where a network of
autonomous agents must coordinate to achieve a common objective. We describe three such examples next; for a
survey describing additional applications of multi-agent methods for coordination, see [7].
1) Decentralized estimation: Consider a wireless sensor network with n nodes where node i has a measurement
ζi which is modeled as a random variable with density p(ζi |x) depending on unknown parameters x. For example,
the network may be deployed to monitor a remote or difficult to reach location, and the estimate of x may be
used for scientific observation (e.g., bird migration patterns) or for detecting events (e.g., avalanches) [8]. In many
applications of sensor networks, uncertainty is primarily due to thermal measurement noise introduced at the sensor
itself, and so it is reasonable to assume that the observations {ζi }ni=1 are conditionally independent given the model
parameters x. In this case, the maximum likelihood estimate of x can obtained by solving
minimize
x

−

n
X

log p(ζi |x),

i=1

which can be addressed by using multi-agent consensus optimization methods [9]–[14] with fi (x) = − log p(ζi |x).
In this example, the data are already being gathered in a decentralized manner at different sensors. When the data
dimension is large (e.g., for image or video sensors), it can be more efficient to perform decentralized estimation
and simply transmit the estimate of x to the end-user, rather than transmitting the raw data and then performing
centralized estimation [9]. Similarly, even if the data dimension is not large, if the number n of nodes in the network
is large, it may still be more efficient to perform decentralized estimation rather than sending raw data to a fusion
center, since the fusion center will become a bottleneck.
When the nodes communicate over a wireless network, whether or not a given pair of nodes can directly
communicate is typically a function of their physical proximity as well as other factors (e.g., fading, shadowing)
affecting the wireless channel, which may possibly result in time-varying and directed network connectivity.
2) Big data and machine learning: Many methods for supervised learning (e.g., classification or regression) can
also be formulated as fitting a model to data. This task may generally be expressed as finding model parameters x
by solving
m
X
minimize
lj (x),
(5)
x

j=1

where the loss function lj (x) measures how well the model with parameters x describes the j th training instance,
and the training data set contains m instances in total. For many popular machine learning models—including

5

linear regression, logistic regression, ridge regression, the LASSO, support vector machines, and their variants—the
corresponding loss function is convex [15].
When m is large, it may not be possible to store the training data on a single server, or it may be desirable, for
other reasons, to partition the training data across multiple nodes (e.g., to speedup training by exploiting parallel
computing resources, or because the data is being gathered and/or stored at geographically distant locations). In this
case, the training task (5) can be solved using multi-agent optimization with local objectives of the form [16]–[20]
X
fi (x) =
lj (x),
j∈Ji

where Ji is the set of indices of training instances at node i.
In this setting, where the nodes are typically servers communicating over a wired network, it may be feasible for
every node to send and receive messages from all other nodes. However, it is often still preferable, for a variety of
reasons, to run multi-agent algorithms over a network with sparser connectivity. Communicating a message takes
time, and reducing the number of edges in the communication graph at any iteration corresponds to reducing the
number of messages to be transmitted. This results in iterations that take less time and also that consume less
bandwidth.
3) Multi-robot systems: Similar to the previous example, multi-agent methods have attracted attention in applications requiring the coordination of multiple robots because they naturally lead to decentralized solutions. One
well-studied problem arising in such systems is that of rendezvous—collectively deciding on a meeting time and
location. When the robots have different battery levels or are otherwise heterogeneous, it may be desirable to design
a rendezvous time and place, and corresponding control trajectories, which minimize the energy to be expended
collectively by all robots. This can be formulated as a multi-agent optimization problem where the local objective
fi (x) at agent i quantifies the energy to be expended by agent i and x encodes the time and place for rendezvous [7],
[21].
When robots communicate over a wireless network, the network connectivity will be dependent on the proximity
of nodes as well as other factors affecting channel conditions, similar to in the first example. Moreover, as the
robots move, the network connectivity is likely to change. It may be desirable to ensure that a certain minimal
level of network connectivity is maintained while the multi-robot system performs its task, and such requirements
can be enforced by introducing constraints in the optimization formulation [22].
C. Outline of the rest of the paper
The purpose of this article is to provide an overview of the main advances in this field, highlighting the stateof-the-art methods and their analyses, and pointing out open questions. During the past decade, a vast literature
has amassed on multi-agent optimization and related methods, and we do not attempt to provide an exhaustive
review (which, in any case, would not be feasible in the space of one article). Rather, in addition to describing the
main advances and results leading to the current state-of-the-art, we also seek to provide an accessible survey of
theoretical techniques arising in the analysis of multi-agent optimization methods.
As we have already seen, decentralized averaging algorithms—where each node initially holds a number or
vector, and the aim is to calculate the average at every node—form a fundamental building block of multi-agent
optimization methods. Section II reviews decentralized averaging algorithms and their convergence theory in the
setting of undirected graphs, where node i receives message from node j if and only if j also receives messages
from i,. The main results of this section provide conditions under which decentralized averaging algorithms converge
asymptotically to the exact average, and they quantify how close the values at each node are to the exact average
after a finite number k of iterations. We initially consider the general scenario where the communication topology is
time-varying, finding that a sufficient condition for convergence is that the topology be sufficiently well-connected
over periodic windows of time. Then, for the particular case where the communication topology is static, we present
stronger results illustrating how the rate of convergence depends intimately on properties of the communication
topology.
With these results for decentralized averaging in hand, Section III presents multi-agent optimization methods
and theory in the setting of undirected communication networks. This section reviews convergence theory for
the centralized subgradient method, which generalizes gradient descent to handle convex functions which are not

6

necessarily differentiable; such functions arise in a variety of important contemporary applications, such as estimators
using `1 regularization (e.g., the LASSO) or fitting support vector machines. The main results of this section establish
conditions for convergence of decentralized, multi-agent subgradient descent, including quantifying how the rate
of convergence depends on the network topology. This section also discusses recent advances which lead to faster
convergence in certain settings, such as when the network size is known in advance, or when the objective function
is strongly convex.
Section IV then describes how decentralized averaging and multi-agent optimization methods can be extended to
run over networks with directed connectivity (i.e., where node i may receive messages from j although j does not
receive messages from i). The key technique we study, which enables this extension, is the so-called “push-sum”
approach. We provide a novel, concise analysis of the push-sum method for decentralized averaging, and then we
describe how it can be used to obtain decentralized optimization methods.
Section V discusses a variety of ways that the basic approaches described in Sections III and IV can be extended.
For example, in both Sections III and IV we limit our attention to methods for unconstrained optimization problems
running in a synchronous manner. Sections V discusses how to handle constrained optimization problem and how
multi-agent optimization methods can be implemented in an asynchronous manner. It also describes other extensions,
such as handling stochastic gradient information or online optimization (where the objective function varies over
time), and discusses connections to other methods for distributed optimization.
We conclude in Section VI and highlight some open problems.
D. Notation
Before proceeding, we summarize some notation that is used throughout the rest of this paper. A matrix is called
stochastic if it is nonnegative and the sum of the elements in each row equals one. A matrix is called doubly
stochastic if, additionally, the sum of the elements in each column equals one. To a stochastic matrix A ∈ Rn×n ,
we associate the directed graph GA with vertex set {1, . . . , n} and edge set EA = {(i, j) | aji > 0}. Note that this
graph may contain self-loops. A directed graph is strongly connected if there exists a directed path from any initial
vertex to every other vertex in the graph. For convenience, we abuse notation slightly by using the matrix A and
the graph GA interchangeably; for example, we say that A is strongly connected. Similarly, we say that the matrix
A is undirected if (i, j) ∈ EA implies (j, i) ∈ EA . Finally, we denote by [A]α the thresholded matrix obtained
from A by setting every element smaller than α to zero.
Given a sequence of stochastic matrices A0 , A1 , A2 . . ., for k > l, we denote by Ak:l the product of matrices Ak
to Al inclusive, i.e.,
Ak:l = Ak Ak−1 · · · Al .
We say that a matrix sequence is B -strongly-connected if the graph with vertex set {1, . . . , n} and edge set
(l+1)B−1

[

EAk

k=lB

is strongly connected for each l = 0, 1, 2, . . .. Intuitively, we partition the iterations k = 1, 2, . . . into consecutive
blocks of length B , and the sequence is B -strongly-connected when the graph obtained by unioning the edges within
each block is always strongly connected. When the graph sequence is undirected, we will simply say B -connected.
The out-neighbors of node i at iteration k refers to the set of nodes that can receive messages from it,
Niout,k = {j | akji > 0},

and similarly, the in-neighbors of i at iteration k are the nodes from which i receives messages,
Niin,k = {j | akij > 0}.

We assume that i is always an neighbor of itself (i.e., the diagonal entries of Ak are non-zero), which means that
we always have i ∈ Niout,k and i ∈ Niin,k . When the graph is not time-varying, we simply refer to the out-neighbors
Niout and in-neighbors Niin . When the graph is undirected, the sets of in-neighbors and out-neighbors are identical,
so we will simply refer to the neighbors Ni of node i, or Nik in the time-varying setting. The out-degree of node
,k
,k
out
in
i at iteration k is defined as the cardinality of Niout,k and is denoted by dout
= |Niout,k |. Similarly, din
i
i , di , di ,
in
,k
dki , and di denote the cardinalities, respectively, of the sets Ni , Niout , Niin , Nik , and Ni .

7

II. D ECENTRALIZED AVERAGING OVER U NDIRECTED G RAPHS
This section reviews methods for decentralized averaging that will form a key building block in our subsequent
discussion of methods for multi-agent optimization.
A. Preliminaries: Results for averaging
We begin by examining the linear consensus process defined as
xk+1 = Ak xk ,

k = 0, 1, . . .

(6)

where the matrices Ak ∈ Rn×n are stochastic, and the initial vector x0 ∈ Rn is given. Various forms of Eq. (6)
can be implemented in a decentralized multi-agent setting, and these form the backbone of many decentralized
algorithms.
For example, consider a collection of nodes interconnected in a directed graph and suppose node i holds the
i’th coordinate of the vector xk . Consider the following update rule: at step k , node i broadcasts the value xki to
its out-neighbors, receives values xkj from its in-neighbors, and sets xk+1
to be the average of the messages it has
i
received, so that
1 X k
xj .
xk+1
= in,k
i
di
in,k
j∈Ni

This is sometimes called the equal neighbor iteration, and by stacking up the variables xki into the vector xk it
can be written in the form of Eq. (6) with an appropriate choice for the matrix Ak .
Intuitively, we may think of the equal-neighbor updates in terms of an opinion dynamics process over a network
wherein node i repeatedly revises it’s opinion vector xki by averaging the opinions of it’s neighbors. As we will
see later, under some relatively mild conditions this process converges to a state where all opinions are identical,
explaining why Eq. (6) is usually referred to as the “consensus iteration.”
Over undirected graphs, an alternative popular choice of update rule is to set

X 
k
k
k
xk+1
=
x
+

x
−
x
i
j
i ,
i
j∈Nik

where  > 0 is sufficiently small. Unfortunately, finding an appropriate choice of  to guarantee convergence of
this iteration can be bothersome (especially when the graphs are time-varying), and it generally requires knowing
an upper bound on the degrees of nodes in the network.
Another possibility (when the underlying graphs are undirected) is the so-called Metropolis update


X
1
k
k
k
xk+1
=
x
+
x
−
x
(7)
i
i .
i
max{dki , dkj } j
k
j∈Ni

The Metropolis update requires node i to broadcast both xki and its degree dki to its neighbors. Observe that the
Metropolis update of Eq. (7) can be written in the form of Eq. (6) where the matrices Ak are doubly stochastic.
A variation on this is the so-called lazy Metropolis update,


X
1
k
k
k
xk+1
x
−
x
(8)
=
x
+
i
i ,
i
2 max{dki , dkj } j
k
j∈Ni

with the key difference being the factor of 2 in the denominator. It is standard convention within the probability
literature that such updates are called “lazy,” since they move half as much per iteration. As we will see later, the
lazy Metropolis iteration possesses a number of attractive convergence properties.
As we have alluded to above, under certain technical conditions, the iteration of Eq. (6) results in consensus,
meaning that all of the xki (for i = 1, . . . , n) approach the same value as k → ∞. We describe one such condition
next. The key properties needed to ensure asymptotic consensus are that the matrices Ak should exhibit sufficient
connectivity and aperiodicity (in the long term). In the following, we use the shorthand Gk for GAk , the graph
corresponding to the matrix Ak . The starting point of our analysis is the following assumption.

8

Assumption 1. The sequence of directed graphs G0 , G1 , G2 , . . . is B -strongly-connected. Moreover, each graph
Gk has a self-loop at every node.
As the next theorem shows, a variation on this assumption suffices to ensure that the update of Eq. (6) converges
to consensus.
Theorem 1 ([5], [23]). [Consensus Convergence over Time-Varying Graphs] Suppose the sequence of stochastic matrices A0 , A1 , A2 , . . . has the property that there exists an α > 0 such that the sequence of graphs G[A0 ]α , G[A1 ]α , G[A2 ]α , . . .
satisfies Assumption 1. Then x(t) converges to a limit in span{1} and the convergence is geometric2 . Moreover, if
all the matrices Ak are doubly stochastic then for all i = 1, . . . , n,
Pn
x0
k
lim x = i=1 i .
k→∞
n
On an intuitive level, the theorem works by ensuring two things. First, there needs to be an assumption of repeated
connectivity in the system over the long-term, and this is what the strong-connectivity condition does. Furthermore,
thresholding the weights at some strictly positive α rules out counterexamples where the weights decay to zero
with time. Secondly, the assumption that every node has a self-loop rules out a class of counterexamples where
the underlying graph is bipartite and the underlying opinions oscillate3 . Once these potential counterexamples are
ruled out, Theorem 1 guarantees convergence.
We now turn to the proof of this theorem, which while being reasonably short, still builds on a sequence of
preliminary lemmas and definitions which we present first. Given a sequence of directed graphs G0 , G1 , G2 , . . . ,,
we say that node b is reachable from node a in time period k : l if there exists a sequence of directed edges
ek , ek−1 , . . . , el+1 , el , such that: (i) ej is present in Gj for all j = l, . . . , k , (ii) the origin of el is a, (iii) the
destination of ek is b. Note that this is the same as stating that [W k:l ]ba > 0 if the matrices W k are nonnegative
with [W k ]ij > 0 if and only if (j, i) belongs to Gk . We use N k:l (a) to denote the set of nodes reachable from
node a in time period k : l.
The first lemma discusses the implications of Assumption 1 for products of the matrices Ak .
Lemma 2 ([5], [23], [24]). Suppose A0 , A1 , A2 , . . . is a sequence of nonnegative matrices with the property that
there exists α > 0 such that the sequence of graphs G[A0 ]α , G[A1 ]α , G[A2 ]α , . . . satisfies Assumption 1 . Then for
any integer l, A(l+n)B−1:lB is a strictly positive matrix. In fact, every entry of A(l+n)B−1:lB is at least αnB .
The proof, given next, is a mathematical formalization of the observation that sufficiently long paths exist between
any two nodes.
Proof. Consider the set of nodes reachable from node i in time period kstart to kfinish in the graph sequence
G[A0 ]α , G[A1 ]α , G[A2 ]α , . . ., and denote this set by N kfinish :kstart (i). Since each of these graphs has a self-loop at
every node by Assumption 1, the reachable set can only be enlarged, i.e.,
N kfinish :kstart (i) ⊆ N kfinish +1:kstart (i) for all i, kstart , kfinish .

A further immediate consequence of Assumption 1 is that if N mB−1:lB (i) 6= {1, . . . , n}, then N (m+1)B−1:lB (i) is
strictly larger than N mB−1:lB (i), because during times (m + 1)B − 1 : mB there is an edge in some [Gi ]α leading
from the set of nodes already reachable from i to those not already reachable from i. Putting together these two
properties, we obtain that from lB to (l + n)B − 1 every node is reachable, i.e.,
N (l+n)B−1:lB (i) = {1, . . . , n}.

But since every non-zero entry of [Ak ]α is at least α by construction, this implies that A(l+n)B−1:lB ≥ αnB , and
the lemma is proved.
A sequence of vectors z(t) converges to the limit z geometrically if kz(t) − zk2 ≤ cαt for some
 c ≥ 0, and 0 < α < 1.
0 1
3
k+1
Indeed, observe that if we do not require that each node has a self-loop, the dynamics x
=
xk , started at x0 6= 0, would
1 0
be a counterexample to Theorem 1.
2

9

Lemma 2 tells us that, over sufficiently long horizons, the products of the matrices Ak have entries bounded
away from zero. The next lemma discusses what multiplication by such a matrix does to the spread of the values
in a vector.
Lemma 3. Suppose W is a stochastic matrix, every entry of which is at least β > 0. If v = W u then


max vi − min vi ≤ (1 − 2β) max ui − min ui .
i=1,...,n

i=1,...,n

i=1,...,n

i=1,...,n

Proof. Without loss of generality, let us assume that the largest entry of u is u1 and the smallest entry of u is un .
Then, for l ∈ {1, . . . , n},
vl ≤ (1 − β)u1 + βun ,
vl ≥ βu1 + (1 − β)un ,

so that for any a, b ∈ {1, . . . , n}, we have
va − vb ≤ (1 − β)u1 + βun − (βu1 + (1 − β)un )
= (1 − 2β)u1 − (1 − 2β)un .

With these two lemmas in place, we are ready to prove Theorem 1. Our strategy is to apply Lemma 3 repeatedly
to show that the spread of the underlying vectors keeps getting smaller.
Proof of Theorem 1. Since we have assumed that G[A0 ]α , G[A1 ]α , G[A2 ]α , . . . satisfy Assumption 1, by Lemma 2,
we have that
i
h
AlB:(l+n)B−1 ≥ αnB
ij

for all l = 0, 1, 2, . . ., and i, j = 1, . . . , n. Applying Lemma 3 gives that
(l+n)B
(l+n)B
max x
− min xi
i=1,...,n i
i=1,...,n

≤ 1 − 2α

nB




max

i=1,...,n

xlB
i −

min

i=1,...,n

xlB
i


.

Applying this recursively, we obtain that |xka − xkb | → 0 for all a, b ∈ {1, . . . , n}.
To obtain further that every xki converges, it suffices to observe that xki lies in the convex hull of the vectors xt
for t ≤ k . Finally, since each Ak is doubly stochastic,
X
X
xk+1
= 1T xk+1 = 1T Ak xk = 1T xk =
xkj ,
j
j=1,...,n

j=1,...,n

where 1 denotes a vector with all entries equal to one, and thus all xki must converge to the initial average.
A potential shortcoming of the proof of Theorem 1 is that the convergence time bounds it leads to tend to
scale poorly in terms of the number of nodes n. We can overcome this shortcoming as illustrated in the following
propositions. These results apply to a much narrower class of scenarios, but they tend to provide more effective
bounds when they are applicable.

The first step is to introduce a precise notion of convergence time. Let T n, , {A0 , A2 , . . . , } denote the first
time k when
Pn
x0i
1
xk − i=1
n
Pn
≤ .
0
xi
x0 − i=1
1
n
In other words, the convergence time is defined as the time until the deviation from the mean shrinks by a factor
of . The convergence time is a function of the desired accuracy  and of the underlying sequence of matrices. In
particular, we emphasize the dependence on the number of nodes, n. When the sequence of matrices is clear from
context, we will simply write T (n, ).

10

Proposition 4. Suppose
xk+1 = Ak xk

where each Ak is a doubly stochastic matrix. Then
!
Pn
Pn
  k
0
x
x0
l
k
0
i=1 i
1 ≤
sup σ2 A
x −
x − i=1 i 1 ,
n
n
l=0,1,2,...
2
2
where σ2 (Al ) denotes the second-largest singular value of the matrix Al .
We skip the proof, which follows quickly from the definition of singular value.
We adopt the slightly non-standard notation
 
λ = sup σ2 Al ,

(9)

l≥0

so that the previous proposition can be conveniently restated as
Pn
Pn
0
x0
k
k
0
i=1 xi
x −
1 ≤ λ x − i=1 i 1 .
n
n
Recalling that log(1/λ) ≤ 1/(1 − λ), a consequence of this equation is that



1
1
0
1
ln
,
T n, , {A , A , . . . , } = O
1−λ 

(10)

(11)

so the number λ provides an upper bound on the convergence rate of decentralized averaging.
In general, there is no guarantee that λ < 1, and the equations we have derived may be vacuous. Fortunately, it
turns out that for the lazy Metropolis matrices on connected graphs, it is true that λ < 1, and furthermore, for many
families of undirected graphs it is possible to give order-accurate estimates on λ, which translate into estimates of
convergence time. This is captured in the following proposition. Note that all of these bounds should be interpreted
as scaling laws, explaining how the convergence time increases as the network size n increases, when the graphs
all come from the same family.
Proposition 5 (Network Scaling for Average Consensus via the Lazy Metropolis Iteration). If each Ak is the lazy
Metropolis matrix on the ...

1) ...path graph, then T (n, ) = O n2 log(1/) .
2) ...2-dimensional grid, then T (n, ) = O (n log n log(1/)).
3) ...2-dimensional torus, then T (n, ) = O (n log(1/)). 
4) ...k -dimensional torus, then T (n, ) = O n2/k
 log(1/) .
5) ...star graph, then T (n, ) = O n2 log(1/) .

6) ...two-star graphs4 , then T (n, ) = O n2 log(1/) .
7) ...complete graph, then T (n, ) = O(1).
8) ...expander graph, then T (n, ) = O(log(1/)).
9) ...Erdős-Rényi random graph5 then with high probability6 T (n, ) = O(log(1/)).
10) ...geometric random graph7 , then with high probability T (n, ) =O(n log n log(1/)).
11) ...any connected undirected graph, then T (n, ) = O n2 log(1/) .
Sketch of the proof. The spectral gap 1/(1 − λ) can be bounded as O(H) where H is the largest hitting time of
the Markov chain whose probability transition matrix is the lazy Metropolis matrix (Lemma 2.13 of [27]). We thus
4

A two-star graph is composed of two star graphs with a link connecting their centers.

An Erdős-Rényi random graph with n nodes and parameter p has a symmetric adjacency matrix A whose n2 distinct off-diagonal
entries are independent Bernoulli random variables taking the value 1 with probability p. In this article we focus on the case where
p = (1 + ε) log(n)/n, where ε > 0, for which it is known that the random graph is connected with high probability [25].
6
A statement is said to hold “with high probability” if the probability of it holding approaches 1 as n → ∞. In this context, n is the
number of nodes of the underlying graph.
7
A geometric random graph is one where n nodes are placed uniformly and independently in the unit square [0, 1]2 and two nodes are
connected with an edge if their distance is at most rn . In this article we focus on the case where rn2 = (1 + ε) log(n)/n for some ε > 0,
for which it is known that the random graph is connected with high probability [26].
5

11

(a)

(b)

(c)

(d)

Fig. 2. Examples of some graph families mentioned in Prop. 5. (a) Path graph with n = 5. (b) 2-d grid with n = 16. (c) Star graph with
n = 6. (d) Two-star graph with n = 12.

only need to bound hitting times on the graphs in question, and these are now standard exercises. For example, the
fact that the hitting time on the path graph is quadratic in the number of nodes is essentially the main finding of the
standard “gambler’s ruin” exercise (see e.g., Proposition 2.1 of [28]). The result for the 2-d grid follows from putting
together Theorem 2.1 and Theorem 6.1 of [29]. For corresponding results on 2-d and k -dimensional tori, please see
Theorem 5.5 of [28]; note that we are treating k as a fixed number and looking at the scaling as a function of the
number of nodes n. Hitting times on star, two-star, and complete graphs are elementary exercises. The result for an
expander graph is a consequence of Cheeger’s inequality; see Theorem 6.2.1 in [25]. For Erdős-Rényi graphs the
result follows because such graphs are expanders with high probability; see the discussion on page 170 of [25]. For
geometric random graphs a bound can be obtained by partitioning the unit square into appropriately-sized regions,
thereby reducing to the case of a 2-d grid; see Theorem 1.1 of [30]. Finally the bound for connected graphs is
from Lemma 2.2 of [31].
Fig. 2 depicts examples of some of the graphs discussed in Proposition 5. Clearly the network structure affects
the time it takes information to diffuse across the graph. For graphs such as the path or 2-d torus, the dependence
on n is intuitively related to the long time it takes information to spread across the network. For other graphs,
such as stars, the dependence is due to the central node (i.e., the “hub”) becoming a bottleneck. For such graphs
this dependence is strongly related to the fact that we have focused on the Metropolis scheme for designing the
entries of the matrices Ak . Because the hub has a much higher degree than the other nodes, the resulting Metropolis
updates lead to very small changes and hence slower convergence (i.e., Ak is diagonally dominant); see Eq. (7).
In general, for undirected graphs in which neighboring nodes may have very different degrees, it is known that
faster rates can be achieved by using linear iterations of the form Eq. (6), where Ak is optimized for the particular
graph topology [32], [33]. However, unlike using the Metropolis weights—which can be implemented locally by
having neighboring nodes exchange their degrees—determining the optimal matrices Ak involves solving a separate
network-wide optimization problem; see [33] for a decentralized approximation algorithm.
On the other hand, the algorithm is evidently fast on certain graphs. For the complete graph (where every node
is directly connected to every other node, this is not surprising—since A = (1/n)11T , the average is computed
exactly at every other node after a single iteration. Expander graphs can be seen as sparse approximations of
the complete graph (sparse here is in the sense of having many fewer edges) which approximately preserve the
spectrum, and hence the hitting time [34]. In applications where one has the option to design the network, expanders
are particularly of practical interest since they allow fast rates of convergence—hence, few iterations—while also
having relatively few links—so each iteration requires few transmissions and is thus fast to implement [35], [36].
B. Worst-case scaling of decentralized averaging
One might wonder about the worst-case complexity of average consensus: how long does it take to get close to
the average on any graph? Initial bounds were exponential in the number of nodes [5], [6], [23], [37]. However,
Proposition 5 tells us that this is at most O(n2 ) using a Metropolis matrix. A recent result [27] shows that if all
the nodes know an upper bound U on the total number of nodes which is reasonably accurate, this convergence
time can be brought down by an order of magnitude. This is a consequence of the following theorem.

12

Theorem 6 ([27], [31]). [Linear8 Time Convergence for Consensus] Suppose each node in an undirected connected
graph G implements the update
1 X ukj − uki
,
2
max(di , dj )
j∈Ni



2
k+1
k+1
ui = wi + 1 −
wik+1 − wik ,
9U + 1

wik+1 = uki +

(12)

where U ≥ n and u0 = w0 . Then


1 k 0
kw − w1k22 ,
kwk − w1k22 ≤ 2 1 −
9U
where w = (1/n)

Pn

0
i=1 wi is the initial average.

Thus if every node knows the upper bound U , the above theorem tells us that the number of iterations until every
element of the vector wk is at most  away from the initial average w is O(U ln(1/)). In the event that U is within
a constant factor of n, (e.g., n ≤ U ≤ 10n) this turns out to be linear in the number of nodes n. One situation in
which this is possible is if every node precisely knows the number of nodes in the network, in which case they can
simply set U = n. However, this scheme is also useful in a number of settings where the exact number of nodes
in the system is not known (e.g., if nodes fail) as long as approximate knowledge of the total number of nodes is
available.
Intuitively, Eq. (12) takes a lazy Metropolis update and accelerates it by adding an extrapolation term. Strategies
of this form are known as over-relaxation in the numerical analysis literature [38] and as Nesterov acceleration in
the optimization literature [39]. On a non-technical level, the extrapolation speeds up convergence by reducing the
inherent oscillations in the underlying sequence. A key feature, however, is that the degree of extrapolation must be
carefully chosen, which is where knowledge of the bound U is required. At present, open questions are whether any
improvement on the quadratic convergence time of Proposition 5 is possible without such an additional assumption,
and whether a linear convergence time scaling can be obtained for time-varying graphs.
III. D ECENTRALIZED OPTIMIZATION OVER UNDIRECTED GRAPHS
We now shift our attention from decentralized averaging back to the problem of optimization. We begin by
describing the (centralized) subgradient method, which is one of the most basic algorithms used in convex optimization.
A. The subgradient method
To avoid confusion in the sequel when we discuss decentralized optimization methods, here we consider an
iterative method for minimizing a convex function h : Rd → R. A vector g ∈ Rd is called a subgradient of h at
the point x if
h(y) ≥ h(x) + g T (y − x), for all x, y ∈ Rd .
(13)
The subgradient may be viewed as a generalization of the notion of the gradient to non-differentiable (but convex)
functions. Indeed, if the function h is continuously differentiable, then g = ∇h(x) is the only subgradient at x. In
general, there are multiple subgradients at points x where the function h is not differentiable. See Figure III-A for
a graphical illustration.
The subgradient method9 for minimizing the function h is defined as the iterate process
uk+1 = uk − αk g k ,

(14)

where g k is a subgradient of the function h at the point uk . The quantity αk is a nonnegative step-size.
8
Note that we do not adhere to the common convention of using “linear convergence” as a synonym for “geometric convergece”; rather,
“linear time” convergence in this paper refers to a convergence time which scales as O(n) in terms of the number of nodes n.
9
The earliest work on subgradient methods appears in [41].

13

Fig. 3. An illustration of the definition of a subgradient. At the point x0 , the function shown is not differentiable. However, there are a
number of possible values g such that the tangent line with slope g at x0 is a global understimate of the function, and some of them are
shown in the figure. Each such g is a subgradient of the function at x0 . The figure is a modified version of an image by Felix Reidel
from [40].

The subgradient method has a somewhat different motivation than the gradient method. It is well known that the
gradient is a descent direction at points that are not the global minima. At these points, unlike the gradient, the
subgradient gives a direction along which either the function h(·) decreases or the distance toward the set of global
minima decreases for small enough stepsizes. In general, it is hard to know what is the best step-size choice for
the convergence of the subgradient method and, as a simple option, a diminishing stepsize αk is commonly used,
i.e., a stepsize αk that decreases to zero as k increases. However, to guarantee the convergence toward a global
minimum of the function, the rate at which the stepsize decreases has to be carefully selected to avoid having the
iterates stuck at a point that is not a minimizer of the function, while controlling the errors that are introduced due
to the use of subgradient directions. This intuition is captured by the following theorem 10 .
Theorem 7 (Convergence and Convergence Time for the Subgradient Method). Let U ∗ be the set of minimizers of
the function h : Rd → R. Assume that (i) h is convex, (ii) U ∗ is nonempty, (iii) kgk ≤ L for all subgradients g of
the function h(·).
1) If the nonnegative step-size sequence αk is “summable but not square summable,” i.e.,
∞
X
k=0

αk = +∞

and

+∞ h
X

αk

i2

< ∞.

k=0

Then, the iterate sequence {uk } converges to some minimizer u∗ ∈ U ∗ .
√
2) If the subgradient method is run for T steps with the (constant) choice of stepsize αk = 1/ T for k =
0, . . . , T − 1, then
PT −1 k !
ku0 − u∗ k2 + L2
k=0 u
√
h
− h∗ ≤
,
T
2 T
where h∗ is the minimal value of the function, i.e., h∗ = h(u∗ ) for any u∗ ∈ U ∗ .
Proof. (1) A proof can be found in Lemma 7 of [44]. (2) The distance kuk − u∗ k22 , for an arbitrary u∗ ∈ U ∗ is
used to measure the progress of the basic subgradient method. From the definition of the method, for the constant
stepsize it follows that
kuk+1 − u∗ k22 = kuk − u∗ k22 − 2α(g k )T (uk − u∗ ) + α2 kg k k2 .

Then, using the subgradient defining inequality in Eq. 13 and the assumption that the subgradient norms are bounded
by L, we obtain


kuk+1 − u∗ k22 = kuk − u∗ k22 − 2α h(uk ) − h(u∗ ) + α2 L2 .
Under weaker assumptions on the stepsize than those of Theorem 7, namely αk → 0 and
lim inf k→∞ h(xk ) = h∗ , where h∗ = inf x∈Rd h(x), see [1], [42], [43].
10

P∞

k=0 α

k

= ∞, one can show that

14

By summing these inequalities over k = 0, 1, . . . T , re-arranging the terms, and dividing by 2αT , one can see that
T −1

kuk − u∗ k22 αL2
1 X
h(uk ) − h(u∗ ) ≤
+
.
T
2αT
2
k=0

The result follows by using the convexity of h(·) which yields
PT −1 k !
T −1
1 X
k=0 u
h
≤
h(uk ),
T
T
k=0

and by letting α = √1T .
For the diminishing step in part (1), since the iterates uk converge to some minimizer u∗ , so does any weighted
average of the iterates (with positive weights). In particular, it follows that
Pt
k k
k=0 α u
= u∗ .
lim P
t
k
t→∞
α
k=0
Furthermore, it is a fact that any convex function whose domain is the entire space of the decision variables is
continuous at every point. Thus, by continuity of h(·), it follows that
!
Pt
k uk
α
k=0
lim h
= h(u∗ ).
P
t
k
t→∞
α
k=0
In the case of a fixed stepsize, part (2) provides an error bound in terms
 √ofthe function values. On a conceptual
level, the main takeaway is that the subgradient method produces O 1/ T convergence to the optimal function
value in terms of the number of iterations T .
Similar to gradients, for the convex functions defined over the entire space, the subgradients are “linear” in the
sense that a subgradient of the sum of two convex functions can be obtained as a sum of two subgradients (one
for each function). Formally, if g1 is a subgradient of a function f1 at x and g2 is a subgradient of f2 at x, then
g1 + g2 is a subgradient of f1 + f2 at x. (This follows directly from the subgradient definition in Eq. (13).)
B. Decentralizing the subgradient method
We now return to the problem of decentralized optimization. To recap, we have n nodes, interconnected in a
time-varying network capturing which pairs of nodes can exchange messages. For now, assume that these networks
are all undirected. (This is relaxed in Section IV, which considers directed graphs.) Node i knows the convex
function fi : Rm → R and the nodes would like to minimize the function
n

1X
f (x) =
fi (x)
n

(15)

i=1

in a decentralized way. If all the functions f1 (x), . . . , fn (x) were available at a single location, we could directly
apply the subgradient method to their average f (x):
k+1

u

k

=u −α

k1

n

n
X

g ki ,

i=1

where g ki is a subgradient of the function fi (·) at uk . Unfortunately, this is not a decentralized method under our
assumptions, since only node i knows the function fi (·), and thus only node i can compute a subgradient of fi (·).
A decentralized subgradient method solves this problem by interpolating between the subgradient method and
an average consensus scheme from Section II. In this scheme, node i maintains the variable xki which is updated
as
X
xk+1
=
akij xkj − αk gik ,
(16)
i
j∈Nik

15

where gik is the subgradient of fi (·) at xki . Here the coefficients [aij ] come from any of the average consensus
methods we discussed in Section II. We will refer to this update as the decentralized subgradient method. Note
that this update is decentralized in the sense that node i only requires local information to execute it. In the case
when f : R → R, the quantities xkj are scalars and we can write this as
xk+1 = Ak xk − αk g k ,

(17)

where the vector xk ∈ Rn stacks up the xki and g k ∈ Rn stacks up the gik . The weights akij should be chosen
by each node in a decentralized way. Later within this section, we will assume that the matrices Ak are doubly
stochastic; perhaps the easiest way to achieve this is to use the lazy Metropolis iteration of Eq. (8).
Intuitively, the decentralized subgradient method of Eq. (16) pulls the value xki at each node in two directions:
on the one hand towards the minimizer (via the subgradient term) and on the other hand towards neighboring
nodes (via the averaging term). Eq. (16) can be thought of as reconciling these pulls; note that the strength of
the consensus pull does not change, but the strength of the subgradient pull is controlled by the stepsize, and this
stepsize αk will be later chosen to decay to zero, so that in the limit the consensus term will prevail. However, if
the rate at which the stepsize decays to zero is slow enough, then under appropriate conditions consensus will be
achieved not on some arbitrary point, but rather on a global minimizer of f (·).
We now turn to the analysis of the decentralized subgradient method. For simplicity, we make the assumption
that all the functions fi (·) are from R to R; this simplifies the presentation but otherwise has no effect on the
results. The same analysis extends in a straightforward manner to functions fi : Rd → R with d > 1 but requires
more cumbersome notation.11
Theorem 8 ([24], [45]). [Convergence and Convergence Time for the Decentralized Subgradient Method] Let X ∗
denote the set of minimizers of the function f . We assume that: (i) each fi is convex; (ii) X ∗ is nonempty; (iii) each
function fi has the property that its subgradients at any point are bounded by the constant L; (iv) the matrices
Ak are doubly stochastic and there exists some α > 0 such that the graph sequence [GA0 ]α , [GA1 ]α , [GA2 ]α , . . .
satisfies Assumption 1; and (v) the initial values x0i are the same across all nodes12 (e.g., x0i = 0). Then:
1) If the positive step-size sequence αk is “summable but not square summable,” i.e.,
∞
X

αk = +∞

k=0
13

then

and

+∞ h
X

αk

i2

< ∞,

k=0

for any x∗ ∈ X ∗ , we have that for all i = 1, . . . , n,
lim f

t→∞

!
Pt
l l
l=0 α xi
f (x∗ ).
Pt
l
l=0 α

√
2) If we P
run the protocol for T steps with (constant) step-size αk = 1/ T , and with the notation y k =
(1/n) ni=1 xki , then we have that for all i = 1, . . . , n,
PT −1 k !
(y 0 − x∗ )2 + L2
L2
l=0 y
√
− f (x∗ ) ≤
+√
,
(18)
f
T
2 T
T (1 − λ)

where λ is defined by Eq. (9).
PT −1 l
We remark that the quantity ( l=0
y )/T on which the suboptimality bound is proved can be computed via an
P −1 l
average consensus protocol after the protocol is finished if node i keeps track of ( Tl=0
xi )/T .
11
If xki are vectors, one can still stack the per-node vectors into a network-wide vector xk , but then in (17) the matrix Ak must be changed
to Ak ⊗ Id where ⊗ is the Kronecker product and Id is the d × d identity matrix. For such details, we refer the interested reader to [24],
[45], [46], which do not assume that xki are scalars.
12
This assumption is not necessary for the results stated here, but we use it to simplify the exposition. When this assumption is violated,
the bound in part (ii) has an additional term depending on the spread of the initial values. This term decays exponentially on the order of
λk .
13
In fact we can show a stronger result that, as k → ∞, the iterate sequences {xki } converge to a common minimizer x∗ ∈ X ∗ , for all i.
However, the proof is more involved; see [45].

16

Comparing part (2) of Theorems 7 and 8, and ignoring the similar terms involving the initial conditions, we see
that the convergence bound gets multiplied by 1/(1 − λ). This term may be thought of as measuring the “price of
decentralization” resulting from having knowledge of the objective function decentralized throughout the network
rather than available entirely at one place.
We can use Proposition 5 to translate this into concrete convergence times on various families of graphs, as the
next result shows. For  > 0, let us define the -convergence time to be the first time when
PT −1 l !
l=0 y
− f (x∗ ) ≤ .
f
T
Naturally, the convergence time will depend on  and on the underlying sequence of matrices/graphs.
Corollary 9 (Network Scaling for the Decentralized Subgradient Method). Suppose all the hypotheses of Theorem
8 are satisfied, and suppose further that the weights akij are the lazy Metropolis weights defined in Eq. (8). Then
the convergence time can be upper bounded as
!
max (y 0 − x∗ )4 , L4 Pn2
,
O
2
where if the graphs Gi are

1) ...path graphs, then Pn = O n2 ;
2) ...2-dimensional grid, then Pn = O (n log n);
3) ...2-dimensional torus, then Pn = O (n); 
4) ...k -dimensional torus, then Pn = O n2/k ;
5) ...complete graphs, then Pn = O(1);
6) ...expander graphs, then Pn = O(1)
 ;
7) ...star graphs, then Pn = O n2 ; 
8) ...two-star graphs, then Pn = O n2 ;
9) ...Erdős-Rényi random graphs, then Pn = O(1);
10) ...geometric random graphs, then Pn = O(n log n); 
11) ...any connected undirected graph, then Pn = O n2 .
These bounds follow immediately by putting together the upper bounds on 1/(1 − λ) discussed in Proposition
5 with Eq. (18).
We remark that it is possible to decrease the scaling from O(L4 Pn2 ) to O(L2 Pn ) in the above corollary if the
constant L, the type of the underlying graph (e.g., star graph, path graph),
√ and the number of nodes n is known to
k
all nodes. Indeed, this can be achieved by setting the stepsize α = β/ T and using a hand-optimized β (which
will depend on L, n, as well as the kind of underlying graphs). We omit the details but this is very similar to the
optimization done in [19].
We now turn to the proof of Theorem 8. We will need two preliminary lemmas covering some background in
optimization. The first lemma discusses how the bound on the norms of the subgradients translate into Lipschitz
continuity of the underlying function.
Lemma 10. Suppose h : Rd → R is a convex function such that h(·) has subgradients gx , gy at the points x, y ,
respectively, satisfying kgx k2 ≤ L and kgy k2 ≤ L. Then
|h(y) − h(x)| ≤ Lky − xk2

Proof. On the one hand, we have by definition of subgradient
h(y) ≥ h(x) + gxT (y − x),

so that, by the Cauchy-Schwarz inequality,
h(y) − h(x) ≥ −Lky − xk2 .

On the other hand
h(x) ≥ h(y) + gyT (x − y),

(19)

17

so that
h(x) − h(y) ≥ −Lkx − yk2 ,

which we rearrange as
h(y) − h(x) ≤ Lky − xk2 .

(20)

Together Eq. (19) and Eq. (20) imply the lemma.
Our overall proof strategy is to view the decentralized subgradient method as a kind of perturbed consensus process. To that end, the next lemma extends our previous analysis of the consensus process to deal with perturbations.
Lemma 11. Suppose
xk+1 = Ak xk + ∆k ,

where Ak are doubly stochastic matrices satisfying Assumption 1 and ∆k ∈ Rn are perturbation vectors.
1) If supk k∆k k2 ≤ L0 then
1T x k
L0
1T x0
xk −
1 ≤ λk x0 −
1 +
,
n
n
1−λ
2
2
where λ is from Eq. (9).
T k
2) If ∆k → 0, then xk − 1 nx 1 → 0.
Proof. For convenience, let us introduce the notation
yk =

1T xk
,
n

ek = xk − y k 1,

mk =

1T ∆k
.
n

Since
y k+1 = y k + mk ,

we have that
ek+1 = Ak ek + ∆k − mk 1,

and therefore
ek = Ak−1:0 e0 + Ak−1:1 (∆0 − m0 1)
+ · · · + Ak−1 (∆k−2 − mk−2 1) + (∆k−1 − mk−1 1).

Now using the fact that the vectors e0 and ∆i − mi 1 have mean zero, by Proposition 4 we have
k

k

0

ke k2 ≤ λ ke k2 +

k−1
X

λk−1−j k∆j k2 .

(21)

j=0

This equation immediately implies the first claim of the lemma.
Now consider the second claim. We define
Lkfirst−half

=

sup k∆j k2
0≤j<k/2

Lksecond−half

=

sup

k∆j k2 .

k/2≤j≤k

Since ∆k → 0, we have
sup Lkfirst−half < ∞,
k≥2

lim Lksecond−half = 0.

k→∞

Now Eq. (21) implies
kek k2 ≤ λk−1 ke0 k2 + λk/2

Lk
Lkfirst−half
+ second−half .
1−λ
1−λ

Combining this with Eq. (22), we have that kek k2 → 0 and this proves the second claim of the lemma.

(22)

18

With these lemmas in place, we now turn to the proof of Theorem 8. Our approach will be to view the decentralized
subgradient method as a perturbation of a subgradient-like process followed by averaging of the entries of the vector
xk . Provided that the step-size αk decays to zero at the appropriate rate, we will argue that (i) the vector xk is not
too far from its average and (ii) this average makes continual progress towards a minimizer of the function f (·).
Proof of Theorem 8. Recall that we are assuming, for simplicity, that the functions fi are from R to R. As before,
let y k be the average of the entries of the vector xk ∈ Rn , i.e.,
yk =

1T xk
.
n

Since the matrices Ak are doubly stochastic, 1T Ak = 1T so that
y k+1 = y k − αk

1T g k
.
n

Now for any x∗ ∈ X ∗ , we have
(y

k+1

∗ 2

k

∗ 2

h

− x ) ≤ (y − x ) + α

k

i2

2

L − 2α

k

Pn

k
i=1 gi

n

(y k − x∗ ).

(23)

Furthermore, for each i = 1, . . . , n,
gik (y k − x∗ ) = gik (xki − x∗ + y k − xki )
= gik (xki − x∗ ) + gik (y k − xki )
≥ fi (xki ) − fi (x∗ ) − L y k − xki
≥ fi (y k ) − fi (x∗ ) − 2L y k − xki ,

where the first inequality uses a rearrangement of the definition of the subgradient and the last inequality uses
Lemma 10. Plugging this into Eq. (23), we obtain
h i2
(y k+1 − x∗ )2 ≤ (y k − x∗ )2 + αk L2 − 2αk (f (y k ) − f (x∗ ))
n

+ 2Lαk

1X k
y − xki ,
n
i=1

or
2αk (f (y k ) − f (x∗ )) ≤ (y k − x∗ )2 − (y k+1 − x∗ )2
n
h i2
1X k
y − xki .
+ αk L2 + 2Lαk
n
i=1

We can sum this up to obtain
2

t
X



αl f (y l ) − f (x∗ )

l=0

≤ (y 0 − x∗ )2 − (y t+1 − x∗ )2
t h i
n
t
X
X
X
2
2
l
l1
y l − xli ,
+L
α + 2L
α
n
l=0

l=0

i=1

19

which in turn implies
!
Pt
l
α
y
l
− f (x∗ )
f
Pl=0
t
α
l
l=0
P  2
P
P
0
(y − x∗ )2 + L2 tl=0 αl + 2L tl=0 αl (1/n) ni=1 y l − xli
≤
P
2 tl=0 αl
P  2
(y 0 − x∗ )2 + L2 tl=0 αl
=
P
2 tl=0 αl
P
P
2L tl=0 αl (1/n) ni=1 y l − xli
.
+
P
2 tl=0 αl
We now turn to the first claim of the theorem statement. The first term on the right-hand side goes to zero because
its numerator is bounded while its denominator is unbounded (due to the assumption that the step-size is summable
but not square summable). For the second term, we view −αk g k as the perturbation ∆k in Lemma 6 to obtain that
xl − y l 1 → 0, and in particular xli − yil → 0 for each i. It follows that the Cesàro sum (which is exactly the second
term on the right-hand side) must go to zero as well. We have thus shown that
!
Pt
l yl
α
− f (x∗ ) → 0.
f
Pl=0
t
l
α
l=0
Putting this together with Lemma 11, which implies that xli − y l → 0 for all i, we complete the proof of the first
claim.
For the second claim, using (i) Lemma
11, (ii) the assumption that the initial conditions are the same across all
√
nodes, and (iii) the inequalities kzk1 ≤ nkzk2 for vectors z ∈ Rn , we have the bound
PT −1 l !
l=0 y
f
− f (x∗ )
T
 √ √ 
P −1 √
n n
(y 0 − x∗ )2 + L2 + 2L Tl=0
1/ T (1/n) √LT (1−λ)
√
≤
2 T
(y 0 − x∗ )2 + L2
L2
√
=
+√
,
(24)
2 T
T (1 − λ)
which completes the proof of the second claim.
C. Improved scaling with the number of nodes
The results of Corollary 9 improve upon those reported in [19], [24], [47]. A natural question is whether it is
possible to further improve the scalings even further. In particular, one might wonder how the worst-case convergence
time of decentralized optimization scales with the number of nodes in the network. In general, this question is open.
Partial progress was made in [27], [31], where, under the assumption that all nodes know an order-accurate bound
on the total number of nodes in the network, it was shown that we can use the linear time convergence of average
consensus described in Theorem 6 to obtain a corresponding convergence time for decentralized optimization when
the underlying graph is fixed and undirected. Specifically, [27], [31] consider the following update rule
yik+1
zik+1

=

1
xki +

=

j∈Ni
k
yi − βgik

X

2



xkj − xji
− βgik
max(di , dj )

xk+1
= yik+1 + 1 −
i

(25)
2
9U + 1





yik+1 − zik+1 ,

where gik is a subgradient of the function fi (·) at the point yik . As in Section II, here the number U is an upper
bound on the number of nodes known to each individual node, and it is assumed that U is within a constant factor

20

of the true number of nodes, i.e., n ≤ U ≤ cn for some constant c (not depending on n, U or any other problem
parameters).
By relying on Theorem 6, it is shown in [27], [31] that the corresponding
by
Pn time until this scheme (followed
2
a round of averaging) is  close to consensus on a minimizer of (1/n) i=1 fi (·) is O(n log n + n/ ). It is an
open question at present whether a similar convergence time can be achieved over time-varying graphs or without
knowledge of the upper bound U .
D. The strongly convex case
√
The error decrease of 1/ T with the number of iterations T is, in general, the best possible rate for dimensionindependent convex optimization [48]. Under the stronger assumption that the underlying functions fi (·) are strongly
convex with Lipschitz-continuous gradients, gradient descent will converge geometrically. Until recently, however,
there were no corresponding decentralized protocols with a geometric rate.
Significant progress on this issue was first made in [49], where, over fixed undirected graphs, the following
scheme was proposed:
h
i
f xk − α ∇f (xk+1 ) − ∇f (xk ) ,
xk+2 = (I + W )xk+1 − W
(26)
with the initialization
x1 = W x0 − α∇f (x0 ).

Here, for simplicity, we continue with the assumption that the functions fi (·) are from R to R. The matrices W and
f are two different, appropriately chosen, symmetric stochastic matrices compatible with the underlying graph; for
W
f = (I +W )/2. It was shown in [49] that this scheme,
example, W might be taken to be the Metropolis matrix and W
called EXTRA, drives all nodes to the global optimal at a geometric rate under natural technical assumptions.
It is not immediately obvious how to extend the EXTRA update to handle time-varying directed graphs; the
original proof in [49] only covered static, undirected graphs. Progress on this question was made in [50] which,
in addition to providing a geometrically convergent method in the time-varying and directed cases, also provides a
new intuitive interpretation of EXTRA. Indeed, [50] observes that the scheme
xk+1 = W k xk − αy k
y k+1 = W k y k + ∇f (xk+1 ) − ∇f (xk )

(27)

is a special case of the EXTRA update of Eq. (26). Here, the initialization x0 can be arbitrary, while y 0 = ∇f (x0 ).
The matrices W k are doubly stochastic. Moreover, Eq. (27) has a natural interpretation. In particular, the second
line of Eq. (27) is a tracking recursion: y k tracks the time-varying gradient average 1T ∇f (xk )/n. Indeed, observe
that, by the double stochasticity of W k , we have that
1T y k /n = 1T ∇f (xk )/n.

In other words, the vector y k has the same average as the average gradient. Moreover, it can be seen that if xk → x
b,
then y k → ∇f (x
b); this is due to the “consensus effect” of repeated multiplications by W k . Such recursions for
tracking were studied in [51].
While the second line of Eq. (27) tracks the average gradient, the first line of Eq. (27) performs a decentralized
gradient step as if y k was the exact gradient direction. The method can be naturally analyzed using methods for
approximate gradient descent. It was shown in [50] that this method converges to the global optimizers geometrically
under the same assumptions as EXTRA, even when the graphs are time-varying; further, the complexity of reaching
an  neighborhood of the optimal solution is polynomial in n.
We conclude by remarking that there is quite a bit of related work in the literature. Indeed, the idea to use a
two-layered scheme as in Eq. (27) originates from [52]–[56]. Furthermore, improved analysis of convergence rates
over an undirected graph is available in [57], [58].

21

IV. AVERAGING AND O PTIMIZATION OVER D IRECTED GRAPHS
A. Decentralized averaging over directed graphs
We have seen in Section II that over time-varying undirected graphs, the lazy Metropolis update results in
consensus on the initial average. In this section, we ask whether this is possible over a sequence of directed graphs.
By way of motivation, we remark that many applications of decentralized optimization involve directed graphs. For
example, in wireless networks the communication radius of a node is a function of its broadcasting power; if nodes do
not all transmit at the same power level, communications will naturally be directed. Any decentralized optimization
protocol meant to work in wireless networks must be prepared to deal with unidirectional communications.
Unfortunately, it turns out that there is no direct analogue of the lazy Metropolis method for average consensus
over directed graphs. In fact, if we consider deterministic protocols where, at each step, nodes broadcast information
to their neighbors and then update their states based on the messages they have received, then it can be proven
that no such protocol can result in average consensus; see [59]. The main obstacle is that the consensus iterations
we have considered up to now (e.g., in Section II-A) relied on doubly stochastic matrices in their updates, which
cannot be done over graphs that are time-varying and directed. We thus need to make an additional assumption to
solve the average consensus problem over directed graphs.
A standard assumption in the field is that every node always knows its out-degree. In other words, whenever
a node broadcasts a message it knows how many other nodes are within listening range. In practice, this can be
accomplished in practice via a two-level scheme, wherein nodes broadcast hello-messages at an identical and high
power level, while the remainder of the messages are transmitted at lower power levels. The initial exchange of
hello-messages provides estimates of distance to neighboring nodes, allowing each node to see how many listeners
it has as a function of its transmission power. Alternatively, the out-degrees can be estimated in a decentralized
manner using linear iterations [60], assuming that the underlying communication topology is strongly connected.
Under this assumption, it turns out that average consensus is indeed possible and may be accomplished via the
following iteration,
xk+1
=
i

xkj
,
dout
j
in,k

X
j∈Ni

yik+1 =

yjk
,
dout
j
in,k

X

(28)

j∈Ni

initialized at an arbitrary x0 and y 0 = 1. This is known as the Push-Sum iteration; it was introduced in [61],
where its correctness was shown for a fully-connected graph (allowing only pairwise communications), while it
was extended to arbitary strongly connected graphs in [62]. In [63] the push-sum was applied to address distributed
energy resources over a static directed graph, with a more recent extensions including imperfect communications
such as those with delays in [64] and with packet drops [65].
On an intuitive level, the update of the variables xki does not lead to consensus because of the lack of doubly
stochasticity. Instead, at each time k , each xki is some linear combination of xkj where j runs over a large enough
neighborhood of i. The main idea of Push-Sum is that an identical iteration started at the all-ones vector (i.e., the
update for yik ) allows the algorithm to estimate the weights of that linear combination. Once these weights are
known, average consensus can be achieved via rescaling. Indeed, we will show later how a decentralized algorithm
can use both xki and yik to achieve average consensus.
The name Push-Sum derives from the nature of the decentralized implementation of Eq. (28). Observe that Eq. (28)
may be implemented with one-directional communication. Specifically, every node i transmits (or broadcasts) the
k out
values xki /dout
i and yi /di to its out-neighbors. After these transmissions, each node has the information it needs to
perform the update (28), which involves summing the pushed values. In contrast, the algorithms for undirected graphs
described in the previous sections required that each node i send a message to all of its neighbors and receive
a message from each neighbor. Protocols of this sort are known as “push-pull” in the decentralized computing
literature because the transmission of a message from node i to node j (the “push”) implies that i also expects to
receive a message from j (the “pull”).
Our next theorem, which is the main result of this subsection, tells us that Push-Sum works. For this result, we
define matrices Ak as follows:
(
1
if j ∈ Niin,k ,
k
dout
j
aij =
(29)
0
else.

22

Theorem 12 ([61], [62], [66]). [Convergence of Push-Sum] Suppose the sequence of graphs GA0 , GA1 , GA2 , . . .
satisfies Assumption 1. Then for each i = 1, . . . , n,
Pn
0
xki
j=1 xj
lim k =
.
k→∞ y
n
i
It is somewhat remarkable that the convergence to the average happens for the ratios xki /yik . Adopting the notation
a./b for the element-wise ratio of two vectors a and b, the above theorem may be restated as
!
Pn
0
x
j
j=1
1.
lim xk ./y k =
k→∞
n
We now turn to the proof of the theorem. Using the matrices Ak as defined in Eq. (29), the iterations in Eq.
(28) may be written as
xk+1 = Ak xk ,

y k+1 = Ak y k .

Observe that Ak is column stochastic by design, i.e.,
1 T Ak = 1 T .

As a consequence of this, the sums of xk and y k are preserved, i.e.,
n
X

xki =

i=1

n
X

n
X

x0i ,

i=1

i=1

yik =

n
X

yi0 = n.

(30)

i=1

For our proof, we will need to use the fact that the vector y k remains strictly positive and bounded away from
zero in each entry. This is shown in the following lemma.
Lemma 13. For all i, k , yik ≥ 1/(n2nB ).
Proof. By Assumption 1, every node has a self-loop, so we have that
1 k
y for all i, k,
(31)
n i
and consequently the lemma is true for k = 1, . . . , nB − 1. Let lnB be the largest multiple of nB which is at most
k . If k > nB − 1 then
y k = Ak:nlB AnlB−1:0 1.
yik+1 ≥

By Lemma 2, the matrix AnlB−1:0 is the transpose product of stochastic matrices satisfying Assumption 1, and
consequently each of its entries is at least αnB (where α = 1/n) by Lemma 2. Thus
 nB
h
i
1
nlB−1:0
A
1 ≥
for all i = 1, . . . , n.
n
i
Applying Eq. (31) to the last k − nlB steps now proves the lemma.
With this lemma in mind, we can give a proof of Theorem 12 that is essentially a quick reduction to the result
already obtained in Theorem 1.
Proof of Theorem 12. Let us introduce the notation zik = xki /yik . Then xki = zik yik , and therefore, we can rewrite
the Push-Sum update as
n
X
zik+1 yik+1 =
[W k ]ij zjk yjk ,
j=1

or
zik+1 =

m
X
j=1

[W k ]ij (yik+1 )−1 zjk yjk ,

23

where the last step used the fact that yik 6= 0, which follows from Lemma 13. Therefore, defining

−1
P k = diag(y k+1 )
W k diag(y k ),
we have that
z k+1 = P k z k .

Moreover, P k is stochastic since

−1
P k 1 = diag(y k+1 )
W k diag(y k )1

−1
W k yk
= diag(y k+1 )

−1
= diag(y k+1 )
y k+1
= 1.

We have thus written the Push-Sum update as an “ordinary” consensus update after a change of coordinates.
However, to apply Theorem 1 about the convergence of the basic consensus process, we need to lower bound the
entries of P k , which we proceed to do next.
Indeed, as a consequence of the definition of P k , if we choose α to be some fixed number such that

−1


α ≤ max yik+1
min
[W k ]ij
min yik
i

i

(i,j) | [W k ]ij >0

always holds, then the sequence of graphs G[P 0 ]α , G[P 1 ]α , G[P 2 ]α , . . . will satisfy Assumption 1. To find an α that
satisfies this condition, we make use of the fact that 1/(n2nB ) ≤ yik ≤ n, which is a consequence of Eq. (30) and
Lemma 13. It follows that the choice α = (1/n) · (1/n) · 1/(n2nB ) suffices. Thus, we can apply Theorem 1 and
obtain that z k converges to a multiple of the all-ones vector.
It remains to show that the final limit point is the initial average. Let z∞ be the ultimate limit of each zik . Then
for each k = 0, 1, 2, . . .,
Pn
yk
z∞ = z∞ Pni=1 ik
yi
Pn i=1
Pn
k
k
(z∞ − zik )yik
i=1 zi yi
=
+ i=1
n
Pn n k Pn
x
(z
−
zik )yik
∞
i=1 i
=
+ i=1
,
n
n
so
Pn
Pn

k k
k
i=1 xi
i=1 (z∞ − zi )yi
= lim
= 0,
lim z∞ −
k→∞
k→∞
n
n
where the last equality used that each yik is a positive number upper bounded by n. Finally, appealing to the first
relation in Eq. (30) we complete the proof of the theorem.
B. Push-Sum based subgradient method
Suppose now that every P
agent i has a (scalar) convex objective function fi (·), and the system objective is
to minimize f (x) = (1/n) ni=1 fi (x). We next describe a decentralized subgradient method for determining a
minimizer of f using the Push-Sum algorithm. Every node i maintains scalar variables xki , yik , wik , and updates
them according to the following rules: for all k ≥ 0 and all i = 1, . . . , n,
wik+1

=

X
j∈Ni

,k
dout
j

X

yjk

in,k

yik+1 =

xkj

in,k

j∈Ni

,k
dout
j

,
,

24

zik+1 =

wik+1
yik+1

,

xk+1
= wik+1 − αk+1 gik+1 ,
i

(32)

where gik+1 is a subgradient of the function fi (z) at z = zik+1 . The method is initiated with an arbitrary vector
x0i ∈ R at node i, and with yi0 = 1 for all i. The Push-Sum updates steer the vectors zik+1 toward each other in
order to converge to a common point, while the subgradients in the updates of xk+1
drive this common point to
i
lie in the set of minimizers of the objective function f .
In the next theorem, we establish the convergence properties of the subgradient method of Eq. (32).
Theorem 14 ([44]). [Convergence of the Push-Sum Subgradient Method] Let X ∗ be the set of minimizers of the
function f . Assume that: (i) each fi is convex; (ii) X ∗ is nonempty; (iii) each function fi has the property that its
subgradients at any point are bounded by a constant L; and (iv) the graph sequence GA0 , GA1 , GA2 , . . . satisfies
Assumption 1.
1) If the stepsizes α1 , α2 , ... are positive, non-increasing, and satisfy the conditions
∞
X

αk = ∞

k=1

and

∞
X

[αk ]2 < ∞,

k=1

then the decentralized subgradient method of Eq. (32) converges asymptotically:
lim zik = x∗

k→∞

for all i and for some x∗ ∈ X ∗ .

√
2) If αk = 1/ k for k ≥ 1 and every node i maintains the variable zeik ∈ R initialized at k = 0 with any zei0 ∈ R
and updated by
αk+1 zik+1 + S k zeik
zeik+1 =
for k ≥ 0,
S k+1
P
s+1 for k ≥ 1, then for all k ≥ 1, i = 1, . . . , n, and any x∗ ∈ X ∗ ,
where S 0 = 0 and S k = k−1
s=0 α


f zeik+1 − f ∗
≤

n |x̄(0) − x∗ | L2 (1 + ln(k + 1))
√
√
+
2
k+1
2n k + 1
P
24L nj=1 |x0j |
24L2 (1 + ln k)
√
√
+
+
,
δ(1 − λ) k + 1 δ(1 − λ) k + 1

where f ∗ is the optimal value of the problem, i.e., f ∗ = minz∈R f (z), and x̄(0) = n1
and λ are functions of the graph sequence GA0 , GA1 , GA2 , . . .; in particular,


1
1 1/(nB)
δ ≥ nB
and λ ≤ 1 − nB
.
n
n

Pn

0
i=1 xi . The scalars δ

We note that the lower bounds on δ and λ can be refined when some additional structure is imposed on the
underlying time-varying graphs. The results of Theorem 14 and their proofs can be found in [44] for a more
general case when fi are defined over Rd with d ≥ 1. A better convergence rate can be obtained under the
additional assumption that the objective functions fi are strongly convex; see [67].
The first work to have employed Push-Sum decentralized averaging within a decentralized optimization methods
is [17], and it was further investigated in [68]–[70]. This work focused on static graphs, and it has been proposed
as an alternative to the algorithm based on synchronous decentralized averaging over undirected graphs in order to
avoid deadlocks and synchronization issues, among others. This work also described a decentralized method based
on Push-Sum for multi-agent optimization problems with constraints by using Nesterov’s dual-averaging approach.
This Push-Sum consensus-based algorithm has been extended to the subgradient-push algorithm in [44], [67] that
can deal with convex optimization problems over time-varying directed graphs. More recently, the paper [71] has

25

extended the Push-Sum algorithm to a larger class of decentralized algorithms that are applicable to nonconvex
objectives, convex constraint sets, and time-varying graphs.
References [72], [73] combine EXTRA with the Push-Sum approach to produce the DEXTRA (Directed ExtraPush) algorithm for optimization over a directed graph. It has been shown that DEXTRA converges at a geometric
(R-linear) rate for a strongly convex objective function, but it requires a careful stepsize selection. It has been noted
in [72] that the feasible region of stepsizes which guarantees this convergence rate can be empty in some cases.
V. E XTENSIONS AND OTHER W ORK ON D ECENTRALIZED O PTIMIZATION
We discuss here some extension as well as other algorithms for minimizing the average sum f (·) = n1
in a decentralized manner.

Pn

i=1 fi (·)

A. Extensions
1) Per-agent constraints: When the nodes have a common convex and closed constraint set (known to each node)
X ⊂ Rd , the algorithms discussed in Section III easily extend to handle the simple set constraints by performing
projections on the set X . For example, the decentralized subgradient method in Eq. (16) can be modified to the
following update rule:


X
xk+1
= ΠX 
akij xkj − αk gik  ,
(33)
i
j∈Nik

where ΠX [·] is the Euclidean projection
X . The subgradient gik of the function fi can be evaluated
P on thek set
k
k
at the past iterate xi or at the point
j∈Nik aij xj . Since the projection mapping ΠX [·] is non-expansive (i.e.,
kΠX [x] − ΠX [y]k2 ≤ kx − yk2 for all x, y ), the convergence properties of the algorithm with projections remain
the same as that of the algorithm without projections.
A more complicated case arises when the constraint set is given as the intersection of per-node constraint sets,
i.e., X = ∩ni=1 Xi , where each Xi is a convex closed set and only known to node i. In this case, the node i update
in Eq. (33) is modified by replacing ΠX [·] with ΠXi [·], thus resulting in the following updates


X
akij xkj − αk gik  .
= ΠXi 
xk+1
i
j∈Nik

The projections on the individual agents’ constraint sets Xi (instead of the true constraint set X = ∩ni=1 Xi ) introduce
additional “perturbations”, which can be controlled with the step-size αk , provided that the sets Xi exhibit some
form of regularity. Regularity is a condition requiring that the sum of the distances of
Pa point from the individual sets
Xi is lower bounded by the distance of the point to the intersection of the sets, i.e., ni=1 kx−ΠXi k22 ≥ ckx−ΠX k22
for all x. As a result of these additional perturbations coming from the sets Xi , the convergence analysis of the
method is much more involved. This algorithm, including random set-selections, has been studied in [74]–[77]
for synchronous updates over time-varying graphs and for (random) asynchronous updates over a static graph. A
variant of this algorithm (using the Laplacian formulation of the consensus problem) for decentralized optimization
with decentralized constraints in noisy networks has been studied in [78], [79].
2) Effect of noise: We will discuss two possibilities, the case of (stochastic) noisy (sub)gradients and the case of
noisy links. In the former case, the decentralized subgradient method proceeds by using stochastic (sub)gradients
instead of subgradients. In particular, it assumes the following form:
X
xk+1
=
akij xkj − αk g̃ik (ω),
i
j∈Nik

where g̃ik (ω) is a stochastic vector (depending on a random variable ω ). As long as the stochastic subgradients have
zero mean and bounded variance (the conditions typically needed for convergence of the centralized stochastic
subgradient method) the decentralized method above can converge with probability 1 to a minimizers of f (·).
In particular,
if the conditions of Theorem 8 are satisfied,
and the stochastic
subgradient errors are such that



Eω g̃ik (ω) | xki = gik for some subgradient gik and Eω kg̃ik (ω) − gik k22 | xki ≤ σ 2 for all k and i, then it can be
seen that limk→∞ xki = x∗ for all i and for some x∗ ∈ X ∗ . Such a result can be shown by incorporating the analysis

26

of stochastic approximation methods with that of the decentralized subgradient method; the work addressing the
decentralized stochastic methods can be found in [47], [79], [80] for undirected time-varying graphs. Tighter bounds
on the rate of convergence are obtained for a stochastic version of the distributed dual averaging algorithm in [81].
For the case of directed time-varying graphs, the push-sum based method is studied for the case of stochastic
subgradients in [44].
k instead of the actual quantity xk that was sent by its
When the links are noisy, the agent i may receive xkj + ξij
j
k is a random link noise. The decentralized algorithm has the following form in this case:
neighbor j , where ξij


X
k
k
k
k k
xk+1
=
a
x
+
ξ
ij
j
j − α g̃i (ω).
i
j∈Nik
k } has zero mean and bounded variance, one can show that all the iterate
Assuming that the noise process {ξij
k
sequences {xi } converge to the same minimizer of f (·) almost surely (see for example [79], [82]). Decentralized
inference algorithms for general estimation problems (including nonlinear least squares) in stochastically timevarying networks under noisy gradient computation have been considered in [12]–[14].
3) Random graphs: In the literature of the decentralized methods for multi-agent optimization, the graph sequence
G1 , G2 , . . . is typically assumed to be externally given. The objective is to develop decentralized methods, given
a graph sequence that constrains the agent communications. Under such a point of view, the algorithmic design
does not address the question of designing the graph sequence, hence, does not optimize the network connectivity
structure.
Another common assumption encountered in the literature is that the graph sequence G1 , G2 , . . . is deterministic.
The only work known to us that departures from such an assumption is the case when the graph sequence {G` } is
independent and identically distributed (iid) random process. Specifically, each G` is a random realization drawn
from a given distribution on the set of all possible graphs on n nodes. In such a case, the connectivity assumption
is imposed on the expected graph Ḡ = E[G` ]. In this case, the decentralized subgradient becomes stochastic
X
akij xkj − αk g̃ik (ω),
xk+1
=
i
j∈Nik

where the neighbor sets Nik are random. To ease the representation, the method is re-written as
=
xk+1
i

n
X

akij xkj − αk g̃ik (ω),

(34)

j=1

where Ak is a stochastic (or double stochastic), and random with Ā = E[A` ]. Assuming that the graph G

Ā is

undirected and connected, the above decentralized subgradient method has been studied in [83]. A related version
of the distributed dual averaging algorithm is studied in the setting of iid undirected graphs in [19].
A special case of such a random iid graph sequence corresponds to the case when the agents use a random gossip
or a random broadcast to communicate over a network. These random protocols have traditionally been used in
network communication literature as protocols designed for asynchronous information exchange. They have also
been used in design of decentralized multi-agent optimization methods, as discussed in the next subsection.
4) Asynchronous vs synchronous computations: All the algorithms we discussed so far have been synchronous in
the sense that all nodes update at the same time and also use the same stepsize αk at iteration k . To accommodate
the asynchronous updates and, also, allow that agents use different stepsizes, one may resort to a random gossip
or broadcast communications, where a random link is activated for communication (gossip) or a random node is
activated to broadcast its information to the neighbors. In this case, the decentralized method assumes the form
as given in Eq. (34) where the matrix Ak takes a particular form. Specifically, for the random gossip scheme, the
underlying undirected graph G is static and, at any time k , only one edge is activated at random, say the edge
connecting agents ik and jk . In this case, the matrix Ak has the following form
Ak = I − γ(eik − ejk (eik − ejk )T ,

where γ ∈ (0, 1), and ei denotes the unit-norm vector with i entry equal to 1 and all other entries equal to 0. Each
matrix Ak is doubly stochastic, implying that the expected matrix Ā is also doubly stochastic.

27

In the case of a random broadcast, at every iteration k , each node can be activated with probability 1/n, and the
activated node broadcasts its value xki to all of the neighbors j ∈ Ni . Given that a node ik was activated at time
k , the updates follow the rule given in Eq. (34), where
Akj,ik = γ,
Akii = 1

Akjj = 1 − γ

for all i 6= ik ,

for all j ∈ Nik ,
Akij = 0

otherwise,

where γ ∈ (0, 1). Here, the node ik and the nodes that are not its neighbors, ` 6∈ Nik , do not update. Only the
neighbors of the node ik update. The matrices Ak are stochastic but not doubly stochastic. However, they have a
special property (in expectation) that pushes the iterates toward a consensus (see [46]), while the subgradients drive
the iterates toward a minimizer of f (·), resulting in an asynchonous algorithm converging with probability 1.
Consensus algorithms implemented in a network using a gossip-based or a broadcast-based communications have
been studied in [84]–[87], while a different consensus algorithm (the push-sum method) has been considered in [61],
[62]. A nonlinear gossip method is investigated in [88], while the survey paper [89] provides a detailed account of
gossip algorithms for decentralized averaging and their applications to signal processing in sensor networks.
B. Additional work on decentralized optimization
A decentralized algorithm preserving an optimality condition at every iteration has been proposed in [90]. Decentralized convex optimization algorithms for weight-balanced directed graphs have been investigated in continuoustime [91].
A different type of a decentralized algorithm for convex optimization has been proposed in [92], where each agent
keeps an estimate for all agents’ decisions. This algorithm solves a problem where the agents have to minimize
a global cost function f (x1 , . . . , xm ) while each agent i can control only its variable xi . The algorithm of [92]
has been recently extended to the online optimization setting in [80], [93], [94]. Decentralized algorithms based on
the augmented Lagrangian approach with gossip-type communications have been studied in [20], and accelerated
versions of decentralized gradient methods have been proposed and studied in [95]. A consensus-based algorithm
for solving problems with a separable constraint structure and the use of primal-dual decentralized methods have
been studied in [51], [96], [97], while a decentralized primal-dual approach with perturbations have been explored
in [98]. Work in [99] provides algorithms for centralized and decentralized convex optimization from the control
perspective, while [100] considers an event-triggered decentralized optimization for sensor networks. In [101], a
decentralized simplex algorithm has been developed for linear programming problems, while a Newton-Raphson
consensus-based method has been proposed in [102] for decentralized convex problems.
Although our discussion has mainly focused on studying asymptotic rates of convergence of iterative decentralized
optimization methods, in [103] it was shown that a related approach based on consensus can solve general
constrained abstract optimization problems in a finite number of iterations.
All of the work mentioned above relies on the use of state-independent weights, i.e., the weights that do not
depend on the agents’ iterates. A consensus-based algorithm employing state-dependent weights has been proposed
and analyzed in [104].
Another popular decentralized approach for consensus optimization over a static network is the alternating
direction method of multipliers (ADMM). This method is based on an equivalent formulation of the consensus
constraints. Unlike consensus-based (sub)-gradient method, which operates in the space of the primal-variables, the
ADMM solves a corresponding Lagrangian dual problem (obtained by relaxing the equality constraints that are
associated with consensus requirement). Just as any dual method, the ADMM is applicable to problems where the
structure of the objective functions fi is simple enough so that the ADMM updates can be executed efficiently. The
algorithm has the potential solve the problem with a geometric convergence rate, which requires global knowledge
of some parameters including eigenvalues of a weight matrix associated with the graph. A recent survey on the
ADMM and its various applications is given in [105]. The first work to address the development of decentralized
ADMM over a network is [106]–[108], and it has been investigated in [109], while its linear rate has been shown
in [110]. For an explicit analysis of the relationship to network topology, see [111], [112]. In [113] the ADMM
with linearization has been proposed for special composite optimization problems over graphs.
The work in [52], [53] utilizes an adapt-then-combine (ATC) strategy [114], [115] of dynamic weighted-average
consensus approach [116] to develop a distribute algorithm, termed Aug-DGM algorithm. This algorithm can be

28

used over static directed or undirected graphs (but requires doubly stochastic matrix). The most interesting aspect
of the Aug-DGM algorithm is that it can produce convergent iterates even when different agents use different
(constant) stepsizes.
Simultaneously and independently, the idea of tracking the gradient averages through the use of consensus
has been proposed in [52] for convex unconstrained problems and in [54] for non-convex problems with convex
constraints. The work in [54]–[56] develops a large class of decentralized algorithms, referred to as NEXT, which
utilizes various “function-surrogate modules” thus providing a great flexibility in its use and rendering a new class
of algorithms that subsumes many of the existing decentralized algorithms. The work in [55], [56] and in [53] have
also been proposed independently, with the former preceding the latter. The algorithm framework of [54]–[56] is
applicable to nonconvex problems with convex constraint sets over time-varying graphs, but requires the use of
doubly stochastic matrices. This assumption was recently removed in [71] by using column-stochastic matrices,
which are more general than the degree-based column-stochastic matrices of the push-sum method. Simultaneously
and independently, the papers [56] and [117] have appeared to treat nonconvex problems over graphs. The work
in [117] proposes and analyzes a decentralized gradient method based on the push-sum consensus in deterministic
and stochastic setting for unconstrained problems.
VI. C ONCLUSION AND O PEN P ROBLEMS
We have discussed decentralized optimization methods for minimizing the average of the nodes’ objectives
over graphs. We have considered undirected and directed time varying graphs, and computational models for
solving consensus problem in such graphs. Then, we have discussed decentralized optimization algorithms that
combine optimization techniques with decentralized averaging algorithms. We have also discussed extensions of
the consensus-based approaches and other decentralized optimization algorithms.
In terms of algorithm scalability with the number n of nodes, at present, it is an open question whether any
improvement on the quadratic convergence time of Proposition 5 is possible without an additional assumption about
the knowledge of n (recall that the assumption of knowing a reasonable upper bound on n was made in Theorem 6
to show a linear scaling with n in the convergence time). Also, it is not known whether a linear convergence-time
scaling can be obtained for time-varying graphs.
Another question for future research is the implementation of decentralized algorithms with lower communication
requirements. In particular, even broadcast based communications can be expensive, in terms of the power needed
to broadcast in some sensor networks. A question is how to implement decentralized algorithms with fewer
communications, and what trade-offs are involved in such implementations. Some initial investigations along these
lines were presented in [118] in the context of stochastic optimization, where progressively more time is spent
calculating gradients between each round of communication as the number of iterations progresses. There remains
much further work to be done along these lines.
Finally, we remark that although there are well-understood lower bounds on the number of iterations required
to achieve an -optimal solution in the context of centralized convex optimization [48], [119], much less is
understood about the fundamental limits of decentralized optimization. Although bounds on the number of iterations
for centralized algorithms carry over directly to synchronous decentralized algorithms, since any decentralized
algorithm can always be emulated on a centralized processor, these results do not provide insight into how
much communication is fundamentally required to reach consensus on an -optimal solution. In communicationconstrained settings (e.g., where network links have very low bandwidth), it remains an open question as to how
many iterations may be required, and a related line of questioning would be to understand when there may be
tradeoffs between communication and computation (e.g., to reach an -optimal solution there may be algorithms
which require significant computation and lower communication, or vice versa).
ACKNOWLEDGEMENTS
M.R. thanks Mido Assran for a careful reading and suggestions that improved this paper.
R EFERENCES
[1] D. Bertsekas, A. Nedić, and A. Ozdaglar, Convex Analysis and Optimization. Athena Scientific, 2003.
[2] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge Univeristy Press, 2004.

29

[3] J. Nocedal and S. Wright, Numerical Optimization, 2nd ed. Springer, 2006.
[4] J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asynchronous deterministic and stochastic gradient optimization algorithms,”
IEEE Transactions on Automatic Control, vol. 31, no. 9, pp. 803 – 812, Sep. 1986.
[5] J. Tsitsiklis, “Problems in decentralized decision making and computation,” Ph.D. dissertation, M.I.T., Nov. 1984.
[6] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation: numerical methods, 1st ed. Upper Saddle River, NJ, USA:
Prentice-Hall, Inc., 1989.
[7] Y. Cao, W. Yu, W. Ren, and G. Chen, “An overview of recent progress in the study of distributed multi-agent coordination,” IEEE
Transactions on Industrial Informatics, vol. 9, no. 1, pp. 427–438, Feb. 2013.
[8] G. Barrenetxea, F. Inglerest, G. Schaefer, and M. Vetterli, “Wireless sensor networks for environmental monitoring: The SensorScope
experience,” in Proc. IEEE Intl. Zurich Seminar on Communications, Zurich, Switzerland, Mar. 2008.
[9] M. Rabbat and R. Nowak, “Distributed optimization in sensor networks,” in Proc. ACM/IEEE Intl. Conf. on Information Processing
in Sensor Networks (IPSN), Berkeley, CA, USA, Apr. 2004, pp. 20–27.
[10] I. Schizas, G. Giannakis, S. Roumeliotis, and A. Ribeiro, “Consensus in ad hoc WSNs with noisy links - part ii: Distributed estimation
and smoothing of random signals,” IEEE Transactions on Signal Processing, vol. 56, no. 4, pp. 1650–1666, Apr. 2008.
[11] F. Cattivelli and A. Sayed, “Diffusion LMS strategies for distributed estimation,” IEEE Transactions on Signal Processing, vol. 58,
no. 3, pp. 1035–1048, Mar. 2010.
[12] S. Kar and J. Moura, “Convergence rate analysis of distributed gossip (linear parameter) estimation: Fundamental limits and tradeoffs,”
IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 4, pp. 674–690, Aug. 2011.
[13] S. Kar, J. Moura, and K. Ramanan, “Distributed parameter estimation in sensor networks: Nonlinear observation models and imperfect
communication,” IEEE Transactions on Information Theory, vol. 58, no. 6, pp. 3575–3605, Jun. 2012.
[14] S. Kar and J. Moura, “Asymptotically efficient distributed estimation with exponential family statistics,” IEEE Transactions on
Information Theory, vol. 60, no. 8, pp. 4811–4831, Aug. 2014.
[15] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press,
2014.
[16] S. Ram, A. Nedic, and V. Veeravalli, “A new class of distributed optimization algorithms: Application to regression of distributed
data,” Optimization Methods and Software, vol. 27, no. 1, pp. 71–88, 2009.
[17] K. Tsianos, S. Lawlor, and M. Rabbat, “Consensus-based distributed optimization: Practical issues and applications in large-scale
machine learning,” in 50th Allerton Conference on Communication, Control, and Computing, 2012.
[18] J. Chen and A. H. Sayed, “Diffusion adaptation strategies for distributed optimization and learning over networks,” IEEE Transactions
on Signal Processing, vol. 60, no. 8, pp. 4289–4305, Aug. 2012.
[19] J. Duchi, A. Agarwal, and M. Wainwright, “Dual averaging for distributed optimization: Convergence analysis and network scaling,”
IEEE Transactions on Automatic Control, vol. 57, no. 3, pp. 592–606, 2012.
[20] D. Jakovetić, J. Xavier, and J. Moura, “Cooperative convex optimization in networked systems: Augmented lagrangian algorithms
with directed gossip communication,” IEEE Transactions on Signal Processing, vol. 59, no. 8, pp. 3889–3902, 2011.
[21] J. Cortés, S. Martı́nez, and F. Bullo, “Robust rendezvous for mobile autonomous agents via proximity graphs in arbitrary dimensions,”
IEEE Transactions on Automatic Control, vol. 51, no. 8, pp. 1289–1298, Aug. 2006.
[22] M. Zavlanos, A. Ribeiro, and G. Pappas, “Network integrity in mobile robotic networks,” IEEE Transactions on Automatic Control,
vol. 58, no. 1, pp. 3–18, Jan. 2013.
[23] A. Jadbabaie, J. Lin, and A. Morse, “Coordination of groups of mobile autonomous agents using nearest neighbor rules,” IEEE
Transactions on Automatic Control, vol. 48, no. 6, pp. 988–1001, 2003.
[24] A. Nedić and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control,
vol. 54, no. 1, pp. 48–61, 2009.
[25] R. Durrett, Random Graph Dynamics. Cambridge University Press, 2007.
[26] M. Penrose, Random Geometric Graphs. Oxford University Press, 2003.
[27] A. Olshevsky, “Linear time average consensus on fixed graphs,” in Proceedings of NecSys, the 3rd IFAC Workshop on Distributed
Estimation and Control in Networked Systems, 2015.
[28] E. W. D. Levin, Y. Peres, Markov Chains and Mixing Times. American Mathematical Society, 2009.
[29] A. K. Chandra, P. Raghavan, W. Ruzzo, R. Smolensky, and P. Tiwari, “The electrical resistance of a graph captures its commute and
cover times,” Computational Complexity, vol. 6, no. 4, pp. 312–340, 1996.
[30] C. Avin and G. Ercal, “On the cover time and mixing time of random geometric graphs,” Theoretical Computer Science, vol. 380,
no. 4, pp. 2–22, 2007.
[31] A. Olshevsky, “Linear time average consensus and distributed optimization on fixed graphs,” SIAM Journal on Control and
Optimization, vol. 55, no. 6, pp. 3990–4014, 2017.
[32] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,” Systems and Control Letters, vol. 53, pp. 65–78, 2004.
[33] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Randomized gossip algorithms,” IEEE Transactions on Information Theory, vol. 52,
no. 6, pp. 2508–2530, Jun. 2006.
[34] D. Spielman, “Graphs, vectors, and matrices,” Bulletin of the American Mathematical Society, vol. 54, no. 1, pp. 45–61, Jan. 2017.
[35] R. Olfati-Saber, “Algebraic connectivity ratio of Ramanujan graphs,” in Proc. IEEE American Conf. on Control, New York, USA, Jul.
2007.
[36] S. Kar, S. Aldosari, and J. Moura, “Topology for distributed inference in graphs,” IEEE Transactions on Signal Processing, vol. 56,
no. 6, pp. 2609–2613, Jun. 2008.
[37] J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asynchronous deterministic and stochastic gradient optimization algorithms,”
IEEE Transactions on Automatic Control, vol. 31, no. 9, pp. 803–812, 1986.
[38] R. Varga, Matrix iterative analysis. Springer, 2000.
[39] Y. Nesterov, “A method of solving a convex optimization problem with convergence rate o(1/k2 ),” Soviet Mathematics Doklady,
vol. 27, no. 2, pp. 372–376, 1983.

30

[40] https://de.wikipedia.org/wiki/Subdifferential, image released into the public domain by Felix Reidel.
[41] B. Polyak, “A general method for solving extremum problems,” Soviet Mathematical Doklady, vol. 8, no. 3, pp. 593–597, 1967.
[42] N. Shor, Minimization methods for Nondifferentiable Functions. Berlin: Translated from Russian by K.C. Kiwiel and A. Ruszczynski,
Springer, 1985.
[43] B. Polyak, Introduction to Optimisation. New York: Optimization Software, Inc., 1987.
[44] A. Nedić and A. Olshevsky, “Distributed optimization over time-varying directed graphs,” IEEE Transactions on Automatic Control,
vol. 60, no. 3, pp. 601–615, 2015.
[45] S. S. Ram, A. Nedić, and V. V. Veeravalli, “A new class of distributed optimization algorithms: application to regression of distributed
data,” Optimization Methods and Software, vol. 27, no. 1, pp. 71–88, 2012.
[46] A. Nedić, “Random projection algorithms for convex minimization problems,” Mathematical Programming, Series B, vol. 129, pp.
225–253, 2011.
[47] S. Ram, A. Nedić, and V. Veeravalli, “Distributed Stochastic Subgradient Projection Algorithms for Convex Optimization,” Journal
of Optimization Theory and Applications, vol. 147, pp. 516–545, 2010.
[48] A. Nemirovski and D. Yudin, Problem Complexity and Method Efficiency in Optimization. John Wiley & Sons, 1983.
[49] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: an exact first order algorithm for decentralized consensus optimization,” SIAM Journal
on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
[50] A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” 2016,
accepted at SIAM Journal on Optimization 2017, available on arXiv at https://arxiv.org/abs/1607.03218.
[51] M. Zhu and S. Martı́nez, “On distributed convex optimization under inequality and equality constraints,” IEEE Transactions on
Automatic Control, vol. 57, no. 1, pp. 151–164, 2012.
[52] J. Xu, S. Zhu, Y. Soh, and L. Xie, “Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant
stepsizes,” in Proceedings of the 54th IEEE Conference on Decision and Control (CDC), 2015, pp. 2055–2060.
[53] J. Xu, “Augmented distributed optimization for networked systems,” Ph.D. dissertation, School of Electrical and Electronic Engineering,
Nanyang Technological University, 2016.
[54] P. D. Lorenzo and G. Scutari, “Distributed nonconvex optimization over networks,” in Proceedings of IEEE International Conference
on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP 2015), Dec. 13–16, 2015, Cancun, Mexico, 2015.
[55] ——, “Distributed nonconvex optimization over time-varying networks,” in Proceedings of IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP 16), March 20–25, 2016, Shanghai, China, 2016.
[56] ——, “NEXT: In-network nonconvex optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2,
no. 2, pp. 120–136, 2016.
[57] G. Qu and N. Li, “Accelerated distributed nesterov gradient descent,” preprint. Available at https://arxiv.org/abs/1705.07176.
[58] ——, “Harnessing smoothness to accelerate distributed optimization,” 2017, to appear in IEEE Transactions on Control of Network
Systems.
[59] J. M. Hendrickx and J. N. Tsitsiklis, “Fundamental limitations for anonymous distributed systems with broadcast communications,”
in Proceedings of the 53rd Annual Allerton Conference on Communication, Control, and Computing, 2015.
[60] T. Charalambous, M. Rabbat, M. Johansson, and C. Hadjicostis, “Distributed finite-time computation of digraph parameters: Lefteigenvector, out-degree and spectrum,” IEEE Transactions on Control of Network Systems, vol. 3, no. 2, pp. 137–148, 2016.
[61] D. Kempe, A. Dobra, and J. Gehrke, “Gossip-based computation of aggregate information,” in Proceedings of the 44th Annual IEEE
Symposium on Foundations of Computer Science, 2003, pp. 482–491.
[62] F. Benezit, V. Blondel, P. Thiran, J. Tsitsiklis, and M. Vetterli, “Weighted gossip: distributed averaging using non-doubly stochastic
matrices,” in Proceedings of the 2010 IEEE International Symposium on Information Theory, Jun. 2010.
[63] A. Dominguez-Garcia and C. Hadjicostis, “Distributed algorithms for control of demand responses and distributed energy resources,”
in Proceedings of the 50th IEEE Conference on Decision and Control and European Control Conference, Dec 2011, pp. 27–32.
[64] C. N. Hadjicostis and T. Charalambous, “Average consensus in the presence of delays in directed graph topologies,” IEEE Transactions
on Automatic Control, vol. 59, no. 3, pp. 763–768, 2014.
[65] C. N. Hadjicostis, N. H. Vaidya, and A. D. Domı́nguez-Garcı́a, “Robust distributed average consensus via exchange of running sums,”
IEEE Transactions on Automatic Control, vol. 61, no. 6, pp. 1492–1507, 2016.
[66] A. Dominguez-Garcia and C. Hadjicostis, “Distributed strategies for average consensus in directed graphs,” in Proceedings of the 50th
IEEE Conference on Decision and Control and European Control Conference, Dec 2011.
[67] A. Nedić and A. Olshevsky, “Stochastic gradient-push for strongly convex functions on time-varying directed graphs,” IEEE
Transactions on Automatic Control, vol. 61, no. 12, pp. 3936 – 3947, 2016.
[68] K. Tsianos, S. Lawlor, and M. Rabbat, “Push-sum distributed dual averaging for convex optimization,” in Proceedings of the IEEE
Conference on Decision and Control, 2012.
[69] K. Tsianos and M. Rabbat, “Distributed consensus and optimization under communication delays,” in Proc. of Allerton Conference
on Communication, Control, and Computing, 2011, pp. 974–982.
[70] K. Tsianos, “The role of the network in distributed optimization algorithms: Convergence rates, scalability, communication /
computation tradeoffs and communication delays,” Ph.D. dissertation, McGill University, Dept. of Electrical and Computer Engineering,
2013.
[71] Y. Sun, G. Scutari, and D. Palomar, “Distributed nonconvex multiagent optimization over time-varying networks,” in Proc. of the
Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, Nov. 2016.
[72] C. Xi and U. Khan, “On the linear convergence of distributed optimization over directed graphs,” 2015, available on arXiv at
http://arxiv.org/abs/1510.02149.
[73] J. Zeng and W. Yin, “Extrapush for convex smooth decentralized optimization over directed networks,” 2015, available on arXiv at
http://arxiv.org/abs/1511.02942.
[74] A. Nedić, A. Ozdaglar, and P. Parrilo, “Constrained consensus and optimization in multi-agent networks,” IEEE Transactions on
Automatic Control, vol. 55, no. 4, pp. 922 –938, April 2010.

31

[75] S. Lee, “Optimization over networks: Efficient algorithms and analysis,” Ph.D. dissertation, Department of Electrical and Computer
Engineering, University of Illinois at Urbana-Champaign, 2013.
[76] S. Lee and A. Nedić, “Distributed random projection algorithm for convex optimization,” IEEE Journal on Selected Topics in Signal
Processing, vol. 48, no. 6, pp. 988–1001, 2012.
[77] ——, “Asynchronous gossip-based random projection algorithms over networks,” IEEE Transactions on Automatic Control, vol. 61,
no. 4, pp. 953–968, 2016.
[78] K. Srivastava, A. Nedić, and D. Stipanović, “Distributed constrained optimization over noisy networks,” in Proceedings of the 49th
IEEE Conference on Decision and Control (CDC), Dec. 2010, pp. 1945 –1950.
[79] K. Srivastava and A. Nedić, “Distributed asynchronous constrained stochastic optimization,” IEEE Journal of Selected Topics in Signal
Processing, vol. 5, no. 4, pp. 772–790, 2011.
[80] S. Lee, A. Nedić, and M. Raginsky, “Stochastic dual-averaging for decentralized online optimization on time-varying communication
graphs,” to appear, accepted for publication in IEEE Transactions on Automatic Control, January 2017.
[81] K. Tsianos and M. Rabbat, “Efficient distributed online prediction and stochastic optimization with approximate distributed averaging,”
IEEE Transactions on Signal and Information Processing Over Networks, vol. 2, no. 4, pp. 489–506, Dec. 2016.
[82] K. Srivastava, A. Nedić, and D. Stipanovic, “Distributed constrained optimization over noisy networks,” in Proceedings of the 49th
IEEE Conference on Decision and Control, 2010, pp. 1945–1950.
[83] I. Lobel and A. Ozdaglar, “Distributed subgradient methods for convex optimization over random networks,” IEEE Transactions on
Automatic Control, vol. 56, no. 6, pp. 1291 –1306, June 2011.
[84] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Gossip algorithms: Design, analysis, and applications,” in Proceedings of IEEE
INFOCOM, vol. 3, 2005, pp. 1653–1664.
[85] T. Aysal, M. Yildiz, A. Sarwate, and A. Scaglione, “Broadcast gossip algorithms: Design and analysis for consensus,” in Proceedings
of the 47th IEEE Conference on Decision and Control, 2008, pp. 4843–4848.
[86] T. Aysal, M. Yildriz, A. Sarwate, and A. Scaglione, “Broadcast gossip algorithms for consensus,” IEEE Transactions on Signal
processing, vol. 57, pp. 2748–2761, 2009.
[87] J. Liu, S. Mou, A. Morse, B. Anderson, and C. Yu, “Deterministic gossiping,” Proceedings of the IEEE, vol. 99, no. 9, pp. 1505–1524,
2011.
[88] A. Mathkar and V. Borkar, “Nonlinear gossip,” SIAM Journal on Control and Optimization, vol. 54, no. 3, pp. 1535–1557, 2016.
[89] A. Dimakis, S. Kar, J. Moura, M. Rabbat, and A. Scaglione, “Gossip algorithms for distributed signal processing,” Proceedings of
the IEEE, vol. 98, no. 11, pp. 1847–1864, 2010.
[90] J. Lu and C. Tang, “Zero-gradient-sum algorithms for distributed convex optimization: the continuous-time case,” IEEE Transactions
on Automatic Control, vol. 57, no. 9, pp. 2348–2354, 2012.
[91] B. Gharesifard and J. Cortes, “Distributed continuous-time convex optimization on weight-balanced digraphs,” 2012, preprint, available
at http://arxiv.org/abs/1204.0304.
[92] N. Li and J. R. Marden, “Designing games for distributed optimization,” IEEE Journal on Selected Topics in Signal Processing, vol. 7,
no. 2, pp. 230–242, 2013.
[93] A. Nedić, S. Lee, and M. Raginsky, “Decentralized online optimization with global objectives and local communication,” in Proceedings
of the 2016 American Control Conference (ACC), Boston, MA, July 6–8, 2016, 2016, pp. 4497–4503.
[94] S. Lee, A. Nedić, and M. Raginsky, “Coordinate dual averaging for decentralized online optimization with nonseparable global
objectives,” to appear, accepted in IEEE Transactions on Control of Network Systems, May 2016.
[95] D. Jakovetić, J. Xavier, and J. Moura, “Fast distributed gradient methods,” IEEE Transactions on Automatic Control, vol. 59, no. 5,
pp. 1131–1146, 2014.
[96] M. Zhu and S. Martı́nez, “An approximate dual subgradient algorithm for distributed non-convex constrained optimization,” IEEE
Transactions on Automatic Control, vol. 58, no. 6, pp. 1534–1539, 2013.
[97] K. Srivastava, A. Nedić, and D. Stipanovic, “Distributed bregman-distance algorithms for min-max optimization,” in Agent-Based
Optimization. Springer Studies in Computational Intelligence (SCI), 2013, pp. 143–174.
[98] T.-H. Chang, A. Nedić, and A. Scaglione, “Distributed constrained optimization by consensus-based primal-dual perturbation method,”
IEEE Transactions on Automatic Control, vol. 59, no. 6, pp. 1524–1538, 2014.
[99] J. Wang and N. Elia, “A control perspective for centralized and distributed convex optimization,” in Proceedings of the IEEE Conference
on Decision and Control, (Florida, USA), 2011, pp. 3800–3805.
[100] P. Wan and M. Lemmon, “Event-triggered distributed optimization in sensor networks,” in Symposium on Information Processing of
Sensor Networks, (San Francisco, CA), 2009, pp. 49–60.
[101] M. Bürger, G. Notarstefano, F. Bullo, and F. Allgöwer, “A distributed simplex algorithm for degenerate linear programs and multi-agent
assignments,” Automatica, vol. 48, no. 9, pp. 2298–2304, 2012.
[102] F. Zanella, D. Varagnolo, A. Cenedese, G. Pillonetto, and L. Schenato, “Newton-raphson consensus for distributed convex optimization,”
in Proceedings of the IEEE Conference on Decision and Control, (Florida, USA), 2011, pp. 5917–5922.
[103] G. Notarstefano and F. Bullo, “Distributed abstract optimization via constraints consensus: Theory and applications,” IEEE Transactions
on Automatic Control, vol. 56, no. 10, pp. 2247–2261, 2011.
[104] I. Lobel, A. Ozdaglar, and D. Feijer, “Distributed multi-agent optimization with state-dependent communication,” Mathematical
Programming, vol. 129, no. 2, pp. 255–284, 2011.
[105] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction
method of multipliers,” Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, 2010.
[106] I. Schizas, A. Ribeiro, and G. Giannakis, “Consensus in ad hoc WSNs with noisy links—part I: Distributed estimation of deterministic
signals,” IEEE Transactions on Signal Processing, vol. 56, no. 1, pp. 350–364, Jan. 2008.
[107] E. Wei and A. Ozdaglar, “Distributed alternating direction method of multipliers,” in Proceedings of the 51st IEEE Conference on
Decision and Control and European Control Conference, 2012, pp. 5445–5450.

32

[108] ——, “On the O(1/k) convergence of asynchronous distributed alternating direction method of multipliers,” in Proceedings of IEEE
Global Conference on Signal and Information Processing, 2013, pp. 551–554.
[109] Q. Ling and A. Ribeiro, “Decentralized dynamic optimization through the alternating direction method of multiplier,” IEEE Transactions
on Signal Processing, vol. 62, no. 5, pp. 1185–1197, 2014.
[110] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence of the ADMM in decentralized consensus optimization,”
IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1750 – 1761, 2014.
[111] G. Franca and J. Bento, “Markov chain lifting and distributed admm,” IEEE Signal Processing Letters, vol. 24, no. 3, pp. 294–298,
2017.
[112] ——, “How is distributed admm affected by network topology?” https://arxiv.org/abs/1710.00889.
[113] N. Aybat, Z. Wang, T. Lin, and S. Ma, “Distributed linearized alternating direction method of multipliers for composite convex
consensus optimization,” 2015, preprint, available on arxiv at http://arxiv.org/abs/1512.08122.
[114] A. Sayed, “Diffusion adaptation over networks,” in Academic Press Library in Signal Processing, R. Chellapa and S. Theodoridis,
Eds. Elsevier, 2013, vol. 3, pp. 323–454.
[115] ——, Adaptation, Learning, and Optimization over Networks. Foundations and Trends in Machine Learning, 2014, vol. 7.
[116] M. Zhu and S. Martı́nez, “Discrete-time dynamic average consensus,” Automatica, vol. 46, pp. 322 – 329, 2010.
[117] T. Tatarenko and B. Touri, “Non-convex distributed optimization,” 2016, preprint. Available at http://arxiv.org/abs/1512.00895.
[118] K. Tsianos, S. Lawlor, and M. Rabbat, “Communication/computation tradeoffs in consensus-based distributed optimization,” in Proc.
Advances in Neural Information Processing Systems, Lake Tahoe, USA, Dec. 2012, pp. 1943–1951.
[119] Y. Nesterov, Introductory Lectures on Convex Optimization. Springer, 2003.

