1

Distributed optimization over time-varying
directed graphs

arXiv:1303.2289v2 [math.OC] 16 Mar 2014

Angelia Nedić and Alex Olshevsky

Abstract—We consider distributed optimization by a
collection of nodes, each having access to its own convex
function, whose collective goal is to minimize the sum
of the functions. The communications between nodes are
described by a time-varying sequence of directed graphs,
which is uniformly strongly connected. For such communications, assuming that every node knows its outdegree, we develop a broadcast-based algorithm, termed
the subgradient-push, which steers every node to an optimal value under a standard assumption of subgradient
boundedness. The subgradient-push requires no knowledge
of either the number of agents or the graph sequence
to implement. Our analysis shows that the subgradient√
push algorithm converges at a rate of O ln t/ t . The
proportionality constant in the convergence rate depends
on the initial values at the nodes, the subgradient norms
and, more interestingly, on both the speed of the network
information diffusion and the imbalances of influence
among the nodes.

I. I NTRODUCTION
We consider the problem of distributed convex optimization by a network of nodes when knowledge of the
objective function is scattered throughout the network
and unavailable at any singe location. There has been
much recent interest in multi-agent optimization problems of this type that arise whenever a large collections
of nodes - which may be processors, nodes of a sensor
network, vehicles, or UAVs - desire to collectively optimize a global objective by means of local actions taken
by each node without any centralized coordination.
Specifically, we will study the problem of optimizing
a sum of n convex functions by a network of n nodes
when each function is known to only a single node.
This problem frequently arises when control and signal
processing protocols need to be implemented in sensor
networks. For example, the problems including robust
statistical inference [23], formation control [21], nonautonomous power control [26], distributed message
routing [20], and spectrum access coordination [11],
can be reduced to variations of this problem. We will
The authors are with the Department of Industrial and Enterprise Systems Engineering, University of Illinois at UrbanaChampaign, 104 S. Mathews Avenue, Urbana IL, 61801, Emails:
{angelia,aolshev2}@illinois.edu. A. Nedić gratefully acknowledges
the support by the NSF grants CMMI 07-42538 and CCF 11-11342,
and by the ONR Navy Basic Research Challenge N00014-12-1-0998.

be focusing on the case when communication between
nodes is directed and time-varying.
Distributed optimization of a sum of convex functions
has received a surge of interest in recent years [18],
[23], [9], [14], [12], [13], [28], [16], [3], [24], [5].
There is now a considerable theory justifying the use
of distributed subgradient methods in this setting, and
their performance limitations and convergence times
are well-understood. Moreover, distributed subgradient
methods have been used to propose new solutions for a
number of problems in distributed control and sensor
networks [26], [20], [11]. However, the works cited
above assumed communications among nodes are either
fixed or undirected.
Our paper is the first to demonstrate a working subgradient protocol in the setting of directed time-varying
communications. We develop a broadcast-based protocol, termed the subgradient-push, which steers every
node to an optimal value under a standard assumption of
subgradient boundedness. The subgradient-push requires
each node to know its out-degree at all times, but
beyond this it needs no knowledge of the graph sequence
of even of the number of agents to implement. √Our

results show that it converges at a rate of O ln t/ t ,
where the constant depends, among other factors, on
the information diffusion speed of the corresponding
directed graph sequence and a measure of the influence
imbalance among the nodes.
Our work is closest to the recent papers [30], [31],
[32], [29], [6]. The papers [30], [31], [32], [29] proposed
a distributed subgradient algorithm which is very similar
to the one we study here, involving the introduction of
subgradients into an information aggregation procedure
known as “push-sum.” The convergence of this scheme
for directed but fixed topologies was shown in [30],
[31], [32], [29]; implementation of the protocol proposed
in these papers appears to require knowledge of the
graph or of the number of agents. By contrast, our
results work in time-varying networks and are fully
distributed, requiring no knowledge of either the graph
sequence or the number of agents. The paper [6] shows
the convergence of a distributed optimization protocol
in continuous time, also for directed but fixed graphs;
moreover, an additional assumption is made in [6] that
the graph is “balanced.”

2

All the prior work in distributed optimization, except for [30], [31], [32], [29], requires time-varying
communications with some form of balancedness, often
reflected in a requirement of having a sequence of
doubly stochastic matrices that are commensurate with
the sequence of underlying communication graphs. In
contrast, our proposed method removes the need for
the doubly stochastic matrices. The proposed distributed
optimization model is motivated by applications that are
characterized by time-varying directed communications,
such as those arising in a mobile sensor network where
the links among nodes will come and go as nodes
move in and out of line-of-sight or broadcast range of
each other. Moreover, if different nodes are capable of
broadcasting messages at different power levels, then
communication links connecting the nodes will necessarily be unidirectional.
The remainder of this paper is organized as follows.
We begin in Section II where we describe the problem
of interest, outline the subgradient-push algorithm, and
state the main convergence results. Section III is devoted
to the proof of a key lemma, namely the convergence
rate result for a perturbed version of the so-called pushsum protocol; this lemma is then used in the subsequent
proofs of convergence and convergence rate for the
subgradient-push in Section IV. Finally, some conclusions are offered in Section VI.
Notation: We use boldface to distinguish between the
vectors in Rd and scalars or vectors of different dimensions. For example, the vector xi (t) is in boldface while
the scalar yi (t) is not. The vectors such as y(t) ∈ Rn
obtained by stacking scalar values yi (t) associated with
the nodes are thus not bolded. Additionally, for a vector
y, we will also occasionally use [y]j to denote its j’th
entry. For a matrix A, we will use Aij or [A]ij to
denote its i, j’th entry. The notation A0 will refer to
the transpose of the matrix A. The vectors are seen as
column vectors unless otherwise explicitly stated. We use
1 to denote the vector of ones, and kyk for the Euclidean
norm of a vector y.
II. P ROBLEM , A LGORITHM AND M AIN R ESULTS
We consider a network of n nodes whose goal is to
solve distributedly the following minimization problem:
n
X
minimize F (z) ,
fi (z) over z ∈ Rd ,
i=1

where only node i knows the convex function fi : Rd →
R. Under the assumption that the set of optimal solutions
Z ∗ = Argminz∈Rd F (z) is nonempty, we would like to
design a protocol by which all agents maintain variables
zi (t) converging to the same point in Z ∗ with time.
We will assume that, at each time t, node i can only
send messages to its out-neighbors in some directed

graph G(t). Naturally, the graph G(t) will have vertex
set {1, . . . , n}, and we will use E(t) to denote its
edge set. Also, naturally, the sequence {G(t)} should
posses some good long-term connectivity properties. A
standard assumption, which we will be making, is that
the sequence {G(t)} is uniformly strongly connected (or,
as it is sometimes called, B-strongly-connected), namely,
that there exists some ineger B > 0 (possibly unknown
to the nodes) such that the graph with edge set
(k+1)B−1

EB (k) =

[

E(i)

i=kB

is strongly connected for every k ≥ 0. This is a typical
assumption for many results in multi-agent control: it is
considerably weaker than requiring each G(t) be connected for it allows the edges necessary for connectivity
to appear over a long time period and in arbitrary order;
however, it is still strong enough to derive bounds on the
speed of information propagation from one part of the
network to another.
Finally, we introduce the notation Niin (t) and Niout (t)
for the in- and out-neighborhoods of node i, respectively,
at time t. We will require these neighborhoods to include
the node i itself1 ; formally, we have
Niin (t)

= {j | (j, i) ∈ E(t)} ∪ {i},

Niout (t)

= {j | (i, j) ∈ E(t)} ∪ {i},

and di (t) for the out-degree of node i, i.e.,
di (t) = |Niout (t)|.
Crucially, we will be assuming that every node i knows
its out-degree di (t) at every time t.
Our main contribution is the design of an algorithm
which successfully accomplishes the task of distributed
minimization of F (z) assumptions we have laid out
above. Our scheme is a combination of the subgradient
method and the so-called push-sum protocol, recently
studied in the papers [1], [4], [10], [7]. We will refer
to our protocol as the subgradient-push method. A
subgradient method using the push-sum protocol has
been considered in [32], [30], [31], [29] for a fixed
communication network, whose implementation requires
some knowledge of the graph or the number of agents. In
contrast, our algorithm can handle time-varying networks
and its implementation requires node i knowing its local
out-degree di (t) only.
A. The subgradient-push method
Every node i maintains vector variables xi (t), wi (t) ∈
Rd , as well as a scalar variable yi (t). These quantities are
1 Alternatively, one may define these neighborhoods in a standard
way of the graph theory, but require that each graph in the sequence
{G(t)} has a self-loop at every node.

3

updated according to the following rules: for all t ≥ 0
and all i = 1, . . . , n,
X xj (t)
wi (t + 1) =
,
dj (t)
j∈Niin (t)
X yj (t)
,
yi (t + 1) =
dj (t)
in
j∈Ni (t)

zi (t + 1)

=

wi (t + 1)
,
yi (t + 1)

xi (t + 1)

=

wi (t + 1) − α(t + 1)gi (t + 1), (1)

where gi (t + 1) is a subgradient of the function fi (z) at
z = zi (t + 1). The method is initiated with an arbitrary
vector xi (0) ∈ Rd at node i, and with yi (0) = 1 for
all i. The stepsize α(t + 1) > 0 satisfies the following
decay conditions
∞
X

α(t) = ∞,

t=1

∞
X

kgi k ≤ Li for all subgradients gi of fi (z).
Then, the distributed subgradient-push method of Eq. (1)
with the stepsize satisfying the conditions in Eq. (2) has
the following property
lim zi (t) = z∗

t→∞

α2 (t) < ∞,

t=1

α(t) ≤ α(s) for all t > s ≥ 1.

(2)

We note that the above equations have simple broadcastbased implementation: each node j broadcasts the quantities xj (t)/dj (t), yj (t)/dj (t) to all of the nodes i in its
out-neighborhood2 , which simply sum all the messages
they receive to obtain wi (t+1) and yi (t+1). The update
equations for zi (t+1), xi (t+1) can be executed without
any further communications among nodes during step t.
Without the subgradient term in the final equation, our
protocol would be a version of the push-sum protocol
[10] for average computation studied recently in [1],
[4], [7]. For intuition on the precise form of these
equations, we refer the reader to these three papers;
roughly speaking, the somewhat involved form of the
updates is intended to ensure that every node receives
an equal weighting after all the linear combinations and
ratios have been taken. In this case the vectors zi (t + 1)
converge to some common point, i.e., a consensus is
achieved. The inclusion of the subgradient terms in the
updates of xi (t + 1) is intended to steer the consensus
point towards the optimal set Z ∗ , while the push-sum
updates steer the vectors zi (t + 1) towards each other.
Our main results, which we describe in the next section,
demonstrate that this scheme succeeds in steering all
vectors zi (t + 1) towards the same point in the solution
set Z ∗ .
B. Our results
Our first theorem demonstrates the correctness of the
subgradient-push method for an arbitrary stepsize α(t)
satisfying Eq. (2); this holds under the assumptions we
2 We note that we make use here of the assumption that node i knows

its out-degree di (t).

have laid out above, as well as an additional technical
assumption on the subgradient boundedness.
Theorem 1: Suppose that:
(a) The graph sequence {G(t)} is uniformly strongly
connected.
(b) Each function fi (z) is convex over Rd and the set
Z ∗ = Argminz∈Rd F (z) is nonempty.
(c) The subgradients of each fi (z) are uniformly
bounded, i.e., there exists Li < ∞ such that for
all z ∈ Rd ,

for all i and for some z∗ ∈ Z ∗ .

Our second theorem makes explicit the rate at which
the objective function converges to its optimal value. As
standard with subgradient methods, we will make two
tweaks in order to get a convergence rate √
result: (i) we
take a stepsize which decays as α(t) = 1/ t (stepsizes
which decay at faster rates usually produce inferior
convergence rates), and (ii) each node i will maintain
a convex combination of the values zi (1), zi (2), . . . for
which the convergence rate will be obtained. We then
demonstrate
√that the subgradient-push converges at a rate
of O(ln t/ t); this is formally stated in the following
theorem. The theorem makes use of the matrix A(t) that
captures the weights used in the construction of wi (t+1)
and yi (t + 1) in Eq. (1), which are defined by

1/dj (t) whenever j ∈ Niin (t),
Aij (t) =
(3)
0
otherwise.
Moreover, we will use the notation
n
X
L=
Li .
i=1

Theorem 2: Suppose that all √
the assumptions of
Theorem 1 hold, and let α(t) = 1/ t for t ≥ 1. Moreover, suppose that every node i maintains the variable
e
zi (t) ∈ Rd initialized at time t = 0 with any e
zi (0) ∈ Rd
and updated by
α(t + 1)zi (t + 1) + S(t)e
zi (t)
for t ≥ 0,
S(t + 1)
Pt−1
where S(0) = 0 and S(t) = s=0 α(s + 1) for t ≥ 1.
Then, we have for all t ≥ 1, i = 1, . . . , n, and any
z∗ ∈ Z ∗ ,
e
zi (t + 1) =

F (e
zi (t + 1)) − F (z∗ ) ≤
+

L2 (1 + ln(t + 1))
√
2n t + 1

n kx̄(0) − z∗ k1
√
2
t+1

4

24L

Pn

j=1 kxj (0)k1

√
δ(1 − λ) t + 1
24dL2 (1 + ln t)
√
,
+
δ(1 − λ) t + 1

+

where

n

x̄(0) =

1X
xi (0).
n i=1

The scalars λ and δ are functions of the graph sequence
G(1), G(2), . . . , which are given by

1/(nB)
1
1
λ ≤ 1 − nB
.
δ ≥ nB ,
n
n
If each of the graphs G(t), t ≥ 1, is regular3 , then
(
)
1/B
1
δ = 1, λ ≤ min
1− 3
, max σ2 (A(t)) ,
t≥0
4n
where A(t) is defined by Eq. (3) and σ2 (A) is the
second-largest singular value of a matrix A.
Theorem 2 implies that, along the time-averages e
zi (t)
for each node i, the network objective function F (z)
converges to the optimal objective value F ∗ , i.e.,
lim F (e
zi (t)) = F ∗

t→∞

for all i.

However, the theorem does not state anything about the
convergence of the sequences {e
zi (t)} for i = 1, . . . , n.
Theorem 2 provides the rate at which F (e
zi (t)) converges to F ∗ for any i, which is expected. Specifically,
it is standard for a distributed √
subgradient method to
converge at the rate of O(ln t/ t) with the constant
depending on the subgradient-norm upper bounds Li
and the initial conditions xi (0) [25], [5]. Moreover,
it is also standard for the convergence rate of the
distributed methods over networks to depend on some
measure of the connectivity of the directed graph sequence G(1), G(2), . . .. Namely, here, the closeness of
λ to 1 measures the speed at which the (connectivity)
graph sequence {G(t)} diffuses the information among
the nodes over time. Additionally, our rate results also
include the parameter δ, which is a measure of the
imbalance of influences among the nodes, as we will
later see. Time-varying directed regular networks are
uniform in influence and will have δ = 1, so that δ will
disappear from the bounds entirely; however, networks
which have a row very close to zero, corresponding
nodes which are only weakly influenced by all others,
will suffer a corresponding blow-up in the convergence
time of the subgradient-push algorithm. Consequently, δ
may be thought of as a measure of the uniformity of
long-term influence among the nodes.
3 A directed graph G(t) is regular if every out-degree and every
in-degree of a node in G(t) equals d(t) for some d(t).

Moreover, while the term 1/(δ(1 − λ)) appearing in
our rate estimate is bounded exponentially by n2nB in
the worst case, the term need not be this large for every
graph sequence. Indeed, Theorem 2 shows that for a class
of time-varying regular directed graphs, 1/(δ(1 − λ))
scales polynomially in n. Our work therefore motivates
the search for effective bounds on consensus speed
and the influence imbalances for time-varying directed
graphs. Finally, we note that previous research [18], [25],
[5] has studied the case when the matrices A(t) in (3) are
doubly stochastic, which occurs when the directed graph
sequence {G(t)} is regular. In this case, our polynomial
bounds match the previously known results.
We conclude by briefly summarizing the idea of the
proofs as well as the organization of the remainder of
this paper. As previously remarked, our protocol is a
perturbation of the so-called “push-sum” protocol for
averaging studied in [1], [4], [10]. We begin in Section
III by showing that as a consequence of well-known facts
about consensus protocols, such perturbed push-sum protocols are guaranteed to converge when the perturbations
are well behaved (in a sense). In the subsequent Section
IV, we interpret the subgradient-push algorithm as a
special perturbation of the push-sum protocol and use
the convergence results of Section III to show that,
after a transient period, our subgradient-push algorithm
begins to approximate a centralized subgradient scheme.
Finally some simulations are performed in Section V and
conclusions are drawn in Section VI.
III. P ERTURBED P USH -S UM P ROTOCOL
This section is dedicated to the analysis of a perturbed
version of the so-called push-sum protocol, originally
introduced in the groundbreaking work [10] and recently
analyzed in time-varying directed graphs in [1], [4]. The
push-sum is a protocol for node interaction which allows
nodes to compute averages and other aggregates in the
network with directed communication links.
The work in [10], [1], [4] demonstrates the convergence of the push-sum protocol. Here, we generalize this
result by showing that the protocol remains convergent
even if the state of the nodes is perturbed at each step,
as long as the perturbations decay to zero. However,
while the iterate sequences (at the nodes) obtained by the
push-sum method converge to the average of the initial
values of the nodes, the iterate sequences produced by
the perturbed push-sum method need not converge to a
common point; instead, they converge to each other over
time. We will later use this result to prove Theorems 1
and 2. Since the result has a self-contained interpretation
and analysis, we sequester it to this section.
We begin with a statement of the perturbed pushsum update rule. Every node i maintains scalar variables

5

xi (t), yi (t), zi (t), wi (t), where yi (0) = 1 for all i. These
variables are updated as follows: for t ≥ 0,
wi (t + 1)

=

X
j∈Niin (t)

yi (t + 1)

=

X
j∈Niin (t)

xj (t)
,
dj (t)
yj (t)
,
dj (t)

zi (t + 1)

=

wi (t + 1)
,
yi (t + 1)

xi (t + 1)

=

wi (t + 1) + i (t + 1),

zi (t + 1) −

=

A(t)x(t),

y(t + 1)

=

A(t)y(t),

zi (t + 1)

=

wi (t + 1)
yi (t + 1)

x(t + 1)

=

w(t + 1) + (t + 1),

10 x(t)
n

≤

(4)

where i (t) is a perturbation at time t, perhaps adversarially chosen. Recall that Niin (t) is the in-neighborhood
of node i in a directed graph G(t) and dj (t) is the outdegree of node j, as defined in Section II.
Without the perturbation term i (t), the method in
Eq. (4) reduces to the push-sum protocol. Moreover, our
subgradient-push method of Eq. (1) is simply a vectorspace analog of the perturbed pus-sum method of Eq. (4)
with a specific choice of the perturbations.
The intuition behind the push-sum equations of Eq. (4)
is somewhat involved. This dynamic has been introduced
for the purpose of average computation (in the case
when all the perturbations i (t) are zero) and has a
simple motivating intuition. The push-sum is a variation
of a consensus-like protocol wherein every node updates
its values by taking linear combinations of the values
of its neighbors. Due to taking linear combinations,
some nodes are bound to be more influential than
others (meaning that other nodes end up placing larger
weights on their information), for example by virtue of
being more centrally placed. To cancel out the effect
of these influence imbalances of the nodes, the ratios
zi (t) = wi (t)/yi (t) are P
used, which ensures that each
n
zi (t) converges to (1/n) i=1 xi (0). We refer the reader
to [10], [4], [1] for more details.
Next, we rewrite the perturbed push-sum equations
more compactly by using the definition of the matrix
A(t) in Eq. (3). Then, the relations in Eq. (4) assume
the following form: for all t ≥ 0,
w(t + 1)

Eq. (4), or equivalently Eq. (5). Specifically, the bulk of
this section is dedicated to proving the following lemma.
Lemma 1: Consider the sequences {zi (t)}, i =
1, . . . , n, generated by the method in Eq. (4). Assuming
that the graph sequence {G(t)} is uniformly strongly
connected, the following statements hold:
(a) For all t ≥ 1 we have

for all i = 1, . . . , n,
(5)

where (t) = (1 (t), . . . , n (t))0 . Each of matrices A(t)
is column-stochastic but not necessarily row-stochastic.
We are concerned with demonstrating a convergence
result and a convergence rate for the updates given in

8 t
λ kx(0)k1
δ
!
t
X
+
λt−s k(s)k1 ,
s=1

where δ > 0 and λ ∈ (0, 1) satisfy

1/B
1
1
δ ≥ nB , λ ≤ 1 − nB
.
n
n
If each of the matrices A(t) in Eq. (3) are doubly
stochastic, then
(
)
1/B
1
δ = 1, λ ≤
1− 3
, max σ2 (A(t)) .
t≥0
4n
(b) If limt→∞ i (t) = 0 for all i = 1, . . . , n, then
lim zi (t + 1) −

t→∞

10 x(t)
= 0 for all i.
n

(c) If {α(t)} is P
a non-increasing positive scalar se∞
quence with t=1 α(t)|i (t)| < ∞ for all i, then
∞
X
t=0

α(t + 1) zi (t + 1) −

10 x(t)
< ∞ for all i.
n

For part (b) of Lemma 1, observe that each of the
matrices A(t) is doubly stochastic if each of the graphs
G(t) is regular. Furthermore, observe that if i (t) = 0,
this lemma implies that the push-sum method converges
at a geometric rate. In this case, it is easy to see that
10 x(t)/n = 10 x(0)/n and, therefore zi (t) → 10 x(0)/n,
so that the push-sum protocol successfully computes the
average of the initial values. When the perturbations are
nonzero, Lemma 1 states that if these perturbations decay
to zero, then the push-sum method still converges. Of
course, it will no longer be true that the convergence is
necessarily to the average of the initial values.
We will prove a series of auxiliary lemmas before
beginning the proof of Lemma 1. We first remark that
the matrices A(t) have a special structure that allows
us to efficiently analyze their products. Specifically, we
have the following properties of the matrices A(t) (see
[2], [8], [15], [19], [33] for proofs of this and similar
statements).
Lemma 2: Suppose that the graph sequence {G(t)}
is uniformly strongly connected. Then for each integer

6

s ≥ 0, there is a stochastic vector φ(s) such that for all
i, j and t ≥ s,
|[A0 (t)A0 (t − 1) · · · A0 (s + 1)A0 (s)]ij − φj (s)| ≤ Cλt−s ,
for some C and λ ∈ (0, 1).
There are known bounds on the parameters C, λ in
Lemma 2, which characterize how large C is and how
far away λ is from 1. Moreover, these bounds improve
if the sequence {G(t)} has some nice properties. The
following lemma is a formal statement to this effect.
Lemma 3: Let the graph sequence {G(t)} be uniformly strongly connected. Then, in the statement of
Lemma 2(b) we have
1/B

1
.
C = 2,
λ = 1 − nB
n
If in addition every graph G(t) is regular, we have
(
)
1/B
√
1
C = 2, λ = min
1− 3
, max σ2 (A(t)) .
t≥0
4n
Remark 1: We remark that this lemma is not novel
relative to the previous literature, but rather is an explicit
statement of the bounds that have been implicitly used
in the previous papers on the subject.
Proof: From [2], [8], [33], under the assumption of
the uniform strong connectivity of the graphs and our
definition of neighborhoods, we have that: if
x(t) = A0 (t − 1) · · · A0 (s)x(s) for all t > s ≥ 0,
then

b(t−s)/(nB)c
1
max xi (t)− min xi (t) ≤ 1 − nB
1≤i≤n
1≤i≤n
n


× max xi (s) − min xi (s) .
1≤i≤n

1≤i≤n

The preceding relation implies that for all t > s ≥ 0,

1/(nB) !t−s
1
max xi (t)− min xi (t) ≤ 2
1 − nB
1≤i≤n
1≤i≤n
n


× max xi (s) − min xi (s) .
1≤i≤n

1≤i≤n

This holds for every x(s). By choosing x(s) to be each
of the n basis vectors, we see that for every j = 1, . . . , n,
max[A0 (t) · · · A0 (s)]ij − min[A0 (t) · · · A0 (s)]ij
i
i
1/(nB) !t−s

1
.
≤2
1 − nB
n
Since each matrix A0 (t) is row-stochastic, the entry
φj (s) is a limit of the convex combinations of the n
numbers [A0 (t) · · · A0 (s)]ij , i = 1, . . . , n, as t → ∞.
Hence, we have proven the first relation of the lemma

for t > s. For t = s, since the matrix A(s) is columnstochastic and φ(s) is a stochastic vector, we obviously
have |[A0 (s)]ij − φj (s)| ≤ 2 for all i, j, showing that the
relation also holds for t = s.
As for the second statement, when the graphs G(t) are
regular, each of the matrices A(t) is doubly stochastic.
Then, the results of [17] imply that: if
x(t) = A0 (t − 1) · · · A0 (s)x(s) for all t > s ≥ 0,
then for all t > s ≥ 0 we have
b(t−s)/Bc

1
kx(s) − x̄s 1k2 ,
kx(t) − x̄s 1k2 ≤ 1 − 3
2n
where x̄s is the average of the entries of x(s). From the
preceding inequality, we can see that for all t > s ≥ 0,

(t−s)/B
1
kx(t) − x̄s 1k2 ≤ 2 1 − 3
kx(s) − x̄s 1k2 ,
2n
implying that
s
max |xi (s)−x̄s | ≤
i


(t−s)/B
1
2 1− 3
kx(s)−x̄s 1k.
2n

Moreover, since the relation above holds for any vector
x(s) ∈ Rn , by plugging in each basis vector ej ∈ Rn
and noting that kej − (1/n)1k ≤ 1, we obtain for each
j and all t > s ≥ 0,
!(t−s)/B
r
√
1
1
0
0
1− 3
.
max [A (t) · · · A (s)]ij − ≤ 2
1≤i≤n
n
2n
p
Since 1 − β/2 ≤ 1 − β/4
√ for all β ∈ (0, 1), it follows
that we may choose C = 2 and λ = (1−1/(4n3 ))1/B .
The same line of argument can be used to show that we
may choose C = 1 and λ = maxt≥0 σ2 (A(t)).
We will end up applying this lemma to a sequence
of graphs which is only uniformly strongly connected
after throwing out a few graphs at the start. To that
end, we have the following corollary whose proof is
straightforward.
Corollary 1: Suppose that the graph sequence G(t)
has the following property: there exists an integer T > 0
such that given any integer s ≥ 0, there is a time 0 ≤
ts ≤ T for which the graph sequence G(s + ts ), G(s +
ts +1), G(s+ts +2), . . . is uniformly strongly connected.
Then, for each integer s ≥ 0 there is a stochastic vector
φ(s) such that for all i, j and t ≥ s,
|[A0 (t) · · · A0 (s)]ij − φj (s)| ≤ Cλt−s−T ,
where C, λ may be taken as in Lemma 3.
In the proof of Lemma 1, we will make use of a lower
bound on the entries of the vectors 10 A(t) · · · A(0). To
provide such a bound, we employ the following intermediate result which provides a uniform lower bound
on the entries of the vectors 10 A0 (t) · · · A0 (0).

7

Lemma 4: Given a graph sequence {G(t)}, define


0
0 0
0
δ , inf
min [1 A (t) · · · A (0)]i .
t=0,1,...

1≤i≤n

If the graph sequence {G(t)} is uniformly strongly
1
. If each G(t) is regular, then
connected, then δ 0 ≥ nnB
0
δ = 1.
Proof: By the definition of matrices A(t) in Eq. (3),
we have [A(t)]ii = 1/di (t). Since di (t) ≤ n, it follows
that [A(t)]ii ≥1/n for all t and i. Therefore, for all i,
[A0 (t + 1) · · · A0 (0)]ii ≥

1 0
[A (t) · · · A0 (0)]ii for t ≥ 0.
n

Thus, we certainly have [10 A0 (t) · · · A0 (1)]i ≥ 1/nnB
for all i and all t in the range 1 ≤ t ≤ nnB . However,
it was shown in [8], [33] that for t > (n − 1)B, every
entry of A0 (t) · · · A0 (1) is positive and has value at least
1/nnB . Since nnB > (n − 1)B, this proves the bound
δ 0 ≥ 1/nnB . The final claim that δ 0 = 1 for a sequence
of regular graphs is trivial.
Remark 2: As an immediate consequence of the
definition of δ 0 , we have φj (s) ≥ δ 0 /n for all j.
By taking transposes and applying Lemmas 2, 3, 4, we
immediately obtain the following result on the products
A(t) · · · A(s). For convenience, let us adopt the notation
of denoting these products by A(t : s), i.e.,
A(t : s) , A(t) · · · A(s) for all t ≥ s ≥ 0.
Corollary 2: Let the graph sequence {G(t)} be
uniformly strongly connected. Then, the following statements are valid:
(a) There is a sequence {φ(t)} of stochastic vectors
φ(t) ∈ Rn such that the matrix difference A(t : s)−
φ(t)10 for t ≥ s decays geometrically, i.e., for all
i, j = 1, . . . , n,
|[A(t : s)]ij − φi (t)| ≤ Cλt−s

for all t ≥ s ≥ 0,

where we can always choose
λ = 1 − 1/nnB

C = 4,

1/B

.

If in addition each G(t) is regular, we may choose
√
1/B
C = 2 2, λ = 1 − 1/(4n3 )
,
or
C=

√

2,

λ = max σ2 (A(t)),
t≥0

whenever supt≥0 σ2 (A(t)) < 1.
(b) The quantity


δ = inf
min [A(t) · · · A(0)1]i
t=0,1,...

1≤i≤n

satisfies
δ≥

1
.
nnB

Moreover, if the graphs G(t) are regular, we have
δ = 1.
(c) The stochastic vectors φ(t) satisfy for all j,
φj (t) ≥

δ
n

for all times t ≥ 0.

Proof: The results follow by employing Corollary
1, Lemmas 3, 4, and Remark 2, where we take the
transposes of the matrices. The only aspect that needs to
be verified is that these lemmas can be applied, which is
routine, with the exception of the issue of the uniform
strong connectivity requiring elaboration.
When we transpose A(t : s) to obtain the product
A0 (s) · · · A0 (t − 1)A0 (t), we have reversed the order in
which the matrices appear. Moreover, by taking the transposes of each matrix, we have effectively reversed the
direction of every edge in each graph G(t). In the case
when supt≥0 σ2 (A(t)) < 1, it is easy to see that B = 1
and the resulting “reversed” sequence is connected at
every step, so the bound of Lemma 3 applies verbatim.
In the other cases, the resulting sequence is still strongly
connected once we throw out at most B graphs from the
start of the sequence. The bounds of Corollary 1 thus
apply with t − s replaced by t − s − B, which we take
care of by instead doubling the constant C.
The parameter δ as defined in Lemma 4 may be
thought of as a measure of imbalance of the influence
among the nodes. Indeed, δ is defined as the best lower
bound on the row sums of the matrices A(t : s). In
the case when each of the graphs G(t) is regular, the
matrices A(t) will be doubly stochastic and we will have
δ = 1 as previously remarked (i.e., no imbalance of
influence). By contrast, when δ ≈ 0, there is a node i
such that the i’th row in some A(t : s) will have entries
which are all nearly zero; in short, it is almost as if node
i has no in-neighbors at all.
We proceed with our sequence of intermediate lemmas
for the proof of Lemma 1. We will now need some auxiliary results on the convolution of two scalar sequences,
as in the following statement.
Lemma 5: ([25] Lemma 3.1) Let {γk } be a scalar
sequence.
(a) If limk→∞
Pkγk = γ and γ 0 < β < 1, then
limk→∞ `=0 β k−` γ` = 1−β
.
P∞
(b) If γk ≥ 0 for
all
k,
γ
<
∞ and 0 < β < 1,
k=0 k
P∞ Pk
k−`
then k=0
γ` < ∞.
`=0 β
With these pieces in places, we can now proceed to the
proof of Lemma 1. Our argument will rely on Corollary 2
on the products A(t : s) and the just-stated Lemma 5 on
the convolution sequences.
Proof of Lemma 1: (a) By inspecting Eq. (5) it is

8

easy to see that for all t ≥ 0,
x(t+1) = A(t : 0)x(0)+

t
X

Therefore,
A(t : s)(s)+(t+1), (6)

s=1

which implies
A(t+1)x(t+1) = A(t+1 : 0)x(0)+

t+1
X

A(t+1 : s)(s).

s=1

(7)
Moreover, since each A(t) is column-stochastic, we have
that 10 A(t) = 10 and Eq. (6) further implies that
10 x(t + 1) = 10 x(0) +

t+1
X

10 (s)

for all t ≥ 0. (8)

s=1

Now, from Eq. (7) and Eq. (8) we obtain for all t ≥ 0,
A(t + 1)x(t + 1) − φ(t + 1)10 x(t + 1)
= (A(t + 1 : 0) − φ(t + 1)10 ) x(0)
t+1
X
(A(t + 1 : s) − φ(t + 1)10 ) (s). (9)
+
s=1

According to Corollary 2, if we define D(t, s) to be

Observe that the denominator of the above fraction is n
times the i’th row sum of A(t : 0). By definition of δ,
this row sum is at least δ, and consequently
φi (t) n + [D(t : 0)1]i = [A(t : 0)1]i ≥ δ.

zi (t + 1) −

then we have the entry-wise decay bound
|[D(t, s)]ij | ≤ Cλt−s for all i, j and t ≥ s ≥ 0, (10)
where the constants C > 0 and λ ∈ (0, 1) have the
properties listed in Corollary 2.
Therefore, from relation (9) it follows for t ≥ 0,
A(t + 1)x(t + 1) = φ(t + 1)10 x(t + 1)
t+1
X

10 x(t)
n
Pt
n[D(t : 0)x(0)]i + n s=1 [D(t : s)(s)]i
=
n (φi (t) n + [D(t : 0)1]i )
10 x(t)[D(t : 0)1]i
−
.
n (φi (t) n + [D(t : 0)1]i )

zi (t + 1) −

Thus, for all i and t ≥ 1,

D(t, s) = A(t : s) − φ(t)10

+ D(t + 1, 0)x(0) +

10 x(t)
n
Pt
φi (t) 10 x(t) + [D(t : 0)x(0)]i + s=1 [D(t : s)(s)]i
=
φi (t) n + [D(t : 0)1]i
10 x(t)
.
−
n
By bringing the fractions to a common denominator,
after the cancellation of some terms, we find that
zi (t + 1) −

D(t + 1, s)(s).

s=1

Thus, for t ≥ 1 we have
w(t + 1) =A(t)x(t) = φ(t)10 x(t)
t
X
+ D(t, 0)x(0) +
D(t, s)(s). (11)
s=1

[D(t : 0)x(0)]i +

Factoring n out in the last term, and using estimates for
|[D(t : s)]ij | as given in (10), we obtain

+

0

= φ(t)1 y(0) + D(t, 0)y(0)
(12)

which holds for all t ≥ 0. From (11) and (12) we obtain
for every t ≥ 1 and all i,

s=1 [D(t : s)(s)]i

φi (t) n + [D(t : 0)1]i
|10 x(t)[D(t : 0)1]i |
+
n
(φi (t) n + [D(t : 0)1]i )
1
max |[D(t : 0)]ij | kx(0)k1
≤
j
δ
!


t
X
+
max |[D(t : s)]ij | k(s)k1
j
s=1


1 0
+ |1 x(t)| max |[D(t : 0)]ij | n.
j
nδ

zi (t + 1) −

y(t + 1) = A(t : 0)y(0)

Pt

≤

We may derive a similar expression for y(t + 1):

= φ(t)n + D(t : 0)1,

10 x(t)
n

C
δ

t
X

10 x(t)
C
≤ λt kx(0)k1
n
δ
!

λt−s k(s)k1 + |10 x(t)|λt

.

s=1

Now we look at the term 10 x(t). From Eq. (8), we have
|10 x(t)| ≤ kx(0)k1 +

t
X

k(s)k1 .

s=1

wi (t + 1)
From the preceding two relations it follows that for all
i and t ≥ 1,
yi (t + 1)
Pt
0
φi (t) 1 x(t) + [D(t : 0)x(0)]i + s=1 [D(t : s)(s)]i
10 x(t)
C
=
.
zi (t + 1) −
≤ λt kx(0)k1
φi (t) n + [D(t : 0)1]i
n
δ

zi (t + 1) =

9

 √
√
and the relation 2 t + 2 − 1 ≥ t + 1 for all t ≥ 1.
Corollary 3: Suppose that all the assumptions
√ of
Lemma 1 are satisfied. Moreover, let α(t) = 1/ t for
all t ≥ 1 and the perturbations i (t) are bounded as
follows:
√
for all t ≥ 1,
k(t)k1 ≤ D/ t

t

C X t−s
λ k(s)k1
δ s=1
!
t
X
C t
+ λ kx(0)k1 +
k(s)k1
δ
s=1
!
t
X
C
2λt kx(0)k1 + 2
λt−s k(s)k1 .
δ
s=1
+

=

Since we were able to choose C ≤ 4 in all the cases
considered in Lemma 2, we may choose C = 4 to obtain
the result in part (a).
(b) By letting t → ∞ in the preceding relation, since
λ ∈ (0, 1), we find that for all i,
lim zi (t + 1) −

t→∞

for some scalar D > 0. Defining
n

1X
x̄(t) =
xi (t)
n i=1

we have for every i = 1, . . . , n and t ≥ 1,

t
X
10 x(t)
≤ lim
λt−s k(s)k1 .
t→∞
n
s=1

t
X

t
X

t→∞

λt−s k(s)k1 = 0,

s=1

and the result follows from the preceding two relations.
(c) Since {α(t} is positive and non-increasing sequence,
we have α(t + 1) ≤ α(1) for all t ≥ 0 and α(t + 1) ≤
α(s) for all t ≥ s ≥ 0. Using these relations, we obtain
0

1 x(t)
n
!
t
X
t
t−s
α(1)λ kx(0)k1 +
λ α(s)k(s)k1 .(13)

α(t + 1) zi (t + 1) −
≤

2C
δ

α(k + 1) |zi (k + 1) − x̄(k)|

k=0

When i (t) → 0 for all i, then k(t)k1 → 0 and, by
Lemma 5(a), we conclude that
lim

for all t ≥ 0,

≤

8
(kx(0)k1 + D(1 + ln t)) ,
δ(1 − λ)

(15)

t
X
1
α(k + 1)|zi (k + 1) − x̄(k)|
Pt
k=0 α(k + 1) k=0

≤8

kx(0)k1 + D(1 + ln t)
√
,
δ(1 − λ) t + 1

(16)

where δ ∈ (0, 1] and λ ∈ (0, 1) are as given in Lemma 1.
Proof: From Lemma 1(a) we have for all i and all
t ≥ 1,
t
X

α(k + 1) |zi (k + 1) − x̄(k)|

k=1

s=1

P∞
Since λ ∈ (0, 1), the P
sum Pt=1 λt is finite, and by
∞
t
t−`
Lemma 5(b) the sum
α(`)k(`)k1 is
t=1
`=1 λ
finite. Therefore
∞
X
10 x(t)
< ∞.
α(t + 1) zi (t + 1) −
n
t=0

t

≤

8 X λk
√
kx(0)k1
δ
k+1
k=1

t
k
X
8X
+
α(k + 1)
λk−s k(s)k1
δ
s=1
k=1

t

≤

k

8 λ
8D X X λk−s
kx(0)k1 +
.
δ1−λ
δ
s
s=1
k=1

Lemma 1, which we have just proved, is the central
result of this section. It states that each of the sequences
zi (t+1) tracks the average x̄(t) = 10 x(t)/n increasingly
well as time goes on. We will later require a corollary
of this lemma showing that a weighted time-average of
each zi (t + 1) tracks a weighted time-average of x̄(t).
The next corollary gives a precise statement of this result.
In the derivation, we also use the following inequality
t
X
k=0

√

√
1
≥ t+1
k+1

Note that
t X
k
X
λk−s

(14)

t
X
1

k=0

√

1
≥
k+1

0

√

√

du
=2 t+2−1
u+1

s=1

=1+

s

λk−s ≤

k=s

t
X
1

1
.
s1−λ
s=1

t
X
1

s
s=2

Z t
≤1+

du
= 1 + ln t,
1 u

it follows that
t
X

Z t+1

t
t
X
1X

Furthermore, since

which follows by
t
X

s

k=1 s=1

s
s=1
for all t ≥ 1,

=

α(k + 1) |zi (k + 1) − x̄(k)|

k=1

≤

8λ
8D (1 + ln t)
kx(0)k1 +
.
δ(1 − λ)
δ 1−λ

(17)

10

Note that for k = 0, we have
α(1) |zi (1) − x̄(0)| ≤ |zi (1)| + |x̄(0)|.
From the definition of the perturbed push-sum method,
we have
wi (1)
,
zi (1) =
yi (1)
X xj (0)
X
1
,
yi (1) =
,
wi (1) =
dj (0)
dj (0)
in
in
j∈Ni (0)

j∈Ni (0)

where the last equality follows from yi (0) = 1 for all i.
Thus, zi (1) is a weighted average of the entries in x(0),
while x̄(0) is the average of these entries. Hence,
|zi (1)| + |x̄(0)| ≤ 2 max |xi (0)| ≤ 2kx(0)k1 .
i

Since δ ≤ 1, we see that
2
8
kx(0)k1 < kx(0)k1 . (18)
δ
δ
By summing the relations in Eqs. (17) and (18), we
obtain the estimate in Eq.
√ (15). The relation in Eq. (16)
follows from α(t) = 1/ t and Eqs. (14)–(15).
We conclude this section by noting that Lemma 1
and Corollary 3 can be extended to the case when xi (t)
(and, by extension, zi (t)) is a d-dimensional vector, by
applying the results to each coordinate component of the
space.
α(1) |zi (1) − x̄(0)| ≤

IV. C ONVERGENCE R ESULTS FOR
S UBGRADIENT-P USH M ETHOD
We turn now to the proofs of our main results, namely
Theorems 1 and 2. Our arguments will crucially rely
on the convergence results for the perturbed push-sum
method we have established in Section III.
We give a brief, informal summary of the main ideas
behind our argument. The convergence result for the
perturbed push-sum method of Section III implies that,
under the appropriate assumptions, the entries of zi (t)
get close to each other over time, and consequently
zi (t) approaches a multiple of the all-ones vector. Thus,
every node takes a subgradient of its own function fi at
nearly the same point. Over time, these subgradients are
averaged over the nodes by the push-sum-like updates of
our method. Consequently, the subgradient-push method
approximates the ordinary (centralized) subgradient
alPn
gorithm applied to the average function n1 j=1 fj .
We now begin the formal process of proving Theorems 1 and 2. In what follows, we will use a deterministic counterpart of the well-known (almost) supermartingale convergence result ([27]; see also [22], Lemma 11,
Chapter 2.2). The result is given in the following lemma.
Lemma 6: Let {vt } be a non-negative scalar sequence such that
vt+1 ≤ (1 + bt )vt − ut + ct

for all t ≥ 0,

where bt ≥ 0, ut ≥P0 and ct ≥ 0 for all t ≥ 0 with
P
∞
∞
Then, the sequence
t=0 bt < ∞, and
t=0 ct < ∞. P
∞
{vt } converges to some v ≥ 0 and t=0 ut < ∞.
Our first step is to establish a lemma pertinent to the
convergence of a sequence satisfying a subgradient-like
recursion. In the proof of this lemma, we make use of
Lemma 6.
Lemma 7: Consider a convex minimization problem minx∈Rm f (x), where f : Rm → R is a continuous
function. Assume that the solution set X ∗ of the problem
is nonempty. Let {xt } be a sequence such that for all
x ∈ X ∗ and for all t ≥ 0,
kxt+1 −xk2 ≤ (1+bt )kxt −xk2 −αt (f (xt ) − f (x))+ct ,
where bt ≥ 0, P
αt ≥ 0 and ct ≥ 0Pfor all t ≥ 0, with
P
∞
∞
∞
t=0 bt < ∞,
t=0 αt = ∞ and
t=0 ct < ∞. Then,
the sequence {xt } converges to some solution x∗ ∈ X ∗ .
Proof: By letting x = x∗ for arbitrary x∗ ∈ X ∗
and by defining f ∗ = minx∈Rm f (x), we obtain for all
t ≥ 0,
kxt+1 −x∗ k2 ≤ (1+bt )kxt −x∗ k2 −αt (f (xt ) − f ∗ )+ct .
Thus, all the conditions of Lemma 6 are satisfied, and
by this lemma we obtain the following statements:
{kxt − x∗ k2 } converges for each x∗ ∈ X ∗ ,
∞
X

αt (f (xt ) − f ∗ ) < ∞.

(19)
(20)

t=0

Since

P∞

t=0 αt = ∞, it follows from (20) that

lim inf f (xt ) = f ∗ .
t→∞

Let {xt` } be a subsequence of {xt } such that
lim f (xt` ) = lim inf f (xt ) = f ∗ .
t→∞

`→∞

(21)

Now, Eq. (19) implies that the sequence {xt } is bounded.
Thus, without loss of generality, we can assume that
{xt` } is converging to some x̃ (for otherwise, we can in
turn select a convergent subsequence of {xt` }). Therefore,
lim f (xt` ) = f (x̃),
`→∞

which by Eq. (21) implies that x̃ ∈ X ∗ . By letting x∗ =
x̃ in Eq. (19) we obtain that {xt } converges to x̃.
A key relations in the proofs of Theorems 1 and 2
will be obtained by applying Lemma 7 to the average
process
n

x̄(t) =

1X
xi (t)
n i=1

for t ≥ 0,

where xi (t) is the sequence generated by the
subgradient-push method. We will need to argue that

11

x̄(t) satisfies the assumptions of Lemma 7, for which
the following result will be instrumental.
Lemma 8: Under the same assumptions as in Theorem 1, we have for all v ∈ Rd and t ≥ 0,
2

We next consider a cross-term gi0 (t + 1)(x̄(t) − v)
in (23). For this term, we write
gi0 (t + 1)(x̄(t) − v) =gi0 (t + 1)(x̄(t) − zi (t + 1))
+ gi0 (t + 1)(zi (t + 1) − v).
(24)

2

kx̄(t + 1) − vk ≤ kx̄(t) − vk
2α(t + 1)
(F (x̄(t)) − F (v))
−
n
n
4α(t + 1) X
+
Li kzi (t + 1) − x̄(t)k
n
i=1
L2
+α2 (t + 1) 2 .
n

Using the subgradient boundedness and the CauchySchwarz inequality, we can lower bound the first term
gi0 (t + 1)(x̄(t) − zi (t + 1)) as
gi0 (t+1)(x̄(t)−zi (t+1)) ≥ −Li kx̄(t)−zi (t+1)k. (25)

Proof: Let us define x
e` (t) to be the vector in Rn
which stacks up the `’th entries of all the vectors xi (t):
formally, we define x
e` (t) to be the vector whose j’th
entry is the `’th entry of xj (t). Similarly, we define ge` (t)
to be the vector stacking up the `’th entries of the vectors
gi (t): the j’th entry of ge` (t) is the `’th entry of gj (t).
From the definition of the subgradient-push in Eq. (1)
it can be seen that for ` = 1, . . . , d,

As for the second term gi0 (t + 1)(zi (t + 1) − v), we can
use the fact that gi0 (t + 1) is the subgradient of fi (θ) at
θ = zi (t + 1) to obtain:
gi0 (t + 1)(zi (t + 1) − v) ≥ fi (zi (t + 1)) − fi (v),
from which, by adding and subtracting fi (x̄(t)) and
using the Lipschitz continuity of fi (implied by the
subgradient boundedness), we further obtain
gi0 (t + 1)(zi (t + 1) − v) ≥

x
e` (t + 1) = A(t)e
x` (t) − α(t + 1)e
g` (t + 1).
Since A(t) is a column-stochastic matrix, it follows that
for all ` = 1, . . . , d,

By substituting the estimates of P
Eqs. (25)–(26) back in
n
relation (24), and using F (x) = i=1 fi (x) we obtain

n
n
n
1X
1X
α(t + 1) X
[e
x` (t+1)]j =
[e
x` (t)]j −
[e
g` (t+1)]j .
n j=1
n j=1
n
j=1

n
X

gi0 (t + 1)(x̄(t) − v) ≥ F (x̄(t)) − F (v)

i=1

−2
Since the `’th entry of x̄(t + 1) is exactly the left-hand
side above, we can conclude that
x̄(t + 1) = x̄(t) −

n
X

α(t + 1)
gj (t + 1).
n
j=1

(22)

Now let v ∈ Rd be an arbitrary vector. From relation (22) it follows that for all t ≥ 0,
kx̄(t + 1) − vk2 =kx̄(t) − vk2
−

n
2α(t + 1) X 0
gi (t + 1)(x̄(t) − v)
n
i=1

n
α2 (t + 1) X
+
gi (t + 1)
n2
i=1

2

.

Since the subgradient norms of each fi are uniformly
bounded by Li , it further follows that for all t ≥ 0,
kx̄(t + 1)−vk2 ≤ kx̄(t) − vk2
n
2α(t + 1) X 0
gi (t + 1)(x̄(t) − v)
−
n
i=1
+ α2 (t + 1)

L2
.
n2

(23)

−Li kzi (t + 1) − x̄(t)k
+fi (x̄(t)) − fi (v). (26)

n
X

Li kzi (t + 1) − x̄(t)k.

(27)

i=1

Now, we substitute estimate (27) into relation (23) and
obtain for any v ∈ Rd and all t ≥ 0,
kx̄(t + 1) − vk2 ≤ kx̄(t) − vk2
2α(t + 1)
(F (x̄(t)) − F (v))
−
n
n
4α(t + 1) X
+
Li kzi (t + 1) − x̄(t)k
n
i=1
+ α2 (t + 1)

L2
.
n2

With all the pieces in place, we are finally ready to
prove Theorem 1. The proof idea is to show that the
averages x̄(t), as defined in Lemma 8, converge to some
solution x∗ ∈ Z ∗ and then show that zi (t + 1) − x̄(t)
converges to 0 for all i, as t → ∞. The last step will
be accomplished by invoking Lemma 1 on the perturbed
push-sum protocol.
Proof of Theorem 1: We begin by observing that the
subgradient-push method may be viewed as an instance
of the perturbed push-sum protocol. Indeed, let us adopt
the notation x
e` (t), ge` (t) from the proof of Lemma 8, and
moreover let us define w
e` (t), ze` (t) similarly. Then, the

12

definition of subgradient-push method implies that for
all ` = 1, . . . , d,
w
e` (t + 1) = A(t)e
x` (t),
y(t + 1) = A(t)y(t),
[w
e` (t + 1)]i
for i = 1, . . . , n,
[e
z` (t + 1)]i =
yi (t + 1)
x
e` (t + 1) = w
e` (t + 1) − α(t + 1)e
g` (t + 1).
Next, we will apply Lemma 1(b) to each coordinate
` = 1, . . . , d. Since α(t) → 0 and the subgradients are
bounded, the assumptions of Lemma 1(b) are satisfied
with (t+1) = α(t+1)e
g` (t+1) for each `. From Lemma
1(b) we conclude that
Pn
x` (t)]j
j=1 [e
=0
z` (t + 1)]i −
lim [e
t→∞
n
for all ` = 1, . . . , d and all i = 1, . . . , n, which is
equivalent to
lim kzi (t+1)− x̄(t)k = 0 for all i = 1, . . . , n. (28)

t→∞

Now, we will apply Lemma 1(c) to each coordinate
` = 1, . . . , d. Let ` be arbitrary but fixed coordinate index. Since the subgradients gi (s) are uniformly bounded,
using (t + 1) = α(t + 1)e
g` (t + 1) we can see that for
all i = 1, . . . , n and all t ≥ 1,
|i (t)| ≤ α(t)ke
g` (t)k∞ ≤ α(t) max kgi (t)k.
i
P∞
Therefore, since by assumption t=1 α(t)2 < ∞ and
kgi (t)k ≤ Li , we obtain for all i = 1, . . . , n,
∞
∞

X
X
α(t)|i (t)| ≤ max Li
α2 (t) < ∞.
i

t=1

t=1

In view of the preceding relation and the assumption
that the sequence {α(t)} is non-increasing, by applying
Lemma 1(b) to each coordinate ` = 1, . . . , d, we obtain
Pn
∞
X
x` (t)]j
j=1 [e
α(t + 1) [e
z` (t + 1)]i −
< ∞,
n
t=0
for all ` = 1, . . . , d and all i = 1, . . . , n, implying that
∞
X

α(t + 1)kzi (t + 1) − x̄(t)k < ∞

for all i. (29)

where F ∗ is the optimal value (i.e., F ∗ = F (z ∗ ) for any
z ∗ ∈ Z ∗ ). In view of Eq. (29), it follows that
n
∞
X
4α(t + 1) X
t=0

n

P∞
Also,
t=1 α(t) = ∞ and
P∞ by2 assumption we have
α
(t)
<
∞.
Thus,
the
conditions
of Lemma 7
t=1
are satisfied (note that F (·) is continuous since it is
convex over Rd ), and by this lemma we conclude that
the average sequence {x̄(t)} converges to a solution
zb ∈ Z ∗ . By Eq. (28) it follows that each sequence
{zi (t)}, i = 1, . . . , n, converges to the same solution zb.
Having proven Theorem 1, we now turn to the proof
of the convergence rate results of Theorem 2. The first
step will be a slight modification of the result we have
just proved - whereas the proof of Theorem 1 relies on
F (x̄(t)) → F ∗ , in order to obtain a convergence rate we
will now need to argue that we can replace F (x̄(t)) by
F evaluated at a running average of the vectors zi (t) for
any i. This is stated precisely in the following lemma.
Lemma 9: If all the√assumptions of Theorem 1 are
satisfied and α(t) = 1/ t, then for all t ≥ 1 and any
z∗ ∈ Z ∗,
!
Pt
α(k
+
1)x̄(k)
k=0
F
− F (z ∗ )
P
t
α(k
+
1)
k=0
L2 (1 + ln(t + 1))
n kx̄(0) − z∗ k1
√
√
+
≤
2
t
+
1
2n
t+1
Pn
16L j=1 kxj (0)k1
√
+
δ(1 − λ) t + 1
16dL2 (1 + ln t)
√
+
.
δ(1 − λ) t + 1
Proof: From Lemma 8 we have for any v and t ≥ 0,
t
X
2α(k + 1)
k=0

+
+

n

Now, observe that we can apply Lemma 8 with v = z ∗
for any solution z ∗ ∈ Z ∗ to obtain
kx̄(t + 1) − z ∗ k2 ≤ kx̄(t) − z ∗ k2
2α(t + 1)
−
(F (x̄(t)) − F ∗ )
n
n
4α(t + 1) X
+
Li kzi (t + 1) − x̄(t)k
n
i=1
+ α2 (t + 1)

L2
,
n2

(30)

(F (x̄(k)) − F (v)) ≤ kx̄(0) − vk2

t
n
X
4α(k + 1) X
k=0
t
X

n
α2 (k + 1)

k=0

t=0

Li kzi (t + 1) − x̄(t)k < ∞.

i=1

Li kzi (k + 1) − x̄(k)k

i=1
2

L
.
n2

From the
preceding relation, recalling the notation
Pt−1
S(t) = k=0 α(k + 1) and dividing by (2/n)S(t + 1),
we obtain for any v ∈ Rd and t ≥ 0,
Pt
n kx̄(0) − vk2
k=0 α(k + 1)F (x̄(k))
− F (v) ≤
S(t + 1)
2 S(t + 1)
t
n
X
X
2
+
α(k + 1)
Li kzi (k + 1) − x̄(k)k
S(t + 1)
i=1
k=0
t
X
1
L2
+
α2 (k + 1) .
S(t + 1)
2n
k=0

13

Now setting v = z∗ for some z∗ ∈ Z ∗ and using the
convexity of F , we obtain
!
Pt
k=0 α(k + 1)x̄(k)
F
− F (z∗ )
S(t + 1)
n kx̄(0) − z∗ k2
≤
2 S(t + 1)
t
n
X
X
2
+
α(k + 1)
Li kzi (k + 1) − x̄(k)k
S(t + 1)
i=1
k=0
t
X
1
L2
+
α2 (k + 1) .
(31)
S(t + 1)
2n
k=0

We now bound the quantities kzi (k + 1) − x̄(k)k in
Eq. (31) by applying Corollary 3 to each of its components. Specifically, we will apply Eq. (15) of the corollary. Observe that all the assumption of that corollary
have been assumed to hold. Moreover, the constant
Pn D in
the statement of the corollary can be taken to be i=1 Li
when the corollary is applied to the `’th component of
the vector zi (t+1)−x̄(t). Adopting our notation of x
e` (t)
and ze` (t) as the vectors consisting of the `’th components
of the vectors xi (t) and zi (t), i = 1, . . . , n, respectively,
we have for all ` = 1, . . . , d, all i = 1, . . . , n, and t ≥ 1,

We are now finally in position to prove Theorem 2.
At this point, the proof is a simple
of
 Pcombination

t
k=0 α(k+1)x̄(k)
P
Lemma 9, which tells us that F
t
α(k+1)
k=0
approaches
F (z ∗ ), along with
Pt
Pt Corollary 3, which tells us
α(k+1)zi (k+1)
k=0 α(k+1)x̄(k)
Pt
that P
and k=0
get close
t
k=0 α(k+1)
k=0 α(k+1)
to each other over time for all i = 1, . . . , n.
Proof of Theorem 2: It is easy to see by induction
that the vectors e
zi (t) defined in the statement of Theorem 2 are weighted time-averages of zi (t), given by: for
all i,
Pt
α(k + 1)zi (k + 1)
e
for all t ≥ 0.
zi (t + 1) = k=0
Pt
k=0 α(k + 1)
By the boundedness of subgradients we obtain for all i
and t ≥ 0,
!
Pt
k=0 α(k + 1)x̄(k)
F (e
zi (t + 1)) − F
Pt
k=0 α(k + 1)
t
X
L
α(k + 1)kzi (k + 1) − x̄(k)k.
k=0 α(k + 1) k=0

≤ Pt

8
(ke
x` (0)k1 + L(1 + ln t))
δ(1 − λ)

Applying Corollary 3 to each of the coordinates of the
vectors zi (k + 1) and x̄(t), we obtain for all i and t ≥ 1,
!
Pt
α(k
+
1)x̄(k)
k=0
F (e
zi (t + 1)) − F
P
t
α(k + 1)
 k=0

n
X
8L

√
kxj (0)k1 + dL(1 + ln t) .
≤
δ(1 − λ) t + 1 j=1

(cf. Eq. (15) of Corollary 3), which after some elementary algebra implies that

By summing the preceding relation and that of Lemma
9, we have

t
X
k=0

≤

t
X

√

n
1X
1
[e
z` (k + 1)]i −
[e
x` (k)]j
n j=1
k+1

α(k + 1)

n
X

F (e
zi (t + 1)) − F (z ∗ ) ≤

Li kzi (k + 1) − x̄(k)k1

i=1

k=0

+

n
X
8
≤
L
kxj (0)k1
δ(1 − λ) j=1

+

8d
L2 (1 + ln t) .
δ(1 − λ)

(32)

Now, using the fact that the Euclidean norm of a vector
is not larger that its 1-norm, we substitute Eq. (32) into
Eq. (31). Then, using the definition of S(t) and Eq. (14)
we bound the denominator in Eq. (31) as follows
S(t + 1) =

t
X

α(k + 1) ≥

√

t + 1.

k=0

The preceding relation and
t
X
k=0

α2 (k + 1) =

t+1
X
1
s=1

s

Z t+1
≤ 1+

yield the stated estimate.

1

L2 (1 + ln(t + 1))
√
2n
Pn t + 1
24L j=1 kxj (0)k1
√
+
δ(1 − λ) t + 1
24dL2 (1 + ln t)
√
+
,
δ(1 − λ) t + 1

n kx̄(0) − z∗ k1
√
2
t+1

dx
= 1 + ln(t + 1),
x

thus proving Theorem 2.
V. S IMULATIONS
We report some simulations of the subgradient-push
method which experimentally demonstrate that its performance is often quite scalable.
Pn We will be optimizing
the scalar function F (θ) = i=1 pi (θ − ui )2 where ui
is a variable that is known only to node i. This is a
canonical problem in distributed estimation: the nodes
b and node i
are attempting to measure a parameter θ,
b
measures ui = θ + wi where wi are jointly Gaussian
and zero mean. Letting pi be the inverse of the variance

14

Error decay with time

Error decay with time

35

40

30

35

30
25

25
20
20
15
15
10
10

5

0

5

0

20

40

60

80

100

120

140

160

180

200

1800

0

9000

1600

20

40

60

80

100

120

140

160

180

200

Number of iterations needed to reach a neighborhood of the optimal point

8000

1400

7000

1200

6000

1000

5000

800

4000

600

3000

400

2000

200

0
10

0

1000

20

30

40

50

60

70

80

90

Fig. 1. Top plot: the number of iterations (x-axis) and kz(t) − θ∗ 1k
(y-axis), where z(t) is the vector stacking up all zi (t), for an instance
of a graph with 1000 nodes. Bottom plot: the number of nodes (x-axis)
and the average time until kz(t) − θ∗ 1k ≤ 0.1 over 30 simulations
(y-axis).

of wi , the maximum likelihood estimate is the minimizer
θ∗ of F (θ) (θ∗ is unique provided that pi > 0 for at least
one i). We set a half of the pi ’s to zero (corresponding
to nodes not taking any measurements) and the other
half of the pi ’s to uniformly random values between 0
and 1. The initial points xi (0) are independent random
variables, each with a standard Gaussian distribution.
Figure 1 shows the results for simple random graphs
where every node has two out-neighbors, one belonging
to a fixed cycle and the other one chosen uniformly at
random at each step. The plot to the left shows how
kz(t) − θ∗ 1k decays on a graph instance with n = 1000
nodes, while one to the right shows the (average) time
until kz(t)−θ∗ 1k ≤ 0.1 as the size n of the graph varies
from 10 to 90. Figure 2 illustrates the same quantities
for the sequence of graphs which alternate between two
(undirected) star graphs.
We see that in both cases the initial decay is quite
fast and it takes only a small number of iterations to
bring all the nodes reasonably close to the optimal value.
Nevertheless, the decay is clearly sub-geometric, consis-

0
10

20

30

40

50

60

70

80

90

Fig. 2. Top plot: the number of iterations (x-axis) and kz(t) − θ∗ 1k
(y-axis), where z(t) is the vector stacking up all zi (t), for an instance
of a graph with n = 1000 nodes. Bottom plot: the number of nodes
(x-axis) and the average time until kz(t) − θ∗ 1k ≤ 0.1 over 50
simulations (y-axis).

tently with the bounds we have proven. However, in both
cases, the time to get the error below the threshold of
0.1 starting from random initial points appears to scale
linearly in the number of nodes.
VI. C ONCLUSIONS
We have introduced the subgradient-push, a broadcastbased distributed protocol for distributed optimization
of a sum of convex functions over directed graphs. We
have shown that, as long as the communication graph
sequence {G(t)} is uniformly strongly connected, the
subgradient-push succeeds in driving all the nodes to
the same point in the set of optimal solutions. Moreover,
√
the objective function converges at a rate of O(ln t/ t),
where the constant depends on the initial vectors, bounds
on subgradient norms, consensus speed λ of the graph
sequence {G(t)}, as well as a measure of the imbalance
of influence δ among the nodes.
Our results motivate the open problems associated
with understanding how the consensus speed λ depends
on properties of the sequence {G(t)}. Similarly, it is also

15

natural to ask how the measure of imbalance of influence
δ depends on the combinatorial properties of the graphs
G(t), namely how it depends on the diameters, the size
of the smallest cuts, and other pertinent features of the
graphs in the sequence {G(t)}.
R EFERENCES
[1] F. Benezit, V. Blondel, P. Thiran, J. Tsitsiklis, and M. Vetterli.
Weighted gossip: distributed averaging using non-doubly stochastic matrices. In Proceedings of the 2010 IEEE International
Symposium on Information Theory, Jun. 2010.
[2] V.D. Blondel, J.M. Hendrickx, A. Olshevsky, and J.N. Tsitsiklis.
Convergence in multiagent coordination, consensus, and flocking.
In Proceedings of the IEEE Conference on Decision and Control,
Dec 2005.
[3] J. Chen and A. H. Sayed. Diffusion adaptation strategies for
distributed optimization and learning over networks. IEEE
Transactions on Signal Processing, 60(8):4289–4305, 2012.
[4] A.D. Dominguez-Garcia and C. Hadjicostis. Distributed strategies for average consensus in directed graphs. In Proceedings of
the IEEE Conference on Decision and Control, Dec 2011.
[5] J.C. Duchi, A. Agarwal, and M.J. Wainwright. Dual averaging
for distributed optimization: Convergence analysis and network
scaling. IEEE Transactions on Automatic Control, 57(3):592 –
606, March 2012.
[6] B. Gharesifard and J. Cortes.
Distributed continuoustime convex optimization on weight-balanced digraphs.
http://arxiv.org/pdf/1204.0304.pdf, 2012.
[7] F. Iutzeler, P. Ciblat, and W. Hachem. Analysis of sum-weightlike algorithms for averaging inwireless sensor networks. IEEE
Transactions on Signal Processing, 6(11), 2013.
[8] A. Jadbabaie, J. Lin, and A.S. Morse. Coordination of groups of
mobile autonomous agents using nearest neighbor rules. IEEE
Transactions on Automatic Control, 48:988–1001, 2003.
[9] B. Johansson. On distributed optimization in networked systems.
PhD thesis, Royal Institute of Technology (KTH), tRITA-EE
2008:065, 2008.
[10] D. Kempe, A Dobra, and J. Gehrke. Gossip-based computation
of aggregate information. In Proceedings of the 44th Annual
IEEE Symposium on Foundations of Computer Science, pages
482–491, Oct. 2003.
[11] H. Li and Z. Han. Competitive spectrum access in cognitive
radio networks: graphical game and learning. In IEEE Wireless
Communications and Networking Conference, pages 1–6, 2010.
[12] I. Lobel and A. Ozdaglar. Distributed subgradient methods for
convex optimization over random networks. IEEE Transactions
on Automatic Control, 56(6):1291 –1306, June 2011.
[13] I. Lobel, A. Ozdaglar, and D. Feijer. Distributed multi-agent
optimization with state-dependent communication. Mathematical
Programming, 129(2):255–284, 2011.
[14] C. Lopes and A.H. Sayed. Incremental adaptive strategies over
distributed networks. IEEE Transactions on Signal Processing,
55(8):4046–4077, 2007.
[15] L. Moreau. Stability of multiagent systems with time-dependent
communication links. IEEE Transactions on Automatic Control,
50:169–182, 2005.
[16] A. Nedić. Asynchronous broadcast-based convex optimizatio
over a network. IEEE Transactions on Automatic Control,
56(6):1337–1351, 2011.
[17] A. Nedić, A. Olshevsky, A. Ozdaglar, and J.N. Tsitsiklis. On
distributed averaging algorithms and quantization effects. IEEE
Transactions on Automatic Control, 54(11):2506–2517, 2009.
[18] A. Nedić and A. Ozdaglar. Distributed subgradient methods
for multi-agent optimization. IEEE Transactions on Automatic
Control, 54(1):48 –61, Jan. 2009.
[19] A. Nedić and A. Ozdaglar. Convergence Rate for Consensus with
Delays. Journal of Global Optimization, 47:437456, 2010.

[20] G. Neglia, G. Reina, and S. Alouf. Distributed gradient optimization for epidemic routing: A preliminary evaluation. In IEEE
Wireless Days, 2nd IFIP, pages 1–6, 2009.
[21] A. Olshevsky. Efficient information aggregation for distributed
control and signal processing. PhD thesis, MIT, 2010.
[22] B. Polyak. Introduction to optimization. Optimization software,
Inc., Publications division, New York, 1987.
[23] M. Rabbat and R.D. Nowak. Distributed optimization in sensor
networks. In IPSN, pages 20–27, 2004.
[24] S. S. Ram, A. Nedić, and V. V. Veeravalli. A new class of
distributed optimization algorithms: application to regression of
distributed data. Optimization Methods and Software, 27(1):71–
88, 2012.
[25] S.S. Ram, A. Nedić, and V.V. Veeravalli. Distributed Stochastic
Subgradient Projection Algorithms for Convex Optimization.
Journal of Optimization Theory and Applications, 147:516–545,
2010.
[26] S.S. Ram, V.V. Veeravalli, and A. Nedić. Distributed nonautonomous power control through distributed convex optimization. In IEEE INFOCOM, pages 3001–3005, 2009.
[27] H. Robbins and D. Siegmund. A convergence theorem for
nonnegative almost supermartingales and some applications. Optimizing Methods in Statistics, Academic Press, New York, pages
233 – 257, 1971.
[28] K. Srivastava and A. Nedić. Distributed asynchronous constrained stochastic optimization. IEEE J. Sel. Topics. Signal
Process., 5(4):772–790, 2011.
[29] K.I. Tsianos. The role of the Network in Distributed Optimization
Algorithms: Convergence Rates, Scalability, Communication /
Computation Tradeoffs and Communication Delays. PhD thesis,
McGill University, Dept. of Electrical and Computer Engineering,
2013.
[30] K.I. Tsianos, S. Lawlor, and M.G. Rabbat. Consensus-based
distributed optimization: Practical issues and applications in
large-scale machine learning. In Proceedings of the 50th Allerton
Conference on Communication, Control, and Computing, 2012.
[31] K.I. Tsianos, S. Lawlor, and M.G. Rabbat. Push-sum distributed
dual averaging for convex optimization. In Proceedings of the
IEEE Conference on Decision and Control, 2012.
[32] K.I. Tsianos and M.G. Rabbat. Distributed consensus and
optimization under communication delays. In Proc. of Allerton
Conference on Communication, Control, and Computing, pages
974–982, 2011.
[33] J. Tsitsiklis, D. Bertsekas, and M. Athans. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Transactions on Automatic Control, 31:803–812,
1986.

