A Push-Pull Gradient Method for Distributed Optimization in Networks

arXiv:1803.07588v3 [math.OC] 1 Aug 2019

Shi Pu, Wei Shi, Jinming Xu, and Angelia Nedić

Abstract— In this paper, we focus on solving a distributed
convex optimization problem in a network, where each agent
has its own convex cost function and the goal is to minimize
the sum of the agents’ cost functions while obeying the network
connectivity structure. In order to minimize the sum of the
cost functions, we consider a new distributed gradient-based
method where each node maintains two estimates, namely, an
estimate of the optimal decision variable and an estimate of the
gradient for the average of the agents’ objective functions. From
the viewpoint of an agent, the information about the decision
variable is pushed to the neighbors, while the information about
the gradients is pulled from the neighbors (hence giving the
name “push-pull gradient method”). The method unifies the
algorithms with different types of distributed architecture, including decentralized (peer-to-peer), centralized (master-slave),
and semi-centralized (leader-follower) architecture. We show
that the algorithm converges linearly for strongly convex and
smooth objective functions over a directed static network. In
our numerical test, the algorithm performs well even for timevarying directed networks.

This is a preliminary version of the paper [1].
I. I NTRODUCTION
In this paper, we consider a system involving n agents
whose goal is to collaboratively solve the following problem:
minp f (x) :=

x∈R

n
P

fi (x),

(1)

i=1

where x is the global decision variable and each function
fi : Rp → R is convex and known by agent i only. The
agents are embedded in a communication network, and their
goal is to obtain an optimal and consensual solution through
local neighbor communications and information exchange.
This local exchange is desirable in situations where privacy
needs to be preserved, or the exchange of a large amount of
data is prohibitively expensive due to limited communication
resources.
To solve problem (1) in a networked system of n agents,
many algorithms have been proposed under various assumptions on the objective functions and the underlying network [2]–[27]. Centralized algorithms are discussed in [2],
where extensive applications in learning can be found. Parallel, coordinated, and asynchronous algorithms are discussed
in [3] and the references therein.
Our emphasis in the literature review is on the decentralized optimization since our approach builds on a new
*This work was supported in parts by the NSF grant CCF-1717391 and
by the ONR grant no. N00014-12-1-0998.
Shi Pu, Wei Shi, Jinming Xu and Angelia Nedić are
with the School of Electrical, Computer, and Energy Engineering, Arizona State University, Tempe, AZ 85287, USA.

(emails: shipu3@asu.edu, wshi36@asu.edu,
jinming.xu@asu.edu, Angelia.Nedich@asu.edu)

understanding of the decentralized consensus-based methods for directed communication networks. Most references,
including [4]–[15], often restrict the underlying network
connectivity structure, or more commonly require doubly
stochastic mixing matrices. The work in [4] has been the
first to demonstrate the linear convergence of an ADMMbased decentralized optimization scheme. Reference [5] uses
a gradient difference structure in the algorithm to provide
the first-order decentralized optimization algorithm which is
capable of achieving the typical convergence rates of a centralized gradient method, while references [6], [7] deal with
the second-order decentralized methods. By using Nesterov’s
acceleration, reference [8] has obtained a method whose
convergence time scales linearly in the number of agents
n, which is the best scaling with n currently known. More
recently, for a class of so-termed dual friendly functions,
papers [9], [10] have obtained an optimal decentralized
consensus optimization algorithm whose dependency on the
condition
number1 κ of the system’s objective function
Pn
f (xi ) achieves the best known scaling in the order
i=1√
of O( κ). Work in [14], [15] investigates proximal-gradient
methods which can tackle (1) with proximal friendly component functions. Paper [20] extends the work in [4] to handle
asynchrony and delays. References [21], [22] considers a
stochastic variant of problem (1) in asynchronous networks.
A tracking technique has been recently employed to develop decentralized algorithms for tracking the average of
the Hessian/gradient in second-order methods [7], allowing
uncoordinated step-size [11], [12], handling non-convexity
[13], and achieving linear convergence over time-varying
graphs [23].
For directed graphs, to eliminate the need of constructing
a doubly stochastic matrix in reaching consensus2 , reference
[28] proposes the push-sum protocol. Reference [29] has
been the first to propose a push-sum based distributed
optimization algorithm for directed graphs. Then, based on
the push-sum technique again, a decentralized subgradient
method for time-varying directed graphs has been proposed
and analyzed in [16]. Aiming to improve convergence for
a smooth objective function and a fixed directed graph,
work in [17], [18] modifies the algorithm from [5] with the
push-sum technique, thus providing a new algorithm which
converges linearly for a strongly convex objective function
on a static graph. However, the algorithm has some stability
1 The condition number of a smooth and strongly convex function is the
ratio of its gradient Lipschitz constant and its strong convexity constant.
2 Constructing a doubly stochastic matrix over a directed graph needs
weight balancing which requires an independent iterative procedure across
the network; consensus is a basic element in decentralized optimization.

issues, which have been resolved in [23] in a more general
setting of time-varying directed graphs.
In this paper, we introduce a modified gradient-tracking algorithm for decentralized (consensus-based) optimization in
directed graphs. Unlike the push-sum protocol, our algorithm
uses a row stochastic matrix for the mixing of the decision
variables, while it employs a column stochastic matrix for
tracking the average gradients. Although motivated by a fully
decentralized scheme, we will show that our algorithm can
work both in fully decentralized networks and in two-tier
networks. The contributions of this paper include the design
of a new decentralized algorithm3 and the establishment
of its linear convergence for a static directed graph. We
numerically evaluate our proposed algorithm for both static
and time-varying graphs, and find that the algorithm is
competitive as compared to the linearly convergent algorithm
developed in [23].
The structure of this paper is as follows. We first provide
notation and state basic assumptions in Subsection I-A. Then,
we introduce our algorithm in Section II along with the
intuition of its design and some examples explaining how it
relates to (semi-)centralized and decentralized optimization.
We establish the linear convergence of our algorithm in
Section III, while in Section IV we conduct numerical test to
verify our theoretical claims. Concluding remarks are given
in Section V.
A. Notation and Assumptions
Throughout the paper, vectors default to columns if not
otherwise specified. Let each agent i ∈ {1, 2, . . . , n} hold
a local copy xi ∈ Rp of the decision variable and an
auxiliary variable yi ∈ Rp tracking the average gradients,
where their values at iteration k are denoted by xi,k and
yi,k , respectively. Let
x := [x1 , x2 , . . . , xn ]⊺ ∈ Rn×p ,
y := [y1 , y2 , . . . , yn ]⊺ ∈ Rn×p .

Define F (x) to be an aggregate
Pn objective function of the
local variables, i.e., F (x) := i=1 fi (xi ), and write
∇F (x) := [∇f1 (x1 ), ∇f2 (x2 ), . . . , ∇fn (xn )] ∈ Rn×p .
⊺

Definition 1: Given an arbitrary vector norm k · k on Rn ,
for any x ∈ Rn×p , we define
h
i
kxk := kx(1) k, kx(2) k, . . . , kx(p) k
,
2

(1)

(2)

(p)

n

where x , x , . . . , x ∈ R are columns of x, and k · k2
represents the 2-norm.
A directed graph is a pair G = (V, E), where V is the set
of vertices (nodes) and the edge set E ⊆ V × V consists of
ordered pairs of vertices. A directed tree is a directed graph
where every vertex, except for the root, has only one parent.
A spanning tree of a directed graph is a directed tree that
connects the root to all other vertices in the graph (see [31]).
3 While completing the paper, we became aware of a recent work
discussing an algorithm that is similar to ours [30]. We have independently
arrived to our method and results.

Given a nonnegative matrix M ∈ Rn×n , the directed graph
induced by the matrix M is denoted by GM = (VM , EM ),
where VM = {1, 2, . . . , n} and (j, i) ∈ EM iff Mij > 0. We
let RM denote the set of roots of all directed spanning trees
in the graph GM .
We use the following assumption on the functions fi in (1).
Assumption 1: Each fi is µ-strongly convex and its gradient is L-Lipschitz continuous, i.e., for any x, x′ ∈ Rp ,
h∇fi (x) − ∇fi (x′ ), x − x′ i ≥ µkx − x′ k2 ,

(2)
k∇fi (x) − ∇fi (x′ )k ≤ Lkx − x′ k.
Under Assumption 1, there exists a unique optimal solution
x∗ ∈ R1×p to problem (1).
II. A P USH -P ULL G RADIENT M ETHOD
The aggregated form of the proposed algorithm, termed
push-pull gradient method (Push-Pull), works as follows:
Initialize with any x0 and y0 = ∇F (x0 ), and update
according to the following rule for k ≥ 0,
xk+1 = R(xk − αyk ),
yk+1 = C (yk + ∇F (xk+1 ) − ∇F (xk )) ,

(3a)
(3b)

where R, C ∈ Rn×n . We make the following standard
assumption on the matrices R and C.
Assumption 2: We assume that R ∈ Rn×n is nonnegative4
row-stochastic and C ∈ Rn×n is nonnegative columnstochastic, i.e., R1 = 1 and 1⊺ C = 1⊺ .
Lemma 1: Under Assumption 2, the matrix R has a nonnegative left eigenvector u⊺ (w.r.t. eigenvalue 1) with u⊺ 1 =
n, and the matrix C has a nonnegative right eigenvector v
(w.r.t. eigenvalue 1) with 1⊺ v = n (see [32]).
The next condition ensures that 1 is a simple eigenvalue of
both R and C.
Assumption 3: The diagonal entries of R and C are positive, i.e., Rii > 0 and Cii > 0 for all i ∈ V.
Finally, we give the condition on the structures of GR and
GC . This assumption is weaker than requiring that both GR
and GC are strongly connected.
Assumption 4: The graphs GR and GC T each contain at
least one spanning tree. Moreover, RR ∩ RC ⊺ 6= ∅.
Supposing that we have a strongly connected communication graph G, there are multiple ways to construct GR and
GC satisfying Assumption 4. One trivial approach is to set
GR = GC = G. Another way is to pick at random ir ∈ V and
let GR (respectively, GC ) be a spanning tree (respectively,
reversed spanning tree) contained in G with ir as its root.
Once graphs GR and GC are established, matrices R and C
can be designed accordingly.
To provide some intuition for the development of this
algorithm, let us consider the optimality condition for (1)
in the following form:
x∗ ∈ null{I − R},
1⊺ ∇F (x∗ ) = 0,
4 A matrix is nonnegative iff all its elements are nonnegative.

(4a)
(4b)

where R satisfies Assumption 2. Consider now the algorithm
in (3). Suppose that the algorithm produces two sequences
{xk } and {yk } converging to some points x∞ and y∞ ,
respectively. Then from (3a) and (3b) we would have
(I − R)(x∞ − αy∞ ) + αy∞ = 0,
(I − C)y∞ = 0.

(5a)



1
 0
C =
 0
0


0.5 0.5 0.5
0.5 0
0 
.
0 0.5 0 
0
0 0.5

For a graphical illustration, the corresponding network
topologies of GR and GC are shown in Fig. 1. The central
node 1 pushes (diffuses) information regarding x1,k to the
neighbors (in this case the entire network) through GR , while
the others can only passively infuse the information from
node 1. At the same time, node 1 pulls (collects) information
regarding yi,k (i = 2, 3, 4) from the neighbors through GC ,
5 This is indeed a consequence of Assumption 4.

4
2

2

1

3

1

3

(5b)

If span{I − R} and null{I − C} are disjoint5 , from (5) we
would have x∞ ∈ null{I − R} and y∞ = 0. Hence x∞
satisfies the optimality condition in (4a). Then by induction
we know 1⊺ ∇F (x∞ ) = 1⊺ y∞ = 0, which is exactly the
optimality condition in (4b).
The structure of the algorithm in (3) is similar to that
of the DIGing algorithm proposed in [23] with the mixing
matrices distorted (doubly stochastic matrices split into a
row-stochastic matrix and a column-stochastic matrix). The
x-update can be seen as an inexact gradient step with
consensus, while the y-update can be viewed as a gradient
tracking step. Such an asymmetric R-C structure design has
already been used in the literature of average consensus [33].
However, we can not analyze the proposed optimization
algorithm using linear dynamical systems since we have a
nonlinear dynamics due to the gradient terms.
We now show how the proposed algorithm (3) unifies
different types of distributed architecture. For the fully
decentralized case, suppose we have a graph G that is
undirected and connected. Then R and C can be chosen as
symmetric matrices, in which case the proposed algorithm
degrades to the one considered in [23]; if the graph is directed
and strongly connected, we can set GR = GC = G and design
the weights for R and C correspondingly.
To illustrate the less straightforward situation of (semi)centralized networks, let us give a simple example. Consider
a four-node star network composed by {1, 2, 3, 4} where
node 1 is situated at the center and nodes 2, 3, and 4 are
(bidirectionally) connected with node 1 but not connected to
each other. In this case, the matrix R in our algorithm can
be chosen as


1
0
0
0
 0.5 0.5 0
0 

R=
 0.5 0 0.5 0 
0.5 0
0 0.5
and

4

Fig. 1.

The left is GR and the right is GC .

while the other nodes can only actively comply with the
request from node 1. This motivates the algorithm’s name
push-pull gradient method. Although nodes 2, 3, and 4
are updating their yi ’s accordingly, these quantities do not
have to contribute to the optimization procedure and will
die out geometrically fast due to the weights in the last
three rows of C. Consequently, in this special case, the
local step-size α for agents 2, 3, and 4 can be set to 0.
Without loss of generality, suppose f1 (x) = 0, ∀x. Then
the algorithm
P4becomes a typical centralized algorithm for
minimizing i=2 fi (x) where the master node 1 utilizes the
slave nodes 2, 3, and 4 to compute the gradient information
in a distributed way.
Taking the above as an example for explaining the semicentralized case, it is worth nothing that node 1 can be
replaced by a strongly connected subnet in GR and GC ,
respectively. Correspondingly, nodes 2, 3, and 4 can all be
replaced by subnets as long as the information from the
master layer in these subnets can be diffused to all the
slave layer agents in GR , while the information from all the
slave layer agents can be diffused to the master layer in GC .
Specific requirements on connectivities of slave subnets can
be understood by using the concept of rooted trees. We refer
to the nodes as leaders if their roles in the network are similar
to the role of node 1; and the other nodes are termed as
followers. Note that after the replacement of the individual
nodes by subnets, the network structure in all subnets are
decentralized, while the relationship between leader subnet
and follower subnets is master-slave. This is why we refer
to such an architecture as semi-centralized.
Remark 1: There can be multiple variants of the proposed
algorithm depending on whether the Adapt-then-Combine
(ATC) strategy [34] is used in the x-update and/or the yupdate (see Remark 3 in [23] for more details). Our following
analysis can be easily adapted for these variants. We have
also tested one of the variants in Section IV.
III. C ONVERGENCE A NALYSIS
In this section, we study the convergence properties of the
proposed algorithm. We first define the following variables:
1 ⊺
1
u xk , ȳk := 1⊺ yk .
n
n
Our strategy is to bound kx̄k+1 −x∗ k2 , kxk+1 −1x̄k+1 kR and
kyk+1 − v ȳk+1 kC in terms of linear combinations of their
previous values, where k · kR and k · kC are specific norms
to be defined later. In this way we establish a linear system
x̄k :=

of inequalities which allows us to derive the convergence
results. The proof technique was inspired by [24], [25].
A. Preliminary Analysis
From the algorithm (3) and Lemma 1, we have
x̄k+1 =

α
1 ⊺
u R(xk − αyk ) = x̄k − u⊺ yk ,
n
n

(6)

and
1 ⊺
1 C (yk + ∇F (xk+1 ) − ∇F (xk ))
n
(7)
1
= ȳk + 1⊺ (∇F (xk+1 ) − ∇F (xk )) .
n
With the initialization y0 = ∇F (x0 ), we obtain by induction
ȳk+1 =

ȳk =

1 ⊺
1 ∇F (xk ), ∀k.
n

(8)

Let us further define gk := n1 1⊺ ∇F (1x̄k ). Then, we obtain
from relation (6)
α
x̄k+1 = x̄k − u⊺ (yk − v ȳk + v ȳk )
n
α
α ⊺
= x̄k − u v ȳk − u⊺ (yk − v ȳk )
n
n
α
′
′
= x̄k − α gk − α (ȳk − gk ) − (u − 1)⊺ (yk − v ȳk ) ,
n
(9)
where

α ⊺
u v.
(10)
n
We will show later that Assumption 4 ensures α′ > 0.
In view of (3) and Lemma 1, using (6) we have
α′ :=

α
xk+1 − 1x̄k+1 = R(xk − αyk ) − 1x̄k + 1u⊺ yk
n


1u⊺
= R(xk − 1x̄k ) − α R −
yk
n




1u⊺
1u⊺
(xk −1x̄k )−α R −
(yk − 1ȳk ) ,
= R−
n
n
(11)
and from (7) we obtain
yk+1 − v ȳk+1 = Cyk − v ȳk


v1⊺
(∇F (xk+1 ) − ∇F (xk ))
+ C−
n


v1⊺
(yk − v ȳk )
= C−
n


v1⊺
(∇F (xk+1 ) − ∇F (xk )) . (12)
+ C−
n
B. Supporting Lemmas
Before proceeding to the main results, we state a few
useful lemmas.
Lemma 2: Under Assumption 1, there holds
L
kȳk − gk k2 ≤ √ kxk − 1x̄k k2 ,
n
kgk k2 ≤ Lkx̄k − x∗ k2 .

(13)
(14)

In addition, when α′ ≤ 2/(µ + L), we have
kx̄k − α′ gk − x∗ k2 ≤ (1 − α′ µ)kx̄k − x∗ k2 , ∀k. (15)
Proof: See Appendix VI-A.
Lemma 3: Suppose Assumption 2 holds, and assume that
RR 6= ∅ and RC ⊺ 6= ∅. Then, RR ∩ RC ⊺ 6= ∅ iff u⊺ v > 0.
Proof: See Appendix VI-B.
Lemma 3 explains why Assumption 4 is essential for the
Push-Pull algorithm (3) to work. Without the condition, α′ =
0 by its definition in (10).
Lemma 4: Suppose Assumptions 2-4 hold. Let ρR and ρC
be the spectral radii of (R − 1u⊺ /n) and (C − v1⊺ /n),
respectively. Then, we have ρR < 1 and ρC < 1.
Proof: See Appendix VI-C.
Lemma 5: There exist matrix norms k · kR and k · kC such
⊺
⊺
that σR := kR − 1un kR < 1, σC := kC − v1n kC < 1, and
σR and σC are arbitrarily close to ρR and ρC , respectively.
Proof: See [32, Lemma 5.6.10] and the discussions
thereafter.
In the rest of this paper, with a slight abuse of notation, we
do not distinguish between the vector norms on Rn and their
induced matrix norms.
Lemma 6: Given an arbitrary norm k · k, for any W ∈
Rn×n and x ∈ Rn×p , we have kW xk ≤ kW kkxk. For any
w ∈ Rn×1 and x ∈ R1×p , we have kwxk = kwkkxk2 .
Proof: See Appendix VI-D.
Lemma 7: There exist constants δC,R , δC,2 , δR,C , δR,2 >
0 such that for all x ∈ Rn×p , we have kxkC ≤ δC,R kxkR ,
kxkC ≤ δC,2 kxk2 , kxkR ≤ δR,C kxkC , and kxkR ≤
δR,2 kxk2 . In addition, with a proper rescaling of the norms
k · kR and k · kC , we have kxk2 ≤ kxkR and kxk2 ≤ kxkC .
Proof: The result follows from the equivalence relation
of all norms on Rn and Definition 1.
C. Main Results
The following lemma establishes a linear system of inequalities that bound kx̄k+1 − x∗ k2 , kxk+1 − 1x̄k kR and
kyk+1 − v ȳk kC .
Lemma 8: Under Assumptions 1-4, when α′ ≤ 2/(µ+L),
we have the following linear system of inequalities:




kx̄k+1 − x∗ k2
kx̄k − x∗ k2
kxk+1 − 1x̄k+1 kR  ≤ A kxk − 1x̄k kR  ,
(16)
kyk+1 − v ȳk+1 kC
kyk − v ȳk kC

where the inequality is to be taken component-wise, and
elements of the transition matrix A = [aij ] are given by:
 


a11
1 − α′ µ
a21  =  ασR kv − 1kR L  ,
a31
ασC δC,2 kRvk2 L2


′
α
 
√L
n


a12


,
a22  = 
σR 1 + αkv − 1kR √Ln



a32
σC δC,2 L kR − Ik2 + αkRvk2 √Ln
 


αku−1k2
a13
n
a23  = 
.
ασR δR,C
a33
σC (1 + αδC,2 kRk2 L)

Proof: See Appendix VI-E.
In light of Lemma 8, kx̄k − x∗ k2 , kxk − 1x̄k kR and kyk −
v ȳk kC all converge to 0 linearly at rate O(ρkA ) if the spectral
radius of A satisfies ρA < 1. The next lemma provides some
sufficient conditions for the relation ρA < 1 to hold.
Lemma 9: Given a nonnegative, irreducible matrix M =
[mij ] ∈ R3×3 with m11 , m22 , m33 < λ∗ for some λ∗ > 0. A
necessary and sufficient condition for ρM < λ∗ is det(λ∗ I −
M ) > 0.
Proof: See Appendix VI-F.
Now, we are ready to deliver our main convergence result
for the Push-Pull algorithm in (3).
Theorem 1: Suppose Assumptions 1-4 hold and
(
)
(1 − σC )
2c3
p
,
α ≤ min
, (17)
c2 + c22 + 4c1 c3 2σC δC,2 kRk2 L

where c1 , c2 , c3 are given in (22)-(24). Then, the quantities
kx̄k − x∗ k2 , kxk − 1x̄k kR and kyk − v ȳk kC all converge to
0 at the linear rate O(ρkA ) with ρA < 1, where ρA denotes
the spectral radius of A.
Proof: In light of Lemma 9, it suffices to ensure
a11 , a22 , a33 < 1 and det(I − A) > 0, or equivalently
det(I − A) = (1 − a11 )(1 − a22 )(1 − a33 ) − a12 a23 a31
−a13 a21 a32 −(1−a22 )a13 a31 −(1−a11 )a23 a32 −(1−a33 )a12 a21
3

L
= (1 − a11 )(1 − a22 )(1 − a33 ) − α′ α2 σR σC δR,C δC,2 kRvk2 √
n
 2

L
L
−α2 σR σC δC,2 ku−1k2 kv−1kR kR − Ik2 + αkRvk2 √
n n
2
L
− α2 σC δC,2 kRvk2 ku − 1k2 (1 − a22 )
n


L
− ασR σC δR,C δC,2 L kR − Ik2 + αkRvk2 √
(1 − a11 )
n
L2
− α′ ασR kv − 1kR √ (1 − a33 ) > 0. (18)
n

We now provide some sufficient conditions under which
a11 , a22 , a33 < 1 and (18) holds true. First, a11 < 1 is
ensured by choosing α′ ≤ 2/(µ + L). let
1 − a22

≥

1 − a33

≥

1
(1 − σR ),
2
1
(1 − σC ).
2

(19)
(20)

We get
α ≤ min




√
(1 − σC )
(1 − σR ) n
.
,
2σR kv − 1kR L 2σC δC,2 kRk2 L

(21)

Second, notice that a22 > σR and a33 > σC . A sufficient
condition for det(I −A) > 0 is to substitute the first (1−a22 )
(respectively, (1 − a33 )) in (18) by (1 − σR )/2 (respectively,
(1−σC )/2), and substitute the second (1−a22 ) (respectively,
(1 − a33 )) by (1 − σR ) (respectively, (1 − σC )). We have
c1 α2 + c2 α − c3 < 0,

where
c1 =

L3
u⊺ v
σR σC δR,C δC,2 kRvk2 √
n
n
L L2
+ σR σC δC,2 ku − 1k2 kv − 1kR kRvk2 √
n n
L
u⊺ v
µσR σC δR,C δC,2 LkRvk2 √
+
n
n
L2
= σR σC δC,2 kRvk2 √ [u⊺ vδR,C (L + µ)
n n
+ku − 1k2 kv − 1kR L] , (22)
L2
n
L2
+ σC δC,2 kRvk2 ku − 1k2 (1 − σR )
n
u⊺ v
µ
+ σR σC δR,C δC,2 LkR − Ik2
n
2
L
u⊺ v
+ σR kv − 1kR √ (1 − σC )
, (23)
n
n

c2 = σR σC δC,2 ku − 1k2 kv − 1kR kR − Ik2

and
c3 =

u⊺ v
µ(1 − σR )(1 − σC ).
4n

Hence
α≤

c2 +

2c3
p
.
c22 + 4c1 c3

(24)
(25)

Relations (21) and (25) yield the final bound on α.
Remark 2: When α is sufficiently small, it can be shown
that ρA ≃ 1 − α′ µ, in which case the Push-Pull algorithm is
comparable to its centralized counterpart with step-size α′ .
IV. SIMULATIONS
In this section, we provide numerical comparisons of a
few different algorithms under various network settings. Our
settings for objective functions are the same as that described
in [23]. Each node in the network holds a Huber-typed
objective function fP
i (x) and the goal is to optimize the total
Huber loss f (x) = ni=1 fi (x). The objective functions fi ’s
are randomly generated but are manipulated such that the
global optimizer x∗ is located at the ℓ22 zone of f (x) while
the origin (which is set to be the initial state of xk for all
involved algorithms) is located outside of that zone.
We first conduct an experiment over time-invariant directed graphs. The network is generated randomly with 12
nodes and 24 unidirectional links (at most 12 × 11 =
131 possible links in this case) and is guaranteed to be
strongly connected. We test our proposed algorithm, PushPull, against Push-DIGing [23] and Xi-Row [25]. Among
these algorithms, Push-DIGing is a push-sum based algorithm which only needs push operations for information
dissemination in the network; Xi-row is an algorithm that
only uses row stochastic mixing matrices and thus only
needs pull operations to fetch information in the network; in
comparison, our algorithm needs the network to support both
push operations and pull operations. The per-node storage
complexity of Push-Pull (or Push-DIGing) is O(p) while

10

100

10-2

Residual

that of Xi-row is O(n + p). Note that at each iteration,
the amount of data transmitted over each link also scales at
such orders for these algorithms, respectively. For large-scale
networks (n ≫ p), Xi-row may suffer from high needs in
storage/bandwidth and/or become under limited transmission
kx −x∗ k2
rates. The evolution of the (normalized) residual kxk0 −x∗ k22
2
is illustrated in Fig. 2. The step-sizes are hand-tuned for all
the algorithms to optimize the convergence speed.

10-4

10-6
Leader-Follower, α = 0.017
Push-Pull-half, α = 0.42
Push-DIGing, α = 0.069
Leader-Follower, α = 0.024
Push-Pull-half, α = 0.59
Push-DIGing, α = 0.031

0

10

10-2

-8

10-10

0

5000

10000

15000

Residual

k
10-4

Fig. 3. Plots of (normalized) residuals against number of iterations over a
time-varying directed graph sequence.
10-6

10

10

Push-Pull, α = 0.77
Push-DIGing, α = 0.21
Xi-row, α = 0.016
Push-Pull, α = 0.93
Push-DIGing, α = 0.25
Xi-row, α = 0.02

-8

-10

0

500

1000

1500

2000

2500

3000

k
Fig. 2. Plots of (normalized) residuals against number of iterations over a
time-invariant directed graph.

Although our algorithm is designed and analyzed over
time-invariant directed graphs, its extension to time-varying
directed graphs is straightforward. Let us use the above generated directed graph as a base graph. To test our theory for
a leader-follower architecture, we randomly select multiple
nodes as leaders and randomly add enough links between the
leaders (in this example, number of leaders is 2) so that they
form a strongly connected subnet. Then at each iteration,
only 50% randomly chosen links will be activated. In Fig. 3,
we plot the performance of Push-Pull-half (a variant of PushPull where the ATC strategy is not employed in the y-update;
it needs only one round of communication at each iteration)
and Push-DIGing without considering the leader-follower
structure. That is, for Push-Pull-half and Push-DIGing, a
time-varying directed graph sequence based on random link
activation is used where the underlying graph is strongly
connected.
Then in Fig. 3 we further show the performance of
Push-Pull under the leader-follower architecture. The major
difference on the graph sequence is that, in the leaderfollower architecture, all the outbound information links of
the leader subnet are not used when performing the y-update;
all the inbound information links of the leader subnet are
not used when performing the x-update. Note that in such a
way, the union of all directed graphs corresponding to Rk (or
Ck ) is not strongly connected. The numerical results show, as
expected, that the convergence of Push-Pull under the leaderfollower architecture is slower than that of Push-Pull-half
with strongly connected underlying graphs.

In the experiment, we observe that there are many spikes
on the residual curve of Push-DIGing. Push-DIGing for
time-varying graphs can be numerically unstable due to the
use of division operations in the algorithm and the divisors
can scale badly at the order of Ω(n−Bn ) where n is the
number of nodes and B a bounded constant that describes
the connectivity of time-varying graphs (the smaller B is, the
better the network is connected; see [23] for the definition
of B). A simple network with number of nodes n = 15 and
time-varying constant B = 10 will easily give a number
that is recognized by common computers (using doubleprecision floating-point format) as 0. As a contrast, in PushPull, there is no divisors that scale at such level. In addition,
the theoretical upper bound on the step-size of Push-DIGing
also scales at the order of O(n−Bn ). This implies that PushDIGing will not work well for either large scale networks or
time-varying networks with large variations.
V. C ONCLUSIONS
In this paper, we have studied the problem of distributed
optimization over a network. In particular, we proposed a
new distributed gradient-based method (Push-Pull) where
each node maintains estimates of the optimal decision
variable and the average gradient of the agents’ objective
functions. From the viewpoint of an agent, the information
about the decision variable is pushed to its neighbors, while
the information about the gradients is pulled from its neighbors. This method works for different types of distributed
architecture, including decentralized, centralized, and semicentralized architecture. We have showed that the algorithm
converges linearly for strongly convex and smooth objective
functions over a directed static network. In the simulations,
we have demonstrated the effectiveness of the proposed
algorithm for both static and time-varying directed networks.
R EFERENCES
[1] S. Pu, W. Shi, J. Xu, and A. Nedić, “Push-pull gradient methods for distributed optimization in networks,” arXiv preprint
arXiv:1810.06653, 2018.

[2] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al., “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends R in Machine
Learning, vol. 3, no. 1, pp. 1–122, 2011.
[3] Z. Peng, Y. Xu, M. Yan, and W. Yin, “Arock: an algorithmic framework for asynchronous parallel coordinate updates,” SIAM Journal on
Scientific Computing, vol. 38, no. 5, pp. A2851–A2879, 2016.
[4] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the Linear Convergence of the ADMM in Decentralized Consensus Optimization,”
IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1750–
1761, 2014.
[5] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An Exact First-Order
Algorithm for Decentralized Consensus Optimization,” SIAM Journal
on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
[6] A. Mokhtari, W. Shi, Q. Ling, and A. Ribeiro, “A Decentralized
Second-Order Method with Exact Linear Convergence Rate for Consensus Optimization,” arXiv preprint arXiv:1602.00596, 2016.
[7] D. Varagnolo, F. Zanella, A. Cenedese, G. Pillonetto, and L. Schenato, “Newton-raphson consensus for distributed convex optimization,”
IEEE Transactions on Automatic Control, vol. 61, no. 4, pp. 994–1009,
2016.
[8] A. Olshevsky, “Linear time average consensus and distributed optimization on fixed graphs,” SIAM Journal on Control and Optimization,
vol. 55, no. 6, pp. 3990–4014, 2017.
[9] K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massoulié, “Optimal
algorithms for smooth and strongly convex distributed optimization in
networks,” arXiv preprint arXiv:1702.08704, 2017.
[10] C. A. Uribe, S. Lee, A. Gasnikov, and A. Nedić, “Optimal algorithms
for distributed optimization,” arXiv preprint arXiv:1712.00232, 2017.
[11] J. Xu, S. Zhu, Y. Soh, and L. Xie, “Augmented Distributed Gradient
Methods for Multi-Agent Optimization Under Uncoordinated Constant
Stepsizes,” in Proceedings of the 54th IEEE Conference on Decision
and Control (CDC), 2015, pp. 2055–2060.
[12] A. Nedić, A. Olshevsky, W. Shi, and C. A. Uribe, “Geometrically
convergent distributed optimization with uncoordinated step-sizes,” in
American Control Conference (ACC), 2017. IEEE, 2017, pp. 3950–
3955.
[13] P. Di Lorenzo and G. Scutari, “Next: In-network nonconvex optimization,” IEEE Transactions on Signal and Information Processing over
Networks, vol. 2, no. 2, pp. 120–136, 2016.
[14] W. Shi, Q. Ling, G. Wu, and W. Yin, “A Proximal Gradient Algorithm
for Decentralized Composite Optimization,” IEEE Transactions on
Signal Processing, vol. 63, no. 22, pp. 6013–6023, 2015.
[15] Z. Li, W. Shi, and M. Yan, “A decentralized proximal-gradient method
with network independent step-sizes and separated convergence rates,”
arXiv preprint arXiv:1704.07807, 2017.
[16] A. Nedić and A. Olshevsky, “Distributed optimization over timevarying directed graphs,” IEEE Transactions on Automatic Control,
vol. 60, no. 3, pp. 601–615, 2015.
[17] C. Xi and U. A. Khan, “On the linear convergence of distributed
optimization over directed graphs,” arXiv preprint arXiv:1510.02149,
2015.
[18] J. Zeng and W. Yin, “ExtraPush for Convex Smooth Decentralized Optimization over Directed Networks,” arXiv preprint arXiv:1511.02942,
2015.
[19] J. Xu, S. Zhu, Y. C. Soh, and L. Xie, “Convergence of asynchronous
distributed gradient methods over stochastic networks,” IEEE Transactions on Automatic Control, 2017.
[20] T. Wu, K. Yuan, Q. Ling, W. Yin, and A. H. Sayed, “Decentralized
consensus optimization with asynchrony and delays,” in Signals,
Systems and Computers, 2016 50th Asilomar Conference on. IEEE,
2016, pp. 992–996.
[21] S. Pu and A. Garcia, “A flocking-based approach for distributed
stochastic optimization,” Operations Research, vol. 1, pp. 267–281,
2018.
[22] S. Pu and A. Nedić, “Distributed stochastic gradient tracking methods,”
arXiv preprint arXiv:1805.11454, 2018.
[23] A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM
Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017.
[24] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed
optimization,” IEEE Transactions on Control of Network Systems,
2017.
[25] C. Xi, V. S. Mai, R. Xin, E. H. Abed, and U. A. Khan, “Linear

convergence in optimization over directed graphs with row-stochastic
matrices,” IEEE Transactions on Automatic Control, 2018.
[26] E. Wei, A. Ozdaglar, and A. Jadbabaie, “A distributed newton method
for network utility maximization–i: Algorithm,” IEEE Transactions on
Automatic Control, vol. 58, no. 9, pp. 2162–2175, 2013.
[27] A. Mokhtari, Q. Ling, and A. Ribeiro, “Network newton distributed
optimization methods,” IEEE Transactions on Signal Processing,
vol. 65, no. 1, pp. 146–161, 2017.
[28] D. Kempe, A. Dobra, and J. Gehrke, “Gossip-Based Computation
of Aggregate Information,” in Proceedings of the 44th Annual IEEE
Symposium on Foundations of Computer Science, 2003, pp. 482–491.
[29] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, “Push-sum distributed dual
averaging for convex optimization,” in Decision and Control (CDC),
2012 IEEE 51st Annual Conference on. IEEE, 2012, pp. 5453–5458.
[30] R. Xin and U. A. Khan, “A linear algorithm for optimization
over directed graphs with geometric convergence,” arXiv preprint
arXiv:1803.02503, 2018.
[31] C. Godsil and G. F. Royle, Algebraic graph theory. Springer Science
& Business Media, 2013, vol. 207.
[32] R. A. Horn and C. R. Johnson, Matrix analysis. Cambridge university
press, 1990.
[33] K. Cai and H. Ishii, “Average consensus on general strongly connected
digraphs,” Automatica, vol. 48, no. 11, pp. 2750–2761, 2012.
[34] A. Sayed, “Diffusion Adaptation over Networks,” Academic Press
Library in Signal Processing, vol. 3, pp. 323–454, 2013.
[35] W. Ren and R. W. Beard, “Consensus seeking in multiagent systems
under dynamically changing interaction topologies,” IEEE Transactions on automatic control, vol. 50, no. 5, pp. 655–661, 2005.

VI. APPENDIX
A. Proof of Lemma 2
In light of Assumption 1 and (8),
1 ⊺
k1 ∇F (xk ) − 1⊺ ∇F (1x̄k )k2
n
n
L
LX
kxi,k − x̄k k2 ≤ √ kxk − 1x̄k k2 ,
≤
n i=1
n

kȳk − gk k2 =

and
kgk k2 =

1 ⊺
k1 ∇F (1x̄k ) − 1⊺ ∇F (1x∗ )k2
n
n
LX
kx̄k − x∗ k2 = Lkx̄k − x∗ k2 .
≤
n i=1

Proof of the last relation can be found in [24] Lemma 10.
B. Proof of Lemma 3
We first demonstrate that ui > 0 iff i ∈ RR . Note that
there exists an order of vertices such that R can be written
as


R1 0
(26)
R̃ =
R2 R3

where R1 is a square matrix corresponding to vertices in RR .
R1 is row stochastic and irreducible (since the associated
graph GR1 is strongly connected). In light of the PerronFrobenius theorem, R1 has a strictly positive left eigenvector
u⊺1 (u⊺1 1 = n) corresponding to eigenvalue 1. It follows that
[u1 , 0]⊺ is a row eigenvector of R̃, which is also unique
from the Perron-Frobenius theorem. Since reordering of
vertices does not change the corresponding eigenvector (up
to permutation in the same oder of vertices), ui > 0 iff
i ∈ RR .
Similarly, vj > 0 iff j ∈ RC ⊺ . We conclude that RR ∩
RC ⊺ 6= ∅ iff u⊺ v > 0.

C. Proof of Lemma 4

Lastly, it follows from (12), Lemma 6 and Lemma 7 that

In light of [35, Lemma 3.4], under Assumptions 2-4, spectral radii of R and C are both equal to 1 (the corresponding
eigenvalues have multiplicity 1). Suppose for some λ, ũ 6= 0,


1u⊺
ũ⊺ R −
= λũ⊺ .
n

kyk+1 − v ȳk+1 kC ≤ σC kyk − v ȳk kC + σC δC,2 Lkxk+1 − xk k2
= σC kyk − v ȳk kC + σC δC,2 Lk(R − I)(xk − 1x̄k ) − αRyk k2
≤ σC kyk − v ȳk kC + σC δC,2 LkR − Ik2 kxk − 1x̄k k2
+ ασC δC,2 L kR(yk − v ȳk ) + Rv ȳk k2
≤ σC kyk − v ȳk kC + σC δC,2 LkR − Ik2 kxk − 1x̄k k2
+ ασC δC,2 L (kRk2 kyk − v ȳk k2 + kRvk2 kȳk k2 ) . (31)

Since 1 is a right eigenvector of (R − 1u⊺ /n) corresponding
to eigenvalue 0, ũ⊺ 1 = 0 (see [32] Theorem 1.4.7). We have

In light of Lemma 2,
kȳk k2 ≤

ũ⊺ R = λũ.




L
√ kxk − 1x̄k k2 + Lkx̄k − x∗ k2 .
n

(32)

Hence
Hence λ is also an eigenvalue of R. Noticing that u⊺ 1 = n,
we have ũ⊺ 6= u⊺ so that λ < 1. We conclude that σR < 1.
Similarly we can obtain σC < 1.
D. Proof of Lemma 6
By Definition 1,
kW xk = k[kW x1 k, kW x2 k, . . . , kW xp k]k2

≤ k[kW kkx1 k, kW kkx2 k, . . . , kW kkxp k]k2

= kW kk[kx1k, kx2 k, . . . , kxp k]k2 = kW kkxk, (27)
and
kwxk = k[kwx1 k, kwx2 k, . . . , kwxp k]k2
1

2

p

= kwkk[|x |, |x |, . . . , |x |]k2 = kwkkxk2 . (28)

E. Proof of Lemma 8
The three inequalities embedded in (16) come from (9),
(11), and (12), respectively. First, by Lemma 2 and Lemma
7, we obtain from (9) that
kx̄k+1 − x∗ k2 ≤ kx̄k − α′ gk − x∗ k2 + α′ kȳk − gk k2
α
+ k(u − 1)⊺ (yk − v ȳk )k2
n
α′ L
′
≤ (1 − α µ)kx̄k − x∗ k2 + √ kxk − 1x̄k k2
n
αku − 1k2
+
kyk − v ȳk k2
n
α′ L
≤ (1 − α′ µ)kx̄k − x∗ k2 + √ kxk − 1x̄k kR
n
αku − 1k2
kyk − v ȳk kC .
+
n

kyk+1 − v ȳk+1 kC ≤ σC (1 + αδC,2 kRk2 L) kyk − v ȳk kC
+ σC δC,2 LkR − Ik2 kxk − 1x̄k k2


L
+ ασC δC,2 kRvk2 L √ kxk − 1x̄k k2 + Lkx̄k − x∗ k2
n
≤ σC (1 + αδC,2 kRk2 L) kyk − v ȳk kC


L
kxk − 1x̄k kR
+ σC δC,2 L kR − Ik2 + αkRvk2 √
n
+ ασC δC,2 kRvk2 L2 kx̄k − x∗ k2 . (33)

F. Proof of Lemma 9
The characteristic function of M is given by
g(λ) := det(λI − M ) = (λ − m11 )(λ − m22 )(λ − m33 )
− a23 a32 (λ − m11 ) − a13 a31 (λ − m22 ) − a12 a21 (λ − m33 )

− a12 a23 a31 − a13 a32 a21 . (34)

Necessity is trivial since det(λ∗ I −M ) ≤ 0 implies g(λ) = 0
for some λ ≥ λ∗ . We now show det(λ∗ I − M ) > 0 is also
a sufficient condition.
Given that g(λ∗ ) = det(λ∗ I − M ) > 0,
(λ∗ − m11 )(λ∗ − m22 )(λ∗ − m33 )
> a23 a32 (λ∗ −m11 )+a13 a31 (λ∗ −m22 )+a12 a21 (λ∗ −m33 ).
It follows that
γ1 (λ∗ − m22 )(λ∗ − m33 ) >
γ2 (λ∗ − m11 )(λ∗ − m33 ) >
γ3 (λ∗ − m11 )(λ∗ − m22 ) >

(29)

Second, by relation (11), Lemma 6 and Lemma 7, we see
that
kxk+1 − 1x̄k+1 kR ≤ σR kxk − 1x̄k kR + ασR kyk − 1ȳk kR
≤ σR kxk − 1x̄k kR + ασR kyk − v ȳk kR + ασR kv − 1kR kȳk k2
≤ σR kxk − 1x̄k kR + ασR kyk − v ȳk kR


L
+ ασR kv − 1kR √ kxk − 1x̄k k2 + Lkx̄k − x∗ k2
n


L
≤ σR 1 + αkv − 1kR √
kxk − 1x̄k kR
n
+ ασR δR,C kyk − v ȳk kC + ασR kv − 1kR Lkx̄k − x∗ k2 . (30)

a23 a32
a13 a31
a12 a21

(35)

for some γ1 , γ2 , γ3 > 0 with γ1 + γ2 + γ3 ≤ 1. Consider
g ′ (λ) = (λ − m22 )(λ − m33 ) + (λ − m11 )(λ − m33 )

+ (λ − m11 )(λ − m22 ) − a23 a32 − a13 a31 − a12 a21 .

We have g ′ (λ) > 0 for λ ∈ (−∞, −λ∗ ]∪[λ∗ , +∞). Noticing
that
g(−λ∗ ) ≤ −(λ∗ + m11 )(λ∗ + m22 )(λ∗ + m33 )

+ a23 a32 (1 + m11 ) + a13 a31 (λ∗ + m22 )
+ a12 a21 (λ∗ + m33 ) < 0,

all real roots of g(λ) = 0 lie in the interval (−λ∗ , λ∗ ). By
the Perron-Frobenius theorem, ρM is an eigenvalue of M .
We conclude that ρM < λ∗ .

