EXTRA: AN EXACT FIRST-ORDER ALGORITHM FOR
DECENTRALIZED CONSENSUS OPTIMIZATION∗

arXiv:1404.6264v4 [math.OC] 30 Nov 2014

WEI SHI , QING LING , GANG WU , AND WOTAO YIN
Abstract. Recently, there have been growing interests in solving consensus optimization problems in
a multi-agent network. In this paper, we develop a decentralized algorithm for the consensus optimization
problem
n
X
¯(x) = 1
minimize
f
fi (x),
x∈Rp
n i=1

which is defined over a connected network of n agents, where each function fi is held privately by agent i and
encodes the agent’s data and objective. All the agents shall collaboratively find the minimizer while each
agent can only communicate with its neighbors. Such a computation scheme avoids a data fusion center or
long-distance communication and offers better load balance to the network.
This paper proposes a novel decentralized exact first-order algorithm (abbreviated as EXTRA) to solve
the consensus optimization problem. “Exact” means that it can converge to the exact solution. EXTRA can
use a fixed large step size, which is independent of the network size, and has synchronized iterations. The
local variable of every agent i converges uniformly and consensually to an exact minimizer of f¯. In contrast,
the well-known decentralized gradient descent (DGD) method must use diminishing step sizes in order to
converge to an exact minimizer. EXTRA and DGD have the same choice of mixing matrices and similar
per-iteration complexity. EXTRA, however, uses the gradients of last two iterates, unlike DGD which uses
just that of last iterate.
EXTRA has the best known convergence rates among the existing first-order decentralized algorithms
for decentralized consensus optimization with convex Lipschitz–differentiable objectives. Specifically, if fi ’s

are convex and have Lipschitz continuous gradients, EXTRA has an ergodic convergence rate O k1 in
terms of the first-order optimality residual. If f¯ is also (restricted) strongly convex, EXTRA converges to
an optimal solution at a linear rate O(C −k ) for some constant C > 1.
Key words. Consensus optimization, decentralized optimization, gradient method, linear convergence

1. Introduction. This paper focuses on decentralized consensus optimization, a problem defined on a connected network and solved by n agents cooperatively
n
X
¯(x) = 1
fi (x),
minimize
f
x∈Rp
n i=1

(1.1)

over a common variable x ∈ Rp , and for each agent i, fi : Rp → R is a convex function
privately known by the agent. We assume that fi ’s are continuously differentiable and will
introduce a novel first-order algorithm to solve (1.1) in a decentralized manner. We stick to
the synchronous case in this paper, that is, all the agents carry out their iterations at the
same time intervals.
Problems of the form (1.1) that require decentralized computation are found widely
in various scientific and engineering areas including sensor network information processing,
∗ This work is supported by Chinese Scholarship Council (CSC) grants 201306340046 and 2011634506,

NSFC grant 61004137, MOF/MIIT/MOST grant BB2100100015, and NSF grants DMS-0748839 and DMS1317602.
1

multiple-agent control and coordination, as well as distributed machine learning. Examples
and works include decentralized averaging [7,15,34], learning [9,22,26], estimation [1,2,16,18,
29], sparse optimization [19, 35], and low-rank matrix completion [20] problems. Functions
fi can take forms of least squares [7, 15, 34], regularized least squares [1, 2, 9, 18, 22], as
well as more general ones [26]. The solution x can represent, for example, the average
temperature of a room [7,34], frequency-domain occupancy of spectra [1,2], states of a smart
grid system [10, 16], sparse vectors [19, 35], and a matrix factor [20] and so on. In general,
decentralized optimization fits the scenarios in which the data is collected and/or stored in a
distributed network, a fusion center is either infeasible or not economical, and/or computing
is required to be performed in a decentralized and collaborative manner by multiple agents.
1.1. Related Methods. Existing first-order decentralized methods for solving (1.1)
include the (sub)gradient method [21, 25, 36], the (sub)gradient-push method [23, 24], the
fast (sub)gradient method [5,14], and the dual averaging method [8]. Compared to classical
centralized algorithms, decentralized algorithms encounter more restrictive assumptions and
typically worse convergence rates. Most of the above algorithms are analyzed under the
assumption of bounded (sub)gradients. Work [21] assumes bounded Hessian for strongly
convex functions. Recent work [36] relaxes such assumptions for decentralized gradient
descent. When (1.1) has additional constraints that force x in a bounded set, which also
leads to bounded (sub)gradients and Hessian, projected first-order algorithms are applicable
[27, 37].
When using a fixed step size, these algorithms do not converge to a solution x∗ of
problem (1.1) but a point in its neighborhood no matter whether fi ’s are differentiable or
not [36]. This motivates the use of certain diminishing step sizes in [5, 8, 14] to guarantee
convergence to x∗ . The rates of convergence are generally weaker than their analogues in
centralized computation. For the general convex case and under the bounded (sub)gradient
(or Lipschitz–continuous objective) assumption,
[5] shows that diminishing step sizes αk =
 
1
ln
k
√ lead to a convergence rate of O √
in terms of the running best of objective error,
k
k
 
√k
and [8] shows that the dual averaging method has a rate of O ln
in the ergodic sense
k
in terms of objective error. For the general convex case, under assumptions of fixed step
size and Lipschitz continuous, bounded gradient, [14] shows an outer–loop convergence rate

of O k12 in terms of objective error, utilizing Nesterov’s acceleration, provided that the
inner loop performs substantial consensus computation, without which diminishing step

1
sizes αk = k1/3
lead to a reduced rate of O lnkk . The (sub)gradient-push method [23] can
be implemented in a dynamic digraph
 and, under the bounded
  (sub)gradient assumption
1
√ k in the ergodic sense in terms
and diminishing step sizes αk = O √k , has a rate of O ln
k

of objective error. A better rate of O lnkk is proved for the (sub)gradient-push method
in [24] under the strong convexity and Lipschitz gradient assumptions, in terms of expected
objective error plus squared consensus residual.
Some of other related algorithms are as follows. For general convex functions and
assuming closed and bounded feasible sets, the decentralized asynchronous ADMM [32] is
2


proved to have a rate of O k1 in terms of expected objective error and feasibility violation.
The augmented Lagrangian based primal-dual methods have linear convergence under strong
convexity and Lipschitz gradient assumptions [4, 30] or under the positive-definite bounded
Hessian assumption [12, 13].
Our proposed algorithm is a synchronous gradient-based algorithm that has a rate of

1
O k for general convex objectives with Lipschitz differentials and has a linear rate once
the sum of, rather than individual, functions fi is also (restricted) strongly convex.
1.2. Notation. Throughout the paper, we let agent i hold a local copy of the global
variable x, which is denoted by x(i) ∈ Rp ; its value at iteration k is denoted by xk(i) . We
introduce an aggregate objective function of the local variables
f (x) ,

n
X

fi (x(i) ),

i=1

where


—

xT
(1)

 — xT

(2)
x,
..


.
— xT
(n)
The gradient of f (x) is defined by


—



— 

 ∈ Rn×p .


—

— ∇T f1 (x(1) )

 — ∇T f2 (x(2) )
∇f (x) , 
..


.
T
— ∇ fn (x(n) )


—

— 
 ∈ Rn×p .


—

Each row i of x and ∇f (x) is associated with agent i. We say that x is consensual if all
of its rows are identical, i.e., x(1) = · · · = x(n) . The analysis and results of this paper hold
for all p ≥ 1. The reader can assume p = 1 for convenience (so x and ∇f become vectors)
without missing any major point.
Finally, for given matrix A and symmetric positive semidefinite matrix G, we define
p
the G-matrix norm kAkG , trace(AT GA). The largest singular value of a matrix A
is denoted as σmax (A). The largest and smallest eigenvalues of a symmetric matrix B
are denoted as λmax (B) and λmin (B), respectively. The smallest nonzero eigenvalue of
a symmetric positive semidefinite matrix B 6= 0 is denoted as λ̃min (B), which is strictly
positive. For a matrix A ∈ Rm×n , null{A} , {x ∈ Rn Ax = 0} is the null space of A and
span{A} , {y ∈ Rm y = Ax, ∀x ∈ Rn } is the linear span of all the columns of A.
1.3. Summary of Contributions. This paper introduces a novel gradient-based decentralized algorithm EXTRA, establishes its convergence conditions and rates, and presents
numerical results in comparison to decentralized gradient descent. EXTRA can use a fixed
3

step size independent of the network size and quickly converges to the solution to (1.1). It

has a rate of convergence O k1 in terms of best running violation to the first-order optimality condition when f¯ is Lipschitz differentiable, and has a linear rate of convergence if
f¯ is also (restricted) strongly convex. Numerical simulations verify the theoretical results
and demonstrate its competitive performance.
1.4. Paper Organization. The rest of this paper is organized as follows. Section 2
develops and interprets EXTRA. Section 3 presents its convergence results. Then, Section
4 presents three sets of numerical results. Finally, Section 5 concludes this paper.
2. Algorithm Development. This section derives the proposed algorithm EXTRA.
We start by briefly reviewing decentralized gradient descent (DGD) and discussing the
dilemma that DGD converges slowly to an exact solution when it uses a sequence of diminishing step sizes, yet it converges faster using a fixed step size but stalls at an inaccurate
solution. We then obtain the update formula of EXTRA by taking the difference of two
formulas of the DGD update. Provided that the sequence generated by the new update
formula with a fixed step size converges to a point, we argue that the point is consensual
and optimal. Finally, we briefly discuss the choice of mixing matrices in EXTRA. Formal
convergence results and proofs are left to Section 3.
2.1. Review of Decentralized Gradient Descent and Its Limitation. DGD
carries out the following iteration
xk+1
(i) =

n
P
j=1

wij xk(j) − αk ∇fi (xk(i) ),

for agent i = 1, . . . , n.

(2.1)

Recall that xk(i) ∈ Rp is the local copy of x held by agent i at iteration k, W = [wij ] ∈ Rn×n
is a symmetric mixing matrix satisfying null{I − W } = span{1} and σmax (W − n1 11T ) < 1,
and αk > 0 is a step size for iteration k. If two agents i and j are neither neighbors nor
identical, then wij = 0. This way, the computation of (2.1) involves only local and neighbor
information, and hence the iteration is decentralized.
Following our notation, we rewrite (2.1) for all the agents together as
xk+1 = W xk − αk ∇f (xk ).

(2.2)

With a fixed step size αk ≡ α, DGD has inexact convergence. For each agent i, xk(i) converges
to a point in the O(α)-neighborhood of a solution to (1.1), and these points for different
agents can be different. On the other hand, properly reducing αk enables exact convergence,
namely, that each xk(i) converges to the same exact solution. However, reducing αk causes
slower convergence, both in theory and in practice.
Paper [36] assumes that ∇fi ’s are Lipschitz continuous, and studies DGD with a constant αk ≡ α. Before the iterates reach the O(α)-neighborhood, the objective value reduces

at the rate O k1 , and this rate improves to linear if fi ’s are also (restricted) strongly con1
vex. In comparison, paper [14] studies DGD with diminishing αk = k1/3
and assumes that
4

∇fi ’s are Lipschitz continuous and bounded. The objective convergence rate slows down

1
1
to O k2/3
. Paper [5] studies DGD with diminishing αk = k1/2
and assumes that fi ’s are
 
ln
k
√
is proved. A simple example of decentralized
Lipschitz continuous; a slower rate O
k
least squares in Section 4.1 gives a rough comparison of these three schemes (and how they
compare to the proposed algorithm).
To see the cause of inexact convergence with a fixed step size, let x∞ be the limit of xk
(assuming the step size is small enough to ensure convergence). Taking the limit over k on
both sides of iteration (2.2) gives us
x∞ = W x∞ − α∇f (x∞ ).
When α is fixed and nonzero, assuming the consensus of x∞ (namely, it has identical rows
∞
x∞
= W x∞ , as a result of W 1 = 1, and thus ∇f (x∞ ) = 0, which is
(i) ) will mean x
∞
equivalent to ∇fi (x∞
(i) ) = 0, ∀i, i.e., the same point x(i) simultaneously minimizes fi for all
agents i. This is impossible in general and is different from our objective to find a point
Pn
that minimizes i=1 fi .
2.2. Development of EXTRA. The next proposition provides simple conditions for
the consensus and optimality for problem (1.1).
Proposition 2.1. Assume null{I − W } = span{1}. If


x∗T
(1)

—

 — x∗T

(2)
x ,
..


.
— x∗T
(n)
∗

—



— 




—

(2.3)

satisfies conditions:
1. x∗ = W x∗ (consensus),
2. 1T ∇f (x∗ ) = 0 (optimality),
then x∗ = x∗(i) , for any i, is a solution to the consensus optimization problem (1.1).
Proof. Since null{I − W } = span{1}, x is consensual if and only if condition 1 holds,
Pn
i.e., x∗ = W x∗ . Since x∗ is consensual, we have 1T ∇f (x∗ ) = i=1 ∇fi (x∗ ), so condition 2
means optimality.
Next, we construct the update formula of EXTRA, following which the iterate sequence
will converge to a point satisfying the two conditions in Proposition 2.1.
Consider the DGD update (2.2) written at iterations k + 1 and k as follows
xk+2 = W xk+1 − α∇f (xk+1 ),
x

k+1

k

k

= W̃ x − α∇f (x ),

where the former uses the mixing matrix W and the latter uses
W̃ =

I +W
.
2
5

(2.4)
(2.5)

The choice of W̃ will be generalized later. The update formula of EXTRA is simply their
difference, subtracting (2.5) from (2.4):
xk+2 − xk+1 = W xk+1 − W̃ xk − α∇f (xk+1 ) + α∇f (xk ).

(2.6)

Given xk and xk+1 , the next iterate xk+2 is generated by (2.6).
Let us assume that {xk } converges for now and let x∗ = limk→∞ xk . Let us also assume
that ∇f is continuous. We first establish condition 1 of Proposition 2.1. Taking k → ∞ in
(2.6) gives us
x∗ − x∗ = (W − W̃ )x∗ − α∇f (x∗ ) + α∇f (x∗ ),

(2.7)

from which it follows that
W x∗ − x∗ = 2(W − W̃ )x∗ = 0.

(2.8)

Therefore, x∗ is consensual.
Provided that 1T (W − W̃ ) = 0, we show that x∗ also satisfies condition 2 of Proposition
2.1. To see this, adding the first update x1 = W x0 − α∇f (x0 ) to the subsequent updates
following the formulas of (x2 − x1 ), (x3 − x2 ), . . . , (xk+2 − xk+1 ) given by (2.6) and then
applying telescopic cancellation, we obtain
xk+2 = W̃ xk+1 − α∇f (xk+1 ) +

k+1
X

(W − W̃ )xt ,

(2.9)

(W − W̃ )xt .

(2.10)

t=0

or equivalently,
xk+2 = W xk+1 − α∇f (xk+1 ) +

k
X
t=0

Taking k → ∞, from x∗ = limk→∞ xk and x∗ = W̃ x∗ = W x∗ , it follows that
α∇f (x∗ ) =

∞
X

(W − W̃ )xt .

(2.11)

t=0

Left-multiplying 1T on both sides of (2.11), in light of 1T (W − W̃ ) = 0, we obtain the
condition 2 of Proposition 2.1:
1T ∇f (x∗ ) = 0.

(2.12)

T
To summarize, provided that null{I − W } = span{1}, W̃ = I+W
2 , 1 (W − W̃ ) = 0, and
the continuity of ∇f , if a sequence following EXTRA (2.6) converges to a point x∗ , then by
Proposition 2.1, x∗ is consensual and any of its identical row vectors solves problem (1.1).

6

2.3. The Algorithm EXTRA and its Assumptions. We present EXTRA — an
exact first-order algorithm for decentralized consensus optimization — in Algorithm 1.
Algorithm 1: EXTRA
Choose α > 0 and mixing matrices W ∈ Rn×n and W̃ ∈ Rn×n ;
Pick any x0 ∈ Rn×p ;
1. x1 ← W x0 − α∇f (x0 );
2. for k = 0, 1, · · · do


xk+2 ← (I + W )xk+1 − W̃ xk − α ∇f (xk+1 ) − ∇f (xk ) ;
end for
Breaking to the individual agents, Step 1 of EXTRA performs updates
x1(i) =

n
X

wij x0(j) − α∇fi (x0(i) ),

i = 1, . . . , n,

j=1

and Step 2 at each iteration k performs updates
k+1
xk+2
(i) = x(i) +

n
X
j=1

wij xk+1
(j) −

n
X

h
i
k
w̃ij xk(j) − α ∇fi (xk+1
)
−
∇f
(x
)
,
i
(i)
(i)

i = 1, . . . , n.

j=1

Each agent computes ∇fi (xk(i) ) once for each k and uses it twice for xk+1
and xk+2
(i) . For our
Pn (i) k
recommended choice of W̃ = (W + I)/2, each agent computes j=1 wij x(j) once as well.
Here we formally give the assumptions on the mixing matrices W and W̃ for EXTRA.
All of them will be used in the convergence analysis in the next section.
Assumption 1 (Mixing matrix). Consider a connected network G = {V, E} consisting
of a set of agents V = {1, 2, · · · , n} and a set of undirected edges E. The mixing matrices
W = [wij ] ∈ Rn×n and W̃ = [w̃ij ] ∈ Rn×n satisfy
1. (Decentralized property) If i 6= j and (i, j) 6∈ E, then wij = w̃ij = 0.
2. (Symmetry) W = W T , W̃ = W̃ T .
3. (Null space property) null{W − W̃ } = span{1}, null{I − W̃ } ⊇ span{1}.
4. (Spectral property) W̃  0 and I+W
< W̃ < W .
2
We claim that Parts 2–4 of Assumption 1 imply null{I − W } = span{1} and the eigenvalues of W lie in (−1, 1], which are commonly assumed for DGD. Therefore, the additional
assumptions are merely on W̃ . In fact, EXTRA can use the same W used in DGD and
simply take W̃ = I+W
2 , which satisfies Part 4. It is also worth noting that the recent work
push-DGD [23] relaxes the symmetry condition, yet such relaxation for EXTRA is not trivial
and is our future work.
Proposition 2.2. Parts 2–4 of Assumption 1 imply null{I − W } = span{1} and that
the eigenvalues of W lie in (−1, 1].
Proof. From part 4, we have I+W
< W̃  0 and thus W  −I and λmin (W ) > −1.
2
I+W
Also from part 4, we have 2 < W and thus I  W , which means λmax (W ) ≤ 1. Hence,
all eigenvalues of W (and those of W̃ ) lie in (−1, 1].
7

Now, we show null{I − W } = span{1}. Consider a nonzero vector v ∈ null{I − W },
which satisfies (I − W )v = 0 and thus vT (I − W )v = 0 and vT v = vT W v. From I+W
< W̃
2
T
)v
≥
v
W̃
v,
while
from
W̃
<
W
(part
4)
we
also
get
(part 4), we get vT v = vT ( I+W
2
T
T
T
T
T
v W̃ v ≥ v W v = v v. Therefore, we have v W̃ v = v v or equivalently (W̃ − I)v = 0,
adding which to (I − W )v = 0 yields (W̃ − W )v = 0. In light of null{W − W̃ } = span{1}
(part 3), we must have v ∈ span{1} and thus null{I − W } = span{1}.
2.4. Mixing Matrices. In EXTRA, the mixing matrices W and W̃ diffuse information
throughout the network.
The role of W is the similar as that in DGD [5, 31, 36] and average consensus [33]. It
has a few common choices, which can significantly affect performance.
(i) Symmetric doubly stochastic matrix [5, 31, 36]: W = W T , W 1 = 1, and wij ≥ 0.
Special cases of such matrices include parts (ii) and (iii) below.
(ii) Laplacian-based constant edge weight matrix [28, 33],
W =I−

L
,
τ

where L is the Laplacian matrix of the graph G and τ > 12 λmax (L) is a scaling
parameter. Denote deg(i) as the degree of agent i. When λmax (L) is not available,
τ = maxi∈V {deg(i)} +  for some small  > 0, say  = 1, can be used.
(iii) Metropolis constant edge weight matrix [3, 34],

1
, if (i, j) ∈ E,


 max{deg(i),deg(j)}+
0,
if (i, j) ∈
/ E and i 6= j,
wij =
P


1−
wik ,
if i = j,

k∈V

for some small positive  > 0.
(iv) Symmetric fastest distributed linear averaging (FDLA) matrix. It is a symmetric
W that achieves fastest information diffusion and can be obtained by a semidefinite
program [33].
It is worth noting that the optimal choice for average consensus, FDLA, no longer
appears optimal in decentralized consensus optimization, which is more general.
is found to be very efficient.
When W is chosen following any strategy above, W̃ = I+W
2
2.5. EXTRA as Corrected DGD. We rewrite (2.10) as
k−1
X
xk+1 = W xk − α∇f (xk ) +
(W − W̃ )xt ,
|
{z
} t=0
DGD
|
{z
}

k = 0, 1, · · · .

(2.13)

correction

An EXTRA update is, therefore, a DGD update with a cumulative correction term. In
subsection 2.1, we have argued that the DGD update cannot reach consensus asymptotically
unless α asymptotically vanishes. Since α∇f (xk ) with a fixed α > 0 cannot vanish in general,
it must be corrected, or otherwise xk+1 − W xk does not vanish, preventing xk from being
8

asymptotically consensual. Provided that (2.13) converges, the role of the cumulative term
Pk−1
t
k
⊥
t=0 (W − W̃ )x is to neutralize −α∇f (x ) in (span{1}) , the subspace orthogonal to 1.
T
If a vector v obeys v (W − W̃ ) = 0, then the convergence of (2.13) means the vanishing
of vT ∇f (xk ) in the limit. We need 1T ∇f (xk ) = 0 for consensus optimality. The correction
term in (2.13) is the simplest that we could find so far. In particular, the summation is
necessary since each individual term (W − W̃ )xt is asymptotically vanishing. The terms
must work cumulatively.
3. Convergence Analysis. To establish convergence of EXTRA, this paper makes
two additional but common assumptions as follows. Unless otherwise stated, the results in
this section are given under Assumptions 1–3.
Assumption 2. (Convex objective with Lipschitz continuous gradient) Objective functions fi are proper closed convex and Lipschitz differentiable:
k∇fi (xa ) − ∇fi (xb )k2 ≤ Lfi kxa − xb k2 ,

∀xa , xb ∈ Rp ,

where Lfi ≥ 0 are constant.
Pn
Following Assumption 2, function f (x) = i=1 fi (x(i) ) is proper closed convex, and ∇f
is Lipschitz continuous
k∇f (xa ) − ∇f (xb )kF ≤ Lf kxa − xb kF ,

∀xa , xb ∈ Rn×p ,

with constant Lf = maxi {Lfi }.
Assumption 3. (Solution existence) Problem (1.1) has a nonempty set of optimal
solutions: X ∗ 6= ∅.
3.1. Preliminaries. We first state a lemma that gives the first-order optimality conditions of (1.1).
Lemma 3.1 (First-order optimality conditions). Given mixing matrices W and W̃ ,
define U = (W̃ − W )1/2 by letting U , V S 1/2 V T ∈ Rn×n where V SV T = W̃ − W is
the economical-form singular value decomposition. Then, under Assumptions 1–3, x∗ is
consensual and x∗(1) ≡ x∗(2) ≡ · · · ≡ x∗(n) is optimal to problem (1.1) if and only if there
exists q∗ = U p for some p ∈ Rn×p such that
(
U q∗ + α∇f (x∗ ) = 0,
(3.1)
U x∗ = 0.

(3.2)

Proof. According to Assumption 1 and the definition of U , we have
null{U } = null{V T } = null{W̃ − W } = span{1}.
Hence from Proposition 2.1, condition 1, x∗ is consensual if and only if (3.2) holds.
Next, following Proposition 2.1, condition 2, x is optimal if and only if 1T ∇f (x∗ ) = 0.
Since U is symmetric and U T 1 = 0, (3.1) gives 1T ∇f (x∗ ) = 0. Conversely, if 1T ∇f (x∗ ) = 0,
9

⊥

then ∇f (x∗ ) ∈ span{U } follows from null{U } = (span{1}) and thus α∇f (x∗ ) = −U q for
some q. Let q∗ = ProjU q. Then, U q∗ = U q and (3.1) holds.
Let x∗ and q∗ satisfy the optimality conditions (3.1) and (3.2). Introduce auxiliary
sequence
qk =

k
X

U xt

t=0

and for each k,
qk
xk

zk =

!

q∗
x∗

z∗ =

,

!
,

G=

I 0
0 W̃

!
.

(3.3)

The next lemma establishes the relations among xk , qk , x∗ , and q∗ .
Lemma 3.2. In EXTRA, the quadruple sequence {xk , qk , x∗ , q∗ } obeys

=

(I + W − 2W̃ )(xk+1 − x∗ ) + W̃ (xk+1 − xk )
−U (qk+1 − q∗ ) − α[∇f (xk ) − ∇f (x∗ )],

(3.4)

for any k = 0, 1, · · · .
Proof. Similar to how (2.9) is derived, summing EXTRA iterations 1 through k + 1
x1 = W x0 − α∇f (x0 ),
x2 = (I + W )x1 − W̃ x0 − α∇f (x1 ) + α∇f (x0 ),
··· ,
k+1
k
k−1
x
= (I + W )x − W̃ x
− α∇f (xk ) + α∇f (xk−1 ),
we get
xk+1 = W̃ xk −

k
P

(W̃ − W )xt − α∇f (xk ).

(3.5)

t=0

Using qk+1 =

Pk+1

t=0 U x

t

and the decomposition W̃ − W = U 2 , it follows from (3.5) that

(I + W − 2W̃ )xk+1 + W̃ (xk+1 − xk ) = −U qk+1 − α∇f (xk ).

(3.6)

Since (I + W − 2W̃ )1 = 0, we have
(I + W − 2W̃ )x∗ = 0.

(3.7)

Subtracting (3.7) from (3.6) and adding 0 = U q∗ + α∇f (x∗ ) to (3.6), we obtain (3.4).
The convergence analysis is based on the recursion (3.4). Below we will show that xk

converges to a solution x∗ ∈ X ∗ and kzk+1 − zk k2W̃ converges to 0 at a rate of O k1 in
an ergodic sense. Further assuming (restricted) strong convexity, we obtain the Q-linear
convergence of kzk − z∗ k2G to 0, which implies the R-linear convergence of xk to x∗ .
10

3.2. Convergence and Rate. Let us first interpret the step size condition
α<

2λmin (W̃ )
,
Lf

(3.8)

which is assumed by Theorem 3.3 below. First of all, let W satisfy Assumption 1. It is easy
to ensure λmin (W ) ≥ 0 since otherwise, we can replace W by I+W
2 . In light of part 4 of
I+W
1
Assumption 1, if we let W̃ = 2 , then we have λmin (W̃ ) ≥ 2 , which simplifies the bound
(3.8) to
α<

1
,
Lf

which is independent of any network property (size, diameter, etc.). Furthermore, if Lfi
(i = 1, . . . , n) are in the same order, the bound L1f has the same order as the bound
Pn
1/( n1 i=1 Lfi ), which is used in the (centralized) gradient descent method. In other words,
a fixed and rather large step size is permitted by EXTRA.
(W̃ )
, then
Theorem 3.3. Under Assumptions 1–3, if α satisfies 0 < α < 2λmin
Lf
kzk − z∗ k2G − kzk+1 − z∗ k2G ≥ ζkzk − zk+1 k2G ,

k = 0, 1, . . . ,

(3.9)

where ζ = 1 − 2λ αL(fW̃ ) . Furthermore, zk converges to an optimal z∗ .
min
Proof. Following Assumption 2, ∇f is Lipschitz continuous and thus we have
2α
k
∗ 2
Lf k∇f (x ) − ∇f (x )kF
k
∗
k

≤ 2αhx − x , ∇f (x ) − ∇f (x∗ )i
= 2hxk+1 − x∗ , α[∇f (xk ) − ∇f (x∗ )]i + 2αhxk − xk+1 , ∇f (xk ) − ∇f (x∗ )i.

(3.10)

Substituting (3.4) from Lemma 3.2 for α[∇f (xk ) − ∇f (x∗ )], it follows from (3.10) that
2α
k
∗ 2
Lf k∇f (x ) − ∇f (x )kF
k+1
∗
∗
k+1

≤ 2hx
− x , U (q − q
)i + 2hxk+1 − x∗ , W̃ (xk − xk+1 )i
−2kxk+1 − x∗ k2I+W −2W̃ + 2αhxk − xk+1 , ∇f (xk ) − ∇f (x∗ )i.

(3.11)

For the terms on the right-hand-side of (3.11), we have
2hxk+1 − x∗ , U (q∗ − qk+1 )i = 2hU (xk+1 − x∗ ), q∗ − qk+1 i
(∵ U x∗ = 0)
= 2hU xk+1 , q∗ − qk+1 i
= 2hqk+1 − qk , q∗ − qk+1 i,

(3.12)

2hxk+1 − x∗ , W̃ (xk − xk+1 )i = 2hxk+1 − xk , W̃ (x∗ − xk+1 )i,

(3.13)

and

≤

2αhxk − xk+1 , ∇f (xk ) − ∇f (x∗ )i
αLf
k
k+1 2
k
∗ 2
kF + 2α
2 kx − x
Lf k∇f (x ) − ∇f (x )kF .
11

(3.14)

Plugging (3.12)–(3.14) into (3.11) and recalling the definitions of zk , z∗ , and G, we have
2α
k
∗ 2
Lf k∇f (x ) − ∇f (x )kF
k+1
k
∗
k+1

≤ 2hq
− q ,q − q
i + 2hxk+1 − xk , W̃ (x∗ − xk+1 )i
k+1
∗ 2
k
k+1 2
k
∗ 2
f
−2kx
− x kI+W −2W̃ + αL
kF + 2α
2 kx − x
Lf k∇f (x ) − ∇f (x )kF ,

(3.15)

that is
k
k+1 2
f
0 ≤ 2hzk+1 − zk , G(z∗ − zk+1 )i − 2kxk+1 − x∗ k2I+W −2W̃ + αL
kF .
2 kx − x

(3.16)

Apply the basic equality 2hzk+1 −zk , G(z∗ −zk+1 )i = kzk −z∗ k2G −kzk+1 −z∗ k2G −kzk −zk+1 k2G
to (3.16), we have
0

≤

kzk − z∗ k2G − kzk+1 − z∗ k2G − kzk − zk+1 k2G
k
k+1 2
f
−2kxk+1 − x∗ k2I+W −2W̃ + αL
kF .
2 kx − x

(3.17)

Define
I
0
f
0 W̃ − αL
2 I

0

G =

!
.

By Assumption 1, in particular, I + W − 2W̃ < 0, we have kxk+1 − x∗ k2I+W −2W̃ ≥ 0 and
thus
kzk − z∗ k2G − kzk+1 − z∗ k2G ≥ kzk − zk+1 kG −

αLf k
kx − xk+1 k2F = kzk − zk+1 kG0 .
2

(W̃ )
, we have G0  0 and
Since α < 2λmin
Lf

kzk − zk+1 k2G0 ≥ ζkzk − zk+1 k2G ,

(3.18)

which gives (3.9).
It shows from (3.9) that for any optimal z∗ , kzk − z∗ k2G is bounded and contractive,
so kzk − z∗ k2G is converging as kzk − zk+1 k2G → 0. The convergence of zk to a solution z∗
follows from the standard analysis for contraction methods; see, for example, Theorem 3
in [11].
To estimate the rate of convergence, we need the following result.
P∞
Proposition 3.4. If a sequence {ak } ⊂ R obeys: ak ≥ 0 and t=1 at < ∞, then we


P
k
have1 : (i) limk→∞ ak = 0; (ii) k1 t=1 at = O k1 ; (iii) mint≤k {at } = o k1 .
Pk
Proof. Part (i) is obvious. Let bk , k1 t=1 at . By the assumptions, kbk is uniformly
bounded and obeys
lim kbk < ∞,

k→∞
1 Part (iii) is due to [6].

12

from which part (ii) follows. Since ck , min{at } is monotonically non-increasing, we have
t≤k

kc2k = k × min {at } ≤
t≤2k

2k
X

at .

t=k+1


P2k
This and the fact that limk→∞ t=k+1 at → 0 give us ck = o k1 or part (iii).
Theorem 3.5. In the same setting of Theorem 3.3, the following rates hold:
(1) Running-average progress:
k

1X t
kz − zt+1 k2G = O
k t=1

 
1
;
k


min kzt − zt+1 k2G = o

 
1
;
k

(2) Running-best progress:

t≤k

(3) Running-average optimality residuals:
k

1X
kU qt + α∇f (xt )k2W̃ = O
k t=1

 
 
k
1
1
1X
t 2
kU x kF = O
and
;
k
k t=1
k

(4) Running-best optimality residuals:


t

min kU q
t≤k

t

+ α∇f (x )k2W̃

 
 

1
1
t 2
and min kU x kF = o
;
=o
t≤k
k
k

Proof. Parts (1) and (2): Since the individual terms kzk − z∗ k2G converge to 0, we are
able to sum (3.9) in Theorem 3.3 over k = 0 through ∞ and apply the telescopic cancellation,
i.e.,
∞
P
t=0

kzt − zt+1 k2G = 1δ

∞
P
t=0

 kz −z∗ k2
kzt − z∗ k2G − kzt+1 − z∗ k2G = 0 δ G < ∞.

(3.19)

Then, the results follow from Proposition 3.4 immediately.
Parts (3) and (4): The progress kzk − zk+1 k2G can be interpreted as the residual to the
first-order optimality condition. In light of the first-order optimality conditions (3.1) and
(3.2) in Lemma 3.1, the optimality residuals are defined as kU qk +α∇f (xk )k2W̃ and kU xk k2F .
Furthermore, k α1 1T (U qk + α∇f (xk ))k22 = k∇f1 (xk(1) ) + ... + ∇fn (xk(n) )k22 is the violation to
the first-order optimality of (1.1), while kU xk k2F is the violation of consensus. Below we
obtain the convergence rates of the optimality residuals.
1
Using the basic inequality ka + bk2F ≥ ρ1 kak2F − ρ−1
kbk2F which holds for any ρ > 1 and
any matrices a and b of the same size, it follows that
kzk − zk+1 k2G

= kqk − qk+1 k2F + kxk − xk+1 k2W̃
= kxk+1 k2W̃ −W + k(I − W̃ )xk + U qk + α∇f (xk )k2W̃
≥

1
kxk+1 k2W̃ −W + ρ1 kU qk + α∇f (xk )k2W̃ − ρ−1
k(I − W̃ )xk k2W̃ .

13

(3.20)

Since W̃ − W and (I − W̃ )W̃ (I − W̃ ) are symmetric and
null{W̃ − W } ⊆ null{(I − W̃ )W̃ (I − W̃ )},
there exists a bounded υ > 0 such that k(I − W̃ )xk k2W̃ = kxk k2(I−W̃ )W̃ (I−W̃ ) ≤ υkxk k2W̃ −W .
It follows from (3.20) that
1
k

≥

=

1
k

k
P

kzt − zt+1 k2G + k1 kx1 k2W̃ −W

t=1
k 
P


υ
kxt+1 k2W̃ −W − ρ−1
kxt k2W̃ −W + k1 kx1 k2W̃ −W

t=1
k
P
1
t
t 2
+ k1
ρ kU q + α∇f (x )kW̃ (Set ρ > υ + 1)
t=1
k
k
P
P
1
υ
1
t
t 2
(1 − ρ−1
)kU xt k2F + k1 kxk+1 k2W̃ −W + k1
k
ρ kU q + α∇f (x )kW̃ .
t=1
t=1

(3.21)


Pk
Pk
As part (1) shows that k1 t=1 kzt −zt+1 k2G = O k1 , we have k1 t=1 kU qt +α∇f (xt )k2W̃ =


Pk
O k1 and k1 t=1 kU xt k2F = O k1 .
From (3.21) and (3.19), we see that both kU qt +α∇f (xt )k2W̃ and kU xt k2F are summable.

Again, by Proposition 3.4, we have part (4), the o k1 rate of running best first-order
optimality residuals.
It is open whether kzk − zk+1 k2G is monotonic or not. If one can show its monotonicity,
then the convergence rates will hold for the last point in the running sequence.
3.3. Linear Convergence under Restricted Strong Convexity. In this subsection we prove that EXTRA with a proper step size reaches linear convergence if the original
objective f¯ is restricted strongly convex.
A convex function h : Rp → R is strongly convex if there exists µ > 0 such that
h∇h(xa ) − ∇h(xb ), xa − xb i ≥ µkxa − xb k2 ,

∀xa , xb ∈ Rp .

h is restricted strongly convex 2 with respect to point x̃ if there exists µ > 0 such that
h∇h(x) − ∇h(x̃), x − x̃i ≥ µkx − x̃k22 ,

∀x ∈ Rp .

For proof convenience, we introduce function
g(x) , f (x) +

1
kxk2W̃ −W
4α

and claim that f¯ is restricted strongly convex with respect to its solution x∗ if, and only if,
g is so with respect to x∗ = 1(x∗ )T .
Proposition 3.6. Under Assumptions 1 and 2, the following two statements are equivalent:
Pn
(i) The original objective f¯(x) = n1 i=1 fi (x) is restricted strongly convex with respect
to x∗ ;
2 There are different definitions of restricted strong convexity. Ours is derived from [17].

14

1
(ii) The penalized function g(x) = f (x) + 4α
kxk2W̃ −W is restricted strongly convex with
respect to x∗ .
In addition, the strong convexity constant of g is no less than that of f¯.
See Appendix A for its proof.
1
Theorem 3.7. If g(x) , f (x) + 4α
kxk2W̃ −W is restricted strongly convex with respect

to x∗ with constant µg > 0, then with proper step size α <
such that the sequence {zk } generated by EXTRA satisfies

2µg λmin (W̃ )
, there exists δ > 0
Lf 2

kzk − z∗ k2G ≥ (1 + δ)kzk+1 − z∗ k2G .

(3.22)

That is, kzk − z∗ k2G converges to 0 at the Q-linear rate O (1 + δ)−k . Consequently, kxk −

x∗ k2W̃ converges to 0 at the R-linear rate O (1 + δ)−k .
Proof. Toward a lower bound of kzk − z∗ k2G − kzk+1 − z∗ k2G : From the definition of
g and its restricted strong convexity, we have
2αµg kxk+1 − x∗ k2F

2αhxk+1 − x∗ , ∇g(xk+1 ) − ∇g(x∗ )i
kxk+1 − x∗ k2W̃ −W + 2αhxk+1 − x∗ , ∇f (xk+1 ) − ∇f (xk )i

≤
=

+2hxk+1 − x∗ , α[∇f (xk ) − ∇f (x∗ )]i.
(3.23)
∗

k

Using Lemma 3.2 for α[∇f (x ) − ∇f (x )] in (3.23), we get
2αµg kxk+1 − x∗ k2F
≤ kxk+1 − x∗ k2W̃ −W + 2αhxk+1 − x∗ , ∇f (xk+1 ) − ∇f (xk )i − 2kxk+1 − x∗ k2I+W −2W̃
+2hxk+1 − x∗ , U (q∗ − qk+1 )i + 2hxk+1 − x∗ , W̃ (xk − xk+1 )i
= kxk+1 − x∗ k2(W̃ −W )−2(I+W −2W̃ ) + 2αhxk+1 − x∗ , ∇f (xk+1 ) − ∇f (xk )i
+2hxk+1 − x∗ , U (q∗ − qk+1 )i + 2hxk+1 − x∗ , W̃ (xk − xk+1 )i.
(3.24)
For the last three terms on the right-hand side of (3.24), we have from Young’s inequality
2αhxk+1 − x∗ , ∇f (xk+1 ) − ∇f (xk )i
≤ αηkxk+1 − x∗ k2F + αη k∇f (xk+1 ) − ∇f (xk )k2F
≤ αηkx

k+1

− x∗ k2F + αLηf

2

kx

k+1

(3.25)

− xk k2F ,

where η > 0 is a tunable parameter and
2hxk+1 − x∗ , U (q∗ − qk+1 )i = 2hqk+1 − qk , q∗ − qk+1 i,

(3.26)

2hxk+1 − x∗ , W̃ (xk − xk+1 )i = 2hxk+1 − xk , W̃ (x∗ − xk+1 )i.

(3.27)

and

Plugging (3.25)–(3.27) into (3.24) and recalling the definition of zk , z∗ , and G, we obtain
2αµg kxk+1 − x∗ k2F
≤ kxk+1 − x∗ k2(W̃ −W )−2(I+W −2W̃ ) + αηkxk+1 − x∗ k2F
2

+ αLηf kxk+1 − xk k2F + 2hzk+1 − zk , G(z∗ − zk+1 )i.
15

(3.28)

By 2hzk+1 − zk , G(z∗ − zk+1 ) = kzk − z∗ k2G − kzk+1 − z∗ k2G − kzk − zk+1 k2G , (3.28) turns
into
kzk − z∗ k2G − kzk+1 − z∗ k2G

≥

kxk+1 − x∗ k2α(2µ −η)I−(W̃ −W )+2(I+W −2W̃ )
g

2

+kzk − zk+1 k2G − αLηf kxk − xk+1 k2F .

(3.29)

A critical inequality: In order to establish (3.22), in light of (3.29), it remains to show
2

kxk+1 − x∗ k2α(2µ −η)I−(W̃ −W )+2(I+W −2W̃ ) + kzk − zk+1 k2G − αLηf kxk − xk+1 k2F
g

≥

δkzk+1 − z∗ k2G .
(3.30)
∗

k

k+1

With the terms of z , z in (3.30) expanded and from kx
kU xk+1 k2F = kqk+1 − qk k2F , (3.30) is equivalent to

−x∗ k2W̃ −W = kU (xk+1 −x∗ )k2F =

kxk+1 − x∗ k2α(2µ −η)I+2(I+W −2W̃ )−δW̃ + kxk − xk+1 k2

W̃ −

g

αLf 2
I
η

≥ δkqk+1 − q∗ k2F ,

(3.31)
which is what remains to be shown below. That is, we must find a upper bound for kqk+1 −
q∗ k2F in terms of kxk+1 − x∗ k2F and kxk − xk+1 k2F .
Establishing (3.31), Step 1: From Lemma 3.2 we have
kU (qk+1 − q∗ )k2F
= k(I + W − 2W̃ )(xk+1 − x∗ ) + W̃ (xk+1 − xk ) + α[∇f (xk ) − ∇f (x∗ )]k2F
(3.32)
= k(I + W − 2W̃ )(xk+1 − x∗ ) + α[∇f (xk+1 ) − ∇f (x∗ )]
+W̃ (xk+1 − xk ) + α[∇f (xk ) − ∇f (xk+1 )]k2F .




β
γ
θ
2
2
From the inequality ka + b + c + dk2F ≤ θ β−1
kak2F + βkbk2F + θ−1
γ−1 kckF + γkdkF ,
which holds for any θ > 1, β > 1, γ > 1 and any matrices a, b, c, d of the same dimensions,
it follows that
kqk+1 − q∗ k2W̃ −W


β
≤ θ β−1
kxk+1 − x∗ k2(I+W −2W̃ )2 + βα2 k∇f (xk+1 ) − ∇f (x∗ )k2F


γ
θ
k
k+1 2
+ θ−1
kW̃ 2 + γα2 k∇f (xk ) − ∇f (xk+1 )k2F .
γ−1 kx − x

(3.33)

By Lemma 3.1 and the definition of qk , all the columns of q∗ and qk+1 lie in the column
space of W̃ − W . This together with the Lipschitz continuity of ∇f (x) turns (3.33) into

≤

kqk+1 − q∗ k2F

2



βλmax ((I+W −2W̃ ) )
θ
+ βα2 Lf 2 kxk+1 − x∗ k2F
λ̃min (W̃ −W )
 β−1

(W̃ 2 )
2
2
+ (θ−1)λ̃ θ (W̃ −W ) γλmax
+
γα
L
kxk − xk+1 k2F ,
f
γ−1
min

(3.34)

where λ̃min (·) gives the smallest nonzero eigenvalue. To make a rather tight bound, we
(W̃ )
−2W̃ )
choose γ = 1 + σmax
and β = 1 + σmax (I+W
in (3.34) and obtain
αLf
αLf
kqk+1 − q∗ k2F
≤

2
θ(σmax (I+W −2W̃ )+αLf )2
max (W̃ )+αLf )
kxk+1 − x∗ k2F + θ(σ
kxk − xk+1 k2F .
λ̃min (W̃ −W )
(θ−1)λ̃min (W̃ −W )

16

(3.35)

Establishing (3.31), Step 2: In order to establish (3.31), with (3.35), it only remains to
show
kxk+1 − x∗ k2α(2µ −η)I+2(I+W −2W̃ )−δW̃ + kxk − xk+1 k2 αLf 2
g
W̃ − η I


θ(σmax (W̃ )+αLf )2
θ(σmax (I+W −2W̃ )+αLf )2
k+1
∗ 2
k
k+1 2
kx
−
x
k
+
kx
−
x
k
.
≥ δ
F
F
λ̃
(W̃ −W )
(θ−1)λ̃
(W̃ −W )
min

(3.36)

min

To validate (3.36), we need

2
 α(2µg − η)I + 2(I + W − 2W̃ ) − δ W̃ < δθ(σmax (I+W −2W̃ )+αLf ) I,
λ̃
(W̃ −W )
2

2

min

max (W̃ )+αLf )
W̃ − αLηf I < δθ(σ
I,
(θ−1)λ̃
(W̃ −W )



(3.37)

min

which holds as long as
n
α(2µg −η)λ̃min (W̃ −W )
δ ≤ min θ(σ (I+W −2W̃ )+αL
)2 +λ
(W̃ )λ̃

o
(θ−1)(ηλmin (W̃ )−αLf 2 )λ̃min (W̃ −W )
.
,
2
θη(σmax (W̃ )+αLf )
max
max
min (W̃ −W )
f
(3.38)
To ensure δ > 0, the following conditions are what we finally need:


(W̃ )
η ∈ (0, 2µg ) and α ∈ 0, ηλmin
, S.
(3.39)
2
Lf
Obviously set S is nonempty. Therefore, with a proper step size α ∈ S, the sequences

kzk − z∗ k2G is Q-linearly convergent to 0 at the rate O (1 + δ)−k . Since the definition of
G-norm implies kxk − x∗ k2W̃ ≤ kzk − z∗ k2G , kxk − x∗ k2W̃ is R-linearly convergent to 0 at the
same rate.
Remark 1 (Strong convexity condition for linear convergence). The restricted strong
1
convexity assumption in Theorem 3.7 is imposed on g(x) = f (x) + 4α
kxk2W̃ −W , not on f (x).
In other words, the linear convergence of EXTRA does not require all fi to be individually
(restricted) strongly convex.
Remark 2 (Acceleration by overshooting W̃ ). For conciseness, we used Assumption
1 for both Theorems 3.5 and 3.7. In fact, for Theorem 3.7, the condition I+W
< W̃ in
2
I+W
part 4 of Assumption 1 can be relaxed, thanks to µg in (3.37). Certain W̃ < 2 , such as
W̃ = 1.5I+W
, can still give linear convergence. In fact, we observed that such an “overshot”
2.5
choice of W̃ can slightly accelerate the convergence of EXTRA.
Remark 3 (Step size optimization). We tried deriving an optimal step size and corresponding explicit linear convergence rate by optimizing certain quantities that appear in
the proof, but it becomes quite tricky and messy. For the special case W̃ = I+W
2 , by taking
µg (1+λmin (W ))
η → µg , we get a satisfactory step size α →
.
4Lf 2
Remark 4 (Step size for
linear convergence). Interestingly, the critical step
 ensuring

2µ λ
(W̃ )
µg
=
O
,
in
(3.39)
for ensuring the linear convergence, and the
size, α < g Lmin
2
Lf 2
f


µ
λ̃min (W̃ −W )
parameter α = 2(1+
= O Lfg2 in (A.7) for ensuring the restricted strong
1
)(µ ¯−2L γ)
γ2

f

f

convexity with O(µg ) = O(µf¯), have the same order.
 
On the other hand, we numerically observed that a step size as large as O L1f still
leads to linear convergence, and EXTRA becomes faster with this larger step size. It remains
an open question to prove linear convergence under this larger step size.
17

3.4. Decentralized implementation. We shall explain how to perform EXTRA with
only local computation and neighbor communication. EXTRA’s formula is formed by ∇f (x),
W x and W̃ x, and α. By definition ∇f (x) is local computation. Assumption 1 part 1 ensures
that W x and W̃ x can be computed with local and neighbor information. Following our
convergence theorems above, determining α requires the bounds on Lf and λmin (W̃ ), as well
as that of µg in the (restricted) strongly convex case. As we have argued at the beginning
of Subsection 3.2, it is easy to ensure λmin (W̃ ) ≥ 21 , so λmin (W̃ ) can be conservatively set
as 12 . To obtain Lf = maxi {Lfi }, a maximum consensus algorithm is needed. On the other
hand, it is tricky to determine µg or its lower bound µf¯, except in the case that each fi is
(restricted) strongly convex, we can conservatively use mini {µfi }. When no bound µg is
available in the (restricted) strongly convex case, setting α according to the general convex
case (subsection 3.2) often still leads to linear convergence.

4. Numerical Experiments.

4.1. Decentralized Least Squares. Consider a decentralized sensing problem: each
agent i ∈ {1, · · · , n} holds its own measurement equation, y(i) = M(i) x + e(i) , where y(i) ∈
Rmi and M(i) ∈ Rmi ×p are measured data, x ∈ Rp is unknown signal, and e(i) ∈ Rmi is
unknown noise. The goal is to estimate x. We apply the least squares loss and try to solve
n

minimize f¯(x) =
x

1X1
kM(i) x − y(i) k22 .
n i=1 2

The network in this experiment is randomly generated with connectivity ratio r = 0.5, where
r is defined as the number of edges divided by L(L−1)
, the number of all possible ones. We
2
set n = 10, mi = 1, ∀i, p = 5. Data y(i) and M(i) , as well as noise e(i) , ∀i, are generated
following the standard normal distribution. We normalize the data so that Lf = 1. The
algorithm starts from x0(i) = 0, ∀i, and kx∗ − x0(i) k = 300.
We use the same matrix W by strategy (iv) in Section 2.4 for both DGD and EXTRA.
For EXTRA, we simply use the aforementioned matrix W̃ = I+W
2 . We run DGD with a
0
α
fixed step size α, a diminishing one k1/3 [14], a diminishing one kα1/3 with hand-optimized
0
α
α0 , a diminishing one k1/2
[5], and a diminishing one kα1/2 with hand-optimized α0 , where
α is the theoretical critical step size given in [36]. We let EXTRA use the same fixed step
size α.
The numerical results are illustrated in Fig. 4.1. In this experiment, we observe that
both DGD with the fixed step size and EXTRA show similar linear convergence in the first
200 iterations. Then DGD with the fixed step size begins to slow down and eventually stall,
and EXTRA continues its progress.
18

0

10

10
3

−2

10

1

Residual

8

4

5

−4

10

−6

10
2

DGD with fixed α
DGD with αk = α/k1/3
DGD with αk = 3α/k1/3
DGD with αk = α/k1/2
DGD with αk = 5α/k1/2
EXTRA with fixed α

−8

10
6
7

−10

10

9

0

500

1000

1500

2000

2500

3000

k
kxk −x∗ k

Fig. 4.1. Plot of residuals kx0 −x∗ kF . Constant α = 0.5276 is the theoretical critical step size given
F

for DGD in [36]. For DGD with diminishing step sizes O(1/k1/3 ) and O(1/k1/2 ), we have hand-optimized
their initial step sizes as 3α and 5α, respectively.

4.2. Decentralized Robust Least Squares. Consider the same decentralized sensing setting and network as in Section 4.1. In this experiment, we use the Huber loss, which
is known to be robust to outliers, and it allows us to observe both sublinear and linear
convergence. We call the problem as decentralized robust least squares:



mi
n X

X
1
Hξ (M(i)j x − y(i)j ) ,
minimize f¯(x) =
x

n i=1  j=1

where M(i)j is the j-th row of matrix M(i) and y(i)j is the j-th entry of vector y(i) . The
Huber loss function Hξ is defined as

(
Hξ (a) =

1 2
2a ,
ξ(|a| − 12 ξ),

for |a| ≤ ξ,
otherwise,

(`22 zone),
(`1 zone).

We set ξ = 2. The optimal solution x∗ is artificially set in the `22 zone while x0(i) is set in
the `1 zone at all agents i.
Except for new hand-optimized initial step sizes for DGD’s diminishing step sizes, all
other algorithmic parameters remain unchanged from the last test.
The numerical results are illustrated in Fig. 4.2. EXTRA has sublinear convergence for
the fist 1000 iterations and then begins linear convergence, as xk(i) for most i enter the `22
zone.
19

0

10

−2

Residual

10

−4

10

−6

10

DGD with fixed α
DGD with αk = α/k1/3
DGD with αk = 10α/k1/3
DGD with αk = α/k1/2
DGD with αk = 20α/k1/2
EXTRA with fixed α

−8

10

−10

10

0

500

1000

1500

2000

2500

3000

k
kxk −x∗ k

Fig. 4.2. Plot of residuals kx0 −x∗ kF . Constant α = 0.5276 is the theoretical critical step size given
F

for DGD in [36]. For DGD with diminishing step sizes O(1/k1/3 ) and O(1/k1/2 ), we have hand-optimized
their initial step sizes as 10α and 20α, respectively. The initial large step sizes have helped them (the red
and purple curves) realize faster convergence initially.

4.3. Decentralized Logistic Regression. Consider the decentralized logistic regression problem:

minimize f¯(x) =
x


mi
n 
X
1 X
1
n i=1  mi j=1



ln 1 + exp −(M(i)j x)y(i)j
,



where every agent i holds its training date M(i)j , y(i)j ∈ Rp × {−1, +1}, j = 1, · · · , mi ,
including explanatory/feature variables M(i)j and binary output/outcome y(i)j . To simplify
the notation, we set the last entry of every M(i)j to 1 thus the last entry of x will yield the
offset parameter of the logistic regression model.
We show a decentralized logistic regression problem solved by DGD and EXTRA over
a medium-scale network. The settings are as follows. The connected network is randomly
generated with n = 200 agents and connectivity ratio r = 0.2. Each agent holds 10 samples,
i.e., mi = 10, ∀i. The agents shall collaboratively obtain p = 20 coefficients via logistic
regression. All the 2000 samples are randomly generated, and the reference (ground true)
logistic classifier x∗ is pre-computed with a centralized method. As it is easy to implement
in practice, we use the Metropolis constant edge weight matrix W , which is mentioned by
strategy (iii) in Section 2.4, with  = 1, and we use W̃ = I+W
2 . The numerical results are
illustrated in Fig. 4.3. EXTRA outperforms DGD, showing linear and exact convergence
to the reference logistic classifier x∗ .
20

0

10

−2

Residual

10

−4

10

−6

10

DGD with fixed α
DGD with αk = α/k 1/3
DGD with αk = 10α/k 1/3
DGD with αk = α/k 1/2
DGD with αk = 20α/k 1/2
EXTRA with fixed α

−8

10

−10

10

0

2000

4000

6000

8000

10000

k
kxk −x∗ k

Fig. 4.3. Plot of residuals kx0 −x∗ kF . Constant α = 0.0059 is the theoretical critical step size given
F

for DGD in [36]. For DGD with diminishing step sizes O(1/k1/3 ) and O(1/k1/2 ), we have hand-optimized
their initial step sizes as 10α and 20α, respectively.

5. Conclusion. As one of the fundamental method, gradient descent has been adapted
to decentralized optimization, giving rise to simple and elegant iterations. In this paper, we
attempted to address a dilemma or deficiency of the current decentralized gradient descent
method: to obtain an accurate solution, it works slowly as it must use a small step size
or iteratively diminish the step size; a large step size will lead to faster convergence to,
however, an inaccurate solution. Our solution is an exact first-order algorithm, EXTRA,
which uses a fixed large step size and quickly returns an accurate solution. The claim is
supported by both theoretical convergence and preliminary numerical results. On the other
hand, EXTRA is far from perfect, and more work is needed to adapt it to the asynchronous
and dynamic network settings. They are interesting open questions for future work.
Appendix A. Proof of Proposition 3.6.
Proof.
“(ii) ⇒ (i)”: By definition of restricted strong convexity, there exists µg > 0 so that for
any x,
µg kx − x∗ k2F

≤ h∇g(x) − ∇g(x∗ ), x − x∗ i
1
= h∇f (x) − ∇f (x∗ ), x − x∗ i + 2α
kx − x∗ k2W̃ −W .

(A.1)

For any x ∈ Rp , set x = 1xT , and from the above inequality, we get
µg kx − x∗ k22

≤

1
n

n
P

h∇fi (x) − ∇fi (x∗ ), x − x∗ i

i=1

= h∇f¯(x) − ∇f¯(x∗ ), x − x∗ i.
Therefore, f¯(x) is restricted strongly convex with a constant µf¯ , µg .
“(i) ⇒ (ii)”: For any x ∈ Rn×p , decompose
x=u+v
21

(A.2)

so that every column of u belongs to span{1} (i.e., u is consensual) while that of v belongs to
⊥
span{1} . Such an orthogonal decomposition obviously satisfies kxk2F = kuk2F +kvk2F . Since
solution x∗ is consensual and thus hu−x∗ , vi = 0, we also have kx−x∗ k2F = ku−x∗ k2F +kvk2F .
In addition, being consensual, u = 1uT for some u ∈ Rp . From the inequalities
n

h∇f (u) − ∇f (x∗ ), u − x∗ i = n

1X
h∇fi (u) − ∇fi (x∗ ), u − x∗ i
n i=1

≥ nµf¯ku − x∗ k22 = µf¯ku − x∗ k2F ,
h∇f (x) − ∇f (u), x − ui ≥ 0,
h∇f (u) − ∇f (x∗ ), x − ui ≥ −Lf ku − x∗ kF kvkF ,
h∇f (x) − ∇f (u), u − x∗ i ≥ −Lf kvkF ku − x∗ kF ,
we get
h∇f (x) − ∇f (x∗ ), x − x∗ i
= h∇f (u) − ∇f (x∗ ), u − x∗ i + h∇f (x) − ∇f (u), x − ui
+h∇f (u) − ∇f (x∗ ), x − ui + h∇f (x) − ∇f (u), u − x∗ i
≥ µf¯ku − x∗ k2F − 2Lf ku − x∗ kF kvkF .

(A.3)

In addition, from the fact that u − x∗ ∈ null{W̃ − W } and v ∈ span{W̃ − W }, it follows
that
λ̃min (W̃ −W )
1
1
∗ 2
2
kvk2F ,
2α kx − x kW̃ −W = 2α kvkW̃ −W ≥
2α

(A.4)

where λ̃min (·) gives the smallest nonzero eigenvalue of a positive semidefinite matrix.
Pick any γ > 0. When kvkF ≤ γku − x∗ kF , it follows that
=

h∇g(x) − ∇g(x∗ ), x − x∗ i
1
h∇f (x) − ∇f (x∗ ), x − x∗ i + 2α
kx − x∗ k2W̃ −W

≥

W̃ −W )
kvk2F
µf¯ku − x∗ k2F − 2Lf ku − x∗ kF kvkF + λ̃min (2α

≥

λ̃min (W̃ −W )
∗ 2
(µf¯ −
kvk2F
n 2Lf γ)ku − x kF +
o2α
W̃ −W )
min µf¯ − 2Lf γ, λ̃min (2α
kx − x∗ k2F .

≥

(by (A.3) and (A.4))

(A.5)
When kvkF ≥ γku − x∗ kF , it follows that
h∇g(x) − ∇g(x∗ ), x − x∗ i
1
= h∇f (x) − ∇f (x∗ ), x − x∗ i + 2α
kx − x∗ k2W̃ −W
W̃ −W )
≥ 0 + λ̃min (2α
kvk2F

(applied convexity of f and (A.4))

≥

λ̃min (W̃ −W )
(W̃ −W )
kvk2F + λ̃min
ku − x∗ k2F
2α(1+ γ12 )
2α(1+ γ12 )

=

λ̃min (W̃ −W )
kx − x∗ k2F .
2α(1+ γ12 )

(A.6)

Finally, in all conditions,

≥

∗
∗
h∇g(x)
 − ∇g(x ), x − x i 
(W̃ −W )
min µf¯ − 2Lf γ, λ̃min
kx − x∗ k2F , µg kx − x∗ k2F .
2α(1+ 1 )
γ2

22

(A.7)

µ¯

By, for example, setting γ = 4Lff , we have µg > 0. Hence, function g is restricted strongly
convex for any α > 0 as long as function f¯ is restricted strongly convex.
In the direction of “(ii) ⇒ (i)”, we find µg < µf¯, unlike the more pleasant µf¯ = µg in
the other direction. However, from (A.7), we have
sup µg = lim µg
γ,α

γ→0+

α=

λ̃min (W̃ −W )
2(1+ 12 )(µ ¯−2Lf γ)
f
γ

= µf¯,

which means that µg can be arbitrarily close to µf¯ as α goes to zero. On the other hand, just
 


µ¯
µf¯
λ̃min (W̃ −W )
to have O(µg ) = O(µf¯), we can set γ = O Lff and α = 2(1+
=
O
=
2
1
Lf
)(µf¯−2Lf γ)
γ2


µ
O Lfg2 . This order of α coincides, in terms of order of magnitude, with the critical step
size for ensuring the linear convergence.
REFERENCES
[1] J. Bazerque and G. Giannakis, Distributed Spectrum Sensing for Cognitive Radio Networks by
Exploiting Sparsity, IEEE Transactions on Signal Processing, 58 (2010), pp. 1847–1862. 1
[2] J. Bazerque, G. Mateos, and G. Giannakis, Group-Lasso on Splines for Spectrum Cartography,
IEEE Transactions on Signal Processing, 59 (2011), pp. 4648–4663. 1
[3] S. Boyd, P. Diaconis, and L. Xiao, Fastest Mixing Markov Chain on a Graph, SIAM Review, 46
(2004), pp. 667–689. 2.4
[4] T. Chang, M. Hong, and X. Wang, Multi-Agent Distributed Optimization via Inexact Consensus
ADMM, arXiv preprint arXiv:1402.6065, (2014). 1.1
[5] I. Chen, Fast Distributed First-Order Methods, master’s thesis, Department of Electrical Engineering
and Computer Science, Massachusetts Institute of Technology, 2012. 1.1, 2.1, 2.4, 4.1
[6] D. Davis and W. Yin, Convergence Rates of Splitting Algorithms for Optimization. arXiv preprint
arXiv:1406.4834, 2014. 1
[7] A. Dimakis, S. Kar, M. R. J. Moura, and A. Scaglione, Gossip Algorithms for Distributed Signal
Processing, Proceedings of the IEEE, 98 (2010), pp. 1847–1864. 1
[8] J. Duchi, A. Agarwal, and M. Wainwright, Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling, IEEE Transactions on Automatic Control, 57 (2012),
pp. 592–606. 1.1
[9] P. Forero, A. Cano, and G. Giannakis, Consensus-Based Distributed Support Vector Machines,
Journal of Machine Learning Research, 59 (2010), pp. 1663–1707. 1
[10] L. Gan, U. Topcu, and S. Low, Optimal Decentralized Protocol for Electric Vehicle Charging, IEEE
Transactions on Power Systems, 28 (2013), pp. 940–951. 1
[11] B. He, A New Method for A Class of Linear Variational Inequalities, Mathematical Programming,
66 (1994), pp. 137–144. 3.2
[12] F. Iutzeler, P. Bianchi, P. Ciblat, and W. Hachem, Explicit Convergence Rate of a Distributed
Alternating Direction Method of Multipliers, arXiv preprint arXiv:1312.1085, (2013). 1.1
[13] D. Jakovetic, J. Moura, and J. Xavier, Linear Convergence Rate of Class of Distributed Augmented
Lagrangian Algorithms, arXiv preprint arXiv:1307.2482, (2013). 1.1
[14] D. Jakovetic, J. Xavier, and J. Moura, Fast Distributed Gradient Methods, IEEE Transactions on
Automatic Control, 59 (2014), pp. 1131–1146. 1.1, 2.1, 4.1
[15] B. Johansson, On Distributed Optimization in Networked Systems, PhD thesis, KTH, 2008. 1
[16] V. Kekatos and G. Giannakis, Distributed Robust Power System State Estimation, IEEE Transactions on Power Systems, 28 (2013), pp. 1617–1626. 1
[17] M. Lai and W. Yin, Augmented `1 and Nuclear-Norm Models with a Globally Linearly Convergent
Algorithm, SIAM Journal on Imaging Sciences, 6 (2013), pp. 1059–1091. 2
23

[18] Q. Ling and Z. Tian, Decentralized Sparse Signal Recovery for Compressive Sleeping Wireless Sensor
Networks, IEEE Transactions on Signal Processing, 58 (2010), pp. 3816–3827. 1
[19] Q. Ling, Z. Wen, and W. Yin, Decentralized Jointly Sparse Recovery by Reweighted `q Minimization,
IEEE Transactions on Signal Processing, 61 (2013), pp. 1165–70. 1
[20] Q. Ling, Y. Xu, W. Yin, and Z. Wen, Decentralized Low-rank Matrix Completion, in Proceedings
of the 37th IEEE International Conference on Acoustics, Speech, and Signal Processing, 2012,
pp. 2925–2928. 1
[21] I. Matei and J. Baras, Performance Evaluation of the Consensus-Based Distributed Subgradient
Method under Random Communication Topologies, IEEE Journal of Selected Topics in Signal
Processing, 5 (2011), pp. 754–771. 1.1
[22] G. Mateos, J. Bazerque, and G. Giannakis, Distributed Sparse Linear Regression, IEEE Transactions on Signal Processing, 58 (2010), pp. 5262–5276. 1
[23] A. Nedic and A. Olshevsky, Distributed Optimization over Time-Varying Directed Graphs, in The
52nd IEEE Annual Conference on Decision and Control, 2013, pp. 6855–6860. 1.1, 2.3
[24]
, Stochastic Gradient-Push for Strongly Convex Functions on Time-Varying Directed Graphs,
arXiv preprint arXiv:1406.2075, (2014). 1.1
[25] A. Nedic and A. Ozdaglar, Distributed Subgradient Methods for Multi-agent Optimization, IEEE
Transactions on Automatic Control, 54 (2009), pp. 48–61. 1.1
[26] J. Predd, S. Kulkarni, and H. Poor, A Collaborative Training Algorithm for Distributed Learning,
IEEE Transactions on Information Theory, 55 (2009), pp. 1856–1871. 1
[27] S. Ram, A. Nedic, and V. Veeravalli, Distributed Stochastic Subgradient Projection Algorithms for
Convex Optimization, Journal of Optimization Theory and Applications, 147 (2010), pp. 516–545.
1.1
[28] A. Sayed, Diffusion Adaptation over Networks, arXiv preprint arXiv:1205.4220, (2012). 2.4
[29] I. Schizas, A. Ribeiro, and G. Giannakis, Consensus in Ad Hoc WSNs with Noisy Links–Part
I: Distributed Estimation of Deterministic Signals, IEEE Transactions on Signal Processing, 56
(2008), pp. 350–364. 1
[30] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, On the Linear Convergence of the ADMM in Decentralized Consensus Optimization, IEEE Transactions on Signal Processing, 62 (2014), pp. 1750–
1761. 1.1
[31] J. Tsitsiklis, Problems in Decentralized Decision Making and Computation, PhD thesis, Department
of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 1984. 2.4
[32] E. Wei and A. Ozdaglar, On the O(1/k) Convergence of Asynchronous Distributed Alternating
Direction Method of Multipliers, arXiv preprint arXiv:1307.8254, (2013). 1.1
[33] L. Xiao and S. Boyd, Fast Linear Iterations for Distributed Averaging, Systems and Control Letters,
53 (2004), pp. 65–78. 2.4
[34] L. Xiao, S. Boyd, and S. Kim, Distributed Average Consensus with Least-mean-square Deviation,
Journal of Parallel and Distributed Computing, 67 (2007), pp. 33–46. 1, 2.4
[35] K. Yuan, Q. Ling, A. Ribeiro, and W. Yin, A Linearized Bregman Algorithm for Decentralized
Basis Pursuit, in Proceedings of the 21st European Signal Processing Conference, 2013, pp. 1–5.
1
[36] K. Yuan, Q. Ling, and W. Yin, On the Convergence of Decentralized Gradient Descent, arXiv
preprint arXiv:1310.7063, (2013). 1.1, 2.1, 2.4, 4.1, 4.1, 4.2, 4.3
[37] M. Zhu and S. Martinez, On Distributed Convex Optimization under Inequality and Equality Constraints, IEEE Transactions on Automatic Control, 57 (2012), pp. 151–164. 1.1

24

