1

Distributed Dual Gradient Tracking for Resource
Allocation in Unbalanced Networks

arXiv:1909.09937v3 [eess.SP] 23 Aug 2020

Jiaqi Zhang, Keyou You, Senior Member, IEEE, and Kai Cai, Senior Member, IEEE

Abstract—This paper proposes a distributed dual gradient
tracking algorithm (DDGT) to solve resource allocation problems
over an unbalanced network, where each node in the network
holds a private cost function and computes the optimal resource
by interacting only with its neighboring nodes. Our key idea is the
novel use of the distributed push-pull gradient algorithm (PPG)
to solve the dual problem of the resource allocation problem.
To study the convergence of the DDGT, we first establish the
sublinear convergence rate of PPG for non-convex objective
functions, which advances the existing results on PPG as they
require the strong-convexity of objective functions. Then we
show that the DDGT converges linearly for strongly convex and
Lipschitz smooth cost functions, and sublinearly without the
Lipschitz smoothness. Finally, experimental results suggest that
DDGT outperforms existing algorithms.
Index Terms—distributed resource allocation, unbalanced
graphs, dual problem, distributed optimization, push-pull gradient.

I. I NTRODUCTION
Distributed resource allocation problems (DRAPs) are concerned with optimally allocating resources to multiple nodes
that are connected via a directed peer-to-peer network. Each
node is associated with a local private objective function to
measure the cost of its allocated resource, and the global goal
is to jointly minimize the total cost. The key feature of the
DRAPs is that each node computes its optimal amount of
resources by interacting only with its neighboring nodes in
the network. A typical application is the economic dispatch
problem, where the local cost function is often quadratic [1].
See [2]–[5] for other applications.
A. Literature review
Existing works on DRAPs can be categorized depending on
whether the underlying network is balanced or not. A balanced
network means that the “amount” of information to any node is
equal to that from this node, which is crucial to the algorithm
design. Most of early works on DRAPs focus on balanced
networks and the recent interest is shifted to the unbalanced
case.
The central-free algorithm (CFA) in [2] is the first documented result on DRAPs in balanced networks where at
* This work was supported by the National Natural Science Foundation
of China under Grant 61722308 and Dong Guan Innovative Research Team
Program under Grant 2018607202007. (Corresponding author: Keyou You).
J. Zhang and K. You are with the Department of Automation, and BNRist, Tsinghua University, Beijing 100084, China. E-mail:
zjq16@mails.tsinghua.edu.cn, youky@tsinghua.edu.cn.
K. Cai is with Department of Electrical and Information Engineering, Osaka
City University, Osaka 558-8585, Japan. E-mail: kai.cai@eng.osaka-cu.ac.jp.

each iteration every node updates its decision variables using
the weighted error between the gradient of its local objective
function and those of its neighbors, and it can be accelerated
by designing an optimal weighting matrix [3]. It is proved
that the CFA achieves a linear convergence rate for strongly
convex and Lipschitz smooth cost functions. For time-varying
networks, the CFA is shown to converge sublinearly in the
absence of strong convexity [4]. This rate is further improved
in [6] by optimizing its dependence on the number of nodes.
In addition, there are also several ADMM-based methods that
only work for balanced networks [7]–[9]. By exploiting the
mirror relationship between the distributed optimization and
distributed resource allocation, several accelerated distributed
algorithms are proposed in [10], [11]. Moreover, [12] and
[13] study continuous-time algorithms for DRAPs by using
the machinery of control theory.
For unbalanced networks, the algorithm design for DRAPs
is much more complicated, which has been widely acknowledged in the distributed optimization literature [14], [15].
In this case, a consensus based algorithm that adopts the
celebrated surplus idea [15] is proposed in [1] and [16].
However, their convergence results are only for quadratic
cost functions where the analyses rely on the linear system
theory. The extension to general convex functions is performed
in [17] by adopting the nonnegative surplus method, at the
expense of a slower convergence rate. The ADMM-based
algorithms are developed in [18], [19], and algorithms that
aim to handle communication delay in time-varying networks
and perform event-triggered updates are respectively studied in
[20] and [21]. We note that all the above-mentioned works [1],
[16]–[21] do not provide explicit convergence rates for their
algorithms. In contrast, the algorithm proposed in this work is
proved to achieve a linear convergence rate for strongly convex
and Lipschitz smooth cost functions, and has a sublinear
convergence rate without the Lipschitz smoothness.
There are several recent works with convergence rate analyses of their algorithms over unbalanced networks. Most
of them leverage the dual relationship between DRAPs and
distributed optimization problems. For example, the algorithms
in [22] and [23] use stochastic gradients and diminishing
stepsize to solve the dual problem of DRAPs, and thus their
√
convergence rates are limited to an order of O(ln(k)/ k)
for Lipschitz smooth cost functions. [23] also shows a rate
of O(ln(k)/k) if the cost function is strongly convex. An
algorithm with linear convergence rate is recently proposed in
[24] for strongly convex and Lipschitz smooth cost functions.
However, its convergence rate is unclear if either the strongly
convexity or the Lipschitz smoothness is removed. In [9], a

2

push-sum-based algorithm is proposed by incorporating the alternating direction method of multipliers (ADMM). Although
it can handle time-varying networks, the convergence rate
is O(1/k) even for strongly convex and Lipschitz smooth
functions.
B. Our contributions
In this work, we propose a distributed dual gradient tracking
algorithm (DDGT) to solve DRAPs over unbalanced networks.
The DDGT exploits the duality of DRAPs and distributed
optimization problems, and takes advantage of the distributed
push-pull gradient algorithm (PPG) [25], which is also called
AB algorithm in [26]. If the cost function is strongly convex
and Lipschitz smooth, we show that the DDGT converges at
a linear rate O(λk ), λ ∈ (0, 1). If the Lipschitz smoothness
is not satisfied, we show the convergence of the DDGT and
establish an convergence rate O(1/k). To our best knowledge,
these convergence results are only reported for undirected or
balanced networks in [10]. Although a distributed algorithm
for directed networks is also proposed in [10], there is no convergence analysis. The advantages of the DDGT over existing
algorithms are also validated by numerical experiments.
To characterize the sublinear convergence of the DDGT,
we first show that PPG converges sublinearly to a stationary
point even for non-convex objective functions. Clearly, this
advances existing works [25]–[27] as their convergence results
are only for strongly-convex objective functions. In fact, the
convergence proofs for PPG in [25]–[27] require constructing
a complicated 3-dimensional matrix and then derive the linear
convergence rate O(λk ) where λ ∈ (0, 1) is the spectral radius
of this matrix. This approach is no longer applicable since a
linear convergence rate is usually not attainable for general
non-convex functions [28] and hence the spectral radius of
such a matrix cannot be strictly less than one.
C. Paper organization and notations
The rest of this paper is organized as follows. In Section
II, we formulate the constrained DRAPs with some standard
assumptions. Section III firstly derives the dual problem of
DRAPs which is amenable to distributed optimization, and
then introduces the PPG. The DDGT is then obtained by
applying PPG to the dual problem and improving the initialization. In Section IV, the convergence result of the DDGT
is derived by establishing the convergence of PPG for nonconvex objective functions. Section V performs numerical
experiments to validate the effectiveness of the DDGT. Finally,
we draw conclusive remarks in Section VI.
We use a lowercase x, bold letter x and uppercase X to
denote a scalar, vector, and matrix, respectively. xT denotes
the transpose of the vector x. [X]ij denotes the element in the
i-th row and j-th column of the matrix X. For vectors we use
k · k to denote the l2 -norm. For matrices we use k · k and k · kF
to denote respectively the spectral norm and the Frobenius
norm. |X | denotes the cardinality of set X . Rn denotes the set
of n-dimensional real vectors. 1 denotes the vector with all
ones, the dimension of which depends on the context. ∇f (x)
denotes the gradient of a differentiable function f at x. We

say a nonnegative matrix X is row-stochastic if X1 = 1, and
column-stochastic if X T is row-stochastic. O(·) denotes the
big-O notation.
II. P ROBLEM FORMULATION
Consider the distributed resource allocation problems
(DRAPs) with n nodes, where each node i has a local private
cost function Fi : Rm → R. The goal is to solve the following
optimization problem in a distributed manner:
minimizem

w1 ,··· ,wn ∈R

subject to

n
X

Fi (wi )

i=1

wi ∈ Wi ,

n
X
i=1

wi =

n
X

(1)
di

i=1

where wi ∈ Rm is the local decision vector of node i,
representing the resources allocated to i. Wi is a local convex
and closed constraint set. di denotes the resource demand
of node
i and di are only known to node i. Let
Pn i. Both WP
n
d , i=1 di , then i=1 wi = d represents the constraint on
total available resources, showing the coupling among nodes.
Remark 1: Problem (1) covers many forms of DRAPs
considered in the literature. For example, the standard local
constraint Wi = [wi , wi ] for some constants wi and wi is a
one-dimensional special case of (1), see e.g. [1], [16], [17],
[20], [24]. Moreover,
Pn the coupling constraint can be given in
a weighted form i=1 Ai wi = d, which can be transformed
into (1) by defining a new variable wi0 = Ai wi and a local
constraint set Wi0 = {Ai wi |wi ∈ Wi }. In addition, many
works only consider quadratic cost functions [1], [16].
Solving (1) in a distributed manner means that each node
can only communicate and exchange information with a subset
of nodes via a communication network, which is modeled by
a directed graph G = (V, E). Here V = {1, · · · , n} denotes
the set of nodes, E ⊆ V × V denotes the set of edges, and
(i, j) ∈ E if node i can send information to node j. Note
that (i, j) ∈ E does not necessarily imply that (j, i) ∈ E.
Define Niin = {j|(j, i) ∈ E} ∪ {i} and Niout = {j|(i, j) ∈
E} ∪ {i} as the set of in-neighbors and out-neighbors of node
i, respectively. That is, node i can only receive messages from
its in-neighbors and send messages to its out-neighbors. Let
aij > 0 be P
the weight associated
to edge (j, i) ∈ E. G is
P
balanced if
j∈Niin aij =
j∈Niout aji for all i ∈ V. Note
that balancedness is a relatively strong condition, since it can
be difficult or even impossible to find weights satisfying it for
a general directed graph [29].
The following assumptions are made throughout the paper.
Assumption 1 (Strong convexity and Slater’s condition):
1) The local cost function Fi is µ-strongly convex for all
i ∈ V, i.e., for any w, w0 ∈ Rm and θ ∈ [0, 1],
Fi (θw + (1 − θ)w0 )
≤ θFi (w) + (1 − θ)Fi (w0 ) −

µ
θ(1 − θ)kw − w0 k2 .
2

Pn
2) The constraint i=1 wi = d is satisfied for some point
in the relative interior of the Cartesian product W :=
W1 × · · · × Wn .

3

Assumption 2 (Strongly connected network): G is strongly
connected, i.e., there exists a directed path from any node i to
any node j.
Assumption 1 is common in the literature. Note that we do
not assume the differentiability of Fi . Under Assumption 1, the
optimal point of (1) is unique. Let F ? and wi? , i ∈ V denote
respectively
its optimal value and optimal point, i.e., F ? =
Pn
?
i=1 Fi (wi ). Assumption 2 is also common and necessary
for the information mixing over a network.

A. The dual problem of (1) and PPG
Define the Lagrange function of (1) as
n
X
Fi (wi ) + xT (
wi − d)

i=1

(2)

i=1

where W = [w1 , · · · , wn ] ∈ Rm×n and x is the Lagrange
multiplier. Then, the dual problem of (1) is given by
maximize
m

inf L(W, x).

W ∈W

x∈R

(3)

W ∈W

W ∈W

=

n
X
i=1

=

n
X

n
X

The dual form (6) allows us to take advantage of recent
advances in distributed optimization to solve DRAPs over
unbalanced networks. For example, distributed algorithms
are proposed in [32, gradient-push], [33, Push-DIGing], [34,
ExtraPush], [35, DEXTRA], [36] to solve (6) over general
directed and unbalanced graphs. Asynchronous algorithms are
also studied in [37]–[40]. In particular, [25] and [26] propose
PPG algorithm (or called AB in [26]) by using the idea of
gradient tracking, which achieves a linear convergence rate
if the objective function fi is strongly convex and Lipschitz
smooth for all i. Moreover, PPG has an empirically faster
convergence speed than its competitors (e.g. [33]), and its
linear update rule is an advantage for implementation. The
compact form of PPG is given as
X
(i)
(j)
(j)
xk+1 =
aij (xk − αyk )
j∈Niin
(i)

yk+1 =

(j)

X

(i)

(i)

bij yk + ∇fi (xk+1 ) − ∇fi (xk )

(8)

j∈Niin

Under Assumption 1, the strong duality holds [30], [31,
Exercise 5.2.2]. The objective function in (3) is written as
inf L(W, x) = inf

(7)

w∈Wi

This section introduces our distributed dual gradient tracking algorithm (DDGT) to solve (1) over an unbalanced network. We start with the dual problem of (1) and transform
it as a form of distributed optimization. Then, the DDGT is
obtained by using the push-pull gradient method (PPG [25],
[26]) on the dual problem, which is an efficient distributed
optimization algorithm over unbalanced networks.

n
X

∇fi (x) = −∇Fi∗ (−x) + di
= − argmin{xT w + Fi (w)} + di .

III. T HE D ISTRIBUTED D UAL G RADIENT T RACKING
A LGORITHM

L(W, x) =

(6) is f ? = −F ? and the optimal point x?1 = · · · = x?n = x?
of (6) satisfies Fi (wi? ) + Fi∗ (−x? ) = −(wi? )T x? . Hence, we
can simply focus on solving the dual problem (6).
The strong convexity of Fi implies that Fi∗ is differentiable
with Lipschitz continuous gradients [30], and the supremum
in (4) is attainable. By Danskin’s theorem [31], the gradient
of Fi∗ is given by ∇Fi∗ (x) = argmaxw∈Wi {xT w − Fi (w)}.
Thus, it follows from (5) that

P
where aij > 0 for any j ∈ Niin and
j∈Niin aij = 1,
P
bij > 0 for any i ∈ Njout and
b
= 1, α is a
out
ij
i∈N
j

(i)

(Fi (wi ) + xT wi ) − xT d

(i)

positive stepsize, and x0 and y0 are initialized such that
(i)
(i)
(i)
y0 = ∇fi (x0 ), ∀i ∈ V. Intuitively, the update for yk aims
to asymptotically track the global gradient ∇f (x̄k ) and the
(i)
update for xk enforces it to converge to x̄k while performing
Pn
(i)
an inexact gradient descent step, where x̄k = n1 i=1 xk is
the mean of nodes’ states. We refer interested readers to [25],
[26] for more discussions on PPG.

i=1

inf {Fi (wi ) + xT wi } − xT d

wi ∈Wi

−Fi∗ (−x) − xT d

i=1

where
Fi∗ (x) , sup {wiT x − Fi (wi )}

(4)

wi ∈Wi

is the convex conjugate function corresponding to the pair
(Fi , Wi ) [31, Section 5.4]. Thus, the dual problem (3) can
be rewritten as a convex optimization problem
minimize
f (x) ,
m
x∈R

n
X

fi (x), fi (x) , Fi∗ (−x) + xT di (5)

i=1

or equivalently,
minimizem

x1 ,··· ,xn ∈R

subject to

n
X

fi (xi )

i=1

(6)

x1 = · · · = xn .

Recall that strong duality holds, and therefore problem (6) is
equivalent to problem (1) in the sense that the optimal value of

B. The DDGT algorithm
We are ready to present the DDGT algorithm. Plugging the
gradient (7) into (8) and noticing that the di term is cancelled
(i)
(i)
in ∇fi (xk+1 ) − ∇fi (xk ), we have
X
(i)
(j)
(j)
wk+1 =
aij (wk + αsk ),
(9a)
j∈Niin
(i)
(i)
wk+1 = argmin{Fi (w) − wT wk+1 },
w∈Wi
X
(i)
(j)
(i)
(i)
sk+1 =
bij sk − (wk+1 − wk ).
j∈Niin

(9b)
(9c)

where notations have been changed to keep consistency with
(i)
(i)
(i)
(i)
the primal problem (1), i.e., xk = −wk and yk = sk .
The DDGT is summarized in Algorithm 1 and we now
elaborate on it. After initialization, each node i iteratively

4

(i)

(i)

(i)

updates three vectors wk , wk and sk . In particular, at
(j)
(j)
(j)
e k := wk + αsk and
each iteration node i receives w
(ji)
(j)
e
sk := bij sk from each of its in-neighbors j, and updates
(i)
wk+1 according
to (9a), where aij is positive for any j ∈ Niin
P
such that
j∈N in aij = 1 as with (8), and α is a positive
i

(i)

stepsize. The update ofPsk in (9c) is similar, where bij > 0
for any i ∈ Njout and i∈N out bij = 1. This process repeats
j
until terminated. We set aij = bij = 0 for any (j, i) ∈
/ E for
convenience. Define two matrices [A]ij = aij and [B]ij = bij ,
then A is a row-stochastic matrix and B is a column-stochastic
matrix. Clearly, the directed network associated with A and B
can be unbalanced.
Remark 2: In practice, one can simply set aij = |Niin |−1
and bij = |Njout |−1 , and then all conditions are satisfied. Note
that this setting requires each node to know the number of
its in-neighbors and out-neighbors, which is common in the
literature of distributed optimization over directed networks
[32]–[35].
Notably, the initialization for DDGT exploits the structure
(i)
of the DRAPs and improves that of PPG. By PPG, w0 and
(i)
(i)
(i)
?
e i and s0 = di −
s0 should be exactly set as w0 = w
e i? , where w
e i? = argminw∈Wi Fi (w) is a local minimizer. In
w
e i? is actually not necessary since
DDGT, the computation of w
(i)
(i)
e i? in w0 and s0 and the update with
the update without w
it become equivalent after the first iteration due to the special
form of ∇fi (x). Clearly, the former is simpler and is adopted
in DDGT.
(i)
The update of wk in (9b) requires finding an optimal
point of an auxiliary local optimization problem, which can be
obtained by standard algorithms, e.g., projected (sub)gradient
method or Newton’s method, and can even be given in an
explicit form for some special cases. Note that solving subproblems per iteration is common in many duality-based
optimization algorithms, including the dual ascent method and
proximal method [41].
Remark 3: Consider two special cases. The first one is that
the local constraint set Wi = Rm and Fi is differentiable as
in [4]. Then, (9b) becomes
(i)

(i)

wk+1 = ∇−1 Fi (wk+1 )

(9b0 )

where ∇−1 Fi denotes the inverse function of ∇Fi , i.e.,
∇−1 Fi (∇Fi (x)) = x for any x ∈ Rm .
The second case is that the decision variable is a scalar, Wi
is an interval [wi , wi ], and Fi is differentiable as in [1], [17],
[20]. Then, (9b) becomes

(i)

if ∇−1 F (wk+1 ) > wi
 wi ,
(i)
(i)
wk+1 =
wi ,
if ∇−1 F (wk+1 ) < wi (9b00 )

 −1
(i)
∇ F (wk+1 ), otherwise
which is in fact adopted in [1], [17], [20]. Hence, (9b) can be
seen as an extension of their methods.
An interesting feature of DDGT lies in the way to handle
Pn
(i)
the coupling constraint i=1 wk = d. Notice that DDGT
(i)
is simply initialized such that w0 = 0, ∀i ∈ V and
Pn (i)
s = d. By summing (9c) over i = 1, · · · , n, we obtain
i=1
Pn0
Pn
(i)
(i)
(i)
(i)
that i=1 (wk + sk ) = i=1 (w0 + s0 ) = d. Thus, if

Algorithm 1 The Distributed Dual Gradient Tracking Algorithm (DDGT) — from the view of node i
(i)

(i)

(i)

Initialization: Let w0 = 0, w0 = 0, s0 = di .a
• For k = 0, 1, · · · , K, repeat
(j)
(j)
(ji)
(j)
(j)
e k := wk + αsk and e
sk := bij sk from
1: Receive w
its in-neighbor j.
(i)
(i)
(i)
2: Compute wk+1 , wk+1 and sk+1 as (9).
(i)
(i)
e k+1 and e
3: Broadcast w
sk+1 to each of out-neighbors.

•

•

(i)

Return wK .

a If only the total resource demand d is known to all nodes, then we can
(i)
1
simply set s0 = n
d, which can be done in a distributed manner [17].

(i)

sk converges to 0, the constraint is satisfied asymptotically,
which is essential to the convergence proof of the DDGT.
By strong duality, the convergence of DDGT can be established by showing the convergence of PPG. However, existing
results, e.g., [25]–[27], [42] for the convergence of PPG are
established only if fi is strongly convex and Lipschitz smooth.
Note that fi in (6) is often not strongly convex due to the
introduction of convex conjugate function Fi∗ , though Fi
in (1) is strongly convex [30]. This is indeed the case if
Fi includes exponential term [43] or logarithmic term [44].
Without Lipschitz smoothness for Fi , we can only obtain
that fi is differentiable and µ1 -Lipschitz smooth [45, Theorem
4.2.1], i.e.,
k∇fi (x) − ∇fi (y)k ≤

1
kx − yk, ∀i ∈ V, x, y ∈ Rn . (10)
µ

Thus, we still need to prove the convergence of PPG for
non-strongly convex objective functions fi . Particularly, a
crucial step in the convergence proof of PPG in [25], [26] uses
a complicated 3-dimensional matrix whose spectral radius is
strictly less than one for a sufficiently small stepsize. Then,
PPG converges at a linear rate. This does not work here since
the spectral radius of such a matrix cannot be strictly less than
one if fi is not strongly convex. In fact, we cannot expect a
linear convergence rate for the non-strongly convex case [28].
Next, we shall prove that PPG converges to a stationary
point at a rate of O(1/k) even for non-convex objective
functions, based on which we show the convergence and
evaluate the convergence rate of DDGT.
IV. C ONVERGENCE A NALYSIS
In this section, we first establish the convergence result of
PPG in (8) for non-convex fi , which is of independent interest
as the existing results on PPG only apply to the strongly
convex case. Then, we show the convergence of the DDGT
and evaluate the convergence rate for a special case.

A. Convergence analysis of PPG without convexity
Consider PPG given in (8). With a slight abuse of notation,
let fi be a general differentiable function in the rest of this

5

subsection. Denote

B. Convergence of the DDGT

(1)
(n)
Xk = [xk , · · · , xk ]T ∈ Rn×m
(1)
(n)
Yk = [yk , · · · , yk ]T ∈ Rn×m
(1)
(n)
∇fk = [∇f1 (xk ), · · · , ∇fn (xk )]T ∈ Rn×m

(11)

and

[A]ij =

aij , if (j, i) ∈ E
0, otherwise,


[B]ij =

bij , if (j, i) ∈ E
0, otherwise.

Note that A is row-stochastic and B is column-stochastic. The
starting points of all nodes are set to the same point x0 for
simplicity.
Then, (8) can be written in the following compact form
Xk+1 = A(Xk − αYk )

(12a)

Yk+1 = BYk + ∇fk+1 − ∇fk

(12b)

We now establish the convergence and quantify the convergence rate of the DDGT.
Theorem 2 (Convergence of the DDGT): Suppose Assumptions 1 and 2 hold. If the stepsize α > 0 is smaller than an
upper bound given in (28) and (43) with L replaced by 1/µ,
(i)
then {wk }, i ∈ V in Algorithm 1 converges to an optimal
(i)
point of (1), i.e., limk→∞ wk = wi? , ∀i ∈ V.
Proof: Under Assumption 1, the strong duality holds
between the original problem (1) and its dual problem (6).
Recall the relation between the DDGT (9) and PPG (8). We
obtain that f (x̄k ) converges to f ? by the convexity of the dual
problem and Theorem 1, and f ? = −F ? = −L(W ? , x) for
any x ∈ Rm . Moreover,
f (x̄k ) − f ? = L(W ? , x̄k ) − inf L(W, x̄k )
W ∈W

The convergence result of PPG for non-strongly convex or
even non-convex functions are given in the following result.
Theorem 1 (Convergence of PPG without convexity): Suppose Assumption 2 holds and fi , i ∈ V in (6) is differentiable
and L-Lipschitz smooth (c.f. (10)). If the stepsize α is suffi(i)
ciently small, i.e., α satisfies (28) and (43), then {xk }, i ∈ V
generated by (8) satisfies that
k

1X
f (x0 ) − f ?
3Lα2 (L2 c20 + c22 )
k∇f (x̄t )k2 ≤
+
k t=1
γk
γ(1 − θ)2 k
(13)
Pk0
√
α( nLc0 + c2 )(1 + t=1
k∇f (x̄t )k2 )
+
γ(1 − θ)2 k
Pn
(i) (i)
where x̄k = i=1 πA xk , πA is the normalized left Perron
vector of A, and θ, c0 , c2 , γ, k0 are positive constants given in
(27), (31), (44), (45) of Appendix, respectively.
Moreover, it holds that
k

k

2c20
c2 α 2 X
1X
2
kXt − 1x̄T
k∇f (x̄t )k2
+ 1
t kF ≤
2
k t=1
(1 − θ) k
k t=1
(14)
and if f is convex, f (x̄k ) converges to f ? .
The proof of Theorem 1 is deferred to the Appendix.
Theorem 1 shows that PPG converges to a stationary point
of f at a rate of O(1/k) for non-convex functions. The order
of convergence rate is consistent with the centralized gradient
descent algorithm [41]. Generally, the network size n affects
the convergence rate in a complicated way since it closely
relates to the network topology and the two weighting matrices
A and B. If σA , σB , δAF and δBF in Lemmas 2 and 3 of
Appendix do not vary with n, which holds, e.g., by setting
A = B in some undirected graphs such as complete graphs
and star graphs, then it follows
√ from (27),√(28), (31) and (44)
that θ = O(1), α = O(1/ n), c0 ≈ O( n) and γ ≈ O(α).
Then, (13) ensures a convergence rate O(n/k), which is
reasonable since the Lipschitz constant L is defined in terms
of local objective functions, and the global Lipschitz constant
generally increases linearly with n, implying a convergence
rate O(n/k) even for the centralized gradient descent method
[41, Section 6.1].

= L(W ? , x̄k ) − L(Wk , x̄k )
µ
≥ ∂W L(Wk , x̄k )T (W ? − Wk ) + kWk − W ? k2F
2
µ
? 2
≥ kWk − W kF
2
(15)
where L(W, x) is the Lagrange function in (2), W ? =
(i)
(i)
[w1? , · · · , wn? ] and Wk = [wk , · · · , wk ]. The first inequality
follows from the strong convexity of F by Assumption 1 and
the second inequality uses the first-order necessary condition
for a constrained minimization problem. The convergence of
(i)
wk is obtained immediately from (15).
The stepsize condition follows from Theorem 1 and the
Lipschitz smoothness of the dual function (c.f. (10)).
Remark 4: We note that it is possible to extend the DDGT
to time-varying networks [17], since the convergence of the
DDGT essentially depends on that of PPG, and a recent work
[27] shows the feasibility of PPG over time-varying networks
for strongly convex functions.

C. Convergence rate of the DDGT
As in [4] and [10], this subsection focuses on the special
case that Wi = Rm and Fi is differentiable for all i ∈ V for
the convergence rate characterization, since the constrained
case involves more complicated concepts and notations such
as subdifferential.
Under Assumption 1, it follows from [30] that the KarushKuhn-Tucker (KKT) condition of (1)
∇F1 (w1? ) = · · · = ∇Fn (wn? ),
n
X
wi? = d

(16a)
(16b)

i=1

is a necessary and sufficient condition for optimality. The
convergence rate of the DDGT is in terms of (16).
Theorem 3 (Convergence rate of the DDGT): Suppose that
Wi = Rm , Fi is differentiable for all i, and the conditions in

6

Pn
(i)
Theorem 2 are satisfied. Let ∇Fk = n1 i=1 ∇Fi (wk ), then
(i)
{wk } generated by the DDGT satisfies that
k
n
1 XX

k t=1
≤

(i)
k∇Fi (wt ) − ∇Ft k2 + k

i=1

n
X

n
X
(i)
wt − dk2


i=1

i=1

2(f (x0 ) − f ? ) 6Lα2 (L2 c20 + c22 ) 4nc20 (µ2 + 1)
+
+ 2
+
γk
γ(1 − θ)k
µ (1 − θ)k
Pk0
√
2α( nLc0 + c2 )(1 + t=1 k∇f (x̄t )k2 )
1
+
+ O( 2 )
γ(1 − θ)k
k

where all the constants are defined in Theorem 1.
Moreover, if Fi , i ∈ V has Lipschitz continuous gradients,
Pn
Pn
(i)
(i)
then i=1 kwk −wi? k2 converges linearly, i.e., i=1 kwk −
k
? 2
wi k ≤ O(λ ) for some λ ∈ (0, 1).
(i)
Proof: Since Wi = Rm , it follows from (9b) that xk+1 =
(i)
−∇Fi (wk+1 ). Thus,
n
X

n
X

(i)

xk −

i=1

2
2
1
1
(i)
= (I − 11T )Xk
xk
n i=1
n
F
2
1 T
11 )Xk
n
F

2
1 T
T
T
2
= 2kXk − 1x̄T
k kF + 2 ( 11 − 1πA )(Xk − 1x̄k )
n
F
2
≤ 2nkXk − 1x̄T
k kF
(17)
where Xk is defined in (11), x̄k and πA are defined in Theorem
1. The first inequality uses the relation ka + bk2 ≤ 2kak2 +
T 2
kF ≤
2kbk2 , and the last inequality follows from k n1 11T −1πA
n − 1.
On the other hand, it follows from (7) and (9b) that

(i)

wk − d = −

i=1

n
X

(i)

∇fi (xk )

i=1



= − ∇f (x̄k ) +

n

X
(i)
(∇fi (xk ) − ∇fi (x̄k ))
i=1

Taking the norm on both sides yields that
k

n
X

(i)

wk − dk2

i=1

2n(1 + µ2 )
2
kXk − 1x̄T
≤ 2k∇f (x̄k )k2 +
k kF
µ2
The desired result then follows from Theorem 1.
The linear convergence rate in the presence of Lipschitz
smoothness can be similarly obtained by following the linear
convergence of PPG for strongly convex and Lipschitz smooth
objective functions ( [26, Theorem 1] or [25, Theorem 1]),
which is omitted to save space.
Theorem 3 shows that the DDGT converges at a sublinear
rate O(1/k) for strongly convex objective functions, and
achieves a linear convergence rate if Lipschitz smoothness is
further satisfied. In view of Theorem 1, the explicit form of the
term corresponding to O(1/k 2 ) in Theorem 3 can be obtained
after tedious computations.
V. N UMERICAL E XPERIMENTS

n
X

T
T
≤ 2k(I − 1πA
)Xk k2F + 2 (1πA
−

n
X

(i)

k∇F (wk ) − ∇Fk k2 + k

(i)

k∇Fi (wk ) − ∇Fk k2

i=1

=

the second inequality follows from (10). Combining (17) and
(18) implies that

n
X

(i)

wk − dk2

(18)

i=1

≤ 2k∇f (x̄k )k2 + 2n

n
X

(i)

k∇fi (xk ) − ∇fi (x̄k )k2

i=1
n

2n X (i)
≤ 2k∇f (x̄k )k2 + 2
kx − x̄k k2
µ i=1 k
= 2k∇f (x̄k )k2 +

2n
2
kXk − 1x̄T
k kF
µ2
2

2

2

where we use ka + bk ≤ 2kak + 2kbk again and the
Cauchy-Schwarz inequality to obtain the first inequality, and

This section validates our theoretical results and compares
the DDGT with existing algorithms via simulation. More
precisely, we compare the DDGT with the algorithms in [17],
[24] and [10, Mirror-Push-DIGing]. Note that [10] does not
provide convergence guarantee for Mirror-Push-DIGing, [17]
has no convergence rate results, and [24] only shows the convergence rate for strongly convex and Lipschitz smooth cost
functions. Moreover, the algorithm in [24] involves solving
a subproblem similar to (9b) per iteration and [17] adopts
the update in (9b00 ) which is a special case of (9b), and
hence the computational complexities of the two algorithms
are similar to DDGT per iteration. In contrast, Mirror-PushDIGing [10] requires computing a proximal operator, which
may have higher computational costs.
We test these algorithms over 126 nodes connected via a
directed network, which is a real Email network [46], [47].
Each node i is associated with a local quadratic cost function
Fi (wi ) = ai (wi − bi )2 where ai ∼ U(0, 1) and bi ∼ N (0, 4)
are randomly sampled. Note that the quadratic cost function
is commonly P
used in the literature [10], [17], [24]. The global
126
constraint is i=1 wi = 50.
We first test the case without local constraints by setting
Wi = Rm . The stepsize used for each algorithm is tuned
via a grid search1 , and all initial conditions are randomly set.
(i)
Fig. 2 depicts the decay of distance between wk and the
optimal solution with respect to the number of iterations. It
clearly shows that the DDGT has a linear convergence rate
and converges faster than algorithms in [17], [24] and [10].
To validate the theoretical result for strongly convex cost
functions without Lipschitz smoothness, we test the algorithms
with a quartic local cost function Fi (wi ) = ai (wi − bi )2 +
ci (wi − di )4 , where ci ∼ U(0, 10) and di ∼ N (0, 4) are
1 The grid search scheme works as follows. For each algorithm, we select
a “good” stepsize by inspection, and then gradually increase and decrease
stepsizes around the selected one with an equal grid size, respectively. Then,
we find the fastest one among all the tried stepsizes.

7

DDGT
Algorithm in [24]
Algorithm in [17]
100/k

102

100

10-2

0

50

100

150

200

250

300

350

400

450

500

Number of iterates k

Fig. 3. Convergence rate w.r.t the number of iterations of different algorithms
with quartic cost function Fi (wi ) = ai (wi − bi )2 + ci (wi − di )4 .

Fig. 1. The communication network in [46], [47].

DDGT
Algorithm in [24]
Algorithm in [17]
Mirror-Push-DIGing [10]

105

DDGT
Algorithm in [24]
Algorithm in [17]
100/k

102

100

100
10-5

10-2
10-10

0

50

100

150

200

250

300

350

400

450

500

0

50

100

150

200

250

300

350

400

450

500

Number of iterates k

Number of iterates

Fig. 2. Convergence rate w.r.t the number of iterations of different algorithms
with quadratic cost function Fi (wi ) = ai (wi − bi )2 .

randomly sampled. Clearly, this function is strongly convex but
not Lipschitz smooth. All other settings remain the same and
the result is plotted in Fig. 3, where the Mirror-Push-DIGing
[10] is not included because its proximal operator is very
time-consuming, and an approximate solution for the proximal
operator often leads to a poor performance of the algorithm.
The dotted line in Fig. 3 is the sequence {100/k} with k the
number of iterations. We can observe that the convergence
rates of all algorithms are slower than that in Fig. 2, but the
DDGT still outperforms the other two algorithms. Moreover,
it is interesting to observe that the DDGT and the algorithm in
[24] have near-linear convergence rate, though the theoretical
convergence rate for the DDGT is O(1/k).
Finally, we study the effect of local constraints on the
convergence rate. To this end, we assign each node a local
constraint −2 ≤ wi ≤ 2, and test all algorithms with the
setting of Fig. 3. The result is depicted in Fig. 4, which shows
that the convergence of the DDGT is essentially not affected,
while the algorithm in [24] is heavily slowed compared with
that in Fig. 3.
VI. C ONCLUSION
We proposed the DDGT for distributed resource allocation
problems (DRAPs) over directed unbalanced networks. Convergence results are provided by exploiting the strong duality
of DRAPs and distributed optimization problems, and taking
advantage of the PPG algorithm. We studied the convergence
and convergence rate of PPG for non-convex problems and
obtained that the DDGT converges linearly for strongly convex
and Lipschitz smooth objective functions, and sub-linearly
without the Lipschitz smoothness. Future works are to provide

Fig. 4. Convergence rate w.r.t the number of iterations of different algorithms
with quartic cost function Fi (wi ) = ai (wi − bi )2 + ci (wi − di )4 and local
constraint −2 ≤ wi ≤ 2, ∀i.

tighter bounds for the convergence rate, design asynchronous
versions [37], [38], study quantized communication [48], and
design accelerated algorithms [49]. In particular, an interesting
idea to accelerate the DDGT is to add a vanishing strongly
convex regularization term to the dual problems of DRAPs,
which may allow a larger stepsize in the early stage and hence
possibly lead to faster convergence.
ACKNOWLEDGMENT
The authors would like to thank the Associate Editor and
anonymous reviewers for their very constructive comments,
which greatly improved the quality of this work.
A PPENDIX
A. Preliminary results on stochastic matrices
We first introduce three lemmas are from [25], [26].
Lemma 1 ( [26], [42]): Suppose Assumption 2 holds. The
matrix A has a unique unit nonnegative left eigenvector πA
T
T
T
w.r.t. eigenvalue 1, i.e., πA
A = πA
and πA
1 = 1. The matrix
B has a unique unit right eigenvector πB w.r.t. eigenvalue 1,
T
i.e., BπB = πB and πB
1 = 1.
The proof of Lemma 1 follows from the Perron-Frobenius
theorem and can be found in [25], [26].
Lemma 2 ( [50], [25], [26]): Suppose Assumption 2 holds.
There exist matrix norms k · kA and k · kB such that σA ,
T
kA − 1πA
kA < 1 and σB , kB − πB 1T kB < 1. Moreover, σA
and σB can be arbitrarily close to the second largest absolute
value of the eigenvalues of A and B, respectively.
A method to construct such matrix norms can be found in
the proof of Lemma 5.6.10 in [50].

8

Lemma 3 ( [25], [26]): There exist constants δFA , δAF , δFB
and δBF such that for any X ∈ Rn×n , we have
kXkF ≤ δFA kXkA , kXkF ≤ δFB kXkB
kXkA ≤ δAF kXkF , kXkB ≤ δBF kXkF
Lemma 3 is a direct result of the norm equivalence theorem.
If A and B are symmetric, which means the √
network is
undirected, then δAF = δBF = 1 and δFA = δFB = n.
Note that the norm k · kA defined in Lemma 2 is only for
matrices in Rn×n . To facilitate presentation, we slightly abuse
the notation and define a vector norm kxkA , k √1n x1T kA for
any x ∈ Rn , where the norm in the right-hand-side is the
matrix norm defined in Lemma 2. Then, we have
T

x1
1
kM xkA = k √ M x1T kA ≤ kM kA √
n
n

A

= kM kA kxkA

where the first equality is by definition and the inequality
follows from the sub-multiplicativity of matrix norms. Moreover, for any matrix X p
=P
[x1 , · · · , xm ] ∈ Rn×m , define the
m
2
matrix norm kXkA =
i=1 kxi kA . Recall that n × m is
the dimension of X and hence the definition is distinguished
from that in Lemma 2. We have
r
Xm
kM XkA = k[M x1 , · · · , M xm ]kA =
kM xi k2A
i=1
r
Xm
≤
kM k2A kxi k2A = kM kA kXkA .
i=1

Therefore, for any M ∈ Rn×n , X ∈ Rn×m , and x ∈ Rn , the
following relation holds
kM XkA ≤ kM kA kXkA , kM xkA ≤ kM kA kxkA .

(19)

Similarly, we can obtain such a relation based on the matrix
norm k · kB defined in Lemma 2.
Next, we define three important auxiliary variables:
x̄k , XkT πA ,

ȳk , YkT πA ,

(12b)
ŷk , YkT 1 = ∇fkT 1

(20)

(i)

where x̄k is a weighted average of xk that is identical to the
(i)
one defined in Theorem 1, ȳk is a weighted average of yk ,
(i)
and ŷk is the sum of yk .
Finally, for any X = [x(1) , · · · , x(n) ]T ∈ Rn×m , let
∇f (X) = [∇f1 (x(1) ), · · · , ∇fn (x(n) )]T ∈ Rn×m ,
and let ρ(X) denote the spectral radius of matrix X.
B. Proof of Theorem 1
T
Step 1: Bound kXk − 1x̄T
k kA and kYk − πB ŷk kB
It follows from (9) that

kXk+1 − 1x̄T
k+1 kA

(21)

T
= kAXk − 1x̄T
k − α(A − 1πA )Yk kA
T
T
T
= (A − 1πA
)[(Xk − 1x̄T
k ) − α(Yk − πB ŷk ) − απB ŷk ] A
T
T
≤ σA kXk − 1x̄T
k kA + ασA kYk − πB ŷk kA + ασA kπB ŷk kA
T
≤ σA kXk − 1x̄T
k kA + ασA δAF δFB kYk − πB ŷk kB

T
T
+ ασA δAF k1T (∇f (Xk ) − ∇f (1x̄T
k ) + 1 ∇f (1x̄k ))k

≤ ασA δAF δFB kYk − πB ŷkT kB + ασA δAF k∇f (x̄k )k
√
+ (σA + ασA δAF δFA L n)kXk − 1x̄T
k kA
where we use Lemma 2 and (19) to obtain the first inequality,
the second inequality is from Lemma 3 and (20), and the last
inequality follows from the L-Lipschitz smoothness.
Now we bound kYk − πB ŷkT kB . From (12) we have
T
kYk+1 − πB ŷk+1
kB
T
= kBYk − πB ŷkT + (∇fk+1 − ∇fk ) − (πB ŷk+1
− πB ŷkT )kB

= k(B − πB 1T )(Yk − πB ŷkT ) + (I − πB 1T )(∇fk+1 − ∇fk )kB
≤ σB kYk − πB ŷkT kB + LδBF kI − πB 1T kB kXk+1 − Xk kF
≤ σB kYk − πB ŷkT kB + LδBF kXk+1 − Xk kF .

(22)
where the last inequality follows from kI − πB 1T kB = 1,
which can be readily obtained from the construction of the
norm k · kB [50, Lemma 5.6.10]. Moreover, it follows from
(12a) that
kXk+1 − Xk kF = kAXk − Xk − αAYk kF
= k(A − I)(Xk − 1x̄T
k ) − αAYk kF
T
T
≤ kA − Ik kXk − 1x̄T
k kF + αkA(Yk − πB ŷk + πB ŷk )kF
√
T
T
≤ 2 nkXk − 1x̄T
k kF + αkAk(kYk − πB ŷk kF + kπB ŷk kF )
√
√
T
T
≤ 2 nδFA kXk − 1x̄T
k kA + α n(δFB kYk − πB ŷk kB + kŷk k)
√
√
T
≤ 2 nδFA kXk − 1x̄k kA + α nδFB kYk − πB ŷk kB
√
T
T
+ α nk1T (∇f (Xk ) − ∇f (1x̄T
k ) + 1 ∇f (1x̄k ))k
√
≤ (αLnδFA + 2 nδFA )kXk − 1x̄T
kA
√
√k
+ α nδFB kYk − πB ŷk kB + α nk∇f (x̄k )k
√
where we used kAk ≤ n. The above relation combined with
(22) yields
T
kYk+1 − πB ŷk+1
kB
√
≤ (σB + Lα nδBF δFB )kYk − πB ŷkT kB
√
√
+ nLδBF δFA (2 + nLα)kXk − 1x̄T
k kA
√
+ α nLδBF k∇f (x̄k )k.

(23)

Combing (21) and (23) implies the following linear matrix
inequality

 


kXk+1 − 1x̄T
P11 P12 kXk − 1x̄T
k+1 kA
k kA
4
T
P21 P22 kYk − πB ŷkT kB
kYk+1 − πB ŷk+1
kB
{z
}|
{z
}
{z
} |
|
,P
, zk
, zk+1


ασ
δ
k∇f
(x̄
)k
A
AF
k
+ √
α nLδBF k∇f (x̄k )k
|
{z
}
, uk
(24)
where 4 denotes the element-wise less than or equal sign and
√
P11 = σA + ασA δAF δFA L n,
P12 = ασA δAF δFB
√
√
√
P21 = nLδBF δFA (2 + nLα), P22 = σB + Lα nδBF δFB
Note that ρ(P ) < 1 for sufficiently small α, since


0
√ σA
lim P =
2L nδBF δFA σB
α→0

9

where c0 , c1 , c2 and c3 are constants given as follows

has spectral radius smaller than 1.
The linear matrix inequality (24) implies that

n

c0 =
zk 4 P k−1 z1 +

k−1
X

P t−1 uk−t .

(25)

t=1

Let θ1 and θ2 be the two eigenvalues of P such that |θ2 | > |θ1 |,
and θ , ρ(P ) = |θ2 |, then P can be diagonalized as
P = T ΛT

−1


θ
, Λ= 1
0

(26)

p
Let Ψ = (P11 − P22 )2 + 4P12 P21 . Note that the analysis
so far holds if σA is replaced by any value in (σA , 1) (similar
for σB ), and hence we assume without loss of generality that
σA 6= σB to simplify presentation. In that case, Ψ is lower
bounded by some positive value that is independent of α, say
Ψ. With some tedious calculations, we have
P11 + P22 − Ψ
θ1 =
2
P11 + P22 + Ψ
(27)
θ = θ2 =
2
√
1
= (σA + σB + Lα n(δBF δFB + σA δAF δFA ) + Ψ).
2

δBF δFB
nLΨ

c2 = kY1 − πB ŷ1T kB ≤ δBF

n
X

2

k∇fi (x1 )k

1/2

(31)

i=1

√
3 nLσA δBF δFA δAF √
+ nLδBF .
Ψ

Step 2: Bound kȳk k2
From (12) and the L-Lipschitz smoothness, we have
f (x̄k+1 ) ≤ f (x̄k ) − α∇f (x̄k )T ȳk +

Lα2
kȳk k2 .
2

(32)

Note that
ȳk = YkT πA = (Yk − πB ŷkT + πB ŷkT )T πA
T
= (Yk − πB ŷkT )T πA + YkT 1πB
πA
T
T
= (Yk − πB ŷkT )T πA + (∇f (Xk ) − ∇f (1x̄T
k )) 1πB πA
T
+ πB
πA ∇f (x̄k )

where we used the relation YkT 1 =
T
∇f (1x̄T
k ) 1 = ∇f (x̄k ). Then, we have

(33)
∇f (Xk )T 1 and

−∇f (x̄k )T ȳk
T
T
= −∇f (x̄k )T (∇f (Xk ) − ∇f (1x̄T
k )) 1πB πA

To let θ = θ2 < 1, it is sufficient for α to satisfy
α<

c1 = σA δAF +

c3 =


0
.
θ2

1/2
δBF  X
kY1 − πB ŷ1T kB
≤
k∇fi (x1 )k2
2
2
nL Ψ
nL Ψ i=1

(1 − σA )(1 − σB )
√
√
.
2( nLσA δAF δFA + 1)( nLδBF δFB + 1)

(28)

T
− ∇f (x̄k )T (Yk − πB ŷkT )T πA − πB
πA k∇f (x̄k )k2
√
T
2
≤ −πB πA k∇f (x̄k )k + L nk∇f (x̄k )kkXk − 1x̄T
k kF

+ k∇f (x̄k )kkYk − πB ŷkT kF

(34)
where we used kπA k ≤ 1, and the Lipschitz smoothness
T
k∇f (Xk ) − ∇f (1x̄T
k )kF ≤ LkXk − 1x̄k kF to obtain the last
 inequality.
P11 −P22 +Ψ
Moreover, it follows from (33) and the relation ka + b +
2Ψ
2
P22 −P11 +Ψ
ck
≤ 3kak2 + 3kbk2 + 3kck2 that
2Ψ

Moreover, T and T −1 in (26) can be expressed in an explicit
form
 P11 −P22 −Ψ
T =

2P21

P11 −P22 +Ψ
2P21

1

1



, T −1 =

 P21
− Ψ
P21
Ψ

T
kȳk k2 ≤ 3k(Yk − πB ŷkT )T πA k2 + 3kπB
πA ∇f (x̄k )k2

It then follows from (26) that
0 2 P k = T Λk T −1
" k k
#
θ1 +θ2
(P11 −P22 )(θ2k −θ1k )
P12
k
k
+
(θ
−
θ
)
2
1
2
2Ψ
Ψ
=
θ1k +θ2k
(P −P )(θ1k −θ2k )
P21
k
k
+ 11 22
Ψ (θ2 − θ1 )
2
2Ψ

−1 
2
1
(nL Ψ)
2 θk √
3 nLδBF δFA /Ψ
1
(29)
where we used |P11 − P22 | ≤ Ψ, Ψ ≥ Ψ, and the bound (28)
to obtain the inequality.
Combining (24), (25) and (29) yields that

k−1
kXk − 1x̄T
+ c1 α
k k F ≤ c0 θ

kYk − πB ŷkT kF ≤ c2 θk−1 + c3 α

k−1
X
t=1
k−1
X
t=1

θ

t−1

T
T
2
+ 3k(∇f (Xk ) − ∇f (1x̄T
k )) 1πB πA k
T
≤ 3kYk − πB ŷkT k2 + 3(πB
πA )2 k∇f (x̄k )k2

2
+ 3L2 nkXk − 1x̄T
kk .
Pk
T
Step 3: Bound
and
t=1 k∇f (x̄t )kkXt − 1x̄t kF
Pk
T 2
kX
−
1x̄
k
t
t F
t=1
We first bound the summation of the terms k∇f (x̄t )kkXt −
T
1x̄T
t kF and k∇f (x̄t )kkYt − πB ŷt kF in (34) over t =
1, · · · , k. It follows from (30) that

k∇f (x̄k )kkXk − 1x̄T
k kF
≤ c0 θk−1 k∇f (x̄k )k + c1 αk∇f (x̄k )k

k∇f (x̄k−t )k

θt−1 k∇f (x̄k−t )k

k−1
X

θt−1 k∇f (x̄k−t )k

t=1

(30)

(35)

Then, define
ϑt = [θt−2 , θt−3 , · · · , θ, 1, 0, · · · , 0]T ∈ Rk

(36)

10

ϑ̃t = [0, · · · , 0, 1, 0, · · · , 0]T ∈ Rk
| {z }

where the elements are defined in (27) and (31). Clearly, Θk
is nonnegative and positive semi-definite. We have from (30)
T
that kXt − 1x̄T
t kF ≤ νk φt , and hence

t−1

υk = [k∇f (x̄1 )k, · · · , k∇f (x̄k )k]T ∈ Rk


0 1 θ · · · θk−2

0 1 · · · θk−1 
k


X

.. 
T
..
..
Θ̃k =
ϑt ϑ̃t = 
.
.
. 


t=1

0
1 
0

k
X

To bound kΘk k, let [Θk ]ij be the element in the i-th row and
j-th column of Θk . For any 0 < i ≤ j ≤ k, we have

t=1

≤ c0

θ

t−1

2

(1 + k∇f (x̄t )k ) + c1 α

t=1

≤

k
X

k
k
X
X
c0
+ c0
θt−1 k∇f (x̄t )k2 + c1 α
υkT ϑ̃t ϑT
t υk
1−θ
t=1
t=1

k
X
c0
+ c0
θt−1 k∇f (x̄t )k2 + c1 αυkT Θ̃k υk .
≤
1−θ
t=1

=

θ

t−i+1 t−j+1

θ

θ

2j−2

(37)

1−θ

2

k
X

where the last inequality follows from ρ(Θ̃k + Θ̃T
k ) ≤ kΘ̃k +
2
Θ̃T
k
≤
k
Θ̃
k
+
k
Θ̃
k
≤
.
Thus,
we
have
from (37)
1
k
1
k
∞
k
1−θ
that

i=1

=

j
X

(1 − θ
1 − θ2

[Θk ]ij +

=

i=j+1
k
X

j
X
θj−i (1 − θ2(k−j+2) )

1 − θ2

≤

k∇f (x̄t )kkYt − πB ŷtT kF

t=1
k
k
X
c2
c3 α X
+ c2
θt−1 k∇f (x̄t )k2 +
k∇f (x̄t )k2 .
1−θ
1
−
θ
t=1
t=1
Pk
Pk
T 2
Next, we bound
t − 1x̄t kF and
t=1 kXP
t=1 kYt −
k
T 2
πB ŷt k2F . We first consider
kX
−
1x̄
k
.
For any
t
t
F
t=1
k ∈ N, define

≤

νk = [c0 , c1 αk∇f (x̄1 )k, · · · , c1 αk∇f (x̄k−1 )k]T ∈ Rk
φt = [θt−1 , θt−2 , · · · , θ, 1, 0, · · · , 0]T ∈ Rk
k
X
t=1

k×k
φt φT
t ∈R

k
X
θi−j (1 − θ2(k−i+2) )
1 − θ2
i=j+1

θ
1
1
+
≤
(1 − θ)(1 − θ2 ) (1 − θ)(1 − θ2 )
(1 − θ)2

and we have from the Gershgorin circle theorem that

j

k
c1 α X

+

j
k
X
X
θi−j
θj−i
+
≤
1 − θ2 i=j+1 1 − θ2
i=1

kΘk k ≤ max

c0
+ c0
θt−1 k∇f (x̄t )k2 +
k∇f (x̄t )k2
1−θ
1
−
θ
t=1
t=1
(38)
Pk
Similarly, we can bound t=1 k∇f (x̄t )kkYt − πB ŷtT kF as
follows,

Θk =

[Θk ]ji

i=j+1

t=1

k
X

θj−i (1 − θ2(k−j+2) )
) 2−i−j
θ
=
.
1 − θ2

j
k
X
X
[Θk ]ij +
[Θk ]ij
i=1

k∇f (x̄t )kkXt − 1x̄T
t kF

≤

θ2t−i−j+2

t=j−1
2(k−j+2)

[Θk ]ij =

i=1

Θ̃k + Θ̃T
1
kυk k2
2
k
υkT Θ̃k υk = υkT
υk ≤ ρ(Θ̃k + Θ̃T
)kυ
k
≤
k
k

k
X

k
X

=

t=j−1

i=1

The last term υkT Θ̃k υk in (37) can be bounded by

k
X

k
X

Since Θk is symmetric, it holds that
k∇f (x̄t )kϑT
t υk

t=1

2

[Θk ]ij

=

k∇f (x̄t )kkXt − 1x̄T
t kF
k
X

(39)

t=1

where
θ is defined in (27). Note that k∇f (x̄t )k = υkT ϑ̃t
Pt−1
and l=1 θl−1 k∇f (x̄t−l )k = υkT ϑt , ∀t ≤ k, which combined
with the relation k∇f (x̄k )k ≤ 1 + k∇f (x̄k )k2 and (36) yields
k
X

2
T
2
kXt − 1x̄T
t kF ≤ νk Θk νk ≤ kΘk kkνk k .

k
X
i=1

[Θk ]ij ≤

1
.
(1 − θ)2

It then follows from (39) that
k
X
t=1
k
X

2
kXt − 1x̄T
t kF ≤

k−1
h
i
X
2
2
2 2
c
+
c
α
k∇f (x̄t )k2
0
1
2
(1 − θ)
t=1

k−1
h
i
X
2
2
2 2
2
c
+
c
α
k∇f
(x̄
)k
.
t
3
(1 − θ)2 2
t=1
t=1
(40)
Pk
Step 4: Bound t=1 k∇f (x̄t )k2
Combining (32), (34) and (35) implies that

kYt − πB ŷt k2F ≤

f (x̄k+1 )
Lα2
≤ f (x̄k ) − α∇f (x̄k )T ȳk +
kȳk k2
2


T
3LαπB
πA
T
≤ f (x̄k ) − απB πA 1 −
k∇f (x̄k )k2
2
3Lα2
+ αk∇f (x̄k )kkYk − πB ŷkT kF +
kYk − πB ŷkT k2F
2
√
3L3 α2
2
T
+
kXk − 1x̄T
k kF + Lα nk∇f (x̄k )kkXk − 1x̄k kF .
2
(41)

11

Summing both sides of (41) over 1, · · · , k, we have

 k
T
3LαπB
πA X
T
απB πA 1 −
k∇f (x̄t )k2
2
t=1
≤ f (x0 ) − f (x̄k ) + α

k
X

k∇f (x̄t )kkYt − πB ŷtT kF

t=1

3Lα
+
2
+

k
X

k 
2 X

(42)

2
kYt − πB ŷt k2F + L2 kXt − 1x̄T
t kF



R EFERENCES

t=1

√

Lα nk∇f (x̄t )kkXt − 1x̄T
t kF

t=1

√
3Lα2 (L2 c20 + c22 ) α( nLc0 + c2 )
+
(1 − θ)2
(1 − θ)2
k
3Lα4 (L2 c21 + c23 ) X
+
k∇f (x̄t )k2
(1 − θ)2
t=1
√
k
α2 ( nLc1 + c3 ) X
k∇f (x̄t )k2
+
(1 − θ)2
t=1

≤ f (x0 ) − f ? +

k
X
√
+ α( nLc0 + c2 )
θt−1 k∇f (x̄t )k2
t=1

where the last inequality follows from (38)
Pkand (40).
2
We can move the terms related to
t=1 k∇f (x̄t )k in
the
right-hand-side
of
(42)
to
the
left-hand-side
to
bound
Pk
2
t=1 k∇f (x̄t )k . To this end, the stepsize α should satisfy
α<
−1

√
T
3L3 c21 + 3Lc23 + L n(c0 + c1 ) + c2 + c3
πA
3LπB
+
Tπ
2
(1 − θ)2 πB
A
(43)
which is followed by

T
πA
3LαπB
T
γ , απB πA 1 −
2

√
α(3L3 c21 + 3Lc23 + L n(c0 + c1 ) + c2 + c3 )
−
> 0.
Tπ
(1 − θ)2 πB
A
(44)
α
, i.e.,
If θk ≤ 1−θ
k ≥ k0 ,

ln(α) − ln(1 − θ)
,
ln(θ)

(45)

then it follows from (42) that
γ

k
X
t=1

+

k∇f (x̄t )k2 ≤ f (x0 ) − f ? +

3Lα2 (L2 c20 + c22 )
(1 − θ)2

√
k0
X
√
α( nLc0 + c2 )
+
α(
nLc
+
c
)
k∇f (x̄t )k2
0
2
(1 − θ)2
t=1

Thus, we have
k

which is (13) in Theorem 1. The inequality (14) follows from
(40) immediately.
Now we look back at (41). Jointly with (13), (38), (40) and
(41), it follows from the supermartingale convergence theorem
[41, Proposition A.4.4] that f (x̄k ) converges.
PkIf f is further
convex, it follows from the convergence of t=1 k∇f (x̄t )k2
that f (x̄k ) converges to f ? .

1X
f (x0 ) − f ?
3Lα2 (L2 c20 + c22 )
k∇f (x̄t )k2 ≤
+
k t=1
γk
γ(1 − θ)2 k
Pk0
√
α( nLc0 + c2 )(1 + t=1
k∇f (x̄t )k2 )
+
2
γ(1 − θ) k

[1] S. Yang, S. Tan, and J.-X. Xu, “Consensus based approach for economic
dispatch problem in a smart grid,” IEEE Transactions on Power Systems,
vol. 28, no. 4, pp. 4416–4426, 2013.
[2] Y. Ho, L. Servi, and R. Suri, “A class of center-free resource allocation
algorithms,” IFAC Proceedings Volumes, vol. 13, no. 6, pp. 475–482,
1980.
[3] L. Xiao and S. Boyd, “Optimal scaling of a gradient method for
distributed resource allocation,” Journal of optimization theory and
applications, vol. 129, no. 3, pp. 469–488, 2006.
[4] H. Lakshmanan and D. P. De Farias, “Decentralized resource allocation
in dynamic networks of agents,” SIAM Journal on Optimization, vol. 19,
no. 2, pp. 911–940, 2008.
[5] C. Zhao, J. Chen, J. He, and P. Cheng, “Privacy-preserving consensusbased energy management in smart grids,” IEEE Transactions on Signal
Processing, vol. 66, no. 23, pp. 6162–6176, 2018.
[6] T. T. Doan and A. Olshevsky, “Distributed resource allocation on
dynamic networks in quadratic time,” Systems & Control Letters, vol. 99,
pp. 57–63, 2017.
[7] T.-H. Chang, M. Hong, and X. Wang, “Multi-agent distributed optimization via inexact consensus ADMM,” IEEE Transactions on Signal
Processing, vol. 63, no. 2, pp. 482–497, 2014.
[8] T.-H. Chang, “A proximal dual consensus ADMM method for multiagent constrained optimization,” IEEE Transactions on Signal Processing, vol. 64, no. 14, pp. 3719–3734, 2016.
[9] N. S. Aybat and E. Y. Hamedani, “A Distributed ADMM-like Method
for Resource Sharing over Time-Varying Networks,” SIAM Journal on
Optimization, vol. 29, no. 4, pp. 3036–3068, 2019.
[10] A. Nedić, A. Olshevsky, and W. Shi, “Improved convergence rates for
distributed resource allocation,” in 2018 IEEE Conference on Decision
and Control (CDC). IEEE, 2018, pp. 172–177.
[11] J. Xu, S. Zhu, Y. C. Soh, and L. Xie, “A dual splitting approach for
distributed resource allocation with regularization,” IEEE Transactions
on Control of Network Systems, vol. 6, no. 1, pp. 403–414, 2018.
[12] S. Liang, X. Zeng, G. Chen, and Y. Hong, “Distributed sub-optimal
resource allocation via a projected form of singular perturbation,” arXiv
preprint arXiv:1906.03628, 2019.
[13] Y. Zhu, W. Ren, W. Yu, and G. Wen, “Distributed resource allocation
over directed graphs via continuous-time algorithms,” IEEE Transactions
on Systems, Man, and Cybernetics: Systems, 2019.
[14] P. Xie, K. You, R. Tempo, S. Song, and C. Wu, “Distributed convex
optimization with inequality constraints over time-varying unbalanced
digraphs,” IEEE Transactions on Automatic Control, vol. 63, no. 12, pp.
4331–4337, 2018.
[15] K. Cai and H. Ishii, “Average consensus on general strongly connected
digraphs,” Automatica, vol. 48, no. 11, pp. 2750–2761, 2012.
[16] Y. Xu, K. Cai, T. Han, and Z. Lin, “A fully distributed approach to
resource allocation problem under directed and switching topologies,”
in 2015 10th Asian Control Conference (ASCC). IEEE, 2015, pp. 1–6.
[17] Y. Xu, T. Han, K. Cai, Z. Lin, G. Yan, and M. Fu, “A distributed algorithm for resource allocation over dynamic digraphs,” IEEE Transactions
on Signal Processing, vol. 65, no. 10, pp. 2600–2612, 2017.
[18] P. Li and J. Hu, “An ADMM based distributed finite-time algorithm for
economic dispatch problems,” IEEE Access, vol. 6, pp. 30 969–30 976,
2018.
[19] A. Falsone, I. Notarnicola, G. Notarstefano, and M. Prandini, “TrackingADMM for distributed constraint-coupled optimization,” arXiv preprint
arXiv:1907.10860, 2019.
[20] T. Yang, J. Lu, D. Wu, J. Wu, G. Shi, Z. Meng, and K. H. Johansson, “A
distributed algorithm for economic dispatch over time-varying directed
networks with delays,” IEEE Transactions on Industrial Electronics,
vol. 64, no. 6, pp. 5095–5106, 2017.
[21] X. Shi, Y. Wang, S. Song, and G. Yan, “Distributed optimisation for
resource allocation with event-triggered communication over general
directed topology,” International Journal of Systems Science, vol. 49,
no. 6, pp. 1119–1130, 2018.

12

[22] H. Zhang, H. Li, Y. Zhu, Z. Wang, and D. Xia, “A distributed stochastic
gradient algorithm for economic dispatch over directed network with
communication delays,” International Journal of Electrical Power &
Energy Systems, vol. 110, pp. 759–771, 2019.
[23] Y. Yuan, H. Li, J. Hu, and Z. Wang, “Stochastic gradient-push for
economic dispatch on time-varying directed networks with delays,”
International Journal of Electrical Power & Energy Systems, vol. 113,
pp. 564–572, 2019.
[24] H. Li, Q. L, and T. Huang, “Convergence analysis of a distributed
optimization algorithm with a general unbalanced directed communication network,” IEEE Transactions on Network Science and Engineering,
vol. 6, no. 3, pp. 237–248, 2019.
[25] S. Pu, W. Shi, J. Xu, and A. Nedic, “Push-pull gradient methods for
distributed optimization in networks,” IEEE Transactions on Automatic
Control, pp. 1–1, 2020.
[26] R. Xin and U. A. Khan, “A linear algorithm for optimization over
directed graphs with geometric convergence,” IEEE Control Systems
Letters, vol. 2, no. 3, pp. 315–320, July 2018.
[27] F. Saadatniaki, R. Xin, and U. A. Khan, “Decentralized optimization
over time-varying directed graphs with row and column-stochastic
matrices,” IEEE Transactions on Automatic Control, pp. 1–1, 2020.
[28] Y. Nesterov, Introductory lectures on convex optimization: A basic
course. Springer Science & Business Media, 2013, vol. 87.
[29] B. Gharesifard and J. Cortés, “When does a digraph admit a doubly
stochastic adjacency matrix?” in American Control Conference, 2010.
IEEE, 2010, pp. 2440–2445.
[30] S. Boyd and L. Vandenberghe, Convex optimization.
Cambridge
university press, 2004.
[31] D. Bertsekas, Nonlinear programming.
Belmont, Massachusetts:
Athena Scientific, 2016.
[32] A. Nedić and A. Olshevsky, “Distributed optimization over time-varying
directed graphs,” IEEE Transactions on Automatic Control, vol. 60,
no. 3, pp. 601–615, 2015.
[33] A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence
for distributed optimization over time-varying graphs,” SIAM Journal on
Optimization, vol. 27, no. 4, pp. 2597–2633, 2017.
[34] J. Zeng and W. Yin, “Extrapush for convex smooth decentralized
optimization over directed networks,” Journal of Computational Mathematics, vol. 35, no. 4, pp. 383–396, 2017.
[35] C. Xi and U. A. Khan, “DEXTRA: A fast algorithm for optimization
over directed graphs,” IEEE Transactions on Automatic Control, vol. 62,
no. 10, pp. 4980–4993, 2017.
[36] V. S. Mai and E. H. Abed, “Distributed optimization over weighted
directed graphs using row stochastic matrix,” in 2016 American Control
Conference (ACC). IEEE, 2016, pp. 7165–7170.
[37] J. Zhang and K. You, “AsySPA: An exact asynchronous algorithm for
convex optimization over digraphs,” IEEE Transactions on Automatic
Control, vol. 65, no. 6, pp. 2494–2509, 2020.
[38] J. Zhang and K. You, “Asynchronous decentralized optimization in
directed networks,” arXiv preprint arXiv:1901.08215, 2019.
[39] X. Zhao and A. H. Sayed, “Asynchronous adaptation and learning over
networks-Part I: Modeling and stability analysis,” IEEE Transactions on
Signal Processing, vol. 63, no. 4, pp. 811–826, 2015.
[40] T. Wu, K. Yuan, Q. Ling, W. Yin, and A. H. Sayed, “Decentralized
consensus optimization with asynchrony and delays,” IEEE Transactions
on Signal and Information Processing over Networks, vol. 4, no. 2, pp.
293–307, 2018.
[41] D. P. Bertsekas, Convex Optimization Algorithms. Athena Scientific
Belmont, 2015.
[42] S. Pu and A. Nedić, “A distributed stochastic gradient tracking method,”
in 2018 IEEE Conference on Decision and Control (CDC). IEEE, 2018,
pp. 963–968.
[43] Y. Zhao, X. He, and L. Chen, “A distributed strategy based on ADMM
for dynamic economic dispatch problems considering environmental
cost function with exponential term,” in IECON 2017-43rd Annual
Conference of the IEEE Industrial Electronics Society. IEEE, 2017,
pp. 7387–7392.
[44] L. Zheng and C. W. Tan, “Optimal algorithms in wireless utility maximization: Proportional fairness decomposition and nonlinear perronfrobenius theory framework,” IEEE Transactions on Wireless Communications, vol. 13, no. 4, pp. 2086–2095, 2014.
[45] J.-B. Hiriart-Urruty and C. Lemaréchal, Conjugacy in Convex Analysis.
Berlin, Heidelberg: Springer Berlin Heidelberg, 1993, pp. 35–90.
[Online]. Available: https://doi.org/10.1007/978-3-662-06409-2 2
[46] “Manufacturing emails network dataset – KONECT,” Sep. 2016.
[Online]. Available: http://konect.uni-koblenz.de/networks/radoslaw
email

[47] J. Kunegis, “KONECT – The Koblenz Network Collection,” in
Proc. Int. Conf. on World Wide Web Companion, 2013, pp. 1343–
1350. [Online]. Available: http://userpages.uni-koblenz.de/∼kunegis/
paper/kunegis-koblenz-network-collection.pdf
[48] J. Zhang, K. You, and T. Başar, “Distributed discrete-time optimization
in multiagent networks using only sign of relative state,” IEEE Transactions on Automatic Control, vol. 64, no. 6, pp. 2352–2367, 2018.
[49] G. Qu and N. Li, “Accelerated distributed Nesterov gradient descent,”
IEEE Transactions on Automatic Control, pp. 1–1, 2019.
[50] R. A. Horn and C. R. Johnson, Matrix analysis. Cambridge university
press, 2012.

Jiaqi Zhang received the B.S. degree in electronic and information engineering from the School
of Electronic and Information Engineering, Beijing
Jiaotong University, Beijing, China, in 2016. He is
currently pursuing the Ph.D. degree at the Department of Automation, Tsinghua University, Beijing,
China. His research interests include networked control systems, distributed or decentralized optimization and their applications.

Keyou You (SM’17) received the B.S. degree in
Statistical Science from Sun Yat-sen University,
Guangzhou, China, in 2007 and the Ph.D. degree in Electrical and Electronic Engineering from
Nanyang Technological University (NTU), Singapore, in 2012. After briefly working as a Research
Fellow at NTU, he joined Tsinghua University in
Beijing, China where he is now a tenured Associate
Professor in the Department of Automation. He held
visiting positions at Politecnico di Torino, Hong
Kong University of Science and Technology, University of Melbourne and etc. His current research interests include networked
control systems, distributed optimization and learning, and their applications.
Dr. You received the Guan Zhaozhi award at the 29th Chinese Control
Conference in 2010 and the ACA (Asian Control Association) Temasek
Young Educator Award in 2019. He was selected to the National 1000-Youth
Talent Program of China in 2014 and received the National Science Fund for
Excellent Young Scholars in 2017. He is serving as an Associate Editor for
the IEEE Transactions on Cybernetics, IEEE Control Systems Letters(L-CSS),
Systems & Control Letters.

Kai Cai (S’08-M’12-SM’17) received the B.Eng.
degree in Electrical Engineering from Zhejiang University, Hangzhou, China, in 2006; the M.A.Sc.
degree in Electrical and Computer Engineering from
the University of Toronto, Toronto, ON, Canada, in
2008; and the Ph.D. degree in Systems Science from
the Tokyo Institute of Technology, Tokyo, Japan,
in 2011. He is currently an Associate Professor
at Osaka City University. Previously, he was an
Assistant Professor at the University of Tokyo (20132014), and a postdoctoral Fellow at the University of
Toronto (2011-2013). Dr. Cai’s research interests include distributed control
of discrete-event systems and cooperative control of networked multi-agent
systems. He is the co-author (with W.M. Wonham) of Supervisory Control
of Discrete-Event Systems (Springer 2019) and Supervisor Localization
(Springer 2016). He is serving as an Associate Editor for the IEEE Transactions on Automatic Control.

