Noname manuscript No.
(will be inserted by the editor)

Distributed Nonconvex Constrained Optimization
over Time-Varying Digraphs
Gesualdo Scutari · Ying Sun

arXiv:1809.01106v1 [math.OC] 4 Sep 2018

Received: June 3, 2017; Revised June 5, 2018.

Abstract This paper considers nonconvex distributed constrained optimization over networks, modeled as directed (possibly time-varying) graphs. We
introduce the first algorithmic framework for the minimization of the sum of
a smooth nonconvex (nonseparable) function–the agent’s sum-utility–plus a
Difference-of-Convex (DC) function (with nonsmooth convex part). This general formulation arises in many applications, from statistical machine learning
to engineering. The proposed distributed method combines successive convex approximation techniques with a judiciously designed perturbed push-sum
consensus mechanism that aims to track locally the gradient of the (smooth
part of the) sum-utility. Sublinear convergence rate is proved when a fixed stepsize (possibly different among the agents) is employed whereas asymptotic
convergence to stationary solutions is proved using a diminishing step-size.
Numerical results show that our algorithms compare favorably with current
schemes on both convex and nonconvex problems.
1 Introduction
This paper focuses on the following (possibly) nonconvex multiagent composite
optimization problem:
I
X
(P)
fi (x) + G+ (x) − G− (x),
min U (x) ,
x∈K
|
{z
}
i=1
G(x)
| {z }
F (x)

m

where fi : R → R is the cost function of agent i, assumed to be smooth (possibly) nonconvex ; G : Rm → R is a DC function, whose concave part −G− is
smooth; and K is a closed convex subset of Rm . The function G is generally
used to promote some extra structure on the solution, like sparsity. Note that,
differently from most of the papers in the literature, we do not require the
(sub)gradient of fi , G− or G+ to be (uniformly) bounded on K. Agents are
connected through a communication network, modeled as a directed graph,
possibly time-varying. Moreover, each agent i knows only its own function fi
(as well as G and K). In this setting, the agents want to cooperatively solve
Problem(P) leveraging local communications with their immediate neighbors.
Distributed nonconvex optimization in the form (P) has found a wide range
of applications in several areas, including network information processing,
telecommunications, multi-agent control, and machine learning. In particular, Problem (P) is a key enabler of many emerging nonconvex “big data”
Scutari and Ying are with the School of Industrial Engineering, Purdue University, WestLafayette, IN, USA. Emails: <gscutari,sun578>@purdue.edu. This work was supported by
the USA National Science Foundation (NSF) under Grants CIF 1564044 and CAREER
Award No. 1555850; and the Office of Naval Research (ONR) Grant N00014-16-1-2244.
Part of this work has been presented at the 2016 Asilomar Conference on System, Signal,
and Computers [41] and the 2017 IEEE ICASSP Conference [40].

2

Gesualdo Scutari, Ying Sun

analytic tasks, including nonlinear least squares, dictionary learning, principal/canonical component analysis, low-rank approximation, and matrix completion [18], just to name a few. Moreover, the DC structure of G allows to
accommodate in an unified fashion convex and nonconvex sparsity-inducing
surrogates of the `0 cardinality function (cf. Sec. 2). Time-varying communications arise, for instance, in mobile wireless networks (e.g., ad-hoc networks),
wherein nodes are mobile and/or communicate throughout fading channels.
Moreover, since nodes generally transmit at different power and/or communication channels are not symmetric, directed links are a natural assumption.
In most of the above scenarios, data processing and optimization need to be
performed in a distributed but collaborative manner by the agents within the
network. For instance, this is the case in data-intensive (e.g., sensor-network)
applications wherein the sheer volume and spatial/temporal disparity of scattered data render centralized processing and storage infeasible or inefficient.
While distributed methods for convex optimization have been widely studied in the literature, there are no such schemes for (P) (cf. Sec. 1.1). We propose
the first family of distributed algorithms that converge to stationary solutions
of (P) over time-varying (directed) graphs. Asymptotic convergence is proved,
under the use of either constant uncoordinate step-sizes from the agents or
diminishing ones. When a constant step-size is employed, the algorithms are
showed to achieve sublinear convergence rate. Furthermore, the technical tools
we introduce are of independent interest. Our analysis hinges on a descent technique technique valid for nonconvex, nonsmooth, constrained problems based
on a novel Lyapunov-like function (see Sec. 1.2 for the list of contributions).
1.1 Related works
The design of distributed algorithms for (P) faces the following challenges: (i)
U is nonconvex and nonseparable; (ii) G+ is nonsmooth; (iii) there are constraints; (iv) the graph is directed and time-varying, with no specific structure;
and (v) the (sub)gradient of U is not assumed to be bounded on K. We are not
aware of any distributed design addressing (even a subset of) challenges (i)-(v),
as documented next. Since the focus of this work is on distributed algorithms
working on general network architectures, we omit to discuss the vast literature
of schemes implementable on specific topologies, such as hierarchical networks
(e.g., master-slave or shared memory systems); see, e.g., [6,15,16,32,36,37,47]
and references therein for an entry point of this literature.
Distributed convex optimization: Although the focus of this paper is
mainly on nonconvex optimization, we begin overviewing the much abundant
literature of distributed algorithms for convex problems. We show in fact that,
even in this simpler setting, some of the challenges (ii)-(v) remain unaddressed.
−Primal methods: While substantially different, primal methods can be generically abstracted as a combination of a local (sub)gradient-like step and a
subsequent consensus-like update (or multiple consensus updates); examples
include [23, 27, 31, 38, 39]. Algorithms for adaptation and learning tasks based
on in-network diffusion techniques were proposed in [8, 11, 35]. Schemes in
[8, 11, 23, 31, 35, 38, 39] are applicable only to undirected graphs; [8, 31] require

Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs

3

the consensus matrices to be double-stochastic whereas [11, 35] use only rowstochastic matrices but are applicable only to strongly convex agents’ cost
functions having a common minimizer. When the graph is directed, doublestochastic weight matrices compliant to the graph might not exist or are not
easy to be constructed in a distributed way [20]. This requirement was removed in [27] where the authors combined the sub-gradient algorithm [31] with
push-sum consensus [24]. Other schemes applicable to digraphs are [48, 49].
However, [31, 48, 49] cannot handle constraints. In fact, up until this work
(and the associated conference papers [40, 41]) it was not clear how to leverage push-sum-like protocols to deal with constraints over digraphs. Finally, as
far as challenge (v) is concerned, only recent proposals [30, 33, 38, 39, 48, 49, 51]
removed the assumption that the (sub-)gradient of U has to be bounded; however [30, 33, 38, 48, 49, 51] can handle only smooth and unconstrained problems
while [33, 38, 39, 49, 51] are not implementable over digraphs.
−Dual-based methods: This class of algorithms is based on a different approach:
slack variables are first introduced to decouple the sum-utility function while
forcing consistency among these local copies by adding consensus equality
constraints (compliant with the graph topology). Lagrangian dual variables
are then introduced to deal with such coupling constraints. The resulting
algorithms build on primal-dual updates, aiming at converging to a saddle
point of the (augmented) Lagrangian function. Examples of such algorithms
include ADMM-like methods [9, 22, 39, 45] as well as inexact primal-dual instances [10, 25, 26]. All these algorithms can handle only static and undirected
graphs. Their extensions to time-varying graphs or digraphs seem not possible, because it is not clear how to enforce consensus via equality constraints
over time-varying or directed networks. Furthermore, all the above schemes
but [9, 39] require U to be smooth and (P) to be unconstrained.
In summary, even restricting to convex instances of (P), there exists no
distributed algorithm in the literature that can deal with either constraints
[issue (iii)] or nonsmooth U [issue (ii)] with nonbounded (sub-)gradient [issue
(v)] over (time-varying) digraphs. Also, it is not clear how to extend the convergence analysis developed in the above papers when U is no longer convex.
Distributed nonconvex optimization: Distributed algorithms dealing with
special instances of Problem (P) are scarce; they include primal methods
[4, 12, 42, 44] and dual-based schemes [21, 53]. The key features of these algorithms are summarized in Table 1 and discussed next.
−Primal methods: The scheme in [4] combines the distributed stochastic projection algorithm, employing a diminishing step-size, with the random gossip protocol. It can handle smooth objective functions over undirected static
graphs; no rate analysis of the scheme is known. In [42], the authors showed
that the (randomly perturbed) push-sum gradient algorithm with diminishing
(square summable) step-size, earlier proposed for convex objectives in [27],
converges also when applied to nonconvex smooth unconstrained problems.
Asymptotic convergence and a sublinear convergence rate were proved (the
latter under the assumption that the set of stationary points of U is finite).
The first, to our knowledge, provably convergent distributed scheme for (P),

4

Gesualdo Scutari, Ying Sun
ProjDGM
[4]
nonsmooth G

+

NEXT
[12]

Push-sum

Prox-

DGM

PDA

[42]

[21]

X

constraints

X

step-size:
complexity

X
time-varying

X

X

digraph

restricted

X

constant
diminishing

K compact

X

This work

X
X
X
X

X
X

SONATA

X

X

unbounded gradient
network
:
topology

DeFW [44]

X

X

X

X

X

X

X

X

X

Table 1 Distributed nonconvex optimization: Current works and contribution of this paper.

with G+ 6= 0 and constraints K, over time-varying graphs is NEXT [12]. The
algorithm requires the consensus matrices to be doubly-stochastic. Asymptotic convergence was proved, when a diminishing step-size is employed; no
rate analysis was provided. A special instance of NEXT was studied in [44],
where the authors considered smooth (possibly nonconvex) U over undirected
static graphs. Under a diminishing step-size (and further technical assumptions
on the set of stationary solutions), a sublinear convergence rate is proved. Finally, all the algorithms discussed above require that the (sub)gradient of U is
bounded on K (or Rm ). This is a key assumption to prove convergence: in the
analysis of descent, it permits to treat the optimization and consensus steps
separately, with the consensus error being a summable perturbation.
−Dual-based methods: In [53] a distributed approximate dual subgradient algorithm, coupled with a consensus scheme (using double-stochastic weight
matrices), is introduced to solve (P) over time-varying graphs. Assuming zeroduality gap, the algorithm is proved to asymptotically find a pair of primaldual solutions of an auxiliary problem, which however might not be stationary
for the original problem; also, consensus is not guaranteed. No rate analysis
is provided. In [21], a proximal primal-dual algorithm is proposed to solve
an unconstrained, smooth instance of (P) over undirected static graphs. The
algorithm employs either a constant or increasing penalty parameter (which
plays the role of the step-size); a global sublinear convergence rate is proved.
The algorithm can also deal with nonsmooth convex regularizes and norm constraints when it is applied to some distributed matrix factorization problems.
Gradient-tracking: The proposed algorithmic framework leverages the idea
of gradient tracking: each agent updatesP
its own local variables along a direcI
tion that is a proxy of the sum-gradient i=1 ∇fi at the current iteration, an
information that is not locally available. The idea of tracking the gradient averages through the use of consensus coupled with distributed optimization was
independently introduced in [12–14] (NEXT framework) for constrained, nonsmooth, nonconvex instances of (P) over time-varying graphs and in [51] for the
case of strongly convex, unconstrained, smooth optimization over static undirected graphs. This tracking protocol was extended to arbitrary (time-varying)
digraphs (without requiring doubly-stochastic weight matrices) in our conference work [41]. A convergence rate analysis of the scheme in [51] was later
developed in [30, 33], with [30] considering (time-varying) directed graphs. We
refer the reader to Sec. 3 for a more detailed discussion on this topic.

Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs

5

1.2 Summary of contributions
We summarize our major contributions as follows; see also Table 1.
1. Novel algorithmic framework: We propose the first provably convergent distributed algorithmic framework for the general class of Problem (P),
addressing all challenges (i)-(iv). The proposed approach hinges on Successive Convex Approximation (SCA) techniques, coupled with a judiciously
designed perturbed push-sum consensus mechanism that aims to track locally the gradient of F . Both communication and tracking protocols are
implementable on arbitrary time-varying undirected or directed graphs,
and in the latter case only column-stochasticity of the weight matrices is
required. Also, feasibility of the iterates is preserved at each iteration. Either constant or diminishing step-size rules can be used in the same scheme,
and convergence to stationary solutions of Problem (P) is established.
2. Iteration complexity: We prove that the proposed scheme has sublinear
convergence rate as long as the positive step-size is smaller than an explicit upper bound; different step-sizes among the agents can also be used.
To the best of our knowledge, this is the first convergence/complexity result of distributed algorithms employing a constant step-size for nonconvex
(constrained) optimization over (time-varying) digraphs.
3. New Lyapunov-like function and descent technique: We improve
upon existing convergence techniques and introduce new ones. Current
analysis of distributed algorithms has trouble handling nonconvex, nonsmooth, constrained optimization. Moreover, in the presence of unbounded
(sub-)gradients of the objective function, descent on the objective function while treating optimization and consensus errors separately no longer
works. A new convergence analysis is introduced to overcome this difficulty
based on a novel “Lyapunov”-like function that properly combines suitably
defined weighted average dynamics, consensus and tracking disagreements.
4. Broader class of problems and convergence results: The proposed
algorithmic framework and convergence results are applicable to a significantly larger class of (constrained) optimization problems and network
topology than current distributed schemes, including several instances arising from machine learning, signal processing, and data analytic applications (cf. Sec.2.1). Moreover, we contribute to the theory of distributed
algorithms also for convex problems, being our schemes the first able to
provably deal with either constraints [issue (iii)] or nonsmooth U [issue (ii)]
with nonbounded (sub-)gradient [issue (v)] over (time-varying) digraphs.
Finally, our algorithm contains as special cases several recently gradientbased algorithms whose convergence was proved under more restrictive assumptions on the optimization problem and network topology (cf. Sec. 5).
Finally, preliminary numerical results show that the proposed schemes compare favorably with state-of-the-art algorithms.
The rest of the paper is organized as follows. The problem setting is discussed in Sec. 2 along with some motivating applications. Some preliminary
results, including a perturbed push-sum consensus scheme over time-varying

6

Gesualdo Scutari, Ying Sun

digraphs, are introduced in Sec. 3. Sec. 4 describes the proposed algorithmic
framework along with its convergence properties, whose proofs are given in
Sec. 6. Finally, some numerical results are presented in Sec. 7.
Notation. The set of nonegative (resp. positive) natural number is denoted
by N+ (resp. N++ ). A vector x is viewed as a column vector; matrices are
denoted by bold letters. We work with the space Rm , equipped with the standard Euclidean norm, which is denoted by k • k; when the argument of k • k
is a matrix, the default norm is the spectral norm. When some other (vector
or matrix) norms are used, such as `1 -norm, or infinity-norm, we will use the
notation k • kp with the corresponding value of p. The transpose of a vector x
is denoted by x> . The Kronecker product is denoted by ⊗. We use 1 to denote
a vector with all entries equal to 1, and I to denote the identity matrix; With
some abuse of notation, the dimensions of 1 and I will not be given explicitly
but understood within the context. Given I ∈ N++ , we define [I] , {1, . . . , I}.
2 Problem Setup and Motivating Examples
We study Problem (P) under the following assumptions.
Assumption A (On Problem (P)) Given Problem (P), suppose that
A.1 The set K ⊆ Rm is (nonempty) closed and convex;
A.2 Each fi : O → R is C 1 , where O ⊇ K is an open set, and ∇fi is Li Lipschitz on K;
A.3 G+ : K → R is convex (possibly nonsmooth), and G− : O → R is C 1 with
∇G− being LG -Lipschitz on K;
A.4 U is lower bounded on K.
We also made the blanket assumption that each agent i knows only its on
function fi and the regularizer G but not the functions of the other agents.
Assumptions A.1 A.2 and A.4 are quite standard and satisfied by several
problems of practical interest. We remark that, as a major departure from
most of the literature on distributed algorithms, we do not assume that the
gradient of F (and G− ) is bounded on the feasible set K. This, together with
the nonconvexity of G as stated in A.3, opens the way to design for the first
time distributed algorithms for a gamut of new applications, including several
big-data problems in statistical learning; see Sec. 2.1 for details.
On the network topology: Agents communicate through a (possibly) timevarying network. Specifically, time is slotted with n denoting the iteration
index (time-slot); in each time-slot n, the communication network of agents
is modeled as a (possibly) time-varying digraph G n = ([I], E n ), where [I] =
{1, . . . , I} denotes the set of agents–the vertices of the graph–and the set of
edges E n represents the agents’ communication links; we use (i, j) ∈ E n to
indicate that the link is directed from node i to node j. The in-neighborhood
of agent i at time n is defined as Niin [n] = {j | (j, i) ∈ E n } ∪ {i} (we included in
the set node i itself, for notational simplicity); it represents the set of agents
which node i can receive information from. The out-neighborhood of agent
i is Niout [n] = {j | (i, j) ∈ E n } ∪ {i}–the set of agents receiving information

Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs

7

from node i (including node i itself). The out-degree of agent i is defined as
dni , |Niout [n]|. To let information propagate over the network, we assume
that the graph sequence {G n }n∈N+ possesses some “long-term” connectivity
property, as formally stated next.
Assumption B (On graph connectivity) The graph sequence {G n }n∈N+
is B-strongly connected, i.e., there exists a finite integer B > 0 such that the
k+B−1 t
graph with edge set ∪t=k
E is strongly connected, for all k ≥ 0.

We conclude this section discussing some instances of Problem (P) in the
context of statistical learning.
2.1 Distributed sparse statistical learning

We consider two distributed nonconvex problems in statistical learning, namely:
i) a nonconvex sparse linear regression problem; and ii) the sparse Principal
Component Analysis (PCA) problem.
Nonconvex Sparse Linear Regression. Consider the problem of retrieving
a sparse signal x ∈ Rm from the observations {bi }Ii=1 , where each bi = Ai x is
a linear measurement of the signal acquired by agent i. A mainstream approach
in the literature is to solve the following optimization problem
I
X
2
(1)
min
kbi − Ai xk + λ · G (x) ,
x

i=1

where the quadratic term measures the model fitness whereas the regularizer
G is used to promote sparsity in the solution, and λ > 0 is chosen to balance
the trade-off between the model fitness and solution sparsity. Problem (1) is
clearly an instance of (P). Note that each agent knows only its own function
fi (since bi is own only by agent i). Also, ∇fi is not bounded on Rm .
To promote sparsity on the solution, the ideal choice for G would be the cardinality of x (a.k.a. `0 “norm” of x). However, its combinatorial nature makes
the resulting optimization problem numerically intractable as the variable dimension m becomes large. Several convex and, more recently, also nonconvex
surrogates of the `0 function have been proposed in the literature. The structure of G, as stated in Assumption A.3, captures either choices. For instance,
one can choose as regularizer in (1), the `2 or `1 norm of x (and thus G− = 0),
which leads to the ridge and LASSO regression problems, respectively. Moreover, a vast class of nonconvex surrogates can also be considered, including
the SCAD [17], the “transformed” `1 [52], the logarithmic [46], and the exponential [7]; see Table 2. It is well documented that nonconvex regularizers
outperform the `1 norm in enhancing solution sparsity. Quite interestingly,
all the widely used nonconvex surrogates listed in Table 2 enjoy the following
separable DC structure (see, e.g., [1, 43] and references therein)
m
X

(2)
G(x) =
g(xi ), with g (xi ) = η (θ) |xi | − η (θ) |xi | − g (xi ) ,
| {z } |
{z
}
i=1

,g + (xi )

,g − (xi )

where the expression of g : R → R is given in Table 2; and η (θ) is a fixed given
function, defined in Table 3 for each of the surrogate g listed in Table 2. The

8

Gesualdo Scutari, Ying Sun

Table 2 Examples of nonconvex surrogates of the `0 function having a DC structure [cf. (2)]

Penalty function

Expression

Exp [7]

gexp (x) = 1 − e−θ|x|

`p (0 < p < 1) [19]

g`+
(x) = (|x| + )1/θ ,
p

`p (p < 0) [34]

g`−
(x) = 1 − (θ|x| + 1)p
p

2θ
0 ≤ |x| ≤ θ1

 a+1 |x|,

2
2
gscad (x) = −θ |x|a+2aθ|x|−1
, θ1 < |x| ≤ aθ
2 −1



1,
|x| > aθ

SCAD [17]

glog (x) = log(1+θ|x|)
log(1+θ)

Log [46]

parameter θ controls the tightness of the approximation of the `0 function: in
fact, it holds that limθ→+∞ g(xi ) = 1 if xi 6= 0, otherwise limθ→+∞ g(xi ) = 0.
Note that g − is convex and has Lipschitz continuous first derivative dg − /dx
[43], whose closed form is given in Table 3.
It is not difficult to check that Problem (1), with any of the regularizers
discussed above, is an instance of (P) and satisfies Assumption A. Also, note
that the gradient of the smooth part is not bounded on Rm .
Sparse PCA. Consider finding the sparse principal component of a distributed data set given by the rows of a set of matrices Di ’s (each Di is
own by agent i). The problem can be formulated as
I
X
2
(3)
max
kDi xk − λ · G (x) ,
kxk2 ≤1

i=1

where G can be any of the sparse-promoting regularizers discussed in the
previous example. Clearly, Problem (3) is another (nonconvex) instance of
Problem (P) (satisfying Assumption A).
3 Preliminaries: The perturbed condensed push-sum algorithm
The proposed algorithmic framework combines local optimization based on
SCA with constrained consensus and tracking of gradient averages over digraphs. The consensus problem over graphs has been widely studied in the
literature; a renowned distributed scheme solving this problem over (possibly time-varying) digraphs is the so-called push-sum algorithm [24]. A perturbed version of the push-sum scheme has been introduced in [27] to solve
unconstrained optimization problems over (time-varying) digraphs. However,
it is not clear how to leverage the push-sum update and extend these optimization schemes to deal with constraints. In this section, we introduce a
reformulation of the perturbed push-sum protocol [27]–termed perturbed condensed push-sum–that is more suitable for the integration with constrained

Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs

9

Table 3 Explicit expression of η(θ) and dg − /dx [cf. (2)]

g

η(θ)

dgθ− /dx

gexp

θ

sign(x) · θ · (1 − e−θ|x| )

g`+
p

1 1/θ−1
θ

1
1
1
θ −1 − (|x| + ) θ −1 ]
θ sign(x) · [

g`−
p

−p · θ

gscad

2θ
a+1

−sign(x) · p · θ · [1 − (1 + θ|x|)p−1 ]


0,
|x| ≤ θ1


2θ(θ|x|−1)
sign(x) · a2 −1 , θ1 < |x| ≤ aθ



2θ
sign(x) · a+1
,
otherwise

glog

θ
log(1+θ)

2

θ |x|
sign(x) · log(1+θ)(1+θ|x|)

optimization. This scheme will be then used to build the gradient tracking
and constrained consensus mechanisms embedded in the proposed algorithmic
framework (cf. Sec. 4).
Consider a network of I agents, as introduced in Sec. 2, communicating over
a time-varying digraph (cf. Assumption B). Each agent i controls a vector
of variables x(i) ∈ Rm as well as a scalar φi that are iteratively updated,
based upon the information received from its immediate neighbors. Let xn(i)
and φni denote the values of x(i) and φi at iteration n ∈ N+ . We let agents’
updates be subject to a(n adversarial) perturbation; we denote by δ ni ∈ Rm
the perturbation injected in the update of agent i at iteration n. Given xn(i)
and φni , the perturbed condensed consensus algorithm reads:
I
X
anij φnj ,
φn+1
=
(4a)
i
j=1

xn+1
(i) =

1

I
X

φn+1
i
j=1

anij φnj xn(j) + δ n+1
,
i

(4b)

for all n ∈ N+ and i ∈ [I], where x0(i) are arbitrarily chosen and φ0i are positive
PI
scalars such that i=1 φ0i = I; and An , (anij )Ii,j=1 is a (possibly) time-varying
matrix of weights whose nonzero pattern is compliant with the topology of the
graph G n , in the sense of the assumption below.

Assumption C (On the weight matrix An ) Each An , (anij )Ii,j=1 is compliant with G n , that is,
C1. anii ≥ κ > 0, for all i ∈ [I];
C2. anij ≥ κ > 0, if (j, i) ∈ E n ; and anij = 0 otherwise.

Under Assumption C, the protocol (4) is implementable in a distributed fashion: each agent i updates its own variables using only the information φnj xn(j)

10

Gesualdo Scutari, Ying Sun

and φnj received from its current in-neighbors (and its own). We study convergence of (4) under the following further (standard) assumption on An .
Assumption D (Column stochasticity) Each matrix An is column stochastic, that is, 1> An = 1> .
The role of the extra variables φi is to dynamically rebuild the row stochasticity of the equivalent weight matrix governing variables’ updates, which is a
key condition to lock consensus. This can be easily seen rewriting the dynamics
(4b) in terms of the equivalent weights Wn , (wij )Ii,j=1 :
xn+1
(i) =

I
X

n n
wij
x(j) ,

n
wij
,

j=1

anij φnj
φn+1
i

.

(5)

It is not difficult to check that, under Assumption D, Wn is row-stochastic.
To state the main convergence result in compact form, we introduce the
following notation. Let
> >
] ,
xn , [xn(1)> , . . . , x(I)

(6a)

>
φ , [φn1 , . . . , φnI ] ,
δ n , [δ n1 > , . . . , δ nI > ]> .

(6b)

n

(6c)

Noting that, in the absence of perturbation (i.e., δ n = 0), the weighed sum
PI
PI
PI
n+1 n+1
n n
x(i) = · · · = i=1 φ0i x0(i) ,
i=1 φi x(i) is an invariant of (4), that is,
i=1 φi
we define the consensus disagreementPat iteration n as the deviation of each
I
xn(i) from the weighted average (1/I) i=1 φni xn(i) :
I

enx , xn − 1 ⊗

1X n n
φ x .
I i=1 i (i)

(7)

The dynamics of the error enx are studied in the following proposition (whose
proof is postponed to Sec. 3.2).
Proposition 1 Let {G n }n∈N+ be a sequence of digraphs satisfying Assumption B, and let {(φn , xn )}n∈N+ be the sequence generated by the perturbed
condensed push-sum protocol (4), for a given perturbation sequence {δ n }n∈N+
and weight matrices {An }n∈N+ satisfying Assumptions C-D. Then, there hold:
(i) [Bounded {φn }n∈N+ ]:
inf min φni ≥ φlb ,

φlb , κ2(I−1)B ,

sup max φni ≤ φub ,

φub , I − κ2(I−1)B ,

n∈N+ i∈[I]
n∈N+ i∈[I]

(8)

with B ≥ 1 and κ ∈ (0, 1) defined in Assumption B and Assumption C,
respectively;

Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs

11

(ii) [Error decay]: For all n, k ∈ N+ , n ≥ k,
kenx k ≤ λk ken−k
k + λt ·
x
where
λt , min

n√

k−1
X
t=0

kδ n−t k,

(9)


o
t
2 I, 2 c0 I (ρ) (I−1)B ,

and


c0 , 2 1 + κ̃−(I−1)B ,

ρ , 1 − κ̃(I−1)B ,

κ̃ , κ2(I−1)B+1 /I.

(10)



B̄
Furtheremore, there exists a finite B̄ ∈ N+ such that ρB̄ , 2c0 I(ρ) (I−1)B < 1.
Remark 1 The perturbed consensus algorithm (4) was mainly designed for
digraphs. However, when the graph is undirected, one can choose the weight
matrix An to be double stochastic and get rid of the auxiliary variables φn ,
just setting in (4) φ0 = 1. As a consequence, φn ≡ 1 and Wn ≡ An , for all
t
n ∈ N+ . In this case, using [29, Lemma 9], the expression of λp
in Proposition 1
t
bt/Bc
can be tightened by letting λ , min{1, (ρ)
}, with ρ , 1 − κ/ (2I 2 ).
3.1 Discussion
Proposition 1 provides a unified set of convergence conditions of the perturbed
condensed push-sum scheme that are applicable to any given perturbation sequence {δ n }n∈N+ . We discuss next two special cases, namely: the plain average
consensus problem and the distributed tracking of time-varying signals.
1. (Weighted) average consensus: Setting in (4) δ n = 0, for all n ∈
N+ , (4) reduces to the plain (condensed) push-sum scheme. whose geometric
PI
convergence to the (weighted) average of the initial values, (1/I) i=1 φ0i x0(i) ,
PI
n+1 n+1
follows readily from Proposition 1. More specifically, using
x(i)
i=1 φi
PI
0 0
= · · · = i=1 φi x(i) , (9) yields
I

x

n+1

n+1
1X 0 0
≤ 2 c0 I (ρ)b (I−1)B c e0x ,
φ x
−1⊗
I i=1 i (i)

n ∈ N+ .

(11)

Note that, since the weight matrix Wn in (5) is row stochastic, if the initial
values x0(i) all belong to a common set K, then xn(i) ∈ K, for all n ∈ N++ ; that
is feasibility of the iterates is preserved.
2. Tracking of time-varying signals’ averages: Consider the problem of
tracking distributively the average of time-varying signals. At each iteration
n ∈ N+ , each agent i evaluates (or generates) a signal sample uni ∈ Rm from
the (time-varying) sequence {uni }n∈N+ . The goal is to design a distributed
algorithm obeying the communication structure of the graphs G n that tracks
the average of the signals {uni }n∈N+ , that is,
I
1X n
u .
(12)
lim kxn − 1 ⊗ ūn k = 0, ūn ,
n→∞
I i=1 i

12

Gesualdo Scutari, Ying Sun

The perturbed condensed push-sum algorithm (4) can be readily used to
accomplish this task by setting

1
δ n+1
= n+1 un+1
− uni ,
i ∈ [I], n ∈ N+ ,
(13)
i
i
φi
and x0i = u0i , i ∈ [I]. Convergence of this scheme is stated next.

Corollary 2 Let {uni }n∈N+ be a given sequence such that limn→∞ kun+1
−
i
uni k = 0, for all i ∈ [I]. Consider the perturbed condensed push-sum protocol
(4), under the assumptions of Proposition 1; and set δ n+1
as in (13) and
i
x0i = u0i , for all i ∈ [I]. Then, (12) holds.
Proof The
readily from Proposition 1 and the following two facts:
PIproof follows
n+1
xn+1
; and ii) [28, Lemma 7]
i) (1/I) i=1 φn+1
i
(i) = ū
lim kδ n k = 0 ⇒ lim

n→∞

n→∞

n−1
X
t=0

t

(ρ)b (I−1)B c kδ n−t k = 0.


3.2 Proof of Proposition 1
To prove Proposition 1, it is convenient to rewrite the perturbed consensus
protocol (4) in a vector-matrix form. To do so, let us introduce the following
quantities: given the weight matrix An compliant with G n (cf. Assumption C)
and Wn defined in (5), let
Dφn , Diag (φn ) ,
(14a)
b φn , Dφn ⊗ I,
D
(14b)
b n , An ⊗ I,
A

(14c)

W , W ⊗ I,

(14d)

cn

n

where Diag(•) denotes a diagonal matrix whose diagonal entries are the elements of the vector argument, and I is the m × m identity matrix. Under the
column stochasticity of An , it is not difficult to check that the following holds:

−1
−1 n
b φn+1
bn D
b φn .
cn = D
Wn = Dφn+1
A Dφn and W
A
(15)
Using the above notation and (6), the perturbed push-sum protocol (4)
can be rewritten in matrix-vector form as
φn+1 = An φn

c n xn + δ n+1 .
and xn+1 = W

(16)

To study convergence of (16), it is convenient to introduce the following matrix
products: given n, k ∈ N+ , with n ≥ k,
(
An An−1 · · · At , if n > k,
n:k
A ,
An ,
if n = k,
(
(17)
n
n−1
k
W
W
·
·
·
W
,
if
n
>
k,
Wn:t ,
Wn ,
if n = k,

Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs

13

and
b n:k , An:k ⊗ I,
A

c n:k , Wn:k ⊗ I.
W

(18)

Define the weight-averaging matrix
Jφn ,
so that Jφn xn = 1 ⊗ I1


1
1 (φn )> ⊗ I,
I

(19)

PI

n n
i=1 φi x(i) . Also, it is not difficult to check the
c n:t , and A
b n:t : for n, k ∈ N+ ,
following chain of equalities hold among Jnφ , W

with n ≥ k,

b φk = Jφk (b)
c n:k Jφk ,
c n:k (a)
Jφn+1 W
= J1 D
=W

(20)

c n [cf. (15)], Jφn+1 [cd. (20)], and the
where in (a) we used the definition of W
b n ; and (b) is due to the row stochasticity of W
c n:k .
column stochasticity of A
n
n
n
The consensus error ex in (7) can be rewritten as ex = (I − Jφn )x .
To study the evolution of enx , we apply the x-update (16) recursively and
obtain
k−1
X
c n−1:n−k xn−k +
c n−1:n−t δ n−t + δ n .
xn = W
W
(21)
t=1

Using (20) and (21), the weighted average Jφn xn can be written as
Jφn xn = Jφn−k xn−k +

k−1
X

Jφn−t δ n−t + Jφn δ n .

(22)

t=1


c n−1:n−k − Jφn−k Jφn−k = 0
Subtracting (22) from (21) and using W
[cf. (20)], we can bound the consensus error en+1
as
x
c n−1:n−k − Jφn−k ken−k k +
kenx k ≤ W
x

k−1
X
t=1

c n−1:n−t − Jφn−t kδ n−t k
W

n

+ I − Jφ n kδ k.

(23)
Convergence of the perturbed consensus protocol reduces to studying the
c n:k − Jφk k, as done in the lemma below.
dynamics of the matrix product kW
Lemma 3 Let {G n }n∈N+ be a sequence of digraphs satisfying Assumption B;
let {An }n∈N+ be a sequence of weight matrices satisfying Assumptions CD; and let {Wn }n∈N+ be the sequence of row stochastic matrices related to
{An }n∈N+ by (15). There holds:
o

n√
n−k+1
c n:k − Jφk ≤ min
2 I, 2 c0 I(ρ) (I−1)B , n, k ∈ N+ , n ≥ k, (24)
W
where c0 and ρ are defined in Proposition 1.
Proof See Appendix A.



14

Gesualdo Scutari, Ying Sun

The error decay law√(9) comes readily from√(23), Lemma 3, and the following
fact: I − Jφn ≤ 2 I ≤ λ0 , min{2c0 I, 2I}, which is proved below. Let
> >
z ∈ RI·m be an arbitrary vector; let us partition z as z = [z>
1 , . . . , zI ] , with
m
each zi ∈ R . Then,
√
I
I
X
(a)
I X
n
n
zi −
φni zi
k(I − Jφ ) zk ≤ kz − J1 zk + kJ1 z − Jφ zk ≤ kzk +
I i=1
i=1
√
√
Ip 2
I − I kzk ≤ 2 I kzk ,
≤ kzk +
I
(25)
where in (a) we used kI − J1 k = 1.



4 Algorithmic Design
We are ready to introduce the proposed distributed algorithm for Problem (P).
To shed light on the core idea of the novel framework, we begin introducing an
informal and constructive description of the algorithm (cf. Sec. 4.1), followed
by its formal statement along with its convergence properties (cf. Sec. 4.2).
4.1 SONATA at-a-glance
Each agent i maintains and updates iteratively a local copy x(i) of the global
n
variable x, along with an auxiliary variable y(i) ∈ Rm ; let xn(i) and y(i)
denote
the values of x(i) and y(i) at iteration n, respectively. Roughly speaking, the
update of these variables is designed so that all the xn(i) will be asymptotically
consensual, converging to a stationary solution of (P); and each y(i) tracks
PI
locally the average of the gradients (1/I) · i=1 ∇fi , an information that is
not available at the agent’s side. More specifically, the following two steps are
performed iteratively and in parallel across the agents.
Step 1:P
Local SCA. The nonconvexity of fi together with the lack of knowledge of j6=i fj in F , prevent agent i to solve Problem (P) directly. To cope
with these issues, we leverage SCA techniques: at each iteration n, given the
n
current iterate xn(i) and y(i)
, agent i solves instead a convexification of (P),
having the following form:



en , argmin Fei x(i) ; xn , yn + G+ x(i) ,
x
(26)
(i)

(i)

x(i) ∈K

(i)

and updates its x(i) according to
n+1/2

x(i)



en(i) − xn(i) ,
= xn(i) + αn x

(27)

n
where αn ∈ (0, 1) is a step-size (to be properly chosen). In (26), Fei (•; xn(i) , y(i)
)
is chosen as:

>


n
, fei x(i) ; xn(i) − ∇G− xn(i)
x(i) − xn(i)
Fei x(i) ; xn(i) , y(i)
(28)
>

n
+ I · y(i)
− ∇fi xn(i)
x(i) − xn(i) ,

Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs

15

where fei (•; xn(i) ) is a strongly convex approximation of fi at the current iterate xn(i) (see Assumption E below); the second term is the linearization
n
of the smooth nonconvex function −G− ; and y(i)
, as anticipated, aims at
PI
n
n
tracking the gradient average (1/I) j=1 ∇fj (x(i) ), that is, limn→∞ ky(i)
−
PI
n
(1/I) j=1 ∇fj (x(i) )k = 0. This sheds light on the role of the last term in
n
(28): under the claimed tracking properties of y(i)
, there would hold:
lim

n→∞

n
− ∇fi xn(i)
I · y(i)



−

X
j6=i

∇fj (xn(i) ) = 0.

(29)

Therefore,
the last term in (28) can be seen as a proxy of the gradient sum
P
n
∇f
(x
j
j6=i
(i) ), which is not available at agent i’s site. Building on the perturbed condensed push-sum protocol introduced in Sec. 3 we will show in Step
n
2 below how to update y(i)
so that (29) holds, using only local information.
e
The surrogate function fi satisfies the following assumption.
Assumption E (On surrogate function fei ) Let fei : K × K → R be a C 1
function with respect to its first argument, and such that
D1. ∇fei (x; x) = ∇fi (x), for all x ∈ K;
D2. fei (•; y) is uniformly strongly convex on K, with constant τi ;
D3. ∇fei (x; •) is uniformly Lipschitz continuous on K, with constant L̃i ;
where ∇fei (x; y) denotes the partial gradient of fei with respect to the first
argument, evaluated at (x, y).
Conditions D1-D3 are quite natural: fei should be regarded as a (simple)
convex, local, approximation of fi at x that preserves the first order properties
of fi . A gamut of choices for fei satisfying Assumption E are available; some
representative examples are discussed in Sec. 4.4.
Step 2: Information mixing and gradient tracking. To complete the
description of the algorithm, we need to introduce a mechanism to ensure that
i) the local estimates xn(i) ’s asymptotically converge to a common value; and
P
n
ii) each y(i)
tracks the gradient sum j6=i ∇fj (xn(i) ). To this end, we leverage
the perturbed condensed push-sum protocol introduced in Sec. 3. Specifically,
n+1/2
given x(j) ’s, each x(i) is updated according to [cf. (4)]
φn+1
=
i

I
X
j=1

anij φnj ,

xn+1
(i) =

I
X

n+1/2

anij x(j)

,

(30)

j=1

where the anij are chosen to satisfy Assumption C. Note that, the updates in
(30) can be performed in a distributed way: each agent j only needs to select
n+1/2
the set of weights {anij }Ii=1 and send anij φnj and anij φnj x(j)
to its out-neighbors
while summing up the information received from its in-neighbors.
n
To update the y(i)
’s we leverage again the perturbed condensed push-sum

n
scheme (4), with with n+1
= (1/φn+1
) ∇fi (xn+1
i
i
(i) ) − ∇fi (x(i) ) [cf. (13)]. The
resulting gradient tracking mechanism reads

16

Gesualdo Scutari, Ying Sun

n+1
y(i)
=

1

I
X

1
n
anij φnj y(j)
+ n+1
φn+1
φ
i
i
j=1





∇fi xn+1
− ∇fi xn(i) ,
(i)

(31)

0
n
with y(i)
= ∇fi (x0(i) ). Note that the update of y(i)
can be performed locally
by agent i, with the same signaling of that of (30).

4.2 The SONATA algorithm
We can now formally introduce the proposed algorithm, SONATA, just combining steps (26),(27),(30), and (34)–see Algorithm 1.
Algorithm 1 SONATA
Data: x0(i) ∈ K, for all i; φ0 = 1; y0 = g0 . Set n = 0.

[S.1] If xn satisfies termination criterion: STOP;
[S.2] [Distributed Local Optimization] Each agent i
en(i) solving problem (26);
Compute locally x
n+1/2

Update its local variable x(i)

, xn(i) + αn (e
xn(i) − xn(i) );

[S.3] [Information Mixing] Each agent i compute
(a) Consensus
φn+1
=
i

I
X

anij φnj

(32)

j=1

xn+1
(i) =

1

I
X

φn+1
i
j=1

n+1/2

anij φnj x(j)

;

(33)



− ∇fi xn(i) ;
∇fi xn+1
(i)

(34)

(b) Gradient tracking
n+1
y(i)
=

1

I
X

1
n
anij φnj y(j)
+ n+1
φn+1
φ
i
i
j=1



[S.4] n ←− n + 1, go to [S.1]
Note that the algorithm is distributed. Indeed, in Step 2, the optimization
en(i) . To do so,
(26) is performed locally by each agent i, computing its own x
n
n
agent i needs to know the current x(i) and y(i) , which are both available locally.
There are then two consensus steps (Step 3) whereby agents transmit/receive
information only to/from their out/in neighbors: one is on the optimization
variables xn(i) (and the auxiliary scalars φni )–see (32)-(33)–and one is on the
n
variables y(i)
–see (34).
4.3 Convergence and complexity analysis of SONATA
To prove convergence, in addition to Assumptions A-E, one needs some conditions on the step-size αn . Since line-search methods are not practical in a

Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs

17

distributed environment, there are two other options, namely: i) a fixed (sufficiently small) step-size; and ii) a diminishing step-size. We prove convergence
using either choices. Recalling the definition of the network parameters c0 , B̄,
ρB̄ , φlb , and φub as given in Proposition 3 [see also (10)] and introducing the
problem parameters [cf. Assumptions A]
I
X
L,
Li , L̃mx , max L̃i + LG , Lmx , max Li ,
1≤i≤I
1≤i≤I
i=1
(35)

 √
cτ , min τi , cL , L I + Lmx + L̃mx /I,
1≤i≤I

the step-size can be chosen as follows.
Assumption F The step-size {αn }n∈N+ satisfies either one of the following
conditions:
P∞
F1. (diminishing): (0, 1] 3 αn ↓ 0 and n=0 αn = ∞;
F2. (fixed): αn ≡ α, for all n ∈ N+ , with

(1 − ρB̄ ) σ
√
,
α ≤ min
2 c B̄
!−1 
r
r

−1 2 2
12Lmx φlb B̄ c
2cτ φlb L + LG
2cL B̄c
2
1
+
+
, (36)

Iφub
I
1 − ρB̄ 1 − σ 2
(1 − ρB̄ )2
1 − σ2

√
where σ is an arbitrary constant σ ∈ (0, 1) and c = I 2I.
In addition, if all An are double stochastic, the upper bound in (36) holds with
1/2
.
c = 1, B̄ = B, φlb = φub = 1, and ρB̄ = 1 − κ/(2 I 2 )
We can now state the convergence results of the proposed algorithm, postponing all the proofs to Sec. 6. Given {xn , (xn(i) )Ii=1 }n∈N+ generated by Algorithm 1, convergence
is stated measuring the distance of the average sequence
PI
x̄n , (1/I) · i=1 xn(i) from optimality and well as the consensus disagreement
among the local variables xn(i) ’s. Distance from stationarity is measured by the
following function:
J(x̄n ) ,


 >
1
n
n
n
−
n
n 2
+
(z − x̄ ) + kz − x̄ k + G(z)
∇F (x̄ ) − ∇G x̄
x̄ − argmin
.
2
z∈K
(37)
Note that J is a valid measure of stationarity because it is continuous and
J(x̄∞ ) = 0 if and only if x̄∞ is a d-stationary solution of Problem (P) [16].
The consensus disagreement at iteration n is defined as
D(xn ) , kxn − 1I ⊗ x̄n k.

Note that D is equal to 0 if and only if all the xn(i) ’s are consensual. We combine
the metrics J and D in a single merit function, defined as

M (xn ) , max J(x̄n )2 , D(xn )2 .
We are now ready to state the main convergence results for Algorithm 1.

18

Gesualdo Scutari, Ying Sun

Theorem 4 (asymptotic convergence) Given Problem (P) and Algorithm 1,
suppose that Assumptions A-F are satisfied; and let {xn }n∈N+ be the sequence
generated by the algorithm. Then, there holds limn→∞ M (xn ) = 0.
Under a constant step-size (Assumption F.2), the next theorem provides
an upper bound on the number of iterations needed to decrease M (xn ) below
a given accuracy  > 0.
Theorem 5 (complexity) Suppose that Assumptions A-E are satisfied; and
let {xn }n∈N+ be the sequence generated by Algorithm 1, with a constant stepsize αn = α, satisfying Assumption F.2. Given  > 0, let T be the first
iteration n such that M (xn ) ≤ . Then T = O(1/).
Remark 6 (generalizations) Theorems 4 and 5 can be established with minor modifications under the setting wherein each agent i uses different constant
step-size αi . Also the assumption on the strongly convexity of the surrogate
function fei (Assumption E.2) can be weakened to just convexity, if the feasible
set K is compact. With mild additional assumptions on G+ –see [12]–we can
extend convergence results in Theorem 4 to the case wherein agents solve their
subproblems (26) inexactly. We omit further details because of space limitation.
4.4 Discussion
Theorem 4 (resp. Theorem 5) provides the first convergence (resp. complexity)
result of distributed algorithms for constrained and/or composite optimization
problems over time-varying (undirected or directed) graphs, which significantly
enlarges the class of convex and nonconvex problems which distributed algorithms can be applied to with convergence guarantees.
SONATA represents a gamut of algorithms, each of them corresponding to
a specific choice of the surrogate function fei , step-size αn , and matrices An .
Convergence is guaranteed under several choices of the free parameters of the
algorithms, some of which are briefly discussed next.
• On the choice of fei . Examples of fei satisfying Assumption E are
− Linearization: Linearize fi and add a proximal regularization (to make f˜i
strongly convex), which leads to


>
 τi
2
x(i) − xn(i) 2 ;
fei x(i) ; xn(i) = fi xn(i) + ∇fi xn(i)
x(i) − xn(i) +
2
− Partial Linearization: Consider the case where fi can be decomposed as
(1)
(2)
(1)
(2)
fi (x(i) ) = fi (x(i) )+fi (x(i) ), where fi is convex and fi is nonconvex
with Lipschitz continuous gradient. Preserving the convex part of fi while
(2)
linearizing fi leads to the following valid surrogate
τi
(1)
(2)
fei (x(i) ; xn(i) ) = fi (x(i) ) + fi (xn(i) ) + kxi − xn(i) k2
2
(2)
+ ∇fi (xn(i) )> (xi − xn(i) );

Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs

19

− Partial Convexification: Consider the case where x(i) is partitioned as
(x(i,1) , x(i,2) ), and fi is convex in x(i,1) but not in x(i,2) . Then, one can
convexify only the nonconvex part of fi , which leads to the surrogate:
τi
fei (x(i) ; xn(i) ) = fi (x(i,1) , xn(i,2) ) + kx(i,2) − xn(i,2) k2
2
(2)
n >
+ ∇ fi (x(i) ) (x(i,2) − xn(i,2) ),
where ∇(2) fi denotes the gradient of fi with respect to x(i,2) . Other choices
of surrogates can be obtained hinging on [15, 16, 36].
• On the choice of the step-size. Several options are possible for the stepsize sequence {αn }n satisfying the diminishing-rule in Assumption F.1; see,
e.g., [2]. Two instances we found to be effective in our experiments are: i) αn =

β
α0 / (n + 1) , with α0 > 0 and 0.5 < β ≤ 1; and ii) αn = αn−1 1 − µαn−1 ,
with α0 ∈ (0, 1], and µ ∈ (0, 1).

• On the choice of matrix An . When dealing with digraphs, the key requirement of Assumption B is that each An is column stochastic. Such matrices
can be built locally by P
the agents: each agent j can simply choose weight anij
out
for i ∈ Nj [n] so that i∈N out [n] anij = 1. As a special case, An can be set to
j
be the following push-sum matrix [24]: anij = 1/dnj , if (j, i) ∈ E n ; and anij = 0,
otherwise; where dni is the out-degree of agent i. In this case, the information
mixing process in Step 2 becomes a broadcasting protocol, which requires from
each agent only the knowledge of its out-degree.
When the digraphs G n admit a double-stochastic matrix (e.g., they are
undirected ), as already observed in Sec. 3 (cf. Remark 1), one can choose An
as double-stochastic; and the consensus and tracking protocols in Step 3 reduce
respectively to
I
X

xn+1
=
anij xn(j) + αn (e
xn(j) − xn(j) )
(i)
j=1

n+1
y(i)
=

I
X
j=1

(38)
n
n
anij y(j)
+ ∇fi (xn+1
(i) ) − ∇fi (x(i) ).

Several choices have been proposed in the literature to build in a distributed
way a double stochastic matrix An , including: the Laplacian, MetropolisHastings, and maximum-degree weights; see, e.g., [50].
• ATC/CAA updates. In the case of unconstrained optimization, the information mixing step in Algorithm 1 can be performed following two alternative
protocols, namely: i) the Adapt-Then-Combine-based (ATC) scheme; and ii)
the Combine-And-Adapt-based (CAA) approach (termed “consensus strategy”
in [35]). The former is the one used in (30)–each agent i first updates its local
en(i) −xn(i) , and then combines its new update with
copy xn(i) along the direction x
that of its neighbors via consensus. Alternatively, in the CAA update, agent
i first mixes its own local copy xn(i) with that of its neighbors via consensus,

20

Gesualdo Scutari, Ying Sun

en(i) − xn(i) , that
and then performs its local optimization-based update using x
is
I
φni
1 X
n n
n n
x(i) − xn(i) ).
a
φ
x
+
xn+1
=
ij
j
(j)
(i)
n+1 · α (e
φn+1
φ
i
i
j=1
It is not difficult to check that SONATA based on CAA updates converges
under the same conditions as in Theorem 4.
5 SONATA and special cases
In this section, we contrast SONATA with related algorithms proposed in the
literature [12–14, 51] and very recent proposals [30, 33, 49] for special instances
of Problem (P). We show that algorithms in [30,33,49,51] are all special cases
of SONATA and NEXT, proposed in our earlier works [12–14, 41].
We preliminarily rewrite Algorithm 1 in a matrix-vector form. Similarly
to xn , define the concatenated vectors
>
en>
en , [e
x
xn>
(1) , . . . , x
(I) ] ,

(39)

n>
n> >
yn , [y(1)
, . . . , y(I)
] ,

g

n

(40)

, [g1n> , . . . , gIn> ]> ,

n

n

gin , ∇fi (xn(i) ),

n

e −x ,
∆x , x

(41)
(42)

n
en(i) and y(i)
where x
are defined in (26) and (34), respectively. Using the above
notation and the matrices introduced in (14a), SONATA [cf. (32)-(34)] can be
written in compact form as

φn+1 = An φn
x

n+1

cn

(43)
n

n

n

= W (x + α ∆x )

(44)

c n y n + (D
b φn+1 )−1 gn+1 − g
yn+1 = W


n

.

(45)

5.1 Preliminaries: SONATA-NEXT and SONATA-L
Since [30, 33, 49, 51] are applicable only to unconstrained (K = Rm ), smooth
(G = 0) and convex (each fi is convex) multiagent problems, in the following,
we consider only such an instance of Problem (P). Choose each fei as first order
approximation of fi plus a proximal term, that is,
τi
fei (x(i) ; xn(i) ) = fi (xn(i) ) + ∇fi (xn(i) )> (x(i) − xn(i) ) + kx(i) − xn(i) k2 ,
2
en(i) can be computed in closed form [cf. (26)]:
and set τi = I. Then, x
n >
en(i) = argmin (I · y(i)
x
) (x(i) − xn(i) ) +
x(i)

= argmin
x(i)

I
kx(i) − xn(i) k2
2

I
n 2
n
x(i) − xn(i) + y(i)
= xn(i) − y(i)
.
2

n
en(i) − xn(i) = y(i)
Therefore, ∆xn(i) = x
.

(46)

Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs

21

Substituting (46) into (44) and using either ATC or CAA mixing protocols,
Algorithm 1 reduces to
φn+1 = An φn

c n (xn − αn yn )
W
n+1
−1

x
=
c n xn − α n D
b φn+1
b φn yn
W
D

−1

c n yn + D
b φn+1
yn+1 = W
gn+1 − gn ;

(ATC-based update)
(CAA-based update)

(47)

which we will refer to as (ATC/CAA-)SONATA-L (L stands for “linearized”).
When the digraph G n admits a double-stochastic matrix An , and An in
(43) is chosen so, the iterates (47) can be further simplified as reduces to
(
c n (xn − αn yn ) (ATC-based update)
W
n+1
x
=
(48)
c n xn − αn yn
W
(CAA-based update)
c n yn + gn+1 − gn ,
yn+1 = W
c n = Wn ⊗ Im . The ATC-based updates coincide
where Wn = An and thus W
with our previous algorithm NEXT [based on the surrogate (46)], introduced
in [12–14]. We will refer to (48) as (ATC/CAA-)NEXT-L.
5.2 Connection with current algorithms
We can now show that the algorithms recently studied in [30, 33, 49, 51] are all
special cases of SONATA and NEXT, earlier proposed in [12–14].
Aug-DGM [51] and Algorithm in [33]. Introduced in [51] for undirected,
time-invariant graphs, the Aug-DGM algorithm reads
c (xn − Diag (α ⊗ 1m ) yn )
xn+1 = W

c yn + gn+1 − gn
yn+1 = W

(49)

c , W ⊗ Im ; W is a double stochastic matrix satisfying Assumpwhere W
tion C, and α is the vector of agents’ step-sizes αi ’s.
A similar algorithm was proposed independently in [33] (in the same networking setting of [51]), which reads
c (xn − αyn )
xn+1 = W
(50)
c n + gn+1 − gn .
yn+1 = Wy
Clearly Aug-DGM [51] in (49) with the αi ’s equal, and Algorithm [33] in
(50) coincide with (ATC-)NEXT-L [cf. (48)].
(Push-)DIGing [30]. Appeared in [30] and applicable to B-strongly connected undirected graphs, the DIGing Algorithm reads
c n xn − αyn
xn+1 = W
(51)
c n yn + gn+1 − gn ,
yn+1 = W

22

Gesualdo Scutari, Ying Sun

where Wn is a double-stochastic matrix satisfying Assumption C. Clearly,
DIGing coincides with (CAA-)NEXT-L [12–14][cf. (48)]. The push-DIGing
algorithm, studied in the same paper [30], extends DIGing to B-strongly
connected digraphs. It turns out that push-DIGing coincides with (ATC)SONATA-L [cf. Eq. (47)] when anij = 1/dnj .
ADD-OPT [49]. Finally, we mention the ADD-OPT algorithm, proposed
in [49] for strongly connected static digraphs, which takes the following form:
b n − αe
zn+1 = Az
yn
φn+1 = A φn

−1
b φn+1
xn+1 = D
zn+1

(52)

by
e n+1 = A
e n + gn+1 − gn ,
y
b =
where A is a column stochastic matrix satisfying Assumption C, and A
n
−1 e n
b
A ⊗ Im . Defining y = (Dφn+1 ) y , it is not difficult to check that (52) can
be rewritten as

−1
b φn+1
bD
b φn
φn+1 = Aφn , W = D
A

−1
c n−α D
b φn+1
b φn yn
xn+1 = Wx
D

−1

b φn+1
c n+ D
gn+1 − gn .
yn+1 = Wy

(53)

Comparing Eq. (47) and (53), one can see that ADD-OPT coincides with
(CAA-)SONATA-L.
We summarize the connections between the different versions of SONATA(NEXT) and its special cases in Table 4.
6 Convergence Proof of SONATA
In this section, we prove convergence of SONATA; because of space limitation
we prove only Theorem 4. The proof consists in studying the dynamics of a
suitably chosen Lyapunov function along the weighted average of the agents’
local copies, and of the consensus disagreement and tracking errors. We begin
introducing some convenient notation along with some preliminary results. For
the sake of simplicity, all the results of the forthcoming subsections are stated
under the blanket Assumptions A-F.
6.1 Notations and preliminaries
The weighted average and associated consensus disagreement are denoted by

1  n>
x̄φn ,
φ ⊗ Im xn and enx , xn − Jφn xn ,
(54)
I
n
respectively. Similar quantities are defined for the tracking variables y(i)
:

ȳφn ,


1  n>
φ ⊗ Im y n
I

and eny , yn − Jφn yn .

(55)

Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs

23

Table 4 Connection of SONATA with current algorithms

Algorithms

Connection with

Instance of

Graph topology/

SONATA

Problem (P)

Weight matrix

NEXT

special case of

[12]

SONATA (38)

Aug-DGM

ATC-NEXT-L

[33, 51]

(α = α1I ) (48)

DIGing

CAA-

[30]

push-DIGing
[30]

ADD-OPT
[49]

F nonconvex
G 6= 0

K ⊆ Rm
F convex
G=0
K = Rm
F convex

NEXT-L (48)

G=0
K = Rm
F convex

ATCSONATA-L (47)

G=0
K = Rm
F convex

ATCSONATA-L (47)

G=0
K = Rm

time-varying digraph/
doubly-stochastic weights

static undirected graph/
doubly-stochastic weights

time-varying digraph/
doubly-stochastic weights

time-varying digraph/
column-stochastic weights

static digraph/
column-stochastic weights

en(i) of each agent
Recalling (39), define the deviation of the local solution x
from the weighted average as
en(i) − x̄φn ,
∆e
xn(i),φ , x

(56)

and the associated stacked vector
e n − Jφ n x n .
∆e
xnφ , x

(57)

Note that ∆xn [cf. (39)] can be rewritten as
∆xn = ∆e
xnφ − enx .

(58)

Using the above notation, the dynamics of x̄φn and ȳφn generated by
Algorithm 1 are given by [cf. (44) and (45)]:

αn
(φn )> ⊗ Im ∆e
xnφ
I
ȳφn+1 = ȳφn + ḡn+1 − ḡn .

x̄φn+1 = x̄φn +

(59a)
(59b)

Note that, since y0 = g0 and φ0i = 1, we have ȳφn = ḡn , for all n ∈ N+ .
Finally, we introduce the error-free local solution map of each agent i,
b(i) : K → K: Given z ∈ K and i = 1, . . . , I, let
denoted by x

24

Gesualdo Scutari, Ying Sun

n


b(i) (z) , argmin fei x(i) ; z − ∇G− (z)> x(i) − z
x

x(i) ∈K
P
>

+
+
x(i) − z + G (z) .
j6=i ∇fj (z)

(60)

b(i) (•) enjoys the following properties (the proof
It is not difficult to check that x
of the next lemma follows similar steps as in [16, Prop. 8] and thus is omitted).
b(i) (•) satisfies:
Lemma 7 Each x
b(i) (•) is L̂-Lipschitz continuous on K, that is,
i) [Lipschitz continuity]: x
there exits a finite L̂ > 0 such that
b(i) (z) − x
b(i) (w) ≤ L̂ kz − wk ,
x

∀z, w ∈ K;

(61)

b(i) (•) coincides with the set of
ii) [Fixed-points]: The set of fixed points of x
d-stationary solutions of Problem (P).
The next result shows that, as expected, the disagreement between agent i’s
en(i) and its error-free counterpart x
b(i) (xn(i) ) asymptotically vanishes
solution x
n
if both the consensus error ex and the tracking error eny do so.
en(i) [cf. (26)] and x
bi (xn(i) ) [cf. (60)] satisfy:
Lemma 8 x
2I L n
I
bi (xn(i) ) − x
en(i) ≤
eny +
kex k.
x
τi
τi
en(i) k −→ 0.
xi (xn(i) ) − x
Therefore, kenx k, keny k −→ 0 ⇒ kb
n→∞

(62)

n→∞

The last result of this section is a standard martingale-like result; the proof
follows similar to that of [3, Lemma 1] and thus is omitted.
Lemma 9 Let {X n }n∈N+ , {Y n }n∈N+ and {Z n }n∈N+ be three sequences such
that X n and Z n are nonnegative, for all n ∈ N+ . Suppose that
B̄−1
X
k=0

Y n+B̄+k ≤

B̄−1
X
k=0

Y n+k −

B̄−1
X

X n+k +

k=0

B̄−1
X

Z n+k ,

n = 0, 1, . . . ,

(63)

k=0

P∞
PB̄−1
PB̄−1
and that n=0 Z n < +∞. Then,Peither k=0 Y n+k → −∞, or else k=0 Y n+k
∞
converges to a finite value and n=0 X n < +∞.
6.2 Average descent
We begin our analysis studying the dynamics of U along the trajectory of x̄φn .
We define the total energy of the optimization input αn ∆e
xnφ and consensus
errors enx and eny in B̄ consecutive iterations [B̄ is defined in Lemma 3]:
n
E∆e
x ,

B̄−1
X
t=0

αn+t

2

∆e
xn+t
φ

2

,

Exn⊥ ,

B̄−1
X
t=0

en+t
x

2

,

Eyn⊥ ,

B̄−1
X

en+t
y

2

t=0

(64)
Recalling the definitions of cτ , φlb , and φub [see (8) and (35)], we have the
following.

.

Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs

25

Lemma 10 Let {(xn , yn )}n∈N+ be the sequence generated by Algorithm 1.
Then, there holds
B̄−1
X



U x̄φn+B̄+k

k=0

≤

B̄−1
X
k=0

+

X B̄−1
X
2
 cτ φlb B̄−1
·
xn+k+t
U x̄φn+k −
αn+k+t ∆e
φ
I
t=0

φub
2

k=0



L + LG
φub + cL x + y
I

 B̄−1
X
k=0

n+k
E∆e
x +

B̄−1

φub X
n+k
−1 n+k
cL −1
,
x Ex⊥ + y Ey⊥
2
k=0
|
{z
}
term iv

(65)
where x > 0 and y > 0 are arbitrary, finite constants.
Proof Denote for simplicity F̄ , F − G− . Since fei is strongly convex and G+
en(i) , we have
is convex, by the first order optimality of x
> 


n
I · y(i)
+ ∇fei (x̄φn ; xn(i) ) − ∇G− (xn(i) ) − ∇fi (xn(i) )
∆e
xn(i),φ
+ G+ (e
xn(i) ) − G+ (x̄φn ) ≤ −τi k∆e
xn(i),φ k2 . (66)
Since ∇fi and ∇G− are Li and LG -Lipschitz, respectively, ∇F is (L+LG )PI
Lipschitz, where L , i=1 Li [cf. def. (35)]. Applying the descent lemma to
F̄ and using (59a) yields

F̄ x̄φn+1

αn
>
∇F̄ (x̄φn ) (φn )> ⊗ Im ∆e
≤ F̄ (x̄φn ) +
xnφ
I
2

L + LG (αn )
2
·
(φn )> ⊗ Im ∆e
xnφ
+
2
I
2
(a)
L + LG (αn ) 2
2
·
φub ∆e
≤ F̄ (x̄φn ) +
xnφ
2
I


I
2
αn X n
φi τi ∆e
xn(i),φ + G+ (e
−
xn(i) ) − G+ (x̄φn )
I i=1
I

+

>
αn X n 
n
φi ∇F̄ (x̄φn ) + ∇G− (xn(i) ) − I · ȳφn + I · ȳφn − I · y(i)
∆e
xn(i),φ
I i=1

+

>
αn X n 
φi ∇fi (xn(i) ) − ∇fi (x̄φn ) + ∇fei (x̄φn ; x̄φn ) − ∇fei (x̄φn ; xn(i) ) ∆e
xn(i),φ
I i=1

I

(b)

2

L + LG (αn ) 2
·
φub k∆e
xnφ k2
2
I
I

αn X n 
−
φi τi k∆e
xn(i),φ k2 + G+ (e
xn(i) ) − G+ (x̄φn )
I i=1

≤ F̄ (x̄φn ) +

26

Gesualdo Scutari, Ying Sun
I

I

+

X

αn X n
∇fj (xn(j) ) − ∇G− (xn(i) )
φi ∇F̄ (x̄φn ) −
I i=1
j=1

+ αn

I
X
i=1

∆e
xn(i),φ

n
φni kȳφn − y(i)
k k∆e
xn(i),φ k

I

+

αn X n
xn(i),φ k
φ ∇fi (xn(i) ) − ∇fi (x̄φn ) k∆e
I i=1 i

+

αn X n
xn(i),φ k
φ ∇fei (x̄φn ; x̄φn ) − ∇fei (x̄φn ; xn(i) ) k∆e
I i=1 i

I

2

L + LG (αn ) 2
·
φub k∆e
xnφ k2
2
I
I

αn X n 
xn(i),φ k2 + G+ (e
xn(i) ) − G+ (x̄φn )
−
φi τi k∆e
I i=1


I
I
αn X n X
+
φ
Lj kx̄φn − xn(j) k + LG kx̄φn − xn(i) k k∆e
xn(i),φ k
I i=1 i j=1

(c)

≤ F̄ (x̄φn ) +

+ αn

I
X
i=1

+
(d)

n
φni kȳφn − y(i)
kk∆e
xn(i),φ k

I
n X

α
I

i=1



φni Li kxn(i) − x̄φn k k∆e
xn(i),φ k + L̃i kxn(i) − x̄φn k k∆e
xn(i),φ k
2

L + LG (αn ) 2
2
·
φub ∆e
xnφ
2
I
I

αn X n 
φi τi k∆e
xn(i),φ k2 + G+ (e
−
xn(i) )−G+ (x̄φn )
I i=1

≤ F̄ (x̄φn ) +

xnφ + αn φub eny
+ αn cL φub kenx k ∆e

∆e
xnφ ,

(67)

where in (a) we used (66), Assumption E.1, and the bound (107) (along with
some basic manipulations); in (b) we used ȳφn = ḡn [cf. (59b)]; (c) follows
from the Li -Lipschitz continuity of ∇fi , LG -Lipschitz continuity of ∇G− , and
the uniformly L̃i -Lipschitz
continuity of ∇fei (x; •); and in (d) we used the
√
inequality kxk1 ≤ n kxk, and the definition of cL [cf. (35)].
Invoking the convexity of G+ and using (59a), we can write
I

αn X n + n
φ G (e
x(i) ),
G+ x̄φn+1 ≤ (1 − αn ) G+ (x̄φn ) +
I i=1 i
which combined with (67) yields

U x̄φn+1
≤ U (x̄φn ) −

αn
φlb cτ
I

∆e
xnφ

2

2

+

L + LG (αn ) 2
2
·
φub ∆e
xnφ
2
I

Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs

+ αn cL φub kenx k2 ∆e
xnφ + αn φub eny

27

∆e
xnφ
2

L + LG (αn ) 2
2
xnφ
·
φub ∆e
2
I
φub
φub
φub −1 n 2
2
2
n 2
+
ey ,
(cL x + y ) (αn ) ∆e
xnφ +
cL −1

x kex k +
2
2
2 y
where the last inequality follows from the Young’s inequality, with x > 0 and
y > 0. Applying the above inequality recursively for B̄ steps, with B̄ defined
in Lemma 3, yields
B̄−1


2
cτ φlb X n+t
U x̄φn+B̄ ≤ U (x̄φn ) −
∆e
xn+t
α
φ
I t=0



φub L + LG
φub
n
−1 n
n
cL −1
+
· φub + cL x + y E∆e
x Ex⊥ + y Ey⊥ .
x+
2
I
2
(68)
≤ U (x̄φn ) −

αn
φlb cτ
I

∆e
xnφ

2

+

Summing up (68) over B̄ consecutive iterations leads to the desired result. t
u
Since, for sufficiently small αn , the negative term on the RHS of (65) dominates the positive third term, to prove convergence of {U (x̄φn+B̄+k ))}n∈N+ ,
descent-based techniques used in the literature of distributed gradient-based
algorithms would call for the summability of the consensus error {Exn⊥ }n∈N+
and tracking error {Eyn⊥ }n∈N+ sequences. However, under constant step-size
or unbounded (sub-)gradient of U , it seems not possible to infer such a result
by just studying the dynamics of {Exn⊥ }n∈N+ and {Eyn⊥ }n∈N+ independently
from the optimization error ∆e
xnφ . Therefore, exploring the interplay between
these quantities, we put forth a new analysis, based on the following steps:
− Step 1: We first bound Exn⊥ and Eyn⊥ [specifically, term iv in (65)] as a
n
function of E∆e
xnφ )–see Proposition 12 [cf. Sec. 6.3.1]. Usx (and thus ∆e
ing Proposition 12, we then prove that {Exn⊥ }n∈N+ and {Eyn⊥ }n∈N+ are
n
summable, if {E∆e
x }n∈N+ is so−see Proposition 14 [cf. Sec. 6.3.2].
− Step 2: Using Propositions 12 and 14, we build a new Lyapunov function
n
[cf. Sec. 6.4], whose convergence implies the summability of {E∆e
x }n∈N+ and
thus convergence of all error sequences [cf. Sec. 6.5], as stated in Theorem 4.
n
6.3 Interplay among Exn⊥ , Eyn⊥ and E∆e
x

6.3.1 Bounding Exn⊥ and Eyn⊥
We first study the dynamics of kenx k and keny k.

Lemma 11 The disagreements kenx k and keny k satisfy
B̄
ken+
k ≤ ρB̄ kenx k + c
x
B̄
ken+
k ≤ ρB̄ eny + c Lmx φ−1
y
lb

B̄−1
X

B̄−1
X
t=0

αn+t ∆xn+t ,

(69)

t=0


2 kexn+t k + αn+t k∆xn+t k ,

(70)

28

Gesualdo Scutari, Ying Sun

√
n
where c = I 2I. Furthermore,
p if all A are double stochastic, then (69) and
(70) hold with B̄ = B, ρB̄ = 1 − κ/(2I 2 ) and c = 1.
Proof See Appendix B.



Using Lemma 11, we now study the dynamics of the weighted sum of the
disagreements kenx k and keny k over B̄ consecutive iterations.

Proposition 12 The sequences {kenx k2 }n∈N+ and {keny k2 }n∈N+ satisfy
B̄−1
X
k=0

k + 1 + (B̄ − k − 1)ρ̃
1 − ρ̃

2

B̄+k
en+
x

≤

B̄−1
X
k=0

k + 1 + (B̄ − k − 1)ρ̃
1 − ρ̃

en+k
x

2


X
X
 2B̄c2 n 2 B̄−1
 2B̄c2 B̄−1
n+k
−1
−1
− 1 −  + B̄
(αmx )
Exn+k
+

+
B̄
E∆e
x ,
⊥
1 − ρ̃
1 − ρ̃
|
{z
} k=0
|
{z
} k=0
c∆

µn

(71)
B̄−1
X
k=0

k + 1 + (B̄ − k − 1)ρ̃
1 − ρ̃
−

2

B̄+k
en+
y

B̄−1
X

≤

B̄−1
X
k=0

k + 1 + (B̄ − k − 1)ρ̃
1 − ρ̃

en+k
y

2

B̄−1
X
 2B̄c 2 −2
−1
n 2
Eyn+k
+

+
B̄
L
φ
Exn+k
(2
+
α
)
mx
mx
lb
⊥
⊥
1 − ρ̃
k=0
k=0
|
{z
}
2

c⊥

X
 2B̄c2 2 −2 B̄−1
n+k
+ −1 + B̄
Lmx φlb
E∆e
x , (72)
1 − ρ̃
k=0
|
{z
}
c⊥

n
,
where αmx

max
k=0,...,2B̄−2


αn+k ; ρ̃ , ρ2B̄ 1 + B̄ ; and  > 0 is any constant

such that ρ̃ < 1.
Proof We prove only (71); (72) can be proved using similar steps. Squaring
both sides of the inequality (69) leads to
B̄
en+
x

2


2
≤ ρ2B̄ kenx k + c ·

B̄−1
X

2
α

n+t

∆x

t=0

2

≤ ρ̃ kenx k +

t=0

B̄−1
X
t=0

B̄−1
X

2
≤ ρ2B̄ 1 + B̄ kenx k +

(b)

 +2

t=0

(a)

B̄−1
X

n+t

1
+ B̄






c ρB̄ αn+t kenx k ∆xn+t


1
2
+ B̄ c2 (αn+t )2 ∆xn+t


2 c2 (αn+t )2



∆e
xn+t
φ

2

+ en+t
x

2


,
(73)

Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs

29

where (a) follows from the Young’s inequality, with  > 0, and the Jensen’s

inequality; and in (b) we used (58). Note that, since ρB̄ < 1, ρ̃ = ρ2B̄ 1 + B̄ <


1, for all  ∈ 0, 1 − ρ2B̄ /(ρ2B̄ B̄) .
n
Denote α̃mx
, max αn+k . Multiplying (73) by 1/(1 − ρ̃) [resp. ρ̃/(1 − ρ̃)],
k=0,...,B̄−1

B̄ 2
n
adding kenx k2 (resp. ken+
k ) to both sides, and using the definitions of E∆e
x
x
n
and Ex⊥ [cf. (64)], yield

1
ken+B̄ k2 + kenx k2
1 − ρ̃ x



2 c2
1
ρ̃
n 2
n 2
n
n 2 n
kex k + kex k +
+ B̄
E∆e
≤
x + (α̃mx ) Ex⊥
1 − ρ̃
1 − ρ̃ 



2
1
2c
1
n 2
n
n 2 n
=
kex k +
+ B̄
E∆e
x + (α̃mx ) Ex⊥
1 − ρ̃
1 − ρ̃ 

(74)

and
2
2
2
1
ρ̃
B̄
B̄
=
en+B̄ + en+
en+
x
x
1 − ρ̃ x
1 − ρ̃



ρ̃
2 c2
1
n 2 n
n 2
n
≤
kex k +
+ B̄
E∆e
x + (α̃mx ) Ex⊥ ,
1 − ρ̃
1 − ρ̃ 

(75)

respectively.
PB̄−1
We write now k=0 Exn+k
as
⊥
B̄−1
X

Exn+k
=
⊥



B̄−2
en+2
x

2

+2

k=0



B̄−1
+ B̄ en+
x

2

B̄−3
en+2
x

2

+ · · · + (B̄ − 1)

B̄−2
+ (B̄ − 1) en+
x

2

2

B̄
en+
x


(76)

2
+ · · · + kenx k


.

Using (74) and (75) on the two terms in (76), we obtain the following bounds:


2
2
2
ρ̃
n+2B̄−3
n+B̄
n+2B̄−2
+ 2 ex
+ · · · + (B̄ − 1) ex
ex
1 − ρ̃


2
2
2
n+2B̄−2
n+2B̄−3
n+B̄
+
ex
+ 2 ex
+ · · · + (B̄ − 1) ex


2
2
ρ̃
n+B̄−2
n+B̄−3
n 2
≤
+ 2 ex
+ · · · + (B̄ − 1) kex k
(77)
ex
1 − ρ̃





2
2 c2
1
n+B̄−2
n+B̄−2
B̄−2
+
+ B̄
E∆e
+
α̃
Exn+
+ ···
mx
x
⊥
1 − ρ̃ 

i

n
n 2 n
+ B̄ − 1 E∆e
,
x + (α̃mx ) Ex⊥
and
1
1 − ρ̃



B̄−1
B̄ en+2
x

2

B̄−2
+ (B̄ − 1) en+2
x

2

B̄
+ · · · + en+
x

2



30

Gesualdo Scutari, Ying Sun



2
2
n+B̄−2
n 2
B̄−1
e
+
·
·
·
+
ke
k
+
(
B̄
−
1)
+ B̄ en+
x
x
x


2
2
1
n+B̄−2
n 2
B̄−1
+
·
·
·
+
ke
e
k
+
(
B̄
−
1)
B̄ en+
≤
x
x
x
1 − ρ̃





2

2 c2
1
n+B̄−1
n+B̄−1
n+B̄−1
+
+ B̄ B̄ E∆ex
Ex⊥
+ ···
+ α̃mx
1 − ρ̃ 

i
n
n 2 n
+ E∆e
.
x + (α̃mx ) Ex⊥

(78)

Summing (77) and (78) and rearranging terms while using (76), it is not
difficult to check that
B̄−1

B̄−1
X

k + 1 + (B̄ − k − 1)ρ̃ n+B̄+k 2 X n+k
ex
+
Ex⊥
1 − ρ̃
k=0

k=0

≤

B̄−1
X
k=0

k + 1 + (B̄ − k − 1)ρ̃ n+k 2
ex
1 − ρ̃


+

1
+ B̄




B̄−1

2B̄c2 X n+k
E∆ex +
1 − ρ̃



k=0

1
+ B̄


(79)


B̄−1

2B̄c2 n 2 X n+k
(α )
Ex⊥ ,
1 − ρ̃ mx

which leads to the desired result (71).

k=0

t
u

We use now Proposition 12 in conjunction with Lemma 9 to prove the
n
summability of {Exn⊥ }n∈N+ and {Eyn⊥ }n∈N+ , under that of {E∆e
x }n∈N+ . Let
s
αmx , σ ·

1 − ρ̃

2B̄ B̄ + −1 c2

(80)

with σ ∈ (0, 1). This implies [recall the definition of µn in (71)]


 2B̄c2 2
n
−1
n
α
µ ≥ µmin , 1 −  + B̄
= 1 − σ 2 > 0, ∀αmx
≤ αmx . (81)
1 − ρ̃ mx
P∞
Proposition 13 Suppose that i) n=0 (αn )2 k∆e
xnφ k2 < ∞; and ii) αn ≤ αmx ,
for all P
but finite n ∈ N+ . Then,
P∞ the consensus and tracking disagreements
∞
satisfy n=0 kenx k2 < ∞ and n=0 keny k2 < ∞, respectively.
P∞
Proof
from (64)
that it is sufficient P
to prove n=0 Exn⊥ < ∞ (for
P∞ It nfollows
P
∞
∞
2
n
n 2
n=0 kex k < ∞) and
n=0 Ey⊥ < ∞ (for
n=0 key k < ∞). We prove
next only the former result.
By Assumption F and (81), there exists a sufficiently large n, say n̄, such
that µn ≥ µmin > 0, for all n ≥ n̄. We assume, without loss of generality,
9 to (71) [cf. Proposition 12], we have
P∞thatn n̄ = 0. Applying
P∞ Lemma
n
E
<
+∞
=⇒
E
<
+∞. It is then sufficient to prove that
n=0 ∆e
n=0 x⊥
x

Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs

P∞

) k∆e
xnφ k2 < ∞ =⇒
the following chain of inequalities:
n=0 (α

n 2

n
X

k
E∆e
x =

n B̄−1
X
X

αk+t

2

P∞

n
n=0 E∆e
x < +∞. This comes readily from

2

k+t
∆e
xφ

≤ B̄

k=0 t=0

k=0

31

n+
B̄−1
X

αk

2

∆e
xkφ

2

.

k=0

t
u
6.3.2 Bounding term iv in (65)
We are now ready to bound term iv in (65), as stated next.
Proposition 14 Suppose that αn ≤ αmx , then
−1
y

+

B̄−1
X
k=0

k + 1 + (B̄ − k − 1)ρ̃
1 − ρ̃

1



µmin

≤ −1
y

B̄−1
X
k=0

2

B̄+k
en+
y

2

−1
cL −1
x + y c⊥ (2 + αmx )

 B̄−1
X k + 1 + (B̄ − k − 1)ρ̃
1 − ρ̃

k=0

k + 1 + (B̄ − k − 1)ρ̃
1 − ρ̃

en+k
y

2

B̄−1
X

−

n+k
−1 n+k
cL −1
x Ex⊥ + y Ey⊥

+

+



µmin


2
−1
cL −1
x + y c⊥ (2 + αmx )

2



k=0

{z

|
1

exn+B̄+k

}

term iv

 B̄−1
X k + 1 + (B̄ − k − 1)ρ̃
1 − ρ̃

k=0

en+k
x

 B̄−1
 c
X
∆
2
n+k
−1
−1
−1
y c⊥ (2 + αmx ) + cL x
+ y c⊥
E∆e
x
µmin

2

(82)

k=0

Proof Multiplying (72) by −1
y on both sides we have
−1
y

B̄−1
X
k=0

≤ −1
y

k + 1 + (B̄ − k − 1)ρ̃
1 − ρ̃

B̄−1
X
k=0

− −1
y
= −1
y

en+k
y
2

n
Eyn+k
+ −1
y c⊥ (2 + αmx )
⊥

k=0

B̄−1
X
k=0

− −1
y

k + 1 + (B̄ − k − 1)ρ̃
1 − ρ̃

B̄−1
X

k=0

2

B̄−1
X
k=0

k + 1 + (B̄ − k − 1)ρ̃
1 − ρ̃

B̄−1
X

2

eyn+B̄+k

− cL −1
Eyn+k
x
⊥

B̄−1
X
k=0

en+k
y

Exn+k
⊥

2

Exn+k
+ −1
y c⊥
⊥

B̄−1
X

n+k
E∆e
x

k=0

(83)

32

Gesualdo Scutari, Ying Sun
B̄−1
 B̄−1

X
X
n+k
n 2
−1
−1
+ −1
Exn+k
+

c
E∆e
⊥
y c⊥ (2 + αmx ) + cL x
y
x .
⊥
k=0

k=0

n
Since αn ≤ αmx , we have αmx
≤ αmx and µn ≥ µmin . Eq. (71) then implies
B̄−1
X
k=0

k + 1 + (B̄ − k − 1)ρ̃
1 − ρ̃

B̄−1
X

≤

k=0

2

B̄+k
en+
x

k + 1 + (B̄ − k − 1)ρ̃
1 − ρ̃

en+k
x

2

− µmin

B̄−1
X

Exn+k
+ c∆
⊥

k=0

B̄−1
X

n+k
E∆e
x .

k=0
2

−1
Multiplying both sides of the above inequality by (−1
y c⊥ (2 + αmx ) +cL x )/µmin
n
and using the fact that αmx ≤ αmx , we have

1



µmin
1

≤

 B̄−1
X k + 1 + (B̄ − k − 1)ρ̃
1 − ρ̃

k=0



µmin

2

−1
cL −1
x + y c⊥ (2 + αmx )

2

−1
cL −1
x + y c⊥ (2 + αmx )

 B̄−1
X k + 1 + (B̄ − k − 1)ρ̃
1 − ρ̃

k=0


2
−1
− cL −1
x + y c⊥ (2 + αmx )

 B̄−1
X

exn+B̄+k
en+k
x

2

2

Exn+k
⊥

k=0

 c B̄−1
X
∆
n+k
E∆e
x
µmin

(84)

Adding (84) to (83) leads to the desired result.

t
u

+



2
−1
cL −1
x + y c⊥ (2 + αmx )

k=0

6.4 Lyapunov-like function and its descent properties
We are now in the position to construct a function whose descent properties
(every B̄ iterations) will used to prove Theorem 4. Because of that, we will
refer to such a function as Lyapunov-like function.
Adding (65) and (82) (multiplied by φub /2), yields
B̄−1
X
k=0

B̄−1

 φ
X k + 1 + (B̄ − k − 1)ρ̃
ub −1
U x̄φn+B̄+k +
y
2
1 − ρ̃

eyn+B̄+k

2

k=0

B̄−1

+

 X k + 1 + (B̄ − k − 1)ρ̃
φub  −1
2
cL x + −1
y c⊥ (2 + αmx )
2 µmin
1 − ρ̃
k=0

≤

B̄−1
X
k=0

 φub −1
U x̄φn+k +

2 y

B̄−1
X
k=0

k + 1 + (B̄ − k − 1)ρ̃
1 − ρ̃

en+k
y

2

exn+B̄+k

2

Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs

33

B̄−1

+

 X k + 1 + (B̄ − k − 1)ρ̃
φub  −1
2
cL x + −1
y c⊥ (2 + αmx )
2 µmin
1 − ρ̃

en+k
x

2

k=0

−

cτ
φlb
I

φub
+
2
+



B̄−1
X B̄−1
X

xn+k+t
αn+k+t ∆e
φ

2

k=0 t=0

L + LG
· φub + cL x + y + −1
y c⊥
I

φub
2
−1
cL −1
c∆
x + y c⊥ (2 + αmx )
2µmin




 B̄−1
X

n+k
E∆e
x

k=0

B̄−1
X

n+k
E∆e
x .

(85)

k=0

Define
Vn ,

B̄−1
X
k=0

X k + 1 + (B̄ − k − 1)ρ̃
 φub −1 B̄−1
y
U x̄φn+k +
2
1 − ρ̃

2

en+k
y

k=0

B̄−1

+

 X k + 1 + (B̄ − k − 1)ρ̃
φub  −1
2
cL x + −1
y c⊥ (2 + αmx )
2µmin
1 − ρ̃

en+k
x

k=0

and
cτ
φub n
β , φlb −
α
I
2
n



(86)

L + LG
· φub + cL x + y + −1
y c⊥
I

c∆  −1
2
−1
+
cL x + y c⊥ (2 + αmx )
. (87)
µmin

Substituting (86) and (87) in (85), we obtain the desired descent property of
V n : for sufficiently large n, it holds

V n+B̄ ≤ V n −

B̄−1
X
X B̄−1

n+k+t
β n+k+t αn+k+t ∆e
xφ

2

.

(88)

k=0 t=0

6.5 Proof of Theorem 4
The proof consists in two steps, namely:
− Step 1: Leveraging the descent property of the Lyapunov-like function,
we first show that lim k∆e
xnφ k = 0, either using a diminishing or constant
n→∞

step-size αn (satisfying Assumption F); and
− Step 2: Using the results in Step 1, we conclude the proof showing that i)
limn→∞ D(xn ) = 0 and ii) limn→∞ J(x̄n ) = 0

2

,

34

Gesualdo Scutari, Ying Sun

6.5.1 Step 1: lim k∆e
xnφ k = 0
n→∞

Let us distinguish the two choices of step-size, namely: αn is constant (satisfying Assumption F.1); or αn is diminishing (satisfying Assumption F.2).
Case 1: constant step-size. Set αn ≡ α for all n ∈ N+ . To obtain the desired descent on V n [cf. (88)], α has to be chosen so that β n = β > 0 [cf. (87)].
We show next that if α satisfies (36) [cf. Assumption F.2], then β > 0.
Recall that (88) holds under the assumption that α ≤ αmx , with αmx
defined in (80). Substituting the expressions of αmx and µmin = 1−σ 2 [cf. (81)]
in (87) and using the definitions of c∆ and c⊥ [cf. Proposition 12], one can
check that β n = β > 0 [cf. (87)] if, in addition to α ≤ αmx , α satisfies also
2 cτ φlb
α≤
Iφub




L + LG
c∆
−1
· φub + cL x + y +
cL −1
x + 9 c⊥ y
2
I
1−σ

−1
,

(89)
where x , y > 0 are free parameters. The above upperbound is maximized by
r
x =
r
y =

c∆
=c
1 − σ2

s


2B̄ −1 + B̄
(1 − ρ̃)(1 − σ 2 )

9 c⊥ c∆
B̄c2
−1 −1
=
6
L
φ
(
+
B̄)
mx
lb
1 − σ2
1 − ρ̃

r

1
1 − σ2

Combining α ≤ αmx and (89), we get the following bound for α:
 s

s


2B̄ −1 + B̄
1 − ρ̃
2cτ φlb  L + LG
,
α ≤ min σ
· φub + 2c · cL

Iφub
I
(1 − ρ̃)(1 − σ 2 )
2c2 B̄ B̄ + −1
!−1 
r

2
B̄c
1
−1
+12 Lmx φ−1
(
+
B̄)
, (90)
lb

1 − ρ̃ 1 − σ 2
where recall that  < (1 − ρ2B̄ )/(ρ2B̄ B̄) [cf. Proposition 12]. Since (1 − ρ̃)/(−1 + B̄)
is maximized by  = (1 − ρB̄ )/(ρB̄ · B̄) with the corresponding value being
2
(1 − ρB̄ ) /B̄, we obtain from (90) the final bound (36).
Under (36), using (88) and Lemma 9 (recall that lim inf V n > −∞, since U
n→∞

is bounded from below on K) we get lim k∆e
xnφ k = 0 and, by Proposition 13,
lim kenx k = 0 and lim keny k = 0.

n→∞

n→∞

n→∞

Case 2: diminishing step-size. Since αn is diminishing, there exists a sufficiently large n2 so that β n ≥ β > 0 for all n ≥ n2 , implying
∞ B̄−1
X
X
n=0 t=0

n+t
αn+t ∆e
xφ

2

< ∞,

(91)

Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs

which together with

P∞

n=0 α

n

35

= ∞ and Proposition 13 yield

xnφ = 0;
lim inf ∆e

(92)

n→∞

lim kenx k = 0

n→∞

and

lim

n→∞

eny = 0.

(93)

We prove next that lim sup k∆e
xnφ k = 0, which together with (92) implies
n→∞

limn→∞ k∆e
xnφ k = 0. Suppose that lim sup k∆e
xnφ k > 0. This, together with
n→∞

lim inf k∆e
xnφ k = 0, implies that there exists an infinite set of indices N such
n→∞
that for all n ∈ N , one can find an integer in > n such that:
∆e
xnφ < η,

k∆e
xiφn k > 2η

η ≤ k∆e
xjφ k ≤ 2η,

(94)

n < j < in .

(95)

>
bn(i) , x
bi (xn(i) ) and x
bn , [b
bn>
Denote x
xn>
(1) , . . . , x
(I) ] . We have:

xiφn − ∆e
xnφ
xnφ ≤ ∆e
xiφn − ∆e
η ≤ ∆e
ein − x
en + Jφin xin − Jφn xn
≤ x

e in − x
bn + x
bin − x
bn k + Jφin xin − Jφn xn
bin + ke
xn − x
≤ x
|
{z
}
en
1

(a)

≤ L̂ xin − xn + Jφin xin − Jφn xn + en1


√
≤ L̂ xin − Jφin xin + kxn − Jφn xn k + I x̄φin − x̄φn
√
+ I x̄φin − x̄φn + en1

√

≤ L̂ + 1
I x̄φin − x̄φn + L̂ eixn + kenx k + en1
|
{z
}
en
2

(b) 

√



√

≤ L̂ + 1

≤ L̂ + 1

I

iX
n −1

αt

t=n

I


1
(φt )> ⊗ Im ∆e
xtφ + en2 + en1
I

iX
n −1

√
αt ∆e
xtφ + (L̂ + 1) Iαn ∆e
xnφ +en2 + en1
{z
}
|
t=n+1
en
3

iX
n −1
√
2
≤ (L̂ + 1) I η −1
αt ∆e
xtφ + en3 + en2 + en1 ,

(c)

(96)

t=n+1

where in (a) we used (61) [cf. Lemma 7]; (b) follows from (59a); and in (c) we
used the lower bound in (95).
bn k = 0
Since i) lim kenx k = 0 and lim keny k = 0 [cf. (93)]; ii) lim ke
xn − x
n→∞
n→∞
n→∞
P∞ PB̄−1
n+t 2
[cf. Lemma 8]; iii) and n=0 t=0 αn+t k∆e
xφ
k < ∞ [cf. (91)], there exists
a sufficiently large n3 such that the right-hand-side of (96) is less than η, for
all n > n3 , which leads to a contradiction. Therefore, lim sup k∆e
xnφ k = 0.
n→∞

36

Gesualdo Scutari, Ying Sun

6.5.2 Step 2: lim M (xn ) = 0
n→∞

Recall that in the previous subsection we proved that i) lim k∆e
xnφ k = 0; ii)
n→∞

lim kenx k = 0; and iii) lim keny k = 0, using either a constant step-size αn ≡

n→∞

n→∞

α, with α satisfying (36), or a diminishing one. The statement lim D(xn ) = 0
n→∞

follows readily from point ii) and
lim kxn(i) − x̄n k ≤ lim kxn(i) − x̄φn k + lim kx̄φn − x̄n k
n→∞

n→∞

n→∞

I
(97)
1X n
kx(j) − x̄φn k = 0.
n→∞ I
j=1

≤ lim kxn(i) − x̄φn k + lim
n→∞

Next we show lim J(x̄n ) = 0. Recall the definition J(x̄n ) , kx̄(x̄n ) − x̄n k,
n→∞
where for notation simplicity, we set


 >
1
x̄(x̄n ) , argmin
∇F (x̄n ) − ∇G− x̄n
(z − x̄n ) + kz − x̄n k2 + G(z)+ .
2
z∈K
(98)
Since
bi (x̄n )k,
J(x̄n ) ≤ kb
xi (x̄n ) − x̄n k + kx̄(x̄n ) − x

(99)

it is sufficient to show that the two terms on the right hand side are asymptotically vanishing, which is proved below.
• lim kb
xi (x̄n ) − x̄n k = 0. We bound kb
xi (x̄n ) − x̄n k as
n→∞

bi (x̄φn )k
kb
xi (x̄n ) − x̄n k ≤ kb
xi (x̄φn ) − x̄φn k + kx̄φn − x̄n k + kb
xi (x̄n ) − x
(a)

≤ kb
xi (x̄φn ) − x̄φn k + (1 + L̂)kx̄φn − x̄n k,

(100)

where (a) follows from Lemma 7. From (97) we know lim kx̄φn − x̄n k = 0.
n→∞

To show kb
xi (x̄φn ) − x̄φn k is asymptotically vanishing, we bound it as
bi (xn(i) )k + kb
en(i) k
kb
xi (x̄φn ) − x̄φn k ≤ kb
xi (x̄φn ) − x
xi (xn(i) ) − x
+ ke
xn(i) − x̄φn k.

(101)

The result lim kb
xi (x̄φn ) − x̄φn k = 0 follows from Lemma 7, Lemma 8 and
n→∞

points i)-iii).
From (100) and (101) we conclude

lim kb
xi (x̄n ) − x̄n k = 0.

n→∞

(102)

Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs

37

bi (x̄n )k = 0. Using the first order optimality
• We prove lim kx̄(x̄n ) − x
n→∞

bi (x̄n ), we can bound their difference as
conditions of x̄(x̄n ) and x

bi (x̄n )k ≤ k∇fei x
bi (x̄n ); x̄n − ∇fi (x̄n ) − x
bi (x̄n ) + x̄n k
kx̄(x̄n ) − x

bi (x̄n ); x̄n − ∇fi (b
≤ k∇fei x
xi (x̄n ))k

+ k∇fi (b
xi (x̄n )) − ∇fi (x̄n )k + kb
xi (x̄n ) − x̄n k

(103)

≤ (L̃i + Li + 1)kb
xi (x̄n ) − x̄n k.
Using (102) we have

bi (x̄n )k = 0.
lim kx̄(x̄n ) − x

n→∞

(104)

The proof is completed just combining (99), (102) and (104).

7 Numerical results
7.1 Sparse regression
In this section, we test the performance of SONATA on the sparse linear
regression problem (1) [cf. Sec. 2.1]. We generated the data set as follows. The
ground truth signal x? ∈ R500 is built by first drawing randomly a vector
from the normal distribution N (0, I), then thresholding the smallest 80% of
its elements to zero. The underlying linear model is bi = Ai x? + ni , where the
observation matrix Ai ∈ R20×500 is generated by first drawing i.i.d. elements
from the distribution N (0, 1), and then normalizing the rows to unit norm; and
ni is the additive noise, with i.i.d. entries from N (0, 0.1). We simulated 100
Monte Carlo trials, generating in each trial new Ai ’s and ni ’s. We considered
a time-varying digraph, composed of I = 30 agents. In every time slot, a new
digraph is generated according to the following procedure: each agent i has
two out-neighbors, one of them belonging to a chain connecting all the agents
and the other one picked uniformly at random.
To promote sparsity we use
P
the (nonconvex) log function G(x) = λ · i log(1 + θ |xi |)/ log(1 + θ), where
the parameter θ controls the tightness of the approximation of the `0 function.
We set λ = 0.1 and θ = 2. It is convenient to rewrite G(x) in the DC form
G(x) = G+ (x) − G− (x), with G+ (x) = kxk1 · (θ/ log(1 + θ)). It is not difficult
to check that such G+ and G− satisfy Assumption A.3; see, e.g., [1].
We run SONATA considering two alternative choices of fei , namely:
• SONATA-PL (PL stands for partial linearization): Since fi = kbi − Ai xk2
is convex, one can keep fi unalterated and set in (28) fei (x(i) ) = fi (x(i) ) +
en(i) of the
(τP L /2) · kx(i) − xn(i) k2 . We set τP L = 1.5. The unique solution x
resulting subproblem (26) is computed using the FLEXA algorithm, with the
following tuning (see [16] for details): the initial point is selected randomly;
the proximal parameter in the subproblems solved by FLEXA is set to be
2; and the step-size of FLEXA is chosen according to the diminishing rule
γ r = γ r−1 1 − µγ r−1 , with γ 0 = 0.5 and µ = 0.01, with r denoting the

38

Gesualdo Scutari, Ying Sun

r
r
(inner) iteration index. We terminate FLEXA when J(i)
≤ 10−8 , with J(i)
,
n
n,r
n,r
n,r
n,r
n
−
n
>
e i +λ∇G (x(i) ))k∞ ,
kx(i) −Sηλ (x(i) −2Ai (Ai x(i) −bi )−τP L ·(x(i) −x(i) )+ π
denotes
the
value
of
x
at
the
n-th
outer
and the r-th inner
where xn,r
(i)
(i)

iteration, and Sβ (x) , sign (x) · max{|x| − λ1, 0} is the soft-thresholding
operator (intended to be applied to x component-wise).
• SONATA-L (L stands for linearization): To obtain a closed form expression for
en(i) in (28), one can choose fei as linearization of fi (plus the proximal term),
x
that is, fei (x(i) ) = 2A> (Ai xn − bi + (τL /2) · kx(i) − xn k2 . We set τL = 1.5.
i

(i)

(i)

en(i) of the resulting subproblem (28) has the following closed form
The solution x
n
e ni − λ∇G− (xn(i) ))).
en(i) = Sηλ/τL (xn(i) − τ1L (2A>
expression x
i (Ai x(i) − bi ) + π
As benchmark, we also simulated the subgradient-push algorithm [27] with
diminishing step-size. Note that there is no proof of convergence for such a
scheme, when applied to the nonconvex, nonsmooth problem (1). For all the
algorithms, we use the same step-size rule: αn = αn−1 1 − µαn−1 , with α0 =
0.5 and µ = 0.01. Also, for all algorithms, we set x0(i) = 0, for all i.
We monitor the progresses of the algorithms towards stationarity and consensus using respectively the following two functions: i) J n , kx̄n − Sηλ (x̄n −
P
n
−
n
n
n
n
2 i A>
i (Ai x̄ − bi ) + λ∇G (x̄ )) k∞ ; and ii) D , kx − Jx k∞ . It is not
n
difficult to check that J is a valid distance of the average iterates Jxn from
stationarity: it is continuous and zero if and only if its argument is a stationary solution of (1). We also use the normalized mean squared error (NMSE),
defined as NMSEn , kxn − (1 ⊗ I) x? k2 /(I · kx? k2 ).
In Fig. 1, we plot log10 J n and log10 Dn [subplot (a)] and the NMSE [subplot (b)] versus the number of agents’ message exchanges, averaged over 100
Monte-Carlo trials (we applied the log10 transform to J n and Dn so that their
distribution is closer to the normal one). The figures show that both versions of
SONATA are much faster than the distributed gradient algorithm. This seems
mainly due to the gradient tracking mechanism put forth by the proposed
scheme. Under the same tuning, SONATA-PL converges faster than SONATAL. According to our intensive simulations (not reported here), SONATA-PL
becomes up to one order of magnitude faster than SONATA-L when τP L is
reduced whereas reducing τL slows down SONATA-L.
7.2 Distributed PCA
Our second application is the distributed PCA problem
min

kxk2 ≤1

F (x) , −

I
X
i=1

2

kDi xk ,

(105)

with I = 30.
Each agent i locally owns a data matrix Di ∈ Rdi ×m and communicate
via a time-varying digraph generated in the same way as the previous sparse
regression example (cf. Sec. 7.1).
2
Since fi (x) , − kDi xk is concave, to apply SONATA we construct fei by
n >
linearizing fi , which leads to Fei (x(i) ; xn(i) ) = (I · y(i)
) (x(i) − xn(i) ) + (τ /2) ·

Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs
101

SONATA-L
SONATA-PL
Gradient

Jn

100

Dn

0.4

J n)

10−3

0.2

Dn

10−4
10

NMSE

0.6

10−2

−5

SONATA-L
SONATA-PL
Gradient

0.8

10−1

39

200

400
600
Message exchanges

(a)

NMSE = 0.026
800

0

200

400
600
Message exchanges

800

(b)

Fig. 1 Sparse regression problem (1) with log regularizer: SONATA-PL, SONATA-L, and
subgradient-push; average of log10 J n and log10 Dn vs. agent’s message exchange [subplot
(a)]; average of NMSE vs. agent’s message exchange [subplot (b)].

en(i) of the resulting subproblem has the closed
kx(i) − xn(i) k2 . The solution x
n
n
e(i) = Pkx(i) k≤1 (xn(i) − I · y(i)
form solution x
/τ ), where P denotes the Euclidean
projection onto the set {x(i) : kx(i) k ≤ 1}. As benchmark, we implemented
also the gradient projection algorithm [4], adapted to time-varying network.
Note that there is no formal proof of this algorithm in the simulated setting.
The performance of the algorithms is tested on both synthetic and real data
sets, as detailed next.
7.2.1 Synthetic data
Each agent i locally owns a data matrix Di ∈ R30×500 , whose rows are i.i.d.,
drawn by the N (0, Σ). The covariance matrix Σ, whose eigendecomposition
is Σ = UΛUT , is generated as follows: we synthesize U by first generating
a square matrix whose entries follow the i.i.d. standard normal distribution,
then perform the QR decomposition to obtain its orthonormal basis; and the
eigenvalues diag(Λ) are i.i.d. uniformly distributed in [0, 1].
The algorithms are tuned as follows: x0(i) is generated with i.i.d elements
drawn by the standard Normal distribution. The step-size αn is chosen according to the diminishing rule used in the previous example, where we set
α0 = 1 and µ = 10−3 for SONATA and α0 = 1 and µ = 10−2 for the gradient
algorithm. The proximal parameter τ for SONATA is set to be 1. The distance
of x̄n from stationarity is measured by J n , kx̄n −Pkx(i) k≤1 (x̄n −∇F (x̄n ))k∞ ,
while the consensus disagreement Dn and the NMSEn are defined as in the
previous example; in the definition of P
NMSEn the ground truth signal x? is
I
now the leading eigenvector of matrix i=1 D>
i Di .
n
In Fig. 2, we plot log10 J and log10 Dn [subplot (a)] and the NMSE [subplot (b)] versus the number of agents’ message exchanges, averaged over 100
Monte-Carlo trials. In each trial, Σ is fixed while the Di ’s are randomly generated. Fig. 2(a) clearly shows that SONATA can find a stationary point efficiently while the gradient algorithm progresses very slowly. More interestingly,
Fig. 2(b) shows that SONATA always find the leading eigenvector whereas the
gradient algorithm fails to achieve a small NMSE value.

40

Gesualdo Scutari, Ying Sun
100

2

SONATA-L
Gradient

Dn
10−1

SONATA-L
Gradient

1.5

10−2

NMSE

Jn
Dn

10−3

0.5
J

10−4

1

100

n

200
300
400
500
number of message exchanges

00

600

100

(a)

200
300
400
500
number of message exchanges

600

(b)

Fig. 2 Distributed PCA Problem (1) on synthetic data set: SONATA and gradient projection algorithm; average of log10 J n and log10 Dn vs. agent’s message exchange [subplot (a)];
average of NMSE vs. agent’s message exchange [subplot (b)].
100

Dn

1.5

SONATA-L
Gradient

SONATA-L
Gradient

10−1
Jn
MSE

1

10−2
Dn

0.5

10−3
Jn
10

−4

20

40
60
80
number of message exchanges

(a)

100

0

20

40
60
80
number of message exchanges

100

(b)

Fig. 3 Distributed PCA Problem (1) on gene expression data set: SONATA and gradient
projection algorithm; average of log10 J n and log10 Dn vs. agent’s message exchange [subplot
(a)]; average of NMSE vs. agent’s message exchange [subplot (b)].

7.2.2 Gene expression data
This second experiment tests SONATA on a real-world data set. Specifically,
we used the breast cancer gene expression data set [5], which consists of d =
158 samples and m = 12625 genes per sample. We first uniformly randomly
permute the order the samples and then equally divided the samples among
the I = 30 agents. To avoid the issue that d is not divisible by I, we let the
first I − 1 agents owing di = bd/Ic samples each, while the I-th agent owning
the remaining samples. The samples are preprocessed by subtracting the mean
from all of them. Note that this can be achieved distributively by running an
average consensus algorithm beforehand.
The rest of the setting and tuning of the algorithms are the same of those
described in Sec. 7.2.1. In Fig. 3, we plot log10 J n and log10 Dn [subplot (a)]
and the NMSE [subplot (b)] versus the number of agents’ message exchanges,
averaged over 100 Monte-Carlo trials. In each trial, samples are randomly
partitioned among the agents. From the figure we can see that the behavior of
the algorithms on the gene expression data set is similar to that on synthetic

Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs

41

data set. Moreover, SONATA converges quite fast even though the variable
dimension of the real data set we adopted is massive.
8 Appendix
A Proof of Lemma 3
We begin introducing the following intermediate result.
Lemma 15 In the setting of Lemma 3, the following hold:
(i) The elements of An:0 , n ∈ N+ , can be bounded as



min At:0 1 i ≥ φlb ,
inf
t∈N+

(106)

1≤i≤I


sup
t∈N+

max

1≤i≤I


At:0 1 i


≤ φub ,

(107)

where φlb and φub are defined in (8);
(ii) For any given n, k ∈ N+ , n ≥ k, there exists a stochastic vector ξ k , [ξ1k , . . . ξIk ]> (i.e.,
ξ k > 0 and 1> ξ k = 1) such that
 n−k+1 
n:k
Wij
− ξjk ≤ c0 (ρ) (I−1)B ,
∀i, j ∈ [I],
(108)
where c0 and ρ are defined in (10).
The proof Lemma 15 follows similar steps as those in [31, Lemma 2, Lemma 4] and thus is
omitted, although the results in [31] are established under a stronger condition on G n than
Assumption B.
We prove now Lemma 3. Let z ∈ RI·m be an arbitrary vector. For each ` = 1, . . . , m,
define z` , (II ⊗ e>
` ) z, where e` is the `-th canonical vector; we denote by z`,j the j-th
component of z` , with j ∈ [I]. We have
v
u


m
2


u X
1
c n:k − J k z ≤ tI ·
W
.
(109)
Wn:k − 1 (φk )> z`
φ
2
I
∞
`=1
We bound next the above term. Given ξ k as in Lemma 15 [cf. (108)], define En:k , Wn:k −
n:k . We have
1(ξ k )> , whose ij-th element is denoted by Eij



>
1
1
(15)
Wn:k − 1 φn+1
=
1(φk )> z`
Wn:k z`
I
I
∞
∞



1
n+1 >
n:k
I− 1 φ
=
E
z`
I
∞


! I
I
I φn+1 X
X
X
φn+1
j0
n:k
n:k
i

≤ max
1−
Eij
z`,j +
E 0
z`,j 
1≤i≤I
I
I j=1 j j
j=1
j 0 6=i
 n−k+1 
 n−k+1 
√
≤ 2 c0 (ρ) (I−1)B kz` k1 ≤ 2 c0 (ρ) (I−1)B
I kz` k2 .


Wn:k −

(110)

Combining (109) and (110) we obtain
c n:k − J k
W
φ

2

 n−k+1 
≤ 2c0 I(ρ) (I−1)B .

(111)

Moreover, the matrix difference above can be alternatively uniformly bounded as follows:
c n:k − J k
W
φ

(a) √

c n:k k ≤ kI − J n+1 kkW
c n:k k ≤
= k(I − Jφn+1 )W
φ

c n:k k ≤
where (a) follows from (25) and kW

√

I. This completes the proof.

2I ·

√
I,


42

Gesualdo Scutari, Ying Sun

B Proof of Lemma 11
Recall the SONATA update written in vector-matrix form in (43)-(45). Note that the xupdate therein is a special case of the perturbed condensed push-sum algorithm (16), with
c n xn . We can then apply Proposition 1 and readily obtain (69).
perturbation δ n+1 = αn W
To prove (70), we follow a similar approach: noticing that the y-update in (45) is a

b n+1 )−1 gn+1 − gn , we can write
special case of (16), with perturbation δ n+1 = (D
φ
B̄
ken+
k ≤ ρB̄ ken
y
yk +

√

2I

B̄−1
X


b n+t+1 )−1 gn+t+1 − gn+t k
k(D
φ

t=0

≤ ρB̄ en
y +

√

2I Lmx φ−1
lb

B̄−1
X

c n+t (xn+t + αn+t ∆xn+t ) − xn+t
W

t=0

≤ ρB̄ en
y +

√

2I Lmx φ−1
lb

B̄−1
X

c n+t en+t + en+t + αn+t W
c n+t ∆xn+t
W
x
x



t=0

≤ ρB̄ en
y +

√

2I Lmx φ−1
lb

B̄−1
X √


√
+ αn+t I ∆xn+t
( I + 1) en+t
x

t=0
B̄−1
X
√

−1
≤ ρB̄ en
2 en+t
+ αn+t ∆xn+t .
y + I 2I Lmx φlb
x
t=0

This completes the proof.

References
1. Ahn, M., Pang, J.S., Xin, J.: Difference-of-convex learning I: Directional stationarity,
optimality, and sparsity (submitted for publication, 2016)
2. Bertsekas, D.P.: Nonlinear programming. Athena Scientific, 2 ed. (1999)
3. Bertsekas, D.P., Tsitsiklis, J.N.: Gradient convergence in gradient methods with errors.
SIAM Journal on Optimization 10(3), 627–642 (2000)
4. Bianchi, P., Jakubowicz, J.: Convergence of a multi-agent projected stochastic gradient
algorithm for non-convex optimization. IEEE Trans. on Automatic Control 58(2), 391–
405 (2013)
5. Bild, A.H., et al.: Oncogenic pathway signatures in human cancers as a guide to targeted
therapies. Nature 439(7074), 353 (2006)
6. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine
learning. SIAM Review 60(2), 223–311 (2018)
7. Bradley, P.S., Mangasarian, O.L.: Feature selection via concave minimization and support vector machines. In: Proc. of the Fifteenth International Conference on Machine
Learning (ICML 1998), vol. 98, pp. 82–90 (1998)
8. Cattivelli, F.S., Sayed, A.H.: Diffusion LMS strategies for distributed estimation. IEEE
Trans. on Signal Processing 58(3), 1035–1048 (2010)
9. Chang, T.H.: A proximal dual consensus ADMM method for multi-agent constrained
optimization. IEEE Trans. on Signal Processing 64(14), 3719–3734 (2014)
10. Chang, T.H., Hong, M., Wang, X.: Multi-agent distributed optimization via inexact
consensus ADMM. IEEE Trans. on Signal Processing 63(2), 482–497 (2015)
11. Chen, J., Sayed, A.H.: Diffusion adaptation strategies for distributed optimization and
learning over networks. IEEE Trans. on Signal Processing 60(8), 4289–4305 (2012)
12. Di Lorenzo, Scutari, G.: NEXT: In-network nonconvex optimization. IEEE Trans. on
Signal and Information Processing over Networks 2(2), 120–136 (2016). Appeared on
arXiv on Feb. 1, 2016
13. Di Lorenzo, P., Scutari, G.: Distributed nonconvex optimization over networks. In: Proc.
of the IEEE 6th International Workshop on Computational Advances in Multi-Sensor
Adaptive Processing (CAMSAP 2015). Cancun, Mexico (2015)

Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs

43

14. Di Lorenzo, P., Scutari, G.: Distributed nonconvex optimization over time-varying networks. In: Proc. of the IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP 16). Shanghai (2016)
15. Facchinei, F., Lampariello, L., Scutari, G.: Feasible methods for nonconvex nonsmooth
problems with applications in green communications. Mathematical Programming
164(1–2), 55–90 (2017)
16. Facchinei, F., Scutari, G., Sagratella, S.: Parallel selective algorithms for nonconvex big
data optimization. IEEE Trans. on Signal Processing 63(7), 1874–1889 (2015)
17. Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle
properties. J. of the American statistical Association 96(456), 1348–1360 (2001)
18. Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning : Data
Mining, Inference, and Prediction, Springer Series in Statistics, vol. 1. New York, NY
: Springer-Verlag New York (2009)
19. Fu, W.J.: Penalized regressions: the bridge versus the lasso. J. of Computational and
Graphical Statistics 7(3), 397–416 (1998)
20. Gharesifard, B., Cortés, J.: When does a digraph admit a doubly stochastic adjacency
matrix? In: Proceedings of the 2010 American Control Conference, pp. 2440–2445 (2010)
21. Hong, M., Hajinezhad, D., Zhao, M.: Prox-PDA: The proximal primal-dual algorithm
for fast distributed nonconvex optimization and learning over networks. In: Proceedings
of the 34th International Conference on Machine Learning (ICML 2017), vol. 70, pp.
1529–1538 (2017)
22. Jakovetic, D., Xavier, J., Moura, J.M.: Cooperative convex optimization in networked
systems: Augmented Lagrangian algorithms with directed gossip communication. IEEE
Trans. on Signal Processing 59(8), 3889–3902 (2011)
23. Jakovetić, D., Xavier, J., Moura, J.M.: Fast distributed gradient methods. IEEE Trans.
on Automatic Control 59(5), 1131–1146 (2014)
24. Kempe, D., Dobra, A., Gehrke, J.: Gossip-based computation of aggregate information.
In: Proc. of the 44th Annual IEEE Symposium on Foundations of Computer Science,
pp. 482–491. Cambridge, MA, USA, (2003)
25. Mokhtari, A., Shi, W., Ling, Q., Ribeiro, A.: DQM: Decentralized quadratically approximated alternating direction method of multipliers. arXiv:1508.02073 (2015)
26. Mokhtari, A., Shi, W., Ling, Q., Ribeiro, A.: A decentralized second-order method with
exact linear convergence rate for consensus optimization. IEEE Trans. on Signal and
Information Processing over Networks 2(4), 507–522 (2016)
27. Nedic, A., Olshevsky, A.: Distributed optimization over time-varying directed graphs.
IEEE Trans. on Automatic Control 60(3), 601–615 (2015)
28. Nedić, A., Ozdaglar, A., Parrilo, P.A.: Constrained consensus and optimization in multiagent networks. IEEE Trans. on Automatic Control 55(4), 922–938 (2010)
29. Nedich, A., Olshevsky, A., Ozdaglar, A., Tsitsiklis, J.N.: On distributed averaging algorithms and quantization effects. IEEE Trans. on Automatic Control 54(11), 2506–2517
(2009)
30. Nedich, A., Olshevsky, A., Shi, W.: Achieving geometric convergence for distributed
optimization over time-varying graphs. SIAM J. on Optimization 27(4), 2597–2633
(2017)
31. Nedich, A., Ozdaglar, A.: Distributed subgradient methods for multi-agent optimization.
IEEE Trans. on Automatic Control 54(1), 48–61 (2009)
32. Palomar, D.P., Chiang, M.: Alternative distributed algorithms for network utility maximization: Framework and applications. IEEE Trans. on Automatic Control 52(12),
2254–2269 (2007)
33. Qu, G., Li, N.: Harnessing smoothness to accelerate distributed optimization.
arXiv:1605.07112 (2016)
34. Rao, B.D., Kreutz-Delgado, K.: An affine scaling methodology for best basis selection.
IEEE Trans. on Signal Processing 47(1), 187–200 (1999)
35. Sayed, A.H., et al.: Adaptation, learning, and optimization over networks. Foundations
and Trends in Machine Learning 7(4-5), 311–801 (2014)
36. Scutari, G., Facchinei, F., Lampariello, L.: Parallel and distributed methods for constrained nonconvex optimization–Part I: Theory. IEEE Trans. on Signal Processing
65(8), 1929–1944 (2017)

44

Gesualdo Scutari, Ying Sun

37. Scutari, G., Facchinei, F., Song, P., Palomar, D.P., Pang, J.S.: Decomposition by partial linearization: Parallel optimization of multi-agent systems. IEEE Trans. on Signal
Processing 62(3), 641–656 (2014)
38. Shi, W., Ling, Q., Wu, G., Yin, W.: EXTRA: An exact first-order algorithm for decentralized consensus optimization. SIAM J. on Optimization 25(2), 944–966 (2015)
39. Shi, W., Ling, Q., Wu, G., Yin, W.: A proximal gradient algorithm for decentralized
composite optimization. IEEE Trans. on Signal Processing 63(22), 6013–6023 (2015)
40. Sun, Y., Scutari, G.: Distributed nonconvex optimization for sparse representation. in
Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP) pp. 4044–4048 (2017)
41. Sun, Y., Scutari, G., Palomar, D.: Distributed nonconvex multiagent optimization over
time-varying networks. in Proc. of the Asilomar Conference on Signals, Systems, and
Computers (2016). Appeared on arXiv on July 1, 2016
42. Tatarenko, T., Touri, B.: Non-convex distributed optimization. arXiv:1512.00895 (2016)
43. Thi, H.L., Dinh, T.P., Le, H., Vo, X.: DC approximation approaches for sparse optimization. European J. of Operational Research 244(1), 26–46 (2015)
44. Wai, H.T., Lafond, J., Scaglione, A., Moulines, E.: Decentralized frank-wolfe algorithm
for convex and non-convex problems. arXiv:1612.01216 (2017)
45. Wei, E., Ozdaglar, A.: On the o(1/k) convergence of asynchronous distributed alternating direction method of multipliers. In: Proc. of the IEEE Global Conference on Signal
and Information Processing (GlobalSIP 2013), pp. 551–554. Austin, TX, USA (2013)
46. Weston, J., Elisseeff, A., Schölkopf, B., Tipping, M.: Use of the zero-norm with linear
models and kernel methods. J. of machine learning research 3(Mar), 1439–1461 (2003)
47. Wright, S.J.: Coordinate descent algorithms. Mathematical Programming 151(1), 3–34
(2015)
48. Xi, C., Khan, U.A.: On the linear convergence of distributed optimization over directed
graphs. arXiv:1510.02149 (2015)
49. Xi, C., Khan, U.A.: ADD-OPT: Accelerated distributed directed optimization.
arXiv:1607.04757 (2016). Appeared on arXiv on July 16, 2016
50. Xiao, L., Boyd, S., Lall, S.: A scheme for robust distributed sensor fusion based on average consensus. In: Proc. of the 4th international symposium on Information processing
in sensor networks, pp. 63–70. Los Angeles, CA (2005)
51. Xu, J., Zhu, S., Soh, Y.C., Xie, L.: Augmented distributed gradient methods for multiagent optimization under uncoordinated constant stepsizes. In: Proc. of the 54th IEEE
Conference on Decision and Control (CDC 2015), pp. 2055–2060. Osaka, Japan (2015)
52. Zhang, S., Xin, J.: Minimization of transformed L1 penalty: Theory, difference of convex
function algorithm, and robust application in compressed sensing (2016). URL ftp:
//ftp.math.ucla.edu/pub/camreport/cam14-86.pdf
53. Zhu, M., Martı́nez, S.: An approximate dual subgradient algorithm for multi-agent nonconvex optimization. IEEE Trans. on Automatic Control 58(6), 1534–1539 (2013)

