arXiv:1809.00710v3 [math.OC] 16 Mar 2020

A Dual Approach for Optimal Algorithms in Distributed
Optimization over Networks
César A. Uribea∗ , Soomin Leeb , Alexander Gasnikovc , and Angelia Nedićd
a

Laboratory for Information and Decision Systems, and the Institute for Data, Systems, and
Society, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA; b Yahoo!
Research, Sunnyvale, California, USA; c Moscow Institute of Physics and Technology, and
Institute for Information Transmission, Moscow Oblast, Russia; d School of Electrical, Computer
and Energy Engineering, Arizona State University, Tempe, Arizona, USA
We study dual-based algorithms for distributed
Pmconvex optimization problems over networks,
where the objective is to minimize a sum
i=1 fi (z) of functions over in a network. We
provide complexity bounds for four different cases, namely: each function fi is strongly convex and smooth, each function is either strongly convex or smooth, and when it is convex
but neither strongly convex nor smooth. Our approach is based on the dual of an appropriately formulated primal problem, which includes a graph that models the communication
restrictions. We propose distributed algorithms that achieve the same optimal rates as their
centralized counterparts (up to constant and logarithmic factors), with an additional optimal cost related to the spectral properties of the network. Initially, we focus on functions
for which we can explicitly minimize its LegendreFenchel conjugate, i.e., admissible or dual
friendly functions. Then, we study distributed optimization algorithms for non-dual friendly
functions, as well as a method to improve the dependency on the parameters of the functions
involved. Numerical analysis of the proposed algorithms is also provided.
Keywords: Distributed optimization; optimal rates; optimization over networks; convex
optimization; primal-dual algorithms
AMS Subject Classification: 90C25 ; 90C30 ; 90C60 ; 90C35

1.

Introduction

This paper studies the design of efficient distributed algorithms for the solution of convex
optimization problems over networks. The study of distributed algorithms can be traced
back to classic papers from the 70s and 80s [9, 16, 86]. The adoption of distributed
optimization algorithms on several fronts of applied and theoretical machine learning,
robotics, and resource allocation has increased the attention on such methods in recent
years [37, 38, 53, 72, 88]. The particular flexibilities induced by the distributed setup
make them suitable for large-scale learning problems involving large quantities of data
[1, 10, 11, 51, 54].
∗ Corresponding author. Email: cauribe@mit.edu

We consider the following optimization problem
min

z∈Rn

m
X

fi (z),

(1)

i=1

where the each fi : Rn → R ∪ {+∞} is a closed convex function known by an agent i
only, that represents a node in an arbitrary communication network. Problem (1) is to
be solved in a distributed manner by repeated interactions of a set of agents over a static
network.
Initial algorithms for distributed optimization, such as distributed subgradient methods, were shown successful for solving optimization problems in a distributed manner
over networks [50, 56, 57, 74]. Nevertheless, these algorithms are particularly slow compared with their centralized counterparts. Recently, distributed methods that achieve
linear convergence rates for minimizing a sum of strongly convex and smooth (network)
objective functions have been proposed. For example, [7] studies a distributed resource
allocation problem but there the objective functions are decoupled, and the agents couple
through the links in a different manner (sharing resources). The agents’ coupling in the
problem we consider is quite different from that in [7], and consequently, our methods
and their analysis require more information about the graph structure. Also, unlike [47],
we explore the explicit dependency on the graph topology. One can identify three main
approaches to the study of distributed algorithms.
p
• In [52], a new method was proposed where it was shown that O((m2 + L/µm) log ε−1 )
iterations are required to find an ε solution to the optimization problem when the function is µ-strongly convex and L-smooth, and m is the number of nodes in the network.
In [33], a unifying approach was proposed, that recovers rate results from several existing algorithms such as those in [71, 79]. This newly proposed
p general method is
able to recover existing rates and achieves an ε precision in O( L/(µλ2 ) log ε−1 ) iterations, where λ2 is the second largest eigenvalue of the interaction matrix. These
results require some minimal information about the topology of the network and provide explicit statements about the dependency of the convergence rate on the problem
parameters. Specifically, polynomial scalability is shown with the network parameter
for particular choices of small enough step-sizes and even uncoordinated step-sizes are
allowed [58]. One particular advantage of this approach is that can handle time-varying
and directed graphs. Nevertheless, optimal dependencies on the problem parameters
and tight convergence rate bounds are far less understood.
• Secondly, in [81], a new analysis technique for the convergence rate of distributed optimization algorithms via a semidefinite programming characterization was proposed.
This approach provides an innovative procedure to numerically certify worst-case rates
of a plethora of distributed algorithms, which can be useful to fine-tune parameters
in existing algorithms based on feasibility conditions of a semidefinite program.
• A third approach was recently introduced in [77], where the first optimal algorithm for
distributed optimization
problems was proposed. This new method achieves an ε prep
√
cision in O( L/µ(1 + τ / γ) log ε−1 ) iterations for µ-strongly convex and L-smooth
problems, where τ is the diameter of the network and γ is the normalized eigengap
of the interaction matrix. Even though extra information about the topology of the
network is required, the work in [77] provides a coherent understanding of the optimal convergence rates and its dependencies on the communication network. The work
in [77] is based on the representation of the communication structure as an additional
set of linear constraints on the distributed problem to guarantee consensus on the
2

Table 1. Iteration complexity of distributed optimization algorithms. All estimates are presented up to
logarithmic factors, i.e. of the order Õ.

Approach

Reference

Centralized

[59]
[70]b

Gradient
Computations

µ-strongly
convex and
M -Lipschitz

q

L
µ

m3

[67]
[22]
[20]
[39]
[46]
[33]c

Communication
Rounds

µ-strongly
convex and
L-smooth

[77]
[41]

This paper

L-smooth

M -Lipschitz

M2
µε

q

L
ε

M2
ε2


L 5/7
µ

−

1
ε5/7

−

−

−

−

2

m M2

ε
2
m2 M2
ε

−
m2 L
µ
−
m4 L
qµ
m2 L
µ

−
−
−
−

−
−
m3 L
ε
m4 L
ε

−

−

−

q

−
q

−

−

−
q
m L
ε

m2 M
ε

m

L
µ

M2

m2
q µε
2
m M
µε

−
q
m L
µ

−
−
−

mM
ε

a Additionally, it is assumed functions are proximal friendly. No explicit dependence on L, M

or m is

provided.
b An iteration complexity of Õ(

p
1/ε) is shown if the objective is the composition of a linear map and a
strongly convex and smooth function. Moreover, no explicit dependence on L and m is provided.
c A linear dependence on m is achieved if L is sufficiently close to µ.

solution, from which a primal-dual method can be applied [36, 45, 69, 80, 90, 91]. For
example in [43], the authors develop a new primal-dual algorithm that uses the Laplacian of the communication graph as a set of linear constraints to induce coordination.
Moreover, with additional metric subregularity conditions a linear convergence rate is
shown. Recently in [78], the authors extended the optimality lower bounds from [77]
to the non-smooth case using the randomized regularization approach [21].
In this paper, we follow the approach in [77] by formulating a dual problem and exploit
recent results in the study of convex optimization problems with affine constraints [3, 14,
28] to develop algorithms with provably optimal convergence rates for the cases where
each of the objective functions fi has one the following properties: 1) it is strongly convex
and with Lipschitz continuous gradients; 2) it is strongly convex and Lipschitz continuous
(but not necessarily smooth); 3) it is convex with Lipschitz continuous gradients, and
4) it is convex and Lipschitz continuous (not necessarily smooth), see Table 1. We say
that a function f is M -Lipschitz if k∇f (x)k ≤ M , a function f is L-smooth if k∇f (x) −
∇f (y)k2 ≤ Lkx − yk2 , a function f is µ-strongly convex (µ-s.c.) if, for all x, y ∈ Rn ,
f (y) ≥ f (x) + h∇f (x), y − xi + µ2 kx − yk2 .
Again, our goal is to investigate whether distributed algorithms, that use local agents
objective functions and local agents’ interactions in a given communication graph, can
match the performance of their centralized counterparts. With respect to this goal, the
contributions of our work is in the development of the algorithms and their analysis. The
major novelty is in the establishment of the results showing that the performance of our
distributed algorithms can match the performance of their centralized counterparts, up
to a logarithmic factor. These rate results are established for two classes of functions,
namely, the functions that are dual friendly and those that are not dual friendly. However,
to accomplish these optimal rates (up to a logarithmic scaling) of centralized methods,

3

some common knowledge of certain quantities about the underlying graph is needed.
These quantities may require additional information preprocessing in the graph, which we
do not investigate here; it is beyond the scope of the paper. Additionally, our convergence
rate analysis is useful for the design of the communication graphs, as it indicates how
the performance of our methods will depend on various quantities of the graph and the
agents’ objective functions.
We provide a sequence of algorithms, to minimize functions from the problem classes
described above, that achieve an ε-approximate solution (c.f. Definition 1) for any fixed,
connected and undirected graph according to Table 1, where universal constants, logarithmic terms, and dependencies on the initial conditions are hidden for simplicity. The
resulting iteration complexities are given both for the optimality of the solution and
the violation of the consensus constraints. Note that for distributed algorithms based on
primal iterations these estimates translate to computations of gradients of the local functions for each of the agents. On the other side, in dual based algorithms, the complexity
refers to computations of the gradients of the Lagrangian dual function, which translates to the number of communication rounds in the network. Initially, we consider dual
friendly or admissible functions. Our results match known optimal complexity bounds
for centralized convex optimization (obtained by classical methods such as Nesterov’s
fast gradient method FGM [60]), with an additional cost induced by the network of
communication constraints. This extra cost appears in the form of a multiplicative term
proportional to the square root of the spectral gap of the interaction matrix. Later we
study non-dual friendly functions where the number of gradient computations has to be
taken into account. Therefore, we provide complexity bounds in terms of oracle calls, i.e.,
gradient computations, as well as communication rounds. Moreover, we provide a new
algorithm for smooth functions that improves the dependency on the strong convexity
parameters of the local functions.
This paper is organized as follows: we start with a couple of preliminary definitions
and auxiliary results in Section 2. Section 3 introduces the problem of distributed optimization over a network. Section 4 presents a general form of primal-dual analysis of
convex problems with linear constraints, this will serve as the basis for our main results.
Section 5 provides our main results on the optimal convergence rates for distributed convex minimization problems with a dual friendly structure. Section 6 provides our main
results on the optimal convergence rates for distributed convex minimization problems
where an exact solution to the dual subproblem is not available. Section 7 presents a
method to improve the dependency of the convergence rate on the condition number
of the function F . Discussions are provided in Section 8. Section 9 provides numerical
experiments of the proposed algorithms. Finally, Section 10 provides some conclusions
and points out future work.
Notation: We will assume that the nodes in the network, also referred to as agents,
are indexed from 1 through m (no actual enumeration is needed in the execution of the
proposed algorithms; it is only used in our analysis). We use the superscripts i or j to
denote agent indices and the subscript k to denote the iteration index of an algorithm.
We denote by [A]ij the entry of the matrix A in its i-th row and j-th column, and
write In for the identity matrix of size n. For a positive semi-definite matrix W , we let
λmax (W ) be its largest eigenvalue and λ+
min (W ) be its smallest positive eigenvalue, and
we denote its condition number by χ(W ) = λmax (W )/λ+
min (W ). Given a matrix A, define
+
T
σmax (A) , λmax (AT A) and σmin
(A) , λ+
(A
A).
We
use 1 to denote a vector with all
min
entries equal to 1. We write Õ to denote a complexity bound that ignores logarithmic
factors. We will work in the standard Euclidean norm, denoted by k · k2 .

4

Algorithm 1 Nesterov’s Constant Step Scheme II.
µ
n
1: Choose x0 ∈ R and α0 ∈ (0, 1). Set y0 = x0 and q = L .
2: kth iteration (k ≥ 0).
(a) Compute ∇f (yk ). Set xk+1 = yk − L1 ∇f (yk ).
2
(b) Compute αk+1 ∈ (0, 1) from equation αk+1
= (1 − αk+1 )αk2 + qαk+1 ,
k)
and set βk = ααk2 (1−α
, yk+1 = xk + βk (xk+1 − xk ).
+αk+1
k

2.

Preliminaries

In this section, we provide some definitions and preliminary information that we will use
in the forthcoming sections.
If each function fi in (4) is µi -strongly convex in xi , then F in (4) is µ-strongly convex
in x with µ = min1≤i≤m µi . Also, if each fi is Li -smooth, then F is L-smooth with
L = max1≤i≤m Li . In Section 7 we explore a method to improve the dependency on the
condition number µ from the worst case parameter to the average strong convexity.
We will build the proposed algorithms based on Nesterov’s fast gradient method
(FGM) [63]. For example, a version of the FGM for a µ-strongly convex and L-smooth
function f is shown in Algorithm 1. Note that Theorem 2.1 is precisely Theorem 2.2.3
in [63] Other variants of this method can be found in [5, 42, 63].
Theorem 2.1 (Adapted from Theorem 2.2.3 in [63]) If in Algorithm 1 one sets α0 such
that L(1 − α0 ) = α0 (α0 L − µ), then Algorithm 1 generates a sequence {xk }∞
k=0 such that
r k
µ
f (xk ) − f ≤ L 1 −
kx0 − x∗ k22 ,
L
∗



(2)

where f ∗ denotes the minimum value of the function f over Rn and x∗ is its minimizer.
Moreover, Algorithm 1 is optimal for unconstrained minimization of strongly convex and
smooth functions.

3.

Problem Statement

Defining the stacked column vector x = [xT1 , xT2 , . . . , xTm ]T ∈ Rmn allows us to rewrite
problem (1) in an equivalent form as follows:
min

x1 =...=xm

F (x)

where

F (x) ,

m
X

fi (xi ).

(3)

i=1

The distributed optimization framework assumes we want to solve problem (3) in
a distributed manner over a network. We model such a network as a fixed connected
undirected graph G = (V, E), where V is the set of m nodes, and E is the set of edges.
We assume that the graph G does not have self-loops. The network structure imposes
information constraints; specifically, each node i has access to the function fi only and
a node can exchange information only with its immediate neighbors, i.e., a node i can
communicate with node j if and only if (i, j) ∈ E.

5

The communication constraints imposed by the network can be represented as a set of
equivalent constraints via the Laplacian W̄ ∈ Rm×m of the graph G, defined as: [W̄ ]ij =
−1 if (i, j) ∈ E, [W̄ ]ij = deg(i) if i = j, and [W̄ ]ij = 0 otherwise, where deg(i) is
the degree of the node i, i.e., the number of neighbors of the node. Moreover, define
the communication matrix (also referred to as an interaction matrix) by W , W̄ ⊗ In ,
where ⊗ indicates the Kronecker product.
Throughout this paper, we assume the graph G = (V, E) is connected and undirected.
Under this assumption, the Laplacian matrix W̄ is symmetric and positive semi-definite.
Furthermore, the vector 1 is the unique (up to a scaling factor) eigenvector associated
with the eigenvalue λ = 0. One can verify that W inherits all the properties of W̄ , i.e.,
it is a symmetric positive semi-definite matrix
and it satisfies the following relations:
√
W x =√ 0 if and only if x1 = . . . = xm , W x = 0 if and only if x1 = . . . = xm , and
σmax ( W ) = λmax (W ). Therefore, one can equivalently rewrite problem (3) as follows:
√min F (x)
W x=0

where

F (x) ,

m
X

fi (xi ).

(4)

i=1

√
Note that the√constraint sets {x | W x = 0} and {x | x1 = . . . = xm } are equal. This
is because ker( W ) = span(1) due to G being connected. A similar idea of writing the
Laplacian as a product of a matrix B and its transpose has been employed in [43] using
the incidence matrix.
Our results provide convergence rate estimates for the solution of the problem in (1)
for four P
different cases in terms of the properties of the function
F (x) = m
i=1 fi (xi ).
P
Assumption 3.1 Consider a function F (x) = m
i=1 fi (xi ), and assume:
(a) Each fi is µi -strongly convex and Li -smooth, thus F is µ-strongly convex and
L-smooth;
(b) each fi is µi -strongly convex and Mi -Lipschitz on a ball around the optimal point
x∗ with radius kx∗ −x0 k2 , where x0 is an initialization point, thus F is µ-strongly
convex and M -Lipschitz on the same bounded set;
(c) each fi is convex and Li -smooth, thus F is convex and L-smooth;
(d) each fi is convex and Mi -Lipschitz on a ball around the optimal point x∗ with
radius kx∗ − x0 k2 , where x0 is an initialization point, thus F is convex and
M -Lipschitz on the same bounded set;
where µ = min1≤i≤m µi , L = max1≤i≤m Li and M = max1≤i≤m Mi .
Example 3.2 Consider a network of agents as shown in Figure 1, where the agents in
the network seek to cooperatively solve the following regularized linear regression problem
min

1

z∈Rn 2ml

1
kHz − bk22 + ckzk22 ,
2

(5)

where b ∈ Rml , H ∈ Rml×n and c > 0 is some constant. Furthermore, assume the data
in b and H are distributed over the network, where no single agent has full access to the
complete information, i.e., each agent has access to a subset of points such that
bT = [ bT1 | bT2 | · · · | bTm ],
|{z}
|{z}
|{z}
Agent 1

Agent 2

and

Agent m

T
H T = [ H1T | H2T | · · · | Hm
],
|{z}
|{z}
|{z}
Agent 1

6

Agent 2

Agent m

1

5

2

4

3

(a) Cycle graph.

(b) Erdős-Rényi random graph.

Figure 1. Two examples of networks of agents. (a) A cycle graph with 5 agents. (b) An Erdős-Rényi random graph
with 160 agents.

where bi ∈ Rl and Hi ∈ Rl×n for each i. Therefore, problem (5) is equivalent to
√min
W x=0

m 
X
1 1
i=1

2 ml

1 c
kbi − Hi xi k22 +
kxi k22
2m


,

(6)

where W = W̄ ⊗In . Particularly, for the cycle graph network of 5 agents shown in Figure
1(a), agent 1 can share information with agents 2 and 5, agent 5 shares information with
agents 1 and 4, and similarly for the other agents.

4.

Primal-Dual Analysis for Convex Problems with Linear Constraints

In this section, we consider a generic optimization problem with linear constraints. We
will use the convergence results in (2) to obtain some fundamental insights for the distributed optimization problem. Moreover, we will derive the results for a corresponding
distributed algorithm for solving problem (4). The main idea of our analysis will be to
explore the case when
√ the linear constraints Ax = 0 represent the network communication constraints as W x = 0 and the function f (x) corresponds to the network function
F (x) as defined in (4).
Initially, consider a µ-strongly convex and an L-smooth function f to be minimized
over a set of linear constraints, i.e.,
min f (x).

Ax=0

(7)

Assume that problem (7) is feasible, in which case a unique solution exists, denoted by
x∗ . However, we will be interested in finding approximate solutions of (7) that attain a
function value arbitrarily close to the optimal value and have arbitrarily small feasibility
violation of the linear constraints. For this, we introduce the following definition.
Definition 1 [41] A point x ∈ Rmn is called an (ε, ε̃)-solution of (7) if the following
conditions are satisfied: f (x) − f ∗ ≤ ε, and kAxk2 ≤ ε̃, where f ∗ = f (x∗ ) denotes the
optimal value for the primal problem in (7).

7

The Lagrangian dual of (7) is given by
n

min f (x) = max min f (x) − AT y, x
y

Ax=0

o

x

.

(8)

Moreover, (8) can be re-formulated as an equivalent minimization problem, as follows:
min ϕ(y)
y

where

ϕ(y) , max Ψ(x, y),
x

(9)

and Ψ(x, y) , AT y, x − f (x).
T
The function ϕ(y) is µϕ -strongly convex on ker(AT )⊥ with µϕ = λ+
min (A A)/L. MoreT
over, it has Lϕ -Lipschitz continuous gradients with Lϕ = λmax (A A)/µ, see Lemma 3.1
in [6], Proposition 12.60 in [75], Theorem 1 in [61], Theorem 6 in [35]. Additionally, from
Demyanov–Danskin’s theorem (see, for example, Proposition 4.5.1 in [8]), it follows that
∇ϕ(y) = Ax∗ (AT y),

(10)

where x∗ (AT y) denotes the unique solution of the inner maximization problem
x∗ (AT y) = arg max Ψ(x, y).

(11)

x

Also known as the sharp-operator [89].
We will call x∗ the minimizer of (7). On the other hand, we denote x∗ (AT y) as the
solution of (11) for a given value AT y. Particularly, x∗ (0) = arg maxx {−f (x)}. Moreover,
there is no duality gap between the primal problem in (7) and its dual problem in (9),
and the dual problem has a solution y ∗ (see for example, Proposition 6.4.2 in [8]). Thus,
it holds that x∗ = x∗ (AT y ∗ ). In general, the dual problem in (9) can have multiple
solutions of the form y ∗ + ker(AT ) when the matrix A does not have a full row rank,
for example when A is the Laplacian of a graph. If the solution is not unique, then y ∗
denotes the smallest norm solution, and we let R be its norm, i.e. R = ky ∗ k2 .
Note that we typically do not know R or Rx . Thus, we require a method to estimate the strong convexity parameter, which is challenging [62, 66]. Some recent work
have explored restarting techniques to reach optimal convergence rates when the strong
convexity parameters are unknown [34, 66]. Similarly, a generalization of the FGM algorithm can be proposed when the smoothness parameter is unknown [29]. However, the
effect of restarting in the distributed setup requires further study and is out of the scope
of this paper. Additionally, parameters such as λ+
min (W ), λmax (W ) can be efficiently
computed in a distributed manner [82]. Moreover, the values of µ and L can be shared
with simple max-consensus which is guaranteed to converge in finite time [32].
In the next Lemma, we provide an auxiliary condition to check whether a point x is
an (ε, ε̃)-solution in terms of the properties of the dual function ϕ.
Lemma 4.1 (Lemma 1 in [29]) Let hy, ∇ϕ(y)i ≤ ε and k∇ϕ(y)k2 ≤ ε̃. Then, x∗ (AT y)
is an (ε, ε̃)-solution of (7).
Remark 1 From a technical perspective, dual smoothing with linear constraints is not
new. However, the main technical contribution of our approach is the elimination of
the assumption that the target function has bounded variation on the admissible set
8

Algorithm 2 Nesterov’s Constant Step Scheme II for the dual problem.
n
1: Choose y0 ∈ R and α0 ∈ (0, 1). Set ỹ0 = y0 and q = µϕ /Lϕ .
2: kth iteration (k ≥ 0).
(a) Compute ∇ϕ(yk ). Set yk+1 = ỹk − L1ϕ ∇ϕ(ỹk ).
2
(b) Compute αk+1 ∈ (0, 1) from equation αk+1
= (1 − αk+1 )αk2 + qαk+1 ,
αk (1−αk )
and set βk = α2 +αk+1 , ỹk+1 = yk + βk (yk+1 − yk ).
k

(that is typically unbounded). That is, for a function f (·) defined on a set Q, it holds
that maxx,y∈Q {f (x) − f (y)} ≤ ∆ < ∞ [3, 29], or a weaker condition where if we define
∆(f ) = maxx,y∈Q {∇f (x)[y − x]}, where ∇f (x)[y − x] is the directional derivative of f
at a point x in the direction y − x [17], then kx∗ k ≤ ∆(f )/r, with r is the radius of a
ball inside Q, see [17, Theorem 6.1]. One can verify that this assumption significantly
required in the previous dual smoothing (regularized) works. We show how to obtain all
the results without such a restrictive assumption. As far as we known this is the first
time such strong condition is removed.
In what follows, we will apply the bound for the FGM algorithm in (2) on the dual
problem (9), which is not strongly convex in the ordinary sense (on the whole space), see
Algorithm 2. However, by choosing y0 = x0 = 0 in Algorithm 1 as the initial condition,
the algorithm applied to the dual problem will produce iterates that lie in the linear
space of gradients ∇ϕ(y), which are of the form Ax for x = x∗ (AT y). In this case,
the dual function ϕ(y) will be strongly convex when y is restricted to the linear space
spanned by the range of the matrix A.
Note we are now required to compute explicitly x∗ (AT y), which is the solution of the
inner maximization problem (11). Section 5 will present the proposed algorithms and
convergence rates, for the different convexity and smoothness assumptions on the functions fi expressed in Assumption 3.1, when an explicit solution to the inner maximization
problem (11) is available. We will denote this scenario as dual-friendly functions. Particularly, to find x∗ (AT y) one can use optimal (randomized) numerical methods [12, 63, 65].
Later in Section 6, we extend the results of Section 5 to the case where only approximate
solutions to (11) can be computed.
Definition 2 A function f (x) is dual-friendly if, for any y, a solution x∗ (AT y) in (11)
can be computed “efficiently” (e.g., by a closed form or by polynomial time algorithms).
Sometimes this function are also called admissible [31, 73], or Fenchel tractable [23].
Several authors have previously studied the primal-dual approach we will follow in this
paper. In [89], the authors propose a universal framework for the study of constrained
optimization problems. Leveraging on the Fenchel operator, the authors provide an algorithm that can optimally adapt to an unknown Hölder continuity parameters. In [23],
the authors provide an algorithm-free generic framework that provides convergence rate
certificates for primal-dual algorithms. In [83, 84], the authors study a model-based gap
and split-gap reduction techniques that simultaneously updates the primal and dual
smoothness parameters for the convergence to the duality gap at optimal rates. In contrast, we focus our analysis on the recent results in accelerated primal-dual approaches
that allows us to obtain optimal rates for larger classes of functions and quantify the
constraint violation and distance to optimality [14, 29].
Examples of optimization problems for which Definition 2 holds can be found in the
9

literature, e.g., the entropy-regularized optimal transport problem [15], the entropy linear
programming problem [29] or the ridge regression. Note that by definition, finding a
solution for the problem (11) corresponds to finding a maximizing point of the Legendre
transformation f ∗ of the function f , where f ∗ is defined as
f ∗ (y) = max {hx, yi − f (x)} .
x

Moreover, if the conjugate dual function is available, then the maximizing argument is
x∗ (AT y) = ∇f ∗ (y), see Proposition 8.1.1 in [8]. For example, for the ridge regression
problem (6), the maximizing argument in (11) can be explicitly computed as x∗ (AT y) =
(H T H)−1 (AT y + H T b).
Another example where one can find an explicit solution to the auxiliary dual problem
is the Entropy Linear Program (ELP) [2], i.e.,
min

x∈Sn (1),Ax=0

n
X


xj log

j=1

xj
qj


,

(12)

P
where Sn (1) = {x ∈ Rn : xj ≥ 0; j = 1, 2, . . . , n; nj=1 xj = 1} is a unit simplex in
Rn and q ∈ Sn (1). The maximizing argument (11) for problem (12) can be explicitly
computed as
qi exp([AT y]i )
[x∗ (AT y)]i = Pn
.
T
j=1 qj exp([A y]j )

(13)

Additional examples related to optimal transport problems of computation of Wasserstein barycenter can be found in [15, 24, 87].
Remark 2 Note that the particular example of Entropy Linear Program in (12), the
function is not defined on Rn as in (1). In general, we can define the function fi : E → R,
and the optimization space to be constrained to x ∈ Q, where E is some finite-dimensional
vector space and Q is a simple closed convex set. In that case, the matrix A is defined as
a linear operator from E to another finite-dimensional real vector space H, with 0 ∈ H.
Moreover, strong convexity should be defined on Q with respect to some chosen norm
k · kE , which implies that any subgradient of fi is an element of the dual space E ∗ , with
dual norm k · kE ∗ .
kgkE ∗ = max hg, xi.
kxkE ≤1

Similarly, for the linear operator A : E1 → E2 , λmax and should be appropriately defined
as
λmax (W ) = kAkE1 →E2 =

max



x∈E1 ,u∈E2

hu, Axi : kxkE1 = 1, kukE2∗ = 1 ,

and similarly for λ+
min . However, for simplicity of exposition, we will provide our results
in the Euclidean space.
Remark 3 Without loss of generality, we assume that the optimal points x∗ 6= 0 and
y ∗ 6= 0. This facilitates notation, and simplifies analysis as we initialize our algorithms
10

Algorithm 3 Distributed FGM for strongly convex and smooth problems
λ+ (W )

α2 −q

min
0
All agents set z0i = z̃0i = 0 ∈ Rn , q = Lµ λmax
(W ) , α0 solves 1−α0 = 1 and N .
2: For each agent i
3: for k = 0, 1, 2, · · · , N −1 do

4:
x∗i (z̃ki ) = arg max z̃ki , xi −fi (xi )

1:

xi

5:
6:
7:

Share x∗i (z̃ki ) with neighbors, i.e. {j | (i, j) ∈ E}.
P
∗ j
i
zk+1
= z̃ki − λmaxµ(W ) m
j=1 Wij xj (z̃k )
k)
2
Compute αk+1 ∈ (0, 1) from αk+1
= (1−αk+1 )αk2 +qαk+1 and set βk = ααk2 (1−α
+αk+1
k

i
i
i
z̃k+1
= zk+1
+βk (zk+1
−zki )
9: end for

8:

at the 0 point, and our bounds will depend on the distance between the initial point and
the optimal point.

5.

Dual Friendly Functions: Algorithms and Iteration Complexity Analysis

In this section, we provide our main results about the communication complexity considering each function class in Assumption 3.1. For each case, we present an algorithm and
the minimum number of iterations required fo the algorithm to reach an approximate
solution of (4).
5.1

Sums of Strongly Convex and Smooth Functions

In this subsection, we analyze the distributed optimization problem in (3) when Assumption 3.1(a) holds, i.e., each function fi is strongly convex√and smooth.
Initially, consider step (a) in Algorithm 2, and replace A = W , thus
yk+1 = ỹk −

√ T
1 √
1
∇ϕ(ỹk ) =ỹk −
W x∗ ( W ỹk ).
Lϕ
Lϕ

(14)

Unfortunately, (14) cannot be executed
√ over a network in a distributed manner because
the sparsity pattern of the matrix W need not be compliant with the graph G in
the
√ same way the√matrix W is. Therefore, we make the following change of variables:
W yk = zk and W ỹk = z̃k , resulting in an algorithm that can be executed in a distributed manner. Interaction between agents is dictated by the term W x∗ (z̃k ) which
depends only on local information. As a result, each agent i in the network has its local
variables zki and z̃ki , and to compute their value at the next iteration, it only requires the
information sent by the neighbors defined by the communication graph G. Additionally,
the dual subproblem can be computed in a distributed manner at node i as
x∗i (z̃ki ) = arg max



xi

z̃ki , xi −fi (xi ) .

(15)

Next, we formally state the FGM algorithm applied to the dual of problem (3) with
the change of variable that allows a distributed execution.
Remark 4 Note that the networked communication between agents is executed in Line 5
11

of Algorithm 3. In order for an agent i to execute line 6, it needs to have access to
x∗j (z̃kj ) for the non-zero entries of Wij . Sharing with neighbors refers to the action of
an agent i sending a value it holds in its local memory, to the set of neighbors it can
communicate with according to the network structure, i.e., {j | (i, j) ∈ E}. Also, note
that the sequences {αk } and {βk } can be computed independently by each agent and
coordination is not needed.
Algorithm 3 requires the number of iterations N , which effectively corresponds to the
number of communication rounds. We define a communication round as an iteration of
Algorithm 3 where every node shares its local estimates with its neighbors and updates
its local variables. Thus, we are interested in finding a lower bound on N such that we
can guarantee certain optimality of the local solutions in the sense of Definition 1. Next,
we state our main result regarding the number of iterations required by Algorithm 3 to
reach an approximate solution of problem (3).
Theorem 5.1 Let F (x) be dual friendly and Assumption 3.1(a) hold. For any ε > 0,
the output x∗ (zN ) of Algorithm 3 is an (ε, ε/R)-solution of (4) for
s
N ≥2

!
√
2 2λmax (W )R2
,
µ·ε

L
χ(W ) log
µ

where R = ky ∗ k2 , and χ(W ) = λmax (W )/λ+
min (W ).
Proof. Algorithm 3 √
follows from Algorithm
1 applied to the dual problem (9) with the
√
change of variables W yk = zk and W ỹk = z̃k . In other words, we apply Algorithm
2
√
to dual
√ problem, i.e., Algorithm 3 is just Algorithm 2 in new variables zk = W yk and
z̃k = W ỹk . Therefore, we are going to use the convergence results of the FGM for the
dual problem in terms of the dual variables yk and ỹk and provide an estimate of the
convergence rate of in terms of the primal variables.
Initially, it follows from Theorem 2.1 that the sequence of estimates generated by the
iterations in (14) has the following property:
∗

2



ϕ(yk )−ϕ ≤ Lϕ R exp −k

r

µϕ
Lϕ


.

(16)

Moreover, it holds that
k∇ϕ(yk )k22 ≤ 2Lϕ (ϕ(yk )−ϕ∗ ) .
Hence from (10), (11), (16) with A =

√
W (W = W T )

 r 
√
√
µϕ
∗
2
2 2
k W x ( W yk )k2 ≤ 2Lϕ R exp −k
.
Lϕ
q
√

√
2Lϕ R2
We can conclude that k W x∗ (zk )k2 ≤ ε/R if k ≥ 2 Lµϕϕ log
. Now, by using
ε
the Cauchy--Schwarz inequality, it follows that
√
√
√
√
|hyk , W x∗ ( W yk )i|2 ≤ kyk k22 k W x∗ ( W yk )k22 .

12

The authors [17] showed that the iterates generated by Nesterov’s fast gradient method
remain in a ball around the optimal point, that can be bounded by the distance between
the initial point and the optimal point. We exploit this property to bound kyk k2 as [17]
kyk −y ∗ k2 ≤ ky0 −y ∗ k2 .
Thus, since we assume y0 = 0, it holds that kyk k2 ≤ 2ky ∗ k2 ≤ 2R, then
√

 r 
√
√
√
µϕ
2
2
∗
2
4 2
|hyk , W x ( W yk )i| ≤ 4R k W x ( W yk )k2 ≤8R Lϕ exp −k
.
Lϕ
∗

√
√
∗
∗
∗
∗
Therefore
from
Lemma
4.1
f
(x
(z
))−f
=
|hz
,
x
(z
)i|
≤
|hy
,
W
x
(
W yk )i| ≤ ε
k
k
k
k
q
 √ 2 3
2 2Lϕ R
Lϕ
if k ≥ 2 µϕ log
. Finally, based on Lemma 4.1, Algorithm 3 will produce an
ε
(ε, ε/R)-solution if
s
N ≥2

Lϕ
log max
µϕ

( √
)!
√
2 2Lϕ R2
2Lϕ R2
,
.
ε
ε

Following the definitions of Lϕ , µϕ , and χ(W ), we obtain the desired result.



Theorem 5.1 states that in order to obtain an (ε, ε/R)-solution of (4), when
each
convex and smooth, the communication complexity is
p function fi is strongly

O χ(W )L/µ log(1/ε) .
5.2

Sums of Strongly Convex and M -Lipschitz Functions on a Bounded
set

In this subsection, we propose a distributed algorithm for sum of strongly convex functions that are Lipschitz on a bounded set to be specified later. Moreover, we show the
convergence rates of the proposed algorithm.
We will build our results by using Nesterov smoothing [17, 61]. This approach has
been previously used, for example in [48]. We use such approach to guarantee optimal
dependency on the network structure. Furthermore, as pointed before, we remove the
assumption of bounded variation on the admissible set. Particularly, we will use the
following result
Proposition 5.2 (Lemma 3 in [29]) Consider a convex function ϕ, the strongly convex
term (µ̂/2)kyk22 , and define ϕ̂(y) = ϕ(y)+(µ̂/2)kyk22 . Then, ϕ̂(y) is µ̂-strongly convex.
Moreover, if µ̂ ≤ ε/R2 , where R = ky ∗ k2 and assume that there exists yN such that
ϕ̂(yN )−ϕ̂∗ ≤ ε/2, it holds that ϕ(yN )−ϕ∗ ≤ ε, where ϕ∗ and ϕ̂∗ are the optimal values
of the function ϕ and ϕ̂ respectively. Moreover, if ϕ is defined in (9), then ∇ϕ̂(y) =
Ax∗ (AT y)+µ̂y.
The main difference between Algorithm 3 and Algorithm 4 is that we have an additional term inside the parenthesis in Line 6. Moreover, the corresponding strong convexity
constant of the dual function is ε/R2 , where R can be an upper bound on the norm of
optimal solution of the dual problem y ∗ , i.e., ky ∗ k2 ≤ R. Next, we state our main result
regarding the number of iterations required by Algorithm 4 to reach an approximate
solution of problem (3).
13

Algorithm 4 Distributed FGM for strongly convex and M -Lipschitz problems
α2 −q

2

)
0
All agents set z0i = z̃0i = 0 ∈ Rn , q = λmax (Wε/(4R
)/µ+ε/(4R2 ) , α0 solves 1−α0 = 1 and N .
2: For each agent i
3: for k = 0, 1, 2, · · · , N −1 do

4:
x∗i (z̃ki ) = arg max z̃ki , xi −fi (xi )

1:

xi

5:
6:
7:
8:
9:

Share x∗i (z̃ki ) with neighbors, i.e. {j | (i, j) ∈ E}. !
m
P
1
i
zk+1
= z̃ki − λmax (W )/µ+ε/(4R
Wij x∗j (z̃kj )+ 4Rε 2 z̃ki
2)

j=1
k)
2
Compute αk+1 ∈ (0, 1) from αk+1 = (1−αk+1 )αk2 +qαk+1 and set βk = ααk2 (1−α
k +αk+1
i
i
i
z̃k+1
= zk+1
+βk (zk+1
−zki )

end for

Theorem 5.3 Let F (x) be dual friendly and Assumption 3.1(b) hold. Moreover, assume
k∇F (x∗ )k2 ≤ M . For any ε > 0, the output x∗ (zN ) of Algorithm 4 is an (ε, ε/R)-solution
of (4) for
s



M2
M2
N ≥ 2 4χ(W )
+1 log 4χ(W )
+1 ,
µ·ε
µ·ε
where χ(W ) = λmax (W )/λ+
min (W ).
Proof. Initially, consider the regularized dual function ϕ̂ with µ̂ = ε/(4R2 ), which is µϕ̂ strongly convex with µϕ̂ = ε/(4R2 ), and Lϕ̂ -smooth with Lϕ̂ = λmax (W )/µ + ε/(4R2 ).
Thus, similarly as in (16)
∗

2



ϕ̂(yk )−ϕ̂ ≤ Lϕ̂ R̂ exp −k

r

µϕ̂
Lϕ̂



2



≤ Lϕ̂ R exp −k

r

µϕ̂
Lϕ̂


,

where R̂ = kŷ ∗ k2 , and ŷ ∗ is the smallest norm solution of the regularized dual problem.
Note that by definition R̂ = kŷ ∗ k2 ≤ ky ∗ k2 = R. Note also, that kyk − ŷ ∗ k2 ≤ ky0 − ŷ ∗ k2 =
kŷ ∗ k2 (see formula (2.5) from [17]). Therefore, kyk k2 ≤ 2kŷ ∗ k2 ≤ 2ky ∗ k2 .
Next, we provide a relation between the distance to optimality of the non-regularized
primal problem and the regularized dual problem. Note that for any y it holds that
ϕ̂(y)−ϕ̂∗ ≥

k∇ϕ̂(y)k22
k∇ϕ(y)+µ̂yk22
µ̂ hy, ∇ϕ(y)i
=
≥
.
2Lϕ̂
2Lϕ̂
Lϕ̂

 p

Therefore, hy, ∇ϕ(y)i ≤ (Lϕ̂ /µϕ̂ ) (ϕ̂(y)−ϕ̂∗ ) ≤ (4/ε)L2ϕ̂ R4 exp −k µϕ̂ /Lϕ̂ . Consep

quently, if k ≥ 2 Lϕ̂ /µϕ̂ log 2Lϕ̂ R2 /ε , then hy, ∇ϕ(y)i ≤ ε.
Moreover, it follows from the definition of the regularized dual function that
q
k∇ϕ(yk )k2 ≤ k∇ϕ̂(yk )k2 +µ̂kyk k2 ≤ 2Lϕ̂ (ϕ̂(y)−ϕ̂∗ )+µ̂kyk k2


r 
r 
√
√
k µϕ̂
k µϕ̂
ε
≤ 2Lϕ̂ R exp −
+2µ̂R̂ ≤ 2Lϕ̂ R exp −
+ .
2 Lϕ̂
2 Lϕ̂
2R

14

function then
√ definition of the gradient
√Using the
p of the dual √
 we have that
k W x∗ ( W ỹk )k2 ≤ ε/R, for k ≥ 2 Lϕ̂ /µϕ̂ log 2Lϕ̂ R2 /ε .
We conclude, from Lemma 4.1, that we will have an (ε, ε/R) solution of (4) if
)!
√
2Lϕ̂ R2
k≥2
ε



s
ε
2 λmax (W )
2R
+
µ
4R2
λmax (W )/µ+ε/(4R2 )

log 
≥2
2
ε/(4R )
ε
s
 2

4R2 λmax (W )
4R λmax (W )
=2
+1 log
+1 .
µ·ε
µ·ε
s

Lϕ̂
log max
µϕ̂

(

2Lϕ̂ R2
,
ε

Now, we focus our attention to find a bound on the value R such that we can provide
an explicit dependency on the minimum non zero eigenvalue of the graph Laplacian.
This will allow us to provide an explicit iteration complexity in terms of the condition
number of the graph Laplacian.
Theorem 3 in [41] provides a bound that relates R with the magnitude of the gradient
of F (x) at the optimal point x = x∗ . Particularly, it is shown that
+
R2 = ky ∗ k22 ≤ k∇F (x∗ )k22 /σmin
(A).

(17)

+
Thus, it follows from [41] that R2 ≤ M 2 /σmin
(A), where M = k∇F (x∗ )k2 . Therefore to
have an (ε, ε/R)-solution it is necessary that

s



M2
M2
+1 log 4χ(W )
+1 .
k ≥ 2 4χ(W )
µ·ε
µ·ε


Theorem 5.3 states the communication complexity of Algorithm 4. Particularly, the
total number of communication
prounds required by
 each agent to find an (ε, ε/R)-solution
2
of (4) can be bounded by Õ χ(W )M /(µ · ε) .
5.3

Sums of Smooth and Convex Functions

In this subsection, we propose a distributed optimization algorithm to find a solution
of the problem 3, when the function F (x) is smooth, i.e., F (x) has Lipschitz gradients.
We follow the idea of regularization to induce strong convexity, this time on the primal
problem. Moreover, we show the minimum number of iterations required to compute an
approximate solution to the problem.
Theorem 5.4 Let F (x) be dual friendly and Assumption 3.1(c) hold. For any ε > 0,
the output x∗ (zN ) of Algorithm 5 is an (ε, ε/R)-solution of (4) for
s
N ≥2


2LRx2
+1 χ(W ) log
ε

15

!
√
8 2λmax (W )R2 Rx2
.
ε2

Algorithm 5 Distributed FGM for smooth convex problems
2

λ+ (W )

α2 −q

ε/Rx
min
0
= 1 and N .
, α0 solves 1−α
All agents set z0i = z̃0i = 0 ∈ Rn , q = L+ε/R
2
0
x λmax (W )
2: For each agent i
3: for k = 0, 1, 2, · · · , N −1 do
4:
Set x̂∗i (z̃ki ) = arg maxxi { z̃ki , xi −fi (xi )− 2Rε 2 kxi −x∗i (0)k22 }
x
5:
Share x̂∗i (z̃ki ) with neighbors, i.e. {j | (i, j) ∈ E}.
Pm
2
ε/Rx
∗ j
i
6:
zk+1
= z̃ki − λmax
j=1 Wij x̂j (z̃k )
(W )
1:

7:

k)
2
Compute αk+1 ∈ (0, 1) from αk+1
= (1−αk+1 )αk2 +qαk+1 and set βk = ααk2 (1−α
+αk+1
k

i
i
i
z̃k+1
= zk+1
+βk (zk+1
−zki )
9: end for

8:

∗
∗
where χ(W ) = λmax (W )/λ+
min (W ) and Rx = kx −x (0)k2 .

Proof. Initially, consider the regularized problem
√min F̂ (x)
W x=0

where

F̂ (x) , F (x)+

ε
kx−x∗ (0)k22 ,
2Rx2

(18)

where F (x) is defined in (4). The function F̂ (x) is µ̂-strongly convex with µ̂ = ε/(2Rx2 )
and L̂-smooth with L̂ = L+µ̂. Given that the regularized primal function is strongly
convex and smooth, we can use the results from Theorem 5.1. Particularly, in order to
have an (ε/2, ε/(2R))-solution of problem (18), one can use Algorithm 3 with
!
√
L̂
4 2λmax (W )R2
N ≥2
χ(W ) log
µ̂
µ̂ · ε
s
!
√

2LRx2
8 2λmax (W )R2 Rx2
=2
+1 χ(W ) log
.
ε
ε2
s

Having an (ε/2, ε/(2R))-solution of problem (18), guarantees that x̂∗N is an approximate
(ε, ε/R)-solution of problem (4), and the desired result follows.

Theorem 5.4 states the communication complexity of Algorithm 5. Particularly, the
total number of communication rounds required
p by each agent
 to find an
(ε, ε/R)-solution of (4) can be bounded by Õ χ(W )LRx2 /ε .
5.4

Sums of Convex and M -Lipschitz Functions

In this subsection, we present the distributed algorithm for optimization of convex function when no strong convexity or smoothness is guaranteed. The main idea is to use
regularization both in the primal and the dual problem. Therefore, we can build our
algorithm and its analysis from the results in Theorem 5.3 and Theorem 5.4. Next, we
present the proposed algorithm and their convergence analysis.
Theorem 5.5 Let F (x) be dual friendly and Assumption 3.1(d) hold. For any ε > 0,

16

Algorithm 6 Distributed FGM for M -Lipschitz functions
2

α2 −q

ε/(4R )
0
All agents set z0i = z̃0i = 0 ∈ Rn , q = λmax (W )/(ε/R
= 1 and
2
2 , α0 solves 1−α
0
x )+ε/(4R )
N.
2: For each agent i
3: for k = 0, 1, 2, · · · , N −1 do
4:
Set x̂∗i (z̃ki ) = arg maxxi { z̃ki , xi −fi (xi )− 2Rε 2 kxi −x∗i (0)k22 }
x
5:
Share x̂∗i (z̃ki ) with neighbors, i.e. {j | (i, j) ∈ E}.
!
m
P
j
1
i
Wij x̂∗j (z̃k )+ 4Rε 2 z̃ki
6:
zk+1
= z̃ki − λmax (W )/(ε/R
2 )+ε/(4R2 )
1:

x

7:
8:
9:

j=1
k)
2
Compute αk+1 ∈ (0, 1) from αk+1 = (1−αk+1 )αk2 +qαk+1 and set βk = ααk2 (1−α
k +αk+1
i
i
i
z̃k+1
= zk+1
+βk (zk+1
−zki )

end for

the output x∗ (zN ) of Algorithm 6 is an (ε, ε/R)-solution of (4) for
r



M 2 Rx2
M 2 Rx2
N ≥ 2 16χ(W ) 2 +1 log 16χ(W ) 2 +1 ,
ε
ε
∗
∗
∗
where χ(W ) = λmax (W )/λ+
min (W ), R = ky k2 , and Rx = kx −x (0)k2 .

Proof. Consider again, as in Theorem 5.4, the regularized problem (18) where F (x) is
defined in (4). The function F̂ (x) is µ̂-strongly convex with µ̂ = ε/(2Rx2 ). However, we
have assumed now that F (x) is not smooth. Nevertheless, from Theorem 5.3, we have
that Algorithm 4 will generate an (ε/2, ε/(2R))-solution of (18), namely x∗N , for
s



M2
M2
+1 log 8χ(W )
+1
N ≥ 2 8χ(W )
µ̂ · ε
µ̂ · ε
r


M 2 Rx2
M 2 Rx2
= 2 16χ(W ) 2 +1 log 16χ(W ) 2 +1 .
ε
ε
Therefore, by Proposition 5.2, x∗ (zN ) is an (ε, ε/R)-solution for problem (4).



Theorem 5.5 states the communication complexity of Algorithm 6. Particularly, the
total number of communication
prounds requiredby each agent to find an (ε, ε/R)-solution
of (4) can be bounded by Õ χ(W )Rx2 M 2 /ε2 .
6.

Non-dual Friendly Functions: Algorithms and Iteration Complexity
Analysis

The results in Section 5 assume F (x) is dual-friendly. In this section, we explore the
case when no exact solution to the dual problem is available. We state the algorithms
and their convergence rates for the distributed optimization of sums of non-dual friendly
convex functions.
We will build on the results in [18, 19] on the analysis of first-order methods with
inexact oracle, and provide a set of distributed algorithms and their respective iterations

17

complexities. Initially for completeness, we recall the definition of an inexact oracle for
a smooth strongly convex function, and the corresponding iteration complexity of FGM
with an inexact oracle.
Definition 3 (Definition 1 in [19]) Let function f be convex on a convex set Q. We
say that it is equipped with a first-order (δ, L, µ)-oracle if for any y ∈ Q we can compute
a pair (fδ,L,µ (y), gδ,L,µ (y)) ∈ R × Rn such that
µ
L
kx − yk22 ≤ f (x) − (fδ,L,µ (y) + hgδ,L,µ (y), x − yi) ≤ kx − yk22 + δ,
2
2
for all x ∈ Q where δ ≥ 0 and L ≥ µ ≥ 0.
Theorem 6.1 (Theorem 7 in [19]) Nesterov’s fast gradient method applied to a function
f endowed with a (δ, L, µ)-oracle generates a sequence {yk }k>1 satisfying:


k
f (yk ) − f ∗ ≤ LR2 exp −
2

r 
µ
+
L

s !
L
1+
δ,
µ

where R = ky ∗ k2 .
Now, we recall an auxiliary result that shows that an approximate solution to the
auxiliary inner maximization problem (11) defines a (δ, L, µ)-oracle. Furthermore, we
describe the distributed algorithms and their iterations complexities when we remove
the assumption of the function F being dual friendly.
Theorem 6.2 (Theorem 2 in [19]) Assume that f is µ-strongly convex and L-smooth.
For a given y ∈ Rn , assume that instead of computing x∗ (AT y), i.e., the unique solution
of the subproblem (11), for an ξ > 0, we compute w(AT y) such that:
Ψ(y, x∗ (AT y)) − Ψ(y, w(AT y)) ≤ ξ.

(19)

Then,
(Ψ(y, w(AT y)) − ξ = f (w(AT y)) + hAw(AT y), yi − ξ, Aw(AT y))
is a (δ, Lϕ,δ , µϕ,δ )-oracle for ϕ with δ = 3ξ, Lϕ,δ = 2Lϕ and µϕ,δ = (1/2)µϕ .
6.1

Sum of Non-dual Friendly Strongly Convex and Smooth Functions

In this subsection, we introduce a distributed algorithm for the minimization of sums
of strongly convex and smooth functions removing the assumption of dual friendliness.
Moreover, we provide its iteration complexity.
Remark 5 Algorithm 7 has two loops: one inner loop that runs for T iterations, and
seeks to compute an approximate solution to the dual auxiliary problem; and one outer
loop that runs for N iterations that applies FGM on the dual problem. Note that Lines
5 − 7 compute an approximate gradient of the dual function, and Lines 11 − 13 execute
FGM with the inexact gradient computed in the inner loop.

18

Algorithm 7 Distributed FGM for non-dual friendly strongly convex and smooth problems
1:

λ+ (W )

α2 −q

min
0
All agents set z0i = z̃0i = 0 ∈ Rn , q̃ = Lµ , q = q̃ λmax
(W ) , α0 solves 1−α0 = 1, α̃0 solves

α̃20 −q̃
1−α̃0 = 1, T and N .

For each agent i
for k = 0, 1, 2, · · · , N − 1 do
4:
w0i = w̃0i = 0 ∈ Rn
5:
for t = 0, 1, 2, · · · , T − 1 do
i
6:
wt+1
= w̃ti + L1 (z̃ki − ∇fi (w̃ti ))
α̃t )
2
7:
Compute α̃t+1 ∈ (0, 1) from α̃t+1
= (1 − α̃t+1 )α̃t2 + q̃ α̃t+1 and set β̃t = α̃α̃t2(1−
+α̃t+1
2:
3:

t

8:
9:
10:
11:
12:

i
i
i
w̃t+1
= wt+1
+ β̃t (wt+1
− wti )
end for
Share wTi with neighbors, i.e. {j | (i, j) ∈ E}.
P
j
i
zk+1
= z̃ki − λmaxµ(W ) m
j=1 Wij wT
k)
2
Compute αk+1 ∈ (0, 1) from αk+1
= (1 − αk+1 )αk2 + qαk+1 and set βk = ααk2 (1−α
+αk+1
k

i
i
i
z̃k+1
= zk+1
+ βk (zk+1
− zki )
14: end for

13:

Theorem 6.3 Let F (x) be a function such that Assumption 3.1(a) hold. For any ε > 0,
the output x∗ (zN ) of Algorithm 7 is an (ε, ε/R)-solution of (4) for
s

L
χ(W ) log
µ

s

L
log
µ

N ≥8

!
√
2 2λmax (W )R2
µ·ε

and
T ≥

2
6LR2 Rw
ε2

s

!
L
χ(W ) ,
µ

where R = ky ∗ k2 , Rx = kx∗ −x∗ (0)k2 , Rw = Rx +kx∗ k2 and χ(W ) = λmax (W )/λ+
min (W ).
Proof. Lines 5 − 7 in Algorithm 7 are the FGM on the inner problem (11). Therefore,
it followspfrom classical analysis
of FGM [63] that Ψ(y, x∗ (AT y)) − Ψ(y, wT (AT y)) ≤ ξ

2
for T ≥ L/µ log LRw /ξ . Note that at the beginning of iteration k, Rw = kw0 − w∗ k,
w0 = 0, and w∗ = x∗ (z̃k ). Therefore, Rw = kx∗ (z̃k )k2 ≤ kx∗ (z̃k ) − x∗ k2 + kx∗ k2 ≤
kx∗ (0) − x∗ k2 + kx∗ k2 = Rx + kx∗ k2 .
Moreover, Theorem 6.2 shows us that we have endowed the function ϕ with an
(3ξ, 2Lϕ , 12 µϕ )-oracle. Thus, from Theorem 6.1 it holds that


k
ϕ(yk ) − ϕ ≤ Lϕ R exp −
4
∗

2

r

19

µϕ
Lϕ

s


+

1+2

Lϕ
µϕ

!
3ξ.

Now, recall that
√
√
k W x∗ ( W ỹk )k22 ≤ 2Lϕ (ϕ(yk ) − ϕ∗ ) ,

r 
k µϕ
2 2
+
≤ 2Lϕ R exp −
4 Lϕ

s
1+2

Lϕ
µϕ

!

!
3ξ

.

!

!

√
√
Therefore, in order to have k W x∗ ( W ỹk )k2 ≤ ε/R it is necessary that
s
N ≥8

Lϕ
log
µϕ

√

6Lϕ R2
ε

!
and ξ ≤

ε2
6R2

r

µϕ
.
Lϕ

Moreover,
√
√
√
√
|hyk , W x∗ ( W yk )i|2 ≤ 4R2 k W x∗ ( W yk )k22 ,

r 
k µϕ
4 2
+
≤ 8R Lϕ exp −
4 Lϕ

s
1+2

Lϕ
µϕ

3ξ

.

Therefore, in order to have f (zN ) − f ∗ ≤ ε it is necessary that
s
N ≥8

Lϕ
log
µϕ

!
√
8Lϕ R2
ε

ε2
and ξ ≤
6

r

µϕ
.
Lϕ

Finally, we can conclude that to obtain an (ε, ε/R)-solution of (4) we require
s
N ≥8

Lϕ
log
µϕ

!
√
2 2Lϕ R2
ε

s
and T ≥

L
log
µ

2
6LR2 Rw
ε2

The desired result follows from the definitions of Lϕ and µϕ .

s

Lϕ
µϕ

!
.


Theorem 6.3 shows that if no-dual solution is explicitly available for (11), then
one can use FGM to find an approximate solution. This approximate solution is itself an inexact oracle. Particularly, the number of total number of communication
rounds
p required by each
 agent to find an (ε, ε/R)-solution of (4) can be bounded by
O χ(W )L/µ log(1/ε) . Moreover, at each communication round the number of local
p

oracle calls for each agent can be bounded by O L/µ log(1/ε) . Unfortunately, the
total number
calls
p of oracle
 for each agent at all communication rounds is bounded by
2
O (L/µ) χ(W ) log (1/ε) , which is not optimal compared with their centralized FGM
p

where the number of oracle calls of the function F is bounded by O L/µ log(1/ε) .
However, in the centralized case, one oracle call corresponds to the gradient computation
of F (x) which is composed by m functions fi for 1 ≤ i ≤ m, whereas in the distributed
case, local oracle calls are computed in parallel by all agents at the p
same time. Therefore,
one can argue that if the number of agent m is of the order of L/µ, then provided
estimates are optimal given that the oracle calls of all agents in the network are done in
parallel.

20

Algorithm 8 Distributed FGM for non-dual friendly smooth problems
λ+ (W )

2

1:

α2 −q

ε/Rx
min
0
= 1, α̃0
, α0 solves 1−α
All agents set z0i = z̃0i = 0 ∈ Rn , q̃ = L+ε/R
2 , q = q̃ λ
max (W )
0
x

α̃2 −q̃

solves 1−0 α̃0 = 1, T and N .
2: For each agent i
3: for k = 0, 1, 2, · · · , N − 1 do
4:
w0i = w̃0i = 0 ∈ Rn
5:
for t = 0, 1, 2, · · · , T −
 1 do

1
ε
i
i
i
i
i
∗
6:
wt+1 = w̃t + L+ε/R2 z̃k − ∇fi (w̃t ) − R2 w̃t − xi (0)
x

7:

x

α̃t )
2
Compute α̃t+1 ∈ (0, 1) from α̃t+1
= (1 − α̃t+1 )α̃t2 + q̃ α̃t+1 and set β̃t = α̃α̃t2(1−
+α̃t+1
t

8:
9:
10:
11:
12:

i
i
i
w̃t+1
= wt+1
+ β̃t (wt+1
− wti )
end for
Share wTi with neighbors, i.e. {j | (i, j) ∈ E}.
Pm
2
ε/Rx
j
i
zk+1
= z̃ki − λmax
j=1 Wij wT
(W )
k)
2
= (1 − αk+1 )αk2 + qαk+1 and set βk = ααk2 (1−α
Compute αk+1 ∈ (0, 1) from αk+1
+αk+1
k

i
i
i
z̃k+1
= zk+1
+ βk (zk+1
− zki )
14: end for

13:

6.2

Sums of Non-dual Friendly Smooth Convex Functions

In this subsection, we propose a distributed algorithm for the distributed minimization of
sums of non-dual friendly smooth convex functions and provide its iteration complexity.
Theorem 6.4 Let F (x) be a function such that Assumption 3.1(c) hold. For any ε > 0,
the output x∗ (zN ) of Algorithm 8 is an (ε, ε/R)-solution of (4) for
s
N ≥8


2LRx2
+ 1 χ(W ) log
ε

!
√
8 2λmax (W )Rx2 R2
,
ε2

and
r
T ≥

!
√

 s

2
2 6R2 Rw
L
1
2LRx2
+
+ 1 χ(W ) ,
ε
ε
2Rx2
ε

2LRx2
+ 1 log
ε

where R = ky ∗ k2 , Rx
χ(W ) = λmax (W )/λ+
min (W ).

=

kx∗ − x∗ (0)k2 , Rw

=

Rx + kx∗ k2 , and

Proof. Similarly as in the dual friendly case, we consider the regularized primal problem,

√min F̂ (x)
W x=0

where

F̂ (x) , F (x) +

ε
kx − x∗ (0)k22 ,
2Rx2

(20)

Therefore, the auxiliary inner maximization problem seeks to maximize an µ̂-strongly
convex function, with µ̂ = ε/(2Rx2 ), that is also L̂-smooth, with L̂ = L + µ̂. It follows

21

from Theorem 6.3 that Algorithm 8 will generate obtain an (ε/2, ε/(2R))-solution for
!
√
4 2λmax (W )R2
N ≥8
µ̂ · ε
s
!
√
L + 2Rε 2
4 2λmax (W )R2
x
=8
χ(W ) log
ε
ε
2
2 · ε
2Rx
2Rx
s
!
√

8 2λmax (W )Rx2 R2
2LRx2
+ 1 χ(W ) log
,
=8
ε
ε2
s

L̂
χ(W ) log
µ̂

and
s


s
√
2 2
L̂
2 6L̂R Rw L̂
log 
χ(W )
µ̂
ε2
µ̂

r

2LRx2
+ 1 log
ε

T ≥

=



!
√

 s

2
L
2 6R2 Rw
1
2LRx2
+
+ 1 χ(W ) ,
ε
ε
2Rx2
ε

and the desired result follows.





Theorem 6.4 shows that the total number of communication rounds required by each
agent to find an (ε, ε/R)-solution
of (4),
p
 when te functions are not strongly convex,
2
can be bounded by Õ χ(W )(LRx )/ε . Moreover, at each communication round the
p

number of local oracle calls for each agent can be bounded by Õ (LRx2 )/ε . Similarly
as in Theorem
p 6.3,the total number or oracle calls of each agent can be bounded by
O (LRx2 /ε) χ(W ) .
6.3

Distributed Optimization of Sums of Non-smooth Functions

In this subsection, we present an approach for developing distributed algorithms for nonsmooth optimization, i.e., either Assumption 3.1(b) or Assumption 3.1(d) hold. These
scenarios has been recently studied in [41], where
similar convergence rates have been
√
derived. However, our particular selection of W instead of W allows for a better dependency in terms of the graph condition number.
Initially, following [27], consider a convex function f that is also M -Lipschitz, i.e.,
Assumption 3.1(d), and consider the auxiliary problem
min Fε (x) = f (x) +
x

R2
kAxk22 .
ε

Note that if one is able to find an x̄ such that
Fε (x̄) − min Fε (x) ≤ ε,
x

then it also holds that
f (x̄) − min f (x) ≤ ε,
Ax=0

22

and kAx̄k2 ≤

2ε
R

Note that the term (R2 /ε)kAxk22 is Lε -smooth, with Lε = λmax (AT A)R2 /ε, and f is M Lipschitz [30]. For this class of composite problems, one can use the accelerated gradient
sliding method proposed in [40]. As a result, it follows from Corollary 2 in [40], that in
order to find an (ε, ε/R)-solution for problem (4), the total number of communication
rounds and oracle calls can be bounded by
!
!
r
r
Lε Rx2
M 2 Rx2
=O
χ(W ) ,
O
ε
ε2
and
M 2 Rx2
+
ε2

O

r

Lε Rx2
ε

!
=O

M 2 Rx2
+
ε2

r

!
M 2 Rx2
χ(W ) ,
ε2

respectively.
Similarly, if we additionally assume that the function f is µ-strongly convex, i.e.,
Assumption 3.1(b). Then, from Theorem 3 in [40] it follows that the total number of
oracle calls for Fε and f required by the multi-phase gradient sliding algorithm to find
an (ε)-solution for problem (4) can be bounded by
s
O

Lε
log
µ



Rx
ε

s

!
=O

M2
χ(W ) log
εµ



µRx2
ε

!

and
O

M2
+
εµ

s

Lε
log
µ



Rx
ε

!

M2
+
εµ

=O

s

M2
χ(W ) log
εµ



µRx2
ε

!
,

respectively.
If follows from [40] that the above estimates are optimal up to logarithmic factors.
Moreover, one can extend these results to stochastic optimization problems and the
estimations will not change [41].

7.

Improving on the Dependence of the Strong Convexity Parameter:
Computation-Communication Trade-Off

Considering the general problem in (1), the condition number L/µ can be large if one
of the µi is small or even zero. It follows from Section 6 that the iteration complexity of
the proposed algorithms can be very large, even if only one of the functions has a small
strong convexity. In this section, we propose a reformulation of the original Problem (1)
such that the dependency on the individual strong convexity constants can be improved.
However, we will see that the improvement on the dependency of the condition number
of the function F comes at a price in terms of the communication rounds.
Consider the following problem:
m
X
α
α
hx, W xi =
fi (xi ) + hx, W xi .
√min Fα (x) = F (x) +
2
2
W x=0
i=1

23

(21)

Algorithm 9 Augmented Distributed FGM for strongly convex and smooth problems
Pm
i
i
n
1: All agents set z0 = z̃0 = 0 ∈ R , µα =
i=1 µi , α = µα /λmin (W ), Lα = L +
λ+ (W )

α2 −q

α̃2 −q̃

min
0
0
αλmax (W ), q̃ = Lµαα , q = q̃ λmax
(W ) , α0 solves 1−α0 = 1, α̃0 solves 1−α̃0 = 1, T and N .
2: For each agent i
3: for k = 0, 1, 2, · · · , N − 1 do
4:
w0i = w̃0i = 0 ∈ Rn
5:
for t = 0, 1, 2, · · · , T − 1 do
6:
Share w̃ti with neighbors, i.e. {j | (i, j) ∈ E}.
P
j
i
7:
wt+1
= w̃ti + L1α (z̃ki − ∇fi (w̃ti ) − α m
j=1 Wij w̃t )

8:

α̃t )
2
Compute α̃t+1 ∈ (0, 1) from α̃t+1
= (1 − α̃t+1 )α̃t2 + q̃ α̃t+1 and set β̃t = α̃α̃t2(1−
+α̃t+1
t

9:
10:
11:
12:
13:

i
i
i
w̃t+1
= wt+1
+ β̃t (wt+1
− wti )
end for
Share wTi with neighbors, i.e. {j | (i, j) ∈ E}.
Pm
j
µα
i
zk+1
= z̃ki − λmax
j=1 Wij wT
(W )
k)
2
Compute αk+1 ∈ (0, 1) from αk+1
= (1 − αk+1 )αk2 + qαk+1 and set βk = ααk2 (1−α
+αk+1
k

i
i
i
z̃k+1
= zk+1
+ βk (zk+1
− zki )
15: end for

14:

A solution to (21) is clearly a solution to (1).
P
The function Fα is µα -strongly convex with µα = min { m
i=1 µi , αλmin (W )} and
has
L
-Lipschitz
continuous
gradients
with
L
=
L
+
αλ
α
max (W ). Choose α =
Pm α
µ
/λ
(W
)
and
the
function
F
will
have
a
condition
number
min
α
i=1 i
maxi Li
maxi Li
Lα
λmax (W )
= Pm
= Pm
+
+ χ(W ).
µα
λ
(W
)
µ
min
i=1 i
i=1 µi
Unfortunately, the structure of the function Fα does not allow a decentralized computation of a solution for the inner problem (11), i.e., each agent can compute the solution
x∗i using local information only. Nevertheless, the additional term in (21) has a gradient
with a network structure and can be computed in a distributed manner using information shared from the neighbors of each agent. Particularly, consider the auxiliary dual
problem
ϕα (y) = max Ψα (x, y)
x

where

√ T
α
Ψα (x, y) = hx, W yi − F (x) − hx, W xi .
2

(22)

√ T
√ T √
Then, we have that ∇x Ψα (x, y) = W y − ∇F (x) − αW x. W = W ...
In this case, we can use the FGM to obtain an approximate solution to the inner
problem using Nesterov fast gradient method. In Algorithm 9, we propose a modification
of Algorithm 7 to take into account the new structure for the solution of the inner
auxiliary problem.
Corollary
fi is Li -smooth for 1 ≤ i ≤ m
P 7.1 Let F (x) be defined in (3), and assume
∗
and µ̄ = m
µ
>
0.
For
any
ε
>
0,
the
output
x
(z
)
of
Algorithm 9 is an (ε, ε/R)N
i=1 i

24

solution of (4) for
s
N ≥8


L
+ χ(W ) χ(W ) log
µ̄

!
√
2 2λmax (W )R2
,
µ̄ · ε

and
s
T ≥

L
+ χ(W ) log
µ̄

2
6Lα R2 Rw
ε2

s

!

L
+ χ(W ) χ(W ) ,
µ̄

where R = ky ∗ k2 , Rx = kx∗ −x∗ (0)k2 , Rw = Rx +kx∗ k2 , and χ(W ) = λmax (W )/λ+
min (W ).
Corollary 7.1 implies that at each of the outer iterations, required to obtain an approximate solution to the inner maximization problem, the number of oracle calls for f
and communication rounds between agents can be bounded by
s
Õ

Lα
µα

s

!
= Õ

!
L
+ χ(W ) .
µ̄

Moreover, the number of outer communication rounds can be bounded by
s
Õ

Lα
χ(W )
µα

s

!
= Õ

!

L
+ χ(W ) χ(W ) .
µ̄

The total number of communications rounds p
and local
 oracle calls, taking into account
the inner and outer loops is Õ (L/µ̄ + χ(W )) χ(W ) .
This estimate shows that we can replace the smallest strong convexity constant for the
sum among all of them, but we have to pay an additive price proportional to the condition
number of the graph and additional communication rounds in the inner maximization
problem proportional to the number of oracle calls for f .
This result can be extended to the case when F (x) is just smooth by using the regularization technique with µi = ε/(Rx2 ). Particularly, consider the regularized function
F̂α (x) = F (x) +
=

m
X
i=1

ε
α
kx − x∗ (0)k22 + hx, W xi
2
Rx
2

fi (xi ) +

ε
α
kx − x∗ (0)k22 + hx, W xi .
2
Rx
2

(23)

n
o
The function F̂α is µ̂α -strongly convex with µ̂α = min m Rε2 , αλmin (W ) and has
x

L̂α -Lipschitz continuous gradients with L̂α = L + αλmax (W ) + mε/Rx2 . Choose α =
m(ε/Rx2 )/λmin (W ). Under this specific choice of α, the function Fα will have a condition
number
L̂α
L
λmax (W )
Rx2 L
=
+
+
1
=
+ χ(W ) + 1.
µ̂α
m Rε2
λmin (W )
mε
x

25

Algorithm 10 Augmented Distributed FGM for smooth problems
1:

All agents set z0i = z̃0i = 0 ∈ Rn , µ̂α = m Rε2 , α = µ̂α /λmin (W ), L̂α = L+αλmax (W )+
x

λ+ (W )

α̃2 −q̃

α2 −q

min
0
0
m Rε2 , q̃ = L̂µ̂α , q = q̃ λmax
(W ) , α0 solves 1−α0 = 1, α̃0 solves 1−α̃0 = 1, T and N .
x
α
2: For each agent i
3: for k = 0, 1, 2, · · · , N − 1 do
4:
w0i = w̃0i = 0 ∈ Rn
5:
for t = 0, 1, 2, · · · , T − 1 do
6:
Share w̃ti with neighbors, i.e. {j | (i, j) ∈ E}.
P
j
ε
i
∗
7:
wt+1
= w̃ti + L1α (z̃ki − ∇fi (w̃ti ) − α m
j=1 Wij w̃t − R2 (wi − xi (0)))
x

8:

α̃t )
2
Compute α̃t+1 ∈ (0, 1) from α̃t+1
= (1 − α̃t+1 )α̃t2 + q̃ α̃t+1 and set β̃t = α̃α̃t2(1−
+α̃t+1
t

9:
10:
11:
12:
13:

i
i
i
w̃t+1
= wt+1
+ β̃t (wt+1
− wti )
end for
Share wTi with neighbors, i.e. {j | (i, j) ∈ E}.
Pm
j
µ̂α
i
zk+1
= z̃ki − λmax
j=1 Wij wT
(W )
k)
2
Compute αk+1 ∈ (0, 1) from αk+1
= (1 − αk+1 )αk2 + qαk+1 and set βk = ααk2 (1−α
+αk+1
k

i
i
i
z̃k+1
= zk+1
+ βk (zk+1
− zki )
15: end for

14:

The next Corollary shows the complexity of the proposed distributed augmented algorithm for the solution of sums of smooth convex functions.
Corollary 7.2 Let F (x) be a function such that Assumption 3.1(c) hold. Then, for
any ε > 0, the output x∗ (zN ) of Algorithm 9 is an (ε, ε/R)-solution of (4) for
s
N ≥8


2Rx2 L
+ χ(W ) + 1 χ(W ) log C1 ,
mε

and
r
T ≥

2Rx2 L
+ χ(W ) + 1 log C2 ,
mε

where
√
8 2λmax (W )Rx2 R2
C1 =
m · ε2
s

2
24(L + αλmax (W ) + m Rε2 )R2 Rw
2Rx2 L
x
C2 =
+ χ(W ) + 1 χ(W ),
ε2
mε
R = ky ∗ k2 , Rx = kx∗ − x∗ (0)k2 , Rw = Rx + kx∗ k2 , and χ(W ) = λmax (W )/λ+
min (W ).
The number of inner communication rounds and local oracle calls required
by pAlgorithm 10 to obtain an (ε, ε/R)-solution of (4) can be bounded by
Õ LRx2 /(mε) + χ(W ) . On the other hand the number of outer communication
p

rounds can be bounded by Õ (LRx2 /(mε) + χ(W )) χ(W ) . Therefore, this approach
is useful for large but well-connected networks, where m  1 and χ(W ) = O(1) or
26

Table 2. A summary of algorithmic performance.
Property of F (x)

Iterations Required

µ-strongly convex and L-smooth
µ-strongly convex and M -Lipschitz
L-smooth
M -Lipschitz

O

q

L
χ(W )log
µ



λmax (W )R2
µε




q

χ(W )M 2
M2
χ(W
)log
O
µε
µε
q
√

2
LRx
λmax (W )RRx
O
χ(W
)log
ε
ε
r
√
!
2
M 2 Rx
χ(W )M Rx
O
χ(W
)log
2
ε
ε

χ(W ) = O(log(m)).
A similar method to improve the definition of the global strong convexity parameter
was proposedPin [77]. In [77], the authors propose to introduce the proxy function fi (x)−
2
(µi − (1/m) m
i=1 µi )kxk2 . With this new function, the condition number of F improves
to
maxi Li − µi
Pm
− 1.
1
i=1 µi
m
8.

Discussion

Table 2 presents a summary of the results presented in Section 5. In particular, it shows
the number of communication rounds required to obtain an (ε, ε/R)-solution for each
function class in Assumption 3.1.
The estimates in Table 2 are optimal up to logarithmic factors. In the smooth cases,
where L < ∞, these estimates follow from classical centralized complexity
p estimation of
the FGM algorithm. In the distributed setting, one has to perform O( χ(W ) log(1/ε))
additional consensus steps at each iteration. This corresponds to the number of iterations
needed to solve the consensus problem
min
x

1
hx, W xi ,
2

(24)

where W is a communication matrix as defined in Section 3. FGM provides a direct
estimate
√ on the number of iterations required to reach√consensus, given that (24) is
σmin ( W )-strongly convex in x0 + ker(W ) and has σmax ( W )-Lipschitz continuous gradients, and this estimate cannot be improved up to constant factors.
The specific value of χ(W ), and its dependency on the number of nodes m has been
extensively studied in the literature of distributed optimization [54]. In [49], Proposition 5 provides an extensive list of worst-case dependencies of the spectral gap for
large classes of graphs. Particularly, for fixed undirected graphs, in the worst case
we have χ(W ) = O(m2 ) [67]. This matches the best upper bound found in the literature of consensus and distributed
optimization [44, 68]. Thus, the consensus set
√
described by the constraint
W x = 0 should be preferred over the description as
W x = 0, even though both representations correctly
describe the consensus subspace
√
x1 = . . . = xm . Particularly, when we pick A = W , we have χ(AT A) = χ(W ) instead
of χ(W T W ) = χ(W 2 )  χ(W ).
The cases when F (x) is convex or strongly convex can be generalized to p-norms, with

27

p ≥ 1, see [3]. The definitions of the condition number χ needs to be defined accordingly.
Let’s introduce a norm kxk2p = kx1 k2p + ... + kxm k2p for p ≥ 1 and assume that F (x) is
µ-strongly convex and L-Lipschitz continuous gradient in this (new) norm k·kp (in Rmn ),
see [64] (Lemma 1), [25] (Lemma 1) and [61] (Theorem 1). Thus,
hh, W hi
χ(W ) = max
µ
khk=1



hh, W hi
.
L
khk=1,h⊥ker(W )
min

Note that we typically do not know R or Rx . Thus, we require a method to estimate the strong convexity parameter, which is challenging [62, 66]. Some recent work
have explored restarting techniques to reach optimal convergence rates when the strong
convexity parameters are unknown [34, 66]. Similarly, a generalization of the FGM algorithm can be proposed when the smoothness parameter is unknown [29]. However, the
effect of restarting in the distributed setup requires further study and is out of the scope
of this paper.
Additionally, parameters such as λ+
min (W ), λmax (W ) can be efficiently computed in
a distributed manner [82]. Moreover, the values of µ and L can be shared with simple
max-consensus which is guaranteed to converge in finite time [32].

9.

Experimental results

In this section, we will provide experimental results that show the performance of the
optimal distributed algorithm presented in Sections 5. We will consider two different
graph topologies; the cycle graph and Erdős-Rényi random graph, of various sizes. We
choose the cycle graph (χ(W ) = O(m2 )) and the Erdős-Rényi random graph (χ(W ) =
O(log(m))) to show the scalability properties of the algorithms. Generally, we do not
have access to the spectral properties of the graphs. For the cases where the network has
a simple structure, like a path graph, star graph, cycle graph, we use its explicit values
since we have access to them. For the case of random graphs, we use its lower bounds
which are generally found in the literature [55, Table 1].
Initially, consider the ridge regression (strongly convex and smooth) problem
min

1

z∈Rn 2ml

1
kb − Hzk22 + ckzk22 ,
2

(25)

to be solved distributedly over a network. Each entry of the data matrix H ∈ Rml×n is
generated as an independent identically distributed random variable Hij ∼ N (0, 1), the
vector of associated values b ∈ Rml is generated as a vector of random variables where
b = Hx∗ +  for some predefined x∗ ∈ Rn and  ∼ N (0, 0.1). The columns of the data
matrix H and the output vector b are evenly distributed among the agents with a total
of l data points per agent. The regularization constant is set to c = 0.1.
Figure 2 shows experimental results for the ridge regression problem for a cycle graph
and an Erdős-Rényi random graph. For each type of graph we show the distance to
optimality as well as the distance to consensus for a fixed graph with m = 100, n = 10
and l = 100. Additionally, the scalability of the algorithm is shown by plotting the
required number of steps to reach an accuracy of  = 1 · 10−10 versus the number
of nodes in the graph. We compare the performance of the proposed algorithm with
some of the state of the art methods for distributed optimization. Dist-Opt refers
to Algorithm 3. NonAcc-Dist refers to the non-accelerated version of Algorithm 3.
28

Algorithm 2

FGM

DIGing

EXTRA

min

10−3

F (xk ) − F ∗

k≥0;
F (xk )−F ∗ ≤ε

105

Acc-DNGD

NonAcc-Dist

k
log10 10 = 1kW xk k2

105

10−1

104
10−7

103

102

104

102

106

k

60

80

104

10−9

10−15
100

min

k≥0;
kW xk k2 ≤ε

20

40

60

80

10−13
100

100

Number of Agents

Iterations

103

102

104

106

102

20

40

100

Number of Agents

Iterations

(a) Cycle Graph

min

10−3

F (xk ) − F ∗

k≥0;
F (xk )−F ∗ ≤ε

105

k
log10 10 = 1kW xk k2

105

10−1

104
10−7

103

102

104

102

Iterations

k

60

80

104

10−9

10−15
100

min

k≥0;
kW xk k2 ≤ε

20

40

60

80

103

10−13
100

100

Number of Agents

102

104

Iterations

102

20

40

100

Number of Agents

(b) Erdős-Rényi random graphs.
Figure 2. Distance to optimality and consensus, and network scalability for a strongly convex and smooth problem.
(a) Results for a cycle graph. (b) Results for an Erdős-Rényi random graph.

FGM is the centralized FGM. Acc-DNGD refers to the algorithm proposed in [70]
√
with parameter η = 0.1 and α = µη. EXTRA refers to the algorithm proposed in [79]
with parameter α = 1. DIGing refers to the algorithm proposed in [51] with parameter
α = 0.1. Figure 2 shows linear convergence rate with faster performance than other
algorithms and linear scalability with respect to the size of the cycle graphs. When
available, we have used the suggested parameter selection for each of the algorithms
compared.
As a second example, consider the Kullback-Leibler (KL) barycenter computation
problem (strongly convex and M -Lipschitz)
min

z∈Sn (1)

m
X
i=1

DKL (zkqi ) ,

m X
n
X

zi log (zi /[qi ]j ) ,

i=1 j=1

P
where Sn (1) = {z ∈ Rn : zj ≥ 0; j = 1, 2, . . . , n; nj=1 zj = 1} is a unit simplex in Rn
and qi ∈ Sn (1) for all i. Each agent has a private probability distribution q i and seek to
compute the a probability distribution that minimizes the average KL distance to the
distributions {qi }i=1,...,m . Figure 3 shows the results for the KL barycenter problem for
a cycle graph with m = 100, n = 10 and various values of the regularization parameter
when Algorithm 4 is used. We show the distance to optimality as well as the distance to
consensus and the scalability of the algorithm.
In (25), if we assume c = 0 and Hi is a wide matrix where n  l (i.e., the dimension of
the data points is much larger than the number of data points per agent), then the resulting problem is smooth but no longer strongly convex. Figure 4 shows the performance
of Algorithm 5 over a cycle graph and an Erdős-Rényi random graph, where m = 50,
n = 20 and l = 10, for different values of the regularization parameter. As expected,
smaller values of the regularization parameter increase the precision of the algorithm
29

µ̂=10-6

µ̂=10-8

µ̂=10-12

NonAcc-Dist

Algorithm 3

µ̂=10-10

NonAcc-Dist

min

F (xk ) − F ∗
101

101

10−4

10−5

10−9

101

103

105

10−11

Iterations

kW xk k2

101

k≥0;
F (xk )−F ∗ ≤ε

106

103

106

105

105

104

104

103

105

k

50

100

150

200

103

Number of Agents

Iterations

min

k

100

150

k≥0;
kW xk k2 ≤ε

50

200

Number of Agents

Figure 3. Distance to optimality and consensus, and network scalability for a strongly convex and M -Lipschitz problem
over a cycle graph with m = 100, n = 10 and various values of the regularization parameter µ̂ for Algorithm 4. The
brown line shows the performance for the non-accelerated distributed gradient descent of the dual problem.

µ̂=10-5

µ̂=10-6

µ̂=10-7

F (xk ) − F ∗

µ̂=10-8

Acc-DNGD-NSC

kW xk k2

100

10−4
10−6

10−7

10−10

0

0.2 0.4 0.6 0.8

10−12

1

0

0.2 0.4 0.6 0.8

Iterations ·106

1

Iterations ·106

(a) Cycle graph

F (xk ) − F ∗
10−4

10−6

10−8

10−12

kW xk k2

100

0

0.2 0.4 0.6 0.8

10−12

1

0

0.2 0.4 0.6 0.8

Iterations ·106

1

Iterations ·106

(b) Erdős-Rényi random graph
Figure 4. Distance to optimality and consensus for a smooth problem over a Erdős-Rényi random graph with m = 100,
n = 50, l = 10 and various values of the regularization parameter ε for Algorithm 5.

but hinder its convergence rate. We compare the performance of Algorithm 5 with the
distributed accelerated method proposed in [70] for non-strongly convex functions (AccDNGD-NSC) for a fixed regularization value µ̂ = 1 · 10−6 . As presented in Table 1, the
algorithms have similar convergence rates, as shown by the intersection of the curves
around the accuracy point corresponding to the regularization parameter. Nevertheless,
as seen in Figure 2, Acc-DNDG-NSC has a worst scalability with respect to the number
of nodes, which is particularly evident for the cycle graph.
Next, we present a number of numerical experiments and compare the performance
of the proposed methods. Consider the logistic regression problem for training linear

30

classifiers. We seek to solve the following optimization problem:
minn

x∈R

ml
 1
1 X
log 1 + exp −yi · ATi x + ckxk22 ,
2ml i=1
2

(26)

where Ai ∈ Rd is a data point with yi ∈ {−1, 1} as its corresponding class assignment.
We assume there is a total of ml data points distributed evenly among m agents, where
each agent holds l data points. For our experiments, initially we generate a random
vector xtrue ∈ Rn where each entry is chosen uniformly at random on [−1, 1], we fixed
c = 0.1, the data points Ai are generated uniformly at random on [−1, 1]n , and each
label is computed as yi = sign(ATi xtrue ). Note that each of the agents in the network will
have a local function
l

fi (x) =


1
1 X
ckxk22 ,
log 1 + exp −[y i ]j · [Ai ]Tj x +
2ml j=1
2m

(27)

where Aj ∈ Rl×n and y j ∈ {−1, 1}l are the data points held by agent j and their
corresponding class assignments. Moreover, (27) is not dual friendly. Therefore, we will
use Algorithm 7 for our next set of experimental results.
Figure 5 shows the distance to optimality and the distance to consensus of the output
of Algorithm 7 for the problem of logistic regression. We use cycle graphs and ErdősRényi random graph for a problem with 10000 data points of dimension 10. For each
class of graphs, we explore three different scenarios for the distribution of the data among
agents. We present the results for networks of 10, 100, and 1000 agents; where each agent
holds 1000, 100 and 10 data points respectively. We compare the results of the Acc√
DNGD algorithm in [70] with parameter η = 0.1 and α = µη, the EXTRA algorithm
in [79] with parameter α = 1, and the DIGing algorithm in [51] with parameter α = 0.1.
Figure 5 shows a faster geometric convergence rate of Algorithm 7 with respect to AccDNGD, EXTRA and DIGing. Nonetheless, we point out that those algorithms could be
subject to improved convergence rates if the particular parameters of each algorithm are
carefully selected. In the presented results, we do not claim to have selected the optimal
step sizes for the algorithms we are comparing our proposed method. For the cycle graph
in Figure 5(a), as the size of the network increases and the number of points per agent
decreases the convergence rates slows down. The EXTRA algorithms seem to have a
near-optimal scaling on its convergence rate with respect to the size of the network. The
Acc-DNGD and DIGing algorithms rapidly decrease their convergence rate with the
size of the network. Due to the better condition number of the Erdős-Rényi random
graphs, the Figure 5(b) shows a better scaling with the size of the network for all the
analyzes algorithms. For this class of networks, the Acc-DNGD algorithms outperforms
EXTRA and DIGing.
In Figure 6, we use datasets from the library LibSVM [13] to compare the performance
of Algorithm 7 as in Figure 5. We seek to distributedly solve the logistic regression
problem over the following datasets: a9a, mushrooms, ijcnn1 and phishing. Table 3
shoes a brief description of the four datasets used. For each problem, we created an ErdősRényi random graph with 100 agents and evenly distributed the data points among all
agents. Algorithm 7 outperforms the other compared algorithms where the Acc-DNGD
having the second best performance following the same scaling patterns as in Figure 5
for Erdős-Rényi random graphs. The EXTRA and DIGing algorithms have a worst
scaling of their convergence rate as the size of the network increases.
31

F (xk ) − F ∗

Algorithm 6

DIGing

Acc-DNGD

EXTRA

m = 1000, n = 10, l = 10

m = 10, n = 10, l = 1000

m = 100, n = 10, l = 100

100

100

100

10−8

10−8

10−4

10−16

0

2,000

10−16

4,000

0

2

10−8

4

0

2

4
·104

kW xk k2

·104

100

100

100

10−8

10−8

10−5

10−16

0

2,000

10−16

4,000

0

2

10−10

4

0

Iterations ·104

Iterations

2

4

Iterations ·104

F (xk ) − F ∗

(a) Cycle Graph

m = 10, n = 10, l = 1000

m = 100, n = 10, l = 100

m = 1000, n = 10, l = 10

100

100

100

10−8

10−8

10−8

10−16

0

2,000

10−16

4,000

0

0.5

1

1.5

2

10−16

0

2

4

·104

kW xk k2

100

100

100

10−8

10−16

·104

10−8

10−8

0

2,000

10−16

4,000

0

0.5

Iterations

1

1.5

Iterations

2

10−16

0

2

4

Iterations ·104

·104

(b) Erdős-Rényi Random Graph
Figure 5. Logistic regression on synthetic data over a cycle graph and an Erdős-Rényi random graph for a total of
10000 data points and various graph sizes evenly distributing the data points among the agents.

Table 3. Real Datasets from the LibSVM Library.
Name

Classes

Data points

Features

a9a
mushrooms
ijcnn1
phishing

2
2
2
2

32561
8214
49990
11055

123
112
22
68

32

Algorithm 6

DIGing

Acc-DNGD

EXTRA

100

100

100

100

10−8

10−8

10−8

10−8

10−16

0

0.5

1

A9A

1.5

2

10−16

0

1

2

3

10−16

0

1

IJCNN 1

MUSHROOMS
·104

·104

2

3

10−16

0

1

2

3

PHISHING ·104

·104

(a) Distance to Optimality
100

100

100

100

10−8

10−8

10−8

10−8

10−16

0

0.5

1

A9A

1.5

2

·104

10−16

0

1

2

3

10−16

0

1

2

IJCNN 1

MUSHROOMS
·104

3
·104

10−16

0

1

2

3

PHISHING ·104

(b) Distance to Consensus
Figure 6. Logistic regression results with data from the LibSVM Library on an Erdős-Rényi random graph with 100
agents.

10.

Conclusions

We have provided convergence rate estimates for the solution of convex optimization
problems in a distributed manner for dual-friendly functions. The provided complexity
bounds depend explicitly on the properties of the function to be optimized. If F (x) is
smooth, then our estimates are optimal up to logarithmic factors otherwise our estimates
are optimal up to constant factors (in terms of t thep
number of communication rounds).
The inclusion of the graph properties in terms of χ(W ) shows the additional price
to be paid in contrast with classical (centralized/non-distributed) optimal estimates.
The authors recognize that the proposed algorithms required, to some extent, global
knowledge about the graph properties and the condition number of the network function.
Nevertheless, our aim was to provide a theoretical foundation for the performance limits
of the distributed algorithms. The cases where global information is not available require
additional study.
The proposed algorithms and their corresponding analysis use Nesterov’s smoothing to
induce strong-convexity either in the primal or dual problems. However, as noted in [85],
Nesterov’s smoothing has some practical limitations, e.g., it requires explicit knowledge
of the norms R or Rx , and the strong convexity term depends on the desired accuracy
ε, which in turn may induce slow convergence rates. The study of distributed adaptive
methods [85] requires further work.
One can further extend our results and obtain the same rates of converge when the
graphs change with time by using restarting techniques [4, 26]. Nevertheless, we require
additional assumptions. Particularly, the network changes should not happen often and
nodes must be able to detect when these changes occur. The condition number of the
sequence of graphs χ(Wk ) then is the worst one among all the graphs in the execution
of the algorithm [76]. Additionally, it is still an open research question whether these
optimal convergence rates can be achieved over directed networks [45].

33

11.

Acknowledgments

The authors would like to thank the Lund Center for Control of Complex Engineering
Systems (LCCC) at Lund University, particularly Anders Rantzer and Pontus Giselsson
for organizing the 2017 LCCC Focus Period on Large-Scale and Distributed Optimization
from which most of the ideas contained in this paper were initially discussed. We’d also
like to thank Thinh T. Doan, who made a lot very useful comments on the initial version
of this text. This comments allows to repairs significant misprints.
12.

Funding

The work of A. Nedić and C.A. Uribe is supported by the National Science Foundation
under grant no. CPS 15-44953. The work of A. Gasnikov was supported by RFBR 18-2903071 mk. The work of C.A.Uribe, S. Lee, and A. Gasnikov was also partially supported
by the Yahoo! Research Faculty Engagement Program.
References
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A. Davis, J. Dean,
M. Devin, et al., TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems.
CoRR, abs/1603.04467, in Conference on Language Resources and Evaluation (LREC08), 2016, pp.
3243–3249.
[2] A. Anikin, P. Dvurechensky, A. Gasnikov, A. Golov, A. Gornov, Y. Maximov, M. Mendel, and
V. Spokoiny, Efficient numerical algorithms for regularized regression problem with applications to
traffic matrix estimations, arXiv preprint arXiv:1508.00858 (2015).
[3] A. Anikin, A. Gasnikov, P. Dvurechensky, A. Tyurin, and A. Chernov, Dual approaches to the
minimization of strongly convex functionals with a simple structure under affine constraints, Computational Mathematics and Mathematical Physics 57 (2017), pp. 1262–1276.
[4] N. Bansal and A. Gupta, Potential-function proofs for first-order methods, arXiv preprint
arXiv:1712.04581 (2017).
[5] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM Journal on Imaging Sciences 2 (2009), pp. 183–202.
[6] A. Beck and M. Teboulle, A fast dual proximal gradient algorithm for convex minimization and
applications, Operations Research Letters 42 (2014), pp. 1–6.
[7] A. Beck, A. Nedic, A. Ozdaglar, and M. Teboulle, Optimal distributed gradient methods for network
resource allocation problems, IEEE Transactions on Control of Network Systems 1 (2014), pp. 64–74.
[8] D.P. Bertsekas, A. Nedić, and A.E. Ozdaglar, Convex Analysis and Optimization, Athena Scientific,
2003.
[9] V. Borkar and P.P. Varaiya, Asymptotic agreement in distributed estimation, IEEE Transactions on
Automatic Control 27 (1982), pp. 650–655.
[10] L. Bottou, Large-scale machine learning with stochastic gradient descent, in Proceedings of COMPSTAT’2010, Springer, 2010, pp. 177–186.
[11] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical
learning via the alternating direction method of multipliers, Foundations and Trends R in Machine
Learning 3 (2011), pp. 1–122.
[12] S. Bubeck, Convex optimization: Algorithms and complexity, Found. Trends Mach. Learn. 8 (2015),
pp. 231–357.
[13] C.C. Chang and C.J. Lin, Libsvm: a library for support vector machines, ACM transactions on
intelligent systems and technology (TIST) 2 (2011), p. 27.
[14] A. Chernov, P. Dvurechensky, and A. Gasnikov, Fast Primal-Dual Gradient Method for Strongly
Convex Minimization Problems with Linear Constraints, in Discrete Optimization and Operations
Research, Springer International Publishing, Cham, 2016, pp. 391–403.
[15] M. Cuturi and G. Peyré, A smoothed dual approach for variational Wasserstein problems, SIAM
Journal on Imaging Sciences 9 (2016), pp. 320–343.

34

[16] M.H. DeGroot, Reaching a consensus, Journal of the American Statistical Association 69 (1974),
pp. 118–121.
[17] O. Devolder, F. Glineur, and Y. Nesterov, Double smoothing technique for large-scale linearly constrained convex optimization, SIAM Journal on Optimization 22 (2012), pp. 702–727.
[18] O. Devolder, F. Glineur, and Y. Nesterov, First-order methods of smooth convex optimization with
inexact oracle, Mathematical Programming 146 (2014), pp. 37–75.
[19] O. Devolder, F. Glineur, Y. Nesterov, et al., First-order methods with inexact oracle: the strongly
convex case, CORE Discussion Papers 2013016 (2013).
[20] T.T. Doan and A. Olshevsky, Distributed resource allocation on dynamic networks in quadratic time,
Systems & Control Letters 99 (2017), pp. 57–63.
[21] J. Duchi, P. Bartlett, and M. Wainwright, Randomized smoothing for stochastic optimization, SIAM
Journal on Optimization 22 (2012), pp. 674–701.
[22] J.C. Duchi, A. Agarwal, and M.J. Wainwright, Dual averaging for distributed optimization: Convergence analysis and network scaling, IEEE Transactions on Automatic Control 57 (2012), pp.
592–606.
[23] C. Dünner, S. Forte, M. Takáč, and M. Jaggi, Primal-dual Rates and Certificates, in Proceedings of
the 33rd International Conference on International Conference on Machine Learning - Volume 48,
ICML’16, New York, NY, USA, JMLR.org, 2016, pp. 783–792.
[24] P. Dvurechenskii, D. Dvinskikh, A. Gasnikov, C. Uribe, and A. Nedich, Decentralize and randomize:
Faster algorithm for wasserstein barycenters, in Advances in Neural Information Processing Systems
31, 2018, pp. 10760–10770.
[25] P. Dvurechensky, Gradient method with inexact oracle for composite non-convex optimization, arXiv
preprint arXiv:1703.09180 (2017).
[26] O. Fercoq and Z. Qu, Restarting accelerated gradient methods with a rough strong convexity estimate,
arXiv preprint arXiv:1609.07358 (2016).
[27] A. Gasnikov, Universal gradient descent, arXiv preprint arXiv:1711.00394 (2017).
[28] A. Gasnikov, S. Kabanikhin, A. Mohamed, and M. Shishlenin, Convex optimization in hilbert space
with applications to inverse problems, arXiv preprint arXiv:1703.00267 (2017).
[29] A.V. Gasnikov, E. Gasnikova, Y.E. Nesterov, and A. Chernov, Efficient numerical methods for
entropy-linear programming problems, Computational Mathematics and Mathematical Physics 56
(2016), pp. 514–524.
[30] E. Gorbunov, D. Dvinskikh, and A. Gasnikov, Optimal decentralized distributed algorithms for
stochastic convex optimization, arXiv preprint arXiv:1911.07363 (2019).
[31] J.B. Hiriart-Urruty and C. Lemaréchal, Fundamentals of convex analysis, Springer Science & Business Media, 2012.
[32] F. Iutzeler, P. Ciblat, and J. Jakubowicz, Analysis of max-consensus algorithms in wireless channels,
IEEE Transactions on Signal Processing 60 (2012), pp. 6103–6107.
[33] D. Jakoveti, A unification and generalization of exact distributed first-order methods, IEEE Transactions on Signal and Information Processing over Networks 5 (2019), pp. 31–46.
[34] A. Juditsky and Y. Nesterov, Deterministic and stochastic primal-dual subgradient algorithms for
uniformly convex minimization, Stochastic Systems 4 (2014), pp. 44–80.
[35] S. Kakade, S. Shalev-Shwartz, and A. Tewari, Applications of strong convexity–strong smoothness
duality to learning with matrices, CoRR, abs/0910.0610 (2009).
[36] M.B. Khuzani and N. Li, Distributed regularized primal-dual method: Convergence analysis and
trade-offs, arXiv preprint arXiv:1609.08262 (2016).
[37] J. Konečnỳ, B. McMahan, and D. Ramage, Federated optimization: Distributed optimization beyond
the datacenter, arXiv preprint arXiv:1511.03575 (2015).
[38] T. Kraska, A. Talwalkar, J.C. Duchi, R. Griffith, M.J. Franklin, and M.I. Jordan, MLbase: A Distributed Machine-learning System., in CIDR, Vol. 1, 2013, pp. 2–1.
[39] H. Lakshmanan and D.P. De Farias, Decentralized resource allocation in dynamic networks of agents,
SIAM Journal on Optimization 19 (2008), pp. 911–940.
[40] G. Lan, Gradient sliding for composite optimization, Mathematical Programming 159 (2016), pp.
201–235.
[41] G. Lan, S. Lee, and Y. Zhou, Communication-efficient algorithms for decentralized and stochastic
optimization, arXiv preprint arXiv:1701.03961 (2017).
[42] G. Lan, Z. Lu, and R.D.C. Monteiro, Primal-dual first-order methods with o(1/ε) iteration-complexity
for cone programming, Mathematical Programming 126 (2011), pp. 1–29.
[43] P. Latafat, L. Stella, and P. Patrinos, New primal-dual proximal algorithm for distributed optimization, in Proc. IEEE 55th Conf. Decision and Control (CDC), Dec., 2016, pp. 1959–1964.

35

[44] J. Liu, B.D. Anderson, M. Cao, and A.S. Morse, Analysis of accelerated gossip algorithms, Automatica 49 (2013), pp. 873–883.
[45] M. Maros and J. Jaldn, PANDA: A Dual Linearly Converging Method for Distributed Optimization
Over Time-Varying Undirected Graphs, in 2018 IEEE Conference on Decision and Control (CDC),
Dec, 2018, pp. 6520–6525.
[46] I. Necoara, Random coordinate descent algorithms for multi-agent convex optimization over networks,
IEEE Transactions on Automatic Control 58 (2013), pp. 2001–2012.
[47] I. Necoara and V. Nedelcu, On linear convergence of a distributed dual gradient algorithm for linearly
constrained separable convex problems, Automatica 55 (2015), pp. 209 – 216.
[48] I. Necoara and J.A.K. Suykens, Application of a smoothing technique to decomposition in convex
optimization, IEEE Transactions on Automatic Control 53 (2008), pp. 2674–2679.
[49] A. Nedi, A. Olshevsky, and M.G. Rabbat, Network topology and communication-computation tradeoffs in decentralized optimization, Proceedings of the IEEE 106 (2018), pp. 953–976.
[50] A. Nedić and A. Olshevsky, Distributed optimization over time-varying directed graphs, IEEE Transactions on Automatic Control 60 (2015), pp. 601–615.
[51] A. Nedić, A. Olshevsky, and W. Shi, Achieving geometric convergence for distributed optimization
over time-varying graphs, SIAM Journal on Optimization 27 (2017), pp. 2597–2633.
[52] A. Nedić, A. Olshevsky, and W. Shi, Improved convergence rates for distributed resource allocation,
arXiv preprint arXiv:1706.05441 (2017).
[53] A. Nedić, A. Olshevsky, and C.A. Uribe, Distributed learning for cooperative inference, arXiv preprint
arXiv:1704.02718 (2017).
[54] A. Nedić, A. Olshevsky, and C.A. Uribe, Fast convergence rates for distributed non-Bayesian learning, IEEE Transactions on Automatic Control 62 (2017), pp. 5538–5553.
[55] A. Nedić, A. Olshevsky, and C.A. Uribe, Graph-Theoretic Analysis of Belief System Dynamics under
Logic Constraints, Scientific Reports 9 (2019), p. 8843, Available at https://doi.org/10.1038/s41598019-45076-4.
[56] A. Nedić and A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, IEEE
Transactions on Automatic Control 54 (2009), pp. 48–61.
[57] A. Nedić, A. Olshevsky, A. Ozdaglar, and J.N. Tsitsiklis, On distributed averaging algorithms and
quantization effects, IEEE Transactions on Automatic Control 54 (2009), pp. 2506–2517.
[58] A. Nedić, A. Olshevsky, W. Shi, and C.A. Uribe, Geometrically convergent distributed optimization
with uncoordinated step-sizes, in American Control Conference (ACC), 2017, 2017, pp. 3950–3955.
[59] A. Nemirovskii and Yudin, Problem Complexity and Method Efficiency in Optimization, Wiley, 1983.
[60] Y. Nesterov, A method of solving a convex programming problem with convergence rate O(1/k2 ), in
Soviet Mathematics Doklady, Vol. 27, 1983, pp. 372–376.
[61] Y. Nesterov, Smooth minimization of non-smooth functions, Mathematical Programming 103 (2005),
pp. 127–152.
[62] Y. Nesterov, Gradient methods for minimizing composite functions, Mathematical Programming 140
(2013), pp. 125–161.
[63] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Vol. 87, Springer
Science & Business Media, 2013.
[64] Y. Nesterov, Universal gradient methods for convex optimization problems, Mathematical Programming 152 (2015), pp. 381–404.
[65] Y. Nesterov and A. Nemirovskii, Interior-Point Polynomial Algorithms in Convex Programming,
SIAM, 1994.
[66] B. O’Donoghue and E. Candès, Adaptive restart for accelerated gradient schemes, Foundations of
Computational Mathematics 15 (2015), pp. 715–732.
[67] A. Olshevsky, Linear time average consensus on fixed graphs and implications for decentralized
optimization and multi-agent control, preprint arXiv:1411.4186 (2014).
[68] B.N. Oreshkin, M.J. Coates, and M.G. Rabbat, Optimization and analysis of distributed averaging
with short node memory, IEEE Transactions on Signal Processing 58 (2010), pp. 2850–2865.
[69] N. Parikh and S. Boyd, Block splitting for distributed optimization, Mathematical Programming
Computation 6 (2014), pp. 77–102.
[70] G. Qu and N. Li, Accelerated distributed Nesterov gradient descent, arXiv preprint arXiv:1705.07176
(2017).
[71] G. Qu and N. Li, Harnessing smoothness to accelerate distributed optimization, IEEE Transactions
on Control of Network Systems 5 (2018), pp. 1245–1260.
[72] M. Rabbat and R. Nowak, Decentralized source localization and tracking wireless sensor networks,
in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing,

36

Vol. 3, 2004, pp. 921–924.
[73] M. Raginsky and J. Bouvrie, Continuous-time stochastic Mirror Descent on a network: Variance
reduction, consensus, convergence, in Proc. IEEE 51st IEEE Conf. Decision and Control (CDC),
Dec., 2012, pp. 6793–6800.
[74] S.S. Ram, A. Nedić, and V.V. Veeravalli, Distributed stochastic subgradient projection algorithms
for convex optimization, Journal of Optimization Theory and Applications 147 (2010), pp. 516–545.
[75] R. Rockafellar and R. Wets, Variational analysis, Vol. 317, Springer, 2011.
[76] A. Rogozin, C.A. Uribe, A. Gasnikov, N. Malkovsky, and A. Nedić, Optimal distributed optimization
on slowly time-varying graphs, arXiv preprint arXiv:1805.06045 (2018).
[77] K. Scaman, F. Bach, S. Bubeck, Y.T. Lee, and L. Massoulié, Optimal Algorithms for Smooth and
Strongly Convex Distributed Optimization in Networks, in International Conference on Machine
Learning, 2017, pp. 3027–3036.
[78] K. Scaman, F. Bach, S. Bubeck, Y.T. Lee, and L. Massoulié, Optimal algorithms for non-smooth
distributed optimization in networks, arXiv preprint arXiv:1806.00291 (2018).
[79] W. Shi, Q. Ling, G. Wu, and W. Yin, Extra: An exact first-order algorithm for decentralized consensus
optimization, SIAM Journal on Optimization 25 (2015), pp. 944–966.
[80] A. Simonetto and H. Jamali-Rad, Primal recovery from consensus-based dual decomposition for
distributed convex optimization, Journal of Optimization Theory and Applications 168 (2016), p.
172.
[81] A. Sundararajan, B. Hu, and L. Lessard, Robust convergence analysis of distributed optimization
algorithms, in 2017 55th Annual Allerton Conference on Communication, Control, and Computing
(Allerton), Oct, 2017, pp. 1206–1212.
[82] T.M.D. Tran and A.Y. Kibangou, Distributed estimation of graph laplacian eigenvalues by the alternating direction of multipliers method, IFAC Proceedings Volumes 47 (2014), pp. 5526 – 5531, 19th
IFAC World Congress.
[83] Q. Tran-Dinh and V. Cevher, Constrained convex minimization via model-based excessive gap, in
Advances in Neural Information Processing Systems, 2014, pp. 721–729.
[84] Q. Tran Dinh and V. Cevher, Splitting the smoothed primal-dual gap: Optimal alternating direction
methods, Tech. Rep., Tech. Report. LIONS-EPFL (2015), 2015.
[85] Q. Tran-Dinh, O. Fercoq, and V. Cevher, A smooth primal-dual optimization framework for nonsmooth composite convex minimization, arXiv preprint arXiv:1507.06243 (2015).
[86] J.N. Tsitsiklis and M. Athans, Convergence and asymptotic agreement in distributed decision problems, IEEE Transactions on Automatic Control 29 (1984), pp. 42–50.
[87] C.A. Uribe, D. Dvinskikh, P. Dvurechensky, A. Gasnikov, and A. Nedi, Distributed Computation of
Wasserstein Barycenters Over Networks, in 2018 IEEE Conference on Decision and Control (CDC),
Dec, 2018, pp. 6544–6549.
[88] L. Xiao and S. Boyd, Optimal scaling of a gradient method for distributed resource allocation, Journal
of Optimization Theory and Applications 129 (2006), pp. 469–488.
[89] A. Yurtsever, Q.T. Dinh, and V. Cevher, A universal primal-dual convex optimization framework,
in Advances in Neural Information Processing Systems, 2015, pp. 3150–3158.
[90] G. Zhang and R. Heusdens, Distributed optimization using the primal-dual method of multipliers,
IEEE Transactions on Signal and Information Processing over Networks 4 (2018), pp. 173–187.
[91] M. Zhu and S. Martı́nez, On distributed optimization under inequality and equality constraints via
penalty primal-dual methods, in American Control Conference (ACC), 2010, 2010, pp. 2434–2439.

37

