Distributed Subgradient Methods for
Multi-agent Optimization
Asu Ozdaglar
February 2009
Department of Electrical Engineering & Computer Science
Massachusetts Institute of Technology, USA

MIT Laboratory for Information and Decision Systems

1

Motivation
• Increasing interest in distributed control and coordination of networks
consisting of multiple autonomous agents
• Motivated by many emerging networking applications, such as ad hoc wireless
communication networks and sensor networks, characterized by:
– Lack of centralized control and access to information
– Time-varying connectivity
• Control and optimization algorithms for such networks should be:
– Completely distributed relying on local information
– Robust against changes in the network topology

MIT Laboratory for Information and Decision Systems

2

Multi-Agent Optimization Problem
Goal: Develop a general computational model
for cooperatively optimizing a global system
objective through local interactions and computations in a multi-agent system
• Global objective is a combination of
individual agent performance measures

Examples:
• Consensus problems: Alignment of estimates maintained by different agents
– Control of moving vehicles (UAVs), computing averages of initial values
• Parameter estimation in distributed sensor networks:
– Regression-based estimates using local sensor measurements
• Congestion control in data networks with heterogeneous users
MIT Laboratory for Information and Decision Systems

3

Related Literature
• Parallel and Distributed Optimization Algorithms:
– General computational model for distributed asynchronous optimization
∗ Tsitsiklis 84, Bertsekas and Tsitsiklis 95
• Consensus and Cooperative Control:
– Analysis of group behavior (flocking) in dynamical-biological systems
∗ Vicsek 95, Reynolds 87, Toner and Tu 98
– Mathematical models of consensus and averaging
∗ Jadbabaie et al. 03, Olfati-Saber Murray 04, Boyd et al. 05
• Game Theory/Mechanism Design for Distributed Cooperative Control:
– Assign each agent a local utility function such that:
∗ The equilibrium of the resulting game is the same as (or close to) the
global optimum
∗ Derive learning algorithms that reach equilibrium
– Marden, Arslan, Shamma 07 used this approach for the consensus problem
to deal with constraints
MIT Laboratory for Information and Decision Systems

4

This Talk
• Development of a distributed subgradient method for multi-agent optimization
[Nedic, Ozdaglar 08]
– Convergence analysis and performance bounds for time-varying topologies
under general connectivity assumptions
• Effects of local constraints [Nedic, Ozdaglar, Parrilo 08]
• Effects of networked-system constraints: quantization, delay, asynchronism

MIT Laboratory for Information and Decision Systems

5

Model
• We consider a network of m agents with node set V = {1, . . . , m}
– Agents want to cooperatively solve
min

x∈Rn

m
X

f1(x1, . . . , xn)
f2(x1, . . . , xn)

fi (x)

i=1

– Function fi (x) : Rn → R is a convex objective function known only by node i

fm (x1, . . . , xn)

• Agents update and send their information at discrete times t0 , t1 , t2 . . .
• We use xi (k) ∈ Rn to denote the estimate of agent i at time tk
Agent Update Rule:
• Agent i locally minimizes fi according to:
m
X
aij (k)xj (k) − αi (k)di (k)
xi (k + 1) =
j=1

aij (k) are weights, αi (k) is stepsize, di (k) is subgradient of fi at xi (k)
• ai (k) = (ai1 (k), . . . , aim (k)) represents i’s time-varying neighbors at slot k
• The model includes consensus as a special case (fi (x) = 0 for all i)
MIT Laboratory for Information and Decision Systems

6

Linear Dynamics and Transition Matrices
• We let A(s) denote the matrix whose ith column is the vector ai (s) and
introduce the transition matrices
Φ(k, s) = A(s)A(s + 1) · · · A(k − 1)A(k)

for all k ≥ s

• We use these matrices to relate xi (k + 1) to xj (s) at time s ≤ k:
Ãm
!
k−1
m
X
X X
i
i j
x (k+1) =
[Φ(k, s)]j x (s)−
[Φ(k, r + 1)]ij αj (r)dj (r) −αi (k)di (k).
j=1

r=s

j=1

• We analyze convergence properties of the distributed method by establishing:
– Convergence of transition matrices Φ(k, s) (consensus part)
– Convergence of an approximate subgradient method (effect of optimization)

MIT Laboratory for Information and Decision Systems

7

Assumptions
Assumption (Weights) At all times k, we have
(a) The weights aij (k) are nonnegative for all agents i, j.
(b) There exists a scalar η ∈ (0, 1) such that for all agents i,
aii (k) ≥ η

and

aij (k) ≥ η

for agents j communicating with agent i at time k.
(c) The agent weight vectors ai (k) = [ai1 (k), . . . , aim (k)]T are stochastic, i.e.,
Pm i
j=1 aj (k) = 1 for all i and k.
Example: Equal neighbor weights
aij (k) =

1
ni (k) + 1

where ni (k) is the number of agents communicating with i at time k

MIT Laboratory for Information and Decision Systems

8

Information Exchange
• Agent i influences any other agent infinitely
often - connectivity
• Agent j send his information to neighboring agent i within a bounded time interval
- bounded intercommunication interval
At slot k, information exchange may be represented by a directed graph (V, Ek ) with
Ek = {(j, i) | aij (k) > 0}
Assumption (Connectivity) The graph (V, E∞ ) is connected, where
E∞ = {(j, i) | (j, i) ∈ Ek for infinitely many indices k}.
Assumption (Bounded Intercommunication Interval) There is some B ≥ 1 s.t.
(j, i) ∈ Ek ∪ Ek+1 ∪ · · · ∪ Ek+B−1

MIT Laboratory for Information and Decision Systems

for all (j, i) ∈ E∞ and k ≥ 0.

9

Properties of Transition Matrices
Lemma: Let Weights and Information Exchange assumptions hold. We then have
[Φ(s + (m − 1)B − 1, s)]ij ≥ η (m−1)B

for all s, i, and j,

where η is the lower bound on weights and B is the intercommunication interval
bound.
• We introduce the matrices Dk (s) as follows: for a fixed s ≥ 0,
Dk (s) = Φ0 (s + kB0 − 1, s + (k − 1)B0 )

for k = 1, 2, . . . ,

where B0 = (m − 1)B.
• By the previous lemma, all entries of Dk (s) are positive.

MIT Laboratory for Information and Decision Systems

10

Convergence of Transition Matrices
Lemma: Let Weights and Information Exchange assumptions hold. For each s ≥ 0,
we have:
(a) The limit D̄(s) = limk→∞ Dk (s) · · · D1 (s) exists.
(b) The limit D̄(s) has identical rows and the rows are stochastic.
(c) For every j, the entries [Dk (s) · · · D1 (s)]ji , i = 1, . . . , m, converge to the same
limit φj (s) as k → ∞ with a geometric rate:
¯
¯
³
´³
´k
¯
¯
−B0
B0
j
1−η
¯[Dk (s) · · · D1 (s)]i − φj (s)¯ ≤ 2 1 + η
where η is the lower bound on weights, B is the intercommunication interval
bound, and B0 = (m − 1)B.

MIT Laboratory for Information and Decision Systems

11

Proof Outline
• We show that the sequence {(Dk · · · D1 )x} converges for every x ∈ Rm
• Consider the sequence {xk } with xk = Dk · · · D1 x and write xk as
xk = zk + ck e,

where ck = min [xk ]i
1≤i≤m

• Using the property that each entry of the matrix Dk is positive, we show
³
´k
B0
kzk k∞ ≤ 1 − η
kz0 k∞
for all k.
Hence zk → 0 with a geometric rate.
• We then show that the sequence {ck } converges to some c̄ ∈ R and use the
contraction constant to establish the rate estimate
• The final relation follows by picking x = ej , the j th unit vector

MIT Laboratory for Information and Decision Systems

12

Convergence of Transition Matrices
Proposition: Let Weights and Information Exchange assumptions hold. For each
s ≥ 0, we have:
(a) The limit Φ̄(s) = limk→∞ Φ(k, s) exists.
(b) The limit matrix Φ̄(s) has identical columns and the columns are stochastic, i.e.,
Φ̄(s) = φ(s)e0 ,
where φ(s) ∈ Rm is a stochastic vector.
(c) For every i, [Φ(k, s)]ji , j = 1, ..., m, converge to the same limit φi (s) as k → ∞
with a geometric rate, i.e., for all i, j and all k ≥ s,
¯
¯
´ k−s
1 + η −B0 ³
B0
¯
¯
j
B0
1
−
η
¯[Φ(k, s)]i − φi (s)¯ ≤ 2
1 − η B0
where η is the lower bound on weights, B is the intercommunication interval
bound, and B0 = (m − 1)B.
• The rate estimate in part (c) recently improved in [Nedić, Olshevsky, Ozdaglar,
Tsitsiklis 08]
MIT Laboratory for Information and Decision Systems

13

Convergence Analysis
• Recall the evolution of the estimates (with αi (s) = α):
!
Ãm
k−1
m
X
X
X
[Φ(k, s)]ij xj (s) − α
[Φ(k, r + 1)]ij dj (r) − αdi (k).
xi (k + 1) =
r=s

j=1

j=1

• Proof method: Consider a “stopped process”, where after time k̄, agents stop
computing subgradients but keep exchanging their estimates: di (k) = 0 for all
k ≥ k̄ and all i.
• It can be seen that the stopped process takes the form
!
Ã m
m
k̄
X
X
X
[Φ(k, r)]ij dj (r − 1)
x̄i (k + 1) =
[Φ(k, 0)]ij xj (0) − α
r=1

j=1

j=1

• Using limk→∞ [Φ(k, s)]ij = φj (s) for all i, we see that the limit vector
limk→∞ x̄i (k) exists, independent of i, but depends on k̄:
lim x̄i (k) = y(k̄)

k→∞

MIT Laboratory for Information and Decision Systems

14

Behavior of “Stopped Process”
• Stopped process is described by:
y(k + 1) = y(k) − α

m
X

φj (k)dj (k)

j=1

– Subgradients dj (k) of fj are computed at xj (k) instead of y(k)
• This would correspond to an approximate subgradient method for minimizing
P
j fj (x) provided that the values φj (k) are the same for all j
• Assumption (Doubly Stochastic Weights) The matrices A(k) are doubly
Pm i
stochastic, i.e., i=1 aj (k) = 1 for all j and k.
– Can be ensured when the agents exchange their information simultaneously
and coordinate the selection of the weights aij (k)
• In this case the stopped process reduces to
m

α X j
y(k + 1) = y(k) −
d (k)
m j=1
MIT Laboratory for Information and Decision Systems

15

Main Convergence Result
Proposition: Let
• Weights, Doubly Stochastic Weights, and Information Exchange assumptions hold
• The subgradients of fi be uniformly bounded by a constant L, and
max1≤j≤m kxj (0)k ≤ αL.
Then, for the averages x̂i (k) of estimates xi (0), . . . , xi (k − 1) we have
µ
¶
2
∗
m dist (y(0), X )
C
f (x̂i (k)) ≤ f ∗ +
+ αL2
+ 2mC1
2αk
2
P
∗
where f is the optimal value of f = i fi , X ∗ is the optimal set,
1 X i
y(0) =
x (0),
m i
C1 = 1 +

m
1 − (1 − η B0 )

1
B0

C = 1 + 8mC1

1 + η −B0
,
1 − η B0

B0 = (m − 1)B

• Estimates are per iteration
• Captures tradeoff between accuracy and computational complexity
MIT Laboratory for Information and Decision Systems

16

Proof Outline
• We analyze the stopped process:
m

α X j
y(k + 1) = y(k) −
d (k)
m j=1
• Establish approximate convergence relation for the running averages
k−1
1X
ŷ(k) =
y(h)
k
h=0

• Using the convergence rate estimate for the transition matrices Φ(k, s), we show
Pk−1 i
i
1
that ŷ(k) are close to the agents’ averages x̂ (k) = k h=0 x (h)
kŷ(k) − x̂i (k)k ≤ 2αLC1

for all i and k

• Infer the result for the running averages x̂(k)

MIT Laboratory for Information and Decision Systems

17

Constrained Consensus Problem
• Estimates of agent i restricted to lie in a closed convex constraint set Xi
• We assume that the intersection set X = ∩m
i=1 Xi is nonempty
• Examples where constraints important:
– Motion planning and alignment problems, where each agent’s position is
limited to a certain region or range
– Distributed constrained multi-agent optimization
• This talk: Pure consensus problem in the presence of constraints

MIT Laboratory for Information and Decision Systems

18

Projected Consensus Algorithm
• For the constrained consensus problem, we develop a consensus algorithm based
on projections.
• We use xi (k) ∈ Xi to denote the estimate of agent i at time tk
• Given a closed convex set X ⊂ Rn , and a vector y ∈ Rn , we define:
dist(y, X) = min ky − xk,
x∈X

PX (y) = argminx∈X ky − xk

Agent Update Rule:
• Agent i updates his estimates subject to his constraint set:
i

x (k + 1) = PXi

m
hX

i

aij (k)xj (k)

j=1

where ai (k) = (ai1 (k), . . . , aim (k))0 is the weight vector

MIT Laboratory for Information and Decision Systems

19

Connection to Alternating Projections
• Update rule similar to classical alternating projection method
Alternating/Cyclic Projection Methods:
Given closed convex sets X1 , X2 ⊂ Rn , find a
point in X = X1 ∩ X2 :
x(k + 1) = PX1 (x(k))
x(k + 2) = PX2 (x(k + 1))
Convergence analysis:
Xi affine [Von Neumann; Aronszajn 50],
Xi convex [Gubin, Polyak, Raik 67]
Constrained Consensus Algorithm:
Given closed convex sets X1 , X2
P 1
1
w (k) = j aj (k)xj (k)
P 2
2
w (k) = j aj (k)xj (k)
x1 (k + 1) = PX1 (w1 (k))
x2 (k + 1) = PX2 (w2 (k))

⊂

MIT Laboratory for Information and Decision Systems

Rn :

20

Convergence Analysis
• Analysis of the impact of constraints
• Recall the update rule
i

x (k + 1) =

m
X

aij (k)xj (k) + ei (k),

j=1

where the projection error ei (k) is given by
ei (k) = PXi

m
hX

m
i X
aij (k)xj (k) −
aij (k)xj (k) = xi (k + 1) − wi (k).

j=1

j=1

Proposition: Assume that the intersection X = ∩m
i=1 Xi is nonempty. Let Doubly
Stochastic Weights assumption hold. Then:
lim ei (k) = 0

k→∞

for all i.

• Implication: The analysis translates into unconstrained case
• The proof relies on two lemmas
MIT Laboratory for Information and Decision Systems

21

Convergence of Projection Error
Lemma: Let w ∈ Rn and X ⊂ Rn closed and
convex. For all x ∈ X,
kw − PX (w)k2 ≤ kw − xk2 − kPX (w) − xk2 .

Lemma: Let Doubly Stochastic Weights assumption hold. Then:
• kxi (k + 1) − xk ≤ kwi (k) − xk ∀ x ∈ Xi (nonexpansiveness of projection)
Pm
Pm
i
2
i
2
•
kw
(k)
−
xk
≤
kx
(k)
−
xk
∀ x (by doubly stochasticity)
i=1
i=1
– Of independent interest in convergence analysis of doubly stochastic matrices
Implication:
Pm
limk→∞ i=1 kwi (k) − x̄k2 − kxi (k + 1) − x̄k2 = 0, for all x̄ ∈ X

MIT Laboratory for Information and Decision Systems

22

Convergence of the Estimates
• Recall that the transition matrices are defined as follows:
Φ(k, s) = A(s)A(s + 1) · · · A(k − 1)A(k)

for all s and k with k ≥ s,

• Using the transition matrices and the decomposition of the estimate evolution,
Ãm
!
k
m
X X
X
i j
i
[Φ(k, s)]j x (s) +
[Φ(k, r)]ij ej (r − 1) + ei (k).
x (k + 1) =
r=s+1

j=1

j=1

• Use a two-time scale analysis, where we define a similar “stopped process”
Ãm
!
m
k
1 X j
1 X X j
y(k) =
x (s) +
e (r − 1)
m j=1
m r=s+1 j=1
• It can be seen that the agent estimates reach a consensus:
lim kxi (k) − y(k)k = 0

k→∞

for all i

Proposition: (Convergence) Let Weights, Doubly Stochastic Weights, and
Information Exchange assumptions hold. We have for some x̃ ∈ X,
lim kxi (k) − x̃k = 0

k→∞

MIT Laboratory for Information and Decision Systems

for all i.
23

Rate Analysis
Assumption (Interior Point): There exists
a vector x̄ such that
´
³
m
x̄ ∈ int(X) = int ∩i=1 Xi ,
i.e., there exists some scalar δ > 0 such
that {z | kz − x̄k ≤ δ} ⊂ X.
Proposition: Let Interior Point Assumption hold. Let the weight vectors ai (k) be
given by ai (k) = (1/m, . . . , 1/m)0 for all i and k. Then,
µ
¶k X
m
m
X
1
i
2
kxi (k) − x̃k2 ≤ 1 −
kx
(0)
−
x̃k
for all k ≥ 0,
2
4R
i=1
i=1
where x̃ ∈ X is the limit of the sequence {xi (k)}, and R = 1δ
i.e., the convergence rate is linear.

Pm

i
kx
(0) − x̄k,
i=1

• Linear convergence rate extends to time-varying weights with Xi = X
• Convergence rate for time-varying weights and different local constraints open
MIT Laboratory for Information and Decision Systems

24

Summary
• We presented a distributed subgradient method for multi-agent optimization
• The method can operate over networks with time-varying connectivity
• We proposed a constrained consensus policy for the case when agents have local
constraints on their estimates
• This policy has connections to the classical alternating projection method
• We analyzed the convergence and rate of convergence of the algorithms

MIT Laboratory for Information and Decision Systems

25

Putting Things Together: Constrained Distributed
Optimization
• Each agent has a convex closed local constraint set Xi [Nedić, Ozdaglar,
Parrilo 08]
• Agent i updates his estimate by
v i (k) =

m
X

aij (k)xj (k)

j=1
i

x (k + 1) = PXi

h

i
v (k) − α(k)d (k) ,
i

i

α(k) > 0 is a diminishing stepsize sequence and di (k) is a subgradient of fi (x)
at x = v i (k).
Results:
• Agent estimates generated by this algorithm converge to the same optimal
solution for the cases when the weights are constant and equal, and when the
weights are time-varying but Xi = X for all i.
• Convergence analysis in the general case open!
MIT Laboratory for Information and Decision Systems

26

Optimization over Random Networks
• Existing work focuses on deterministic models of network connectivity (i.e.,
worst-case assumptions about intercommunication intervals)
• Time-varying connectivity modeled probabilistically [Lobel, Ozdaglar 08]
• Each component l of agent estimates evolve according to
xl (k + 1) = A(k)xl (k) − α(k)dl (k)
• We assume that the matrix A(k) is a random matrix drawn independently over
time from a probability space of stochastic matrices.
• This allows edges at any time k to be correlated.
Results:
• We establish properties of random transition matrices by constructing positive
probability events in which information propagates from every node to every
other node in the network
• We provide convergence analysis of a stochastic subgradient method under
different stepsize rules
MIT Laboratory for Information and Decision Systems

27

Limits on Communication
Quantization Effects:
• Agents have access to quantized estimates due to storage and communication
bandwidth constraints [Nedić, Olshevsky, Ozdaglar, Tsitsiklis 07]
• Agent i updates his estimate by
i

x (k + 1) =

m
X

aij (k)xjQ (k) − αdi (k)

j=1

xiQ (k + 1) = bxi (k + 1)c
b·c represents rounding down to the nearest integer multiple of 1/Q
Delay Effects:
• Agents have access to outdated estimates due to communication delays
[Bliman, Nedić, Ozdaglar 08]
• In the presence of delays, agent i updates his estimate by
xi (k + 1) =

m
X

aij (k)xj (k − tij (k)) − αdi (k)

j=1

tij (k) is the delay in passing information from j to i
MIT Laboratory for Information and Decision Systems

28

Applications to Social Networks
• Growing interest in dynamics in a social network of communicating agents
• A specific example is learning and information aggregation over networks
• Consensus policies can be used to develop and analyze myopic/quasi-myopic
learning models [Golub, Jackson 07], [Acemoglu, Ozdaglar, ParandehGheibi
08]
• In the context of social networks, these rules may be too myopic:
– Alternative approach: Bayesian learning over social networks [Acemoglu,
Dahleh, Lobel, Ozdaglar 08]

MIT Laboratory for Information and Decision Systems

29

References
• A. Nedić and A. Ozdaglar, “Distributed Subgradient Methods for Multi-agent
Optimization,” IEEE Transactions on Automatic Control, 2008.
• A. Ozdaglar, “Constrained Consensus and Alternating Projections,” Proc. of
Allerton Conference, 2007.
• A. Nedić, A. Olshevsky, A. Ozdaglar, and J.N. Tsitsiklis, “On Distributed
Averaging Algorithms and Quantization Effects,” IEEE Transactions on
Automatic Control, 2008.
• A. Nedić, A. Ozdaglar, and P.A. Parrilo, “Constrained Consensus and
Optimization in Multi-agent Networks,” submitted for publication, 2008.
• I. Lobel and A. Ozdaglar, “Distributed Subgradient Methods over Random
Networks,” Proc. of Allerton Conference, 2008.
• D. Acemoglu, M. Dahleh, I. Lobel, and A. Ozdaglar, “Bayesian Learning in
Social Networks,” submitted for publication, 2008.
• D. Acemoglu, A. Ozdaglar, and A. ParandehGheibi “Spread of (Mis)Information
over Social Networks,” working paper, 2008.
All papers can be downloaded from web.mit.edu/asuman/www
MIT Laboratory for Information and Decision Systems

30

