Algorithms for build a BreadthFirstSearch tree in a network. All assume that there is a designated initiator node that starts the algorithm. At end of algorithm each node except the initiator has a parent pointer and every node has a list of children. These are consistent and define a BFS tree i.e. nodes at distance k from the initiator appear at level k of the tree.

1. Synchronous algorithms

In a synchronous network, Flooding solves BFS; see AttiyaWelch, Lemma 2.8, page 21, or the notes below, based on LynchBook Section 4.2.

1.1. Easy case: network is undirected

Here for every edge uv there is a reverse edge vu. So we can send acknowledgments.

State of the algorithm qi consists of

Initiator has same algorithm as other nodes, and same state except that its parent pointer starts out holding its own ID and its active field starts out true.

Send function:

Update function:

1.1.1. Application: broadcast

If we rip out the child messages and parent pointers, we get a protocol for broadcast. Formally, the states look like (msg, state) where state is one of (waiting, active, done). Initial state of initiator is (msg, active), everybody else is (null, waiting).

Send function is

Update function is

To prove correctness, adapt BFS proof. Note that since we aren't sending back child acknowledgments, we don't need to assume an undirected network.

If we are broadcasting many messages, it may be more efficient to build a BFS tree first and then use a modified broadcast algorithm that only sends to children in the BFS tree (this gets the number of messages down from |E|+|V|-1 to just |V|-1). Note that we can also start a new broadcast each round without overloading the network, since the wave of messages carrying broadcast k+1 is always one step behind broadcast k.

1.2. Hard case: directed network

For a directed network, we can't send child messages back to the parent. So instead we use broadcast on every child message to spam it to the entire system (with an extra To field to identify the intended recipient). This requires substantial expansion in the state space and the message length, since we have to track every message we've ever seen to know whether to forward a new copy. However, the time cost doesn't go up by much: since both the initial propagation of the recruit messages and the subsequent child broadcasts take rounds linear in the diameter, the total is just 2*diameter in the worst case.

1.2.1. Termination testing

One problem with a simple modification of the BFS algorithm using broadcast child messages is that a node can't tell if it has any unacknowledged children out there it hasn't heard about yet. So an additional modification is to have each node broadcast a no-child message for each rejected potential parent. A node can then declare itself done precisely when it's received either child or no-child from every neighbor it sent recruit to.

This doesn't quite allow the initiator to detect when when the tree is done being built, since it may be that its own children can acknowledge it fairly quickly. But this can be handled by adding a neighbor count to the child messages (since the initiator gets spammed with all child messages just like anybody else); when the total number of child or no-child messages received equals the total neighbor count (including the initiator's own neighbor count), the initiator can conclude that all recruit messages have been processed and acknowledged.

1.2.2. Application: leader election

We deferred the question of doing LeaderElection in a directed graph. Here's a simple algorithm: treat each node as the initiator of a separate BFS protocol, where a node only participates in protocols initiated by nodes with IDs >= the max ID it has seen. The only node that finishes its BFS protocol wins. Time is O(diameter), message complexity is O(E*diameter). Message size is about the same as BFS---a given node is effectively only participating in one BFS protocol in each round, so there is not much blow-up from the parallel protocols. (It's still bad---concentrating cascades of child and no-child messages could easily yield messages of size O(E log V) assuming node IDs are O(log V).)

2. Asynchronous algorithms

Here the complication is that we can no longer rely on synchronous communication to reach all nodes at distance d at the same time. So instead we need to keep track of distancs explicitly, or possibly enforce some approximation to synchrony in the algorithm. (A general version of this last approach is to apply a synchronizer to one of the synchronous algorithms: see Synchronizers).

To keep things simple, we'll drop the requirement that a parent learn the IDs of its children, since this can be tacked on as a separate notification protocol as in the "hard case" above.

2.1. A simple algorithm using explicit distances

This is the AsynchBFS automaton from LynchBook §15.4. It's a very simple algorithm, closely related to Dijkstra's algorithm for shortest paths, but there is otherwise no particular reason to use it; it is dominated by the O(D) time and O(DE) message complexity synchronizer-based algorithm described later.

The idea is to run an AsynchronousBroadcast with distances attached. Each node sets its distance to 1 plus the smallest distance sent by its neighbors and its parent to the neighbor supplying that smallest distance. A node notifies all its neighbors of its new distance whenever its distance changes.

In pseudocode:

States: distance, initially 0 for initiator and inf for all other nodes, internal send buffers

Initiator initialization code:
    send distance to all neighbors

All processes:
    upon receiving d from p:
        if d+1 < distance:
            distance := d+1
            parent := p
            send distance to all neighbors

(See LynchBook for a precondition-effect description, which also includes code for buffering outgoing messages.)

The claim is that after at most O(VE) messages and O(D) time, all distance values are equal to the length of the shortest path from the initiator to the appropriate node. The proof is by showing the following

Invariant

distancep is always the length of some path from initiator to p, and any message sent by p is also the length of some path from initiator to p.

Proof
The second part follows from the first; any message sent equals p's current value of distance. For the first part, suppose p updates its distance; then it sets it to 1+the length of some path from initiator to p', which is the length of that same path extended by adding the pp' edge.

We also need a liveness argument that says that distancep = d(initiator, p) no later than time d(initiator, p). Note that we can't detect this condition occurring without a lot of additional work.

In LynchBook, there's an extra |V| term in the time complexity that comes from message pile-ups, since the model used there only allows one incoming message to be processed per time unit.1 The trick to arranging this to happen often is to build a graph where node 1 is connected to nodes 2 and 3, node 2 to 3 and 4, node 3 to 4 and 5, etc. This allows us to quickly generate many paths of distinct distance from 1 to node k, which produce k outgoing messages from node k. (It may be that a more clever analysis can avoid this blowup, by showing that it only happens in a few places.)

2.2. Layering

Here we run up to |V| instances of the simple algorithm with a distance bound on each: instead of sending out just 0, the initiator sends out (0, bound), where bound is initially 1 and increases at each phase. A process only sends out its improved distance if it is less than the bound. When the bound increases, notification of the increase is distributed only through the partial BFS tree constructed so far. With some effort, it is possible to prove that in a bidirectional network that this approach guarantees that each edge is only probed once with a new distance (since distance 1 nodes are recruited before distance 2 nodes) etc, and the bound-update and acknowledgement messages contribute at most |V| messages per phase. So we get O(E + V*diam) total messages. But the time complexity is bad: O(diam2) in the worst case.

2.3. An O(diam) time algorithm for a bidirectional network

The reason the layered algorithm takes so long is that at each phase we have to phone all the way back up the tree to the initiator to get permission to go on to the next phase. We need to do this to make sure that a node is only recruited into the tree once: otherwise we can get pile-ups on the channels as in the simple algorithm. But we don't necessarily need to do this globally. Instead, we'll require each node at distance d to delay sending out a recruiting message until it has confirmed that none of its neighbors will be sending it a smaller distance. We do this by having two classes of messages:

distance(d)
"I know that my distance is d."
not-distance(d)

"I know that my distance is > d."

The rules for sending these messages for a non-initiator are:

  1. I can send distance(d) as soon as I have received distance(d-1) from at least one neighbor and not-distance(d-2) from all neighbors.
  2. I can send not-distance(d) if d = 0 or as soon as I have received not-distance(d-1) from all neighbors.

The initiator send distance(0) to all neighbors at the start of the protocol (these are the only messages the initiator sends).

My distance will be the unique distance that I am allowed to send in a distance(d) messages. Note that this algorithm terminates in the sense that every node learns its distance at some finite time.

Comment regarding Synchronizers: The algorithm essentially corresponds to building the alpha synchronizer into the synchronous BFS algorithm, just as the layered model builds in the beta synchronizer. See AttiyaWelch §11.3.2 for a discussion of BFS using synchronizers. The original approach of applying synchronizers to get BFS is due to Awerbuch (JACM, 1985).

Proof of correctness

Under the assumption that local computation takes zero time and message delivery takes at most 1 time unit, we'll show that if distance(initiator, p) = d, (a) p sends not-distance(d') for any d' < d by time d', (b) p sends distance(d) by time d, (c) p never sends not-distance(d') for any d' >= d, and (d) p never sends distance(d') for any d' > d. For parts (c) and (d) we use induction on distance; for (a) and (b), induction on time. This is not terribly surprising: (c) and (d) are safety properties, so we don't need to talk about time. But (a) and (b) are liveness properties so time comes in.

  • Let's start with (c) and (d). The base case is that the initiator (distance 0) never sends anything but distance(0). Now consider a node p at distance d > 0, and let p' be a neighbor of p at distance d-1. By the induction hypothesis we have (c) p' never sends not-distance(d'-1) for any d' >= d. Since a precondition for p sending not-distance(d') is receiving not-distance(d'-1) from p', p never sends not-distance(d') for any d' >= d and (d) holds for p. Similarly, (c) holds for p, because to send distance(d') with d' > d it must first receive not-distance(d'-2) from p', but then d'-2 > d-2 >= d-1 so not-distance(d'-2) is not sent by p'.

  • Now for (a) and (b). The base case is that the initiator sends distance(0) to all nodes at time 0, giving (a), and there is no not-distance(d') with d' < 0 for it to send, giving (b) vacuously; and any non-initiator sends not-distance(0) immediately. At time t+1, we have that (a) not-distance(t) was sent by any node at distance t+1 or greater by time t and (b) distance(t) was sent by any node at distance t by time t; so for any node at distance t+2 we send not-distance(t+1) no later than time t+1 (because we already received not-distance(t) from all our neighbors) and for any node at distance t+1 we send distance(t+1) no later than time t+1 (because we received all the preconditions for doing so by this time).

Message complexity

A node at distance d sends not-distance(d') for all d' < d and distance d and no other messages. So we have message complexity bounded by |E|*diam in the worst case.

Time complexity
It's immediate from (a) and (b) that all messages that are sent are sent by time diam, and indeed that a node learns its distance at time d(initiator, node). So we have optimal time complexity.

Comment: Our time proof assumes that messages don't pile up on edges, or that such pile-ups don't affect delivery time (this is the default assumption used in AttiyaWelch). A more sophisticated proof could remove this assumption.

One downside of this algorithm is that it has to be started simultaneously at all nodes. Alternatively, we could trigger "time 0" at each node by a broadcast from the initiator, using the usual asynchronous broadcast algorithm; this would give us a BFS tree in O(|E|*diam) messages (since the O(|E|) messages of the broadcast disappear into the constant) and 2*diam time. The analysis of time goes through as before, except that the starting time 0 becomes the time at which the last node in the system is woken up by the broadcast. Further optimizations are possible.


CategoryDistributedComputingNotes

  1. The model in AttiyaWelch doesn't have this restriction. (1)

DistributedBreadthFirstSearch (last edited 2008-01-23 16:11:30 by JamesAspnes)