Copyright © 1978 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org. The definitive version of this paper can be found at ACM's Digital Library.
The concept of one event happening before another in a distributed system is examined, and is shown to define a partial ordering of the events. A distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events. The use of the total ordering is illustrated with a method for solving synchronization problems. The algorithm is then specialized for synchronizing physical clocks, and a bound is derived on how far out of synchrony the clocks can become.
The concept of time is fundamental to our way of thinking. It is derived from the more basic concept of the order in which events occur. We say that something happened at 3:15 if it occurred after our clock read 3:15 and before it read 3:16. The concept of the temporal ordering of events pervades our thinking about systems. For example, in an airline reservation system we specify that a request for a reservation should be granted if it is made before the flight is filled. However, we will see that this concept must be carefully reexamined when considering events in a distributed system.
-
A distributed system consists of a collection of distinct processes which are spatially separated,
-
and which communicate with one another by exchanging messages.
A network of interconnected computers, such as the ARPA net, is a distributed system. A single computer can also be viewed as a distributed system in which the central control unit, the memory units, and the input-output channels are separate processes. A system is distributed if the message transmission delay is not negligible compared to the time between events in a single process.
We will concern ourselves primarily with systems of spatially separated computers. However, many of our remarks will apply more generally. In particular, a multiprocessing system on a single computer involves problems similar to those of a distributed system because of the unpredictable order in which certain events can occur.
In a distributed system, it is sometimes impossible to say that one of two events occurred first. The relation "happened before" is therefore only a partial ordering of the events in the system. We have found that problems often arise because people are not fully aware of this fact and its implications.
In this paper, we discuss the partial ordering defined by the "happened before" relation, and give a distributed algorithm for extending it to a consistent total ordering of all the events. This algorithm can provide a useful mechanism for implementing a distributed system. We illustrate its use with a simple method for solving synchronization problems. Unexpected, anomalous behavior can occur if the ordering obtained by this algorithm differs from that perceived by the user. This can be avoided by introducing real, physical clocks. We describe a simple method for synchronizing these clocks, and derive an upper bound on how far out of synchrony they can drift.
Most people would probably say that an event happened before an event if happened at an earlier time than . They might justify this definition in terms of physical theories of time. However, if a system is to meet a specification correctly, then that specification must be given in terms of events observable within the system. If the specification is in terms of physical time, then the system must contain real clocks. Even if it does contain real clocks, there is still the problem that such clocks are not perfectly accurate and do not keep precise physical time. We will therefore define the "happened before" relation without using physical clocks.
We begin by defining our system more precisely. We
assume that the system is composed of a collection of
processes. Each process consists of a sequence of events.
Depending upon the application, the execution of a
subprogram on a computer could be one event, or the
execution of a single machine instruction could be one
event. We are assuming that the events of a process form
a sequence, where occurs
before in this sequence
if happens
before .
In other words, a single process is
defined to be a set of events with an a priori total
ordering. This seems to be what is generally meant by a
process.
We assume that sending or receiving a message is an event in a process. We can then define the "happened before" relation, denoted by "", as follows.
Definition. The relation "" on the set of events of a system is the smallest relation satisfying the following three conditions:
-
If and are events in the same process, and comes before , then .
-
-
If is the sending of a message by one process and is the receipt of the same message by another process, then .
-
-
If and then .
-
Two distinct events and are said to be concurrent if and .
We assume that for any event . (Systems in which an event can happen before itself do not seem to be physically meaningful.) This implies that is an irreflexive partial ordering on the set of all events in the system.
It is helpful to view this definition in terms of a
"space-time diagram" such as wavy lines
lines with letters denote messages.
Another way of viewing the definition is to say
that means
that it is possible for event to
causally affect event .
Two events are concurrent if neither
can causally affect the other. For example,
events and of
This definition will appear quite natural to the reader
familiar with the invariant space-time formulation of
special relativity, as described for example in
We now introduce clocks into the system. We begin with an abstract point of view in which a clock is just a way of assigning a number to an event, where the number is thought of as the time at which the event occurred. More precisely, we define a clock for each process to be a function which assigns a number to any event in that process. The entire system of clocks is represented by the function which assigns to any event the number , where if is an event in process . For now, we make no assumption about the relation of the numbers to physical time, so we can think of the clocks as logical rather than physical clocks. They may be implemented by counters with no actual timing mechanism.
We now consider what it means for such a system of clocks to be correct. We cannot base our definition of correctness on physical time, since that would require introducing clocks which keep physical time. Our definition must be based on the order in which events occur. The strongest reasonable condition is that if an event occurs before another event , then should happen at an earlier time than . We state this condition more formally as follows.
Note that we cannot expect the converse condition to
hold as well, since that would imply that any two concurrent
events must occur at the same time. In
It is easy to see from our definition of the relation "" that the
Let us consider the clocks in terms of a space-time
diagram. We imagine that a process' clock "ticks"
through every number, with the ticks occurring between
the process' events. For example, if and are
consecutive events in process with and , then clock
ticks 5, 6, and 7 occur between the two events.
We draw a dashed "tick line" through all the like-numbered
ticks of the different processes. The space-time
diagram of
We can consider the tick lines to be the time coordinate
lines of some Cartesian coordinate system on space-time.
We can redraw
The reader may find it helpful to visualize a two-dimensional spatial network of processes, which yields a three-dimensional space-time diagram. Processes and messages are still represented by lines, but tick lines become two-dimensional surfaces.
Let us now assume that the processes are algorithms,
and the events represent certain actions during their
execution. We will show how to introduce clocks into the
processes which satisfy the
To guarantee that the system of clocks satisfies the
To meet
In
We can use a system of clocks satisfying the
-
or
-
and .
It is easy to see that this defines a total ordering,
and that the
The ordering depends upon the system of clocks , and is not unique. Different choices of clocks which satisfy the
Being able to totally order the events can be very useful in implementing a distributed system. In fact, the reason for implementing a correct system of logical clocks is to obtain such a total ordering. We will illustrate the use of this total ordering of events by solving the following version of the mutual exclusion problem. Consider a system composed of a fixed collection of processes which share a single resource. Only one process can use the resource at a time, so the processes must synchronize themselves to avoid conflict. We wish to find an algorithm for granting the resource to a process which satisfies the following three conditions:
We assume that the resource is initially granted to exactly one process.
These are perfectly natural requirements. They precisely
specify what it means for a solution to
be correct.
It is important to realize that this is a nontrivial problem. Using a central scheduling process which grants requests in the order they are received will not work, unless additional assumptions are made.
-
To see this, let be the scheduling process.
-
Suppose sends a request to and then sends a message to .
-
Upon receiving the latter message, sends a request to .
It is possible for 's request to reach before 's request does.
To solve the problem, we implement a system of clocks with
rules
To simplify the problem, we make some assumptions. They are not essential, but they are introduced to avoid distracting implementation details. We assume first of all that any two processes and , the messages sent from to are received in the same order as they are sent. Moreover, we assume that every message is eventually received. (These assumptions can be avoided by introducing message numbers and message acknowledgement protocols.) We also assume that a process can send messages directly to every other process.
Each process maintains its own request queue which is never seen by any other process. We assume that the request queues initially contain the single message , where is the process initially granted the resource and is less than the initial value of any clock.
The algorithm is then defined by the following five rules. For convenience, the actions defined by each rule are assumed to form a single event.
-
To request the resource, process sends the message to every other process, and puts that message on its request queue, where is the timestamp of the message.
Click the lock to request the resource. -
Click the letter icon to receive the message. -
Click the lock to release the resource. -
Click the letter icon to receive the message. -
-
Note that
It is easy to verify that the algorithm defined by these rules satisfies
This is a distributed algorithm. Each process independently
follows these rules, and there is no central
synchronizing process or central storage. This approach
can be generalized to implement any desired synchronization
for such a distributed multiprocess system. The
synchronization is specified in terms of a State Machine,
consisting of a set of possible commands,
a set of possible states, and
a function . The relation means that executing the command with
the machine in state causes the machine state
to change to . In our example, the set consists
of all the commands and , and the state consists of a queue of waiting request commands,
where the request at the head of the queue
is the currently granted one. Executing a request command
adds the request to the tail of the queue, and
executing a release command removes a command from
the queue.
Each process independently simulates the execution of the State Machine, using the commands issued by all the processes. Synchronization is achieved because all processes order the commands according to their timestamps (using the relation ), so each process uses the same sequence of commands. A process can execute a command timestamped when it has learned of all commands issued by all other processes with timestamps less than or equal to . The precise algorithm is straight-forward, and we will not bother to describe it.
This method allows one to implement any desired form of multiprocess synchronization in a distributed system. However, the resulting algorithm requires the active participation of all the processes. A process must know all the commands issued by other processes, so that the failure of a single process will make it impossible for any other process to execute State Machine commands, thereby halting the system.
The problem of failure is a difficult one, and it is
beyond the scope of this paper to discuss it in any detail.
We will just observe that the entire concept of failure is
only meaningful in the context of physical time. Without
physical time, there is no way to distinguish a failed
process from one which is just pausing between events.
A user can tell that a system has "crashed" only because
he has been waiting too long for a response. A method
which works despite the failure of individual processes
or communication lines is described in
Our resource scheduling algorithm ordered the requests according to the total ordering . This permits the following type of "anomalous behavior." Consider a nationwide system of interconnected computers. Suppose a person issues a request on a computer A, and then telephones a friend in another city to have him issue a request on a different computer B. It is quite possible for request to receive a lower timestamp and be ordered before request . This can happen because the system has no way of knowing that actually preceded , since that precedence information is based on messages external to the system.
Let us examine the source of the problem more closely. Let be the set of all system events. Let us introduce a set of events which contains the events in together with all other relevant external events, such as the phone calls in our example. Let denote the "happened before" relation for . In our example, we had , but . It is obvious that no algorithm based entirely upon events in , and which does not relate those events in any way with the other events in , can guarantee that request is ordered before request .
There are two possible ways to avoid such anomalous behavior. The first way is to explicitly introduce into the system the necessary information about the ordering . In our example, the person issuing request could receive the timestamp of that request from the system. When issuing request , his friend could specify that be given a timestamp later than . This gives the user the responsibility for avoiding anomalous behavior.
The second approach is to construct a system of clocks which satisfies the following condition.
This is stronger than the ordinary
Let us identify with
some set of "real" events in physical space-time, and let be
the partial ordering of events defined by special relativity. One of the
mysteries of the universe is that it is possible to construct a system of physical
clocks which, running quite independently of one another, will satisfy the
Let us introduce a physical time coordinate into our space-time picture, and let denote the reading of the clock at physical time .
In order for the clock to be a true physical clock, it must run at approximately the correct rate. That is, we must have for all . More precisely, we will assume that the following condition is satisfied:
For typical crystal controlled clocks, .
It is not enough for the clocks individually to run at approximately the correct rate. They must be synchronized so that for all , , and . More precisely, there must be a sufficiently small constant so that the following condition holds:
If we consider vertical distance in
Since two different clocks will never run at exactly
the same rate, they will tend to drift further and further
apart. We must therefore devise an algorithm to insure
that
Let be a number such that if event occurs at physical time and event in another process satisfies , then occurs later than physical time .
In other words, is less than the shortest transmission time for interprocess messages. We can always choose equal to the shortest distance between processes divided by the speed of light. However, depending upon how messages in are transmitted, could be significantly larger.
To avoid anomalous behavior, we must make sure that for
any , ,
and : . Combining this
with
We now describe our algorithm for insuring that
We now specialize rules
Although the rules are formally specified in terms of
the physical time parameter, a process only needs to
know its own clock reading and the timestamps of messages
it receives. For mathematical convenience, we are
assuming that each event occurs at a precise instant of
physical time, and different events in the same process
occur at different times. These rules are then specializations
of rules
We now show that this clock synchronizing algorithm can be used to
satisfy
-
We assume that the system of processes is described by a directed graph in which an arc from process to process represents a communication line over which messages are sent directly from to .
-
We say that a message is sent over this arc every seconds if for any , sends at least one message to between physical times and .
-
The diameter of the directed graph is the smallest number such that for any pair of distinct processes , , there is a path from to having at most arcs.
In addition to establishing
THEOREM. Assume a strongly connected graph of processes
with diameter which always
obeys rules
-
PC1 holds.
-
Then PC2 is satisfied with for all , where the approximations assume .
The proof of this theorem is surprisingly difficult, and is given in the Appendix.
There has been a great deal of work done on the problem of synchronizing physical
clocks. We refer the reader to
We have seen that the concept of "happening before" defines an invariant partial ordering of the events in a distributed multiprocess system. We described an algorithm for extending that partial ordering to a somewhat arbitrary total ordering, and showed how this total ordering can be used to solve a simple synchronization problem. A future paper will show how this approach can be extended to solve any synchronization problem.
The total ordering defined by the algorithm is somewhat arbitrary. It can produce anomalous behavior if it disagrees with the ordering perceived by the system's users. This can be prevented by the use of properly synchronized physical clocks. Our theorem showed how closely the clocks can be synchronized.
In a distributed system, it is important to realize that the order in which events occur is only a partial ordering. We believe that this idea is useful in understanding any multiprocess system. It should help one to understand the basic problems of multiprocessing independently of the mechanisms used to solve them.
For any , and , let us define to be a clock which is set equal to at time and runs at the same rate as , but is never reset. In order words,
for all . Note that
Suppose process at time sends a message to process which is received at with an unpredictable delay , where . Then for all we have:
-
by
(1) andPC1 -
by
IR2' (b)
Hence, with these assumptions, for all we have:
Now suppose that for we have , , and that at time process sends a message to process which is received at time with an unpredictable delay less than . Then repeated application of the inequality
From
Combining this with
for .
For any two processes and , we can find a sequence of processes , , with communication arcs from each to . By
Now let be any message timestamped , and suppose it is sent at time and received at time . We pretend that has a clock which runs at a constant rate such that and . Then implies that .
For any time , let be the clock having the largest value at time . Since all clocks run at a rate less than , we have for all and all :
We now consider the following two cases:
In
In
Hence
Since , we get
-
by
PC1 - by choice of
- by definition of
-
by
IR2' (a)
Hence, , so and thus .
Letting in case (i), we can combine
Choosing and with , we can combine
Letting , we get
Combining this with
Using the hypothesis that and , we can rewrite
Since this holds for all , we get
and this holds for all . ∎
Note that relation
Acknowledgement. The use of timestamps to order operations, and the concept of anomalous behavior are due to Paul Johnson and Robert Thomas.