The Design and Implementation of Locksmith a Dynamic Race Condition Debugger

The Implementation of Linux Locksmith, a Dynamic Race Condition Detector

Even for experienced programmers, programming with threads adds a new set of challenges that do not appear in the sequential process paradigm. The main difference between debugging sequential programs and debugging concurrent programs is that most bugs in sequential programs are repeatable. In multi-threaded programs, bugs are much harder to reproduce and therefore diagnose due to the pseudo-random variations in the way the operating system switches between individual threads. A common bug that plagues multi-threaded programs is a race condition in which shared data is accessed in an unprotected way making the outcome of the “race” almost entirely dependent on the timing of the context switching. A tool to aid in the debugging of multi-threaded programs developed on the Linux would make the platform even more attractive, especially in the server environment.

We have developed a dynamic race conditioner detector called Locksmith which is based on a modified version of the Lockset algorithm used in Eraser, a dynamic race detector developed for Digital Equipment Corporation’s Digital Unix [1]. An multi-threaded executable written with the standard pthreads package [2] and instrumented by the Locksmith tool will during execution issue error messages that pinpoint exact source code lines involved in a race condition. With this information, the cause of the bug can be quickly identified.

Locksmith is able to identify the source code lines involved in a race condition by tracking memory reads and writes performed by each thread and noting the locks held at the time of access. If a lock is protecting a region of memory then every thread should hold that lock before it accesses that region. If a thread ever accesses the data without the correct locks, then a potential race condition disks. However, the source code contains no specifications of the programmer’s intent to protect certain data with certain locks. Therefore, the Lockset algorithm determines which locks could be potentially be correctly protecting each memory address range. If this set ever becomes empty then Locksmith can report the error.

The pseudo-code for the simplest version of the Lockset algorithm is as follows.

Let locks_held(t) be the set of locks held by some thread t and possible_locks(m) be the set of candidates that are possible for the protection of some memory address m and possible_locks(m) is initialized in a virgin state.

On each access to memory address m by thread t,

If possible_locks(m) is virgin then possible_locks(m) := locks_held(t);

Else possible_locks(m) := possible_locks(m) intersection locks_held(t);

If possible_locks(m) = Empty Set; Detected a possible race condition.

Several refinements to this simple algorithm can be made to prevent common false alarms. For example, errors would be reported for unprotected memory accesses to data even if that data is used exclusively by a single thread. Similarly, programmers may safely initialize data without holding any locks before common case multi-threaded processing begins. Finally, unprotected access to read-only data would cause a false alarm.

These common false alarms are prevented by having Locksmith track additional states for each memory address: exclusive, read-only and modified. Memory addresses begin in the exclusive state and no errors are reported until at least two threads have accessed the data. When a memory address leaves the exclusive state, it is put in the read-only state until it is written by some thread. In the read-only state, possible_locks(m) is updated, but no errors are reported. Once a write occurs the memory address, is placed in the modified state and the simple Lockset algorithm described above is followed.

The current implementation of Linux Locksmith is split up into two parts. First, there is the Locksmith library, which contains the necessary data structures and stubs to implement the Lockset algorithm. The runtime library of Locksmith is implemented entirely in C. Internally thread identifiers are the internal pthread_t retrieved from pthread_self. Locks are represented by their address since all pthread calls reference the address of the pthread_mutex_t. All source level debugging information is determined from the gcc –g output that is normally used for gdb. Second, there is a program parser that will determine the places in code where memory is accessed and where locks are acquired and released and insert calls from the Locksmith library as required. The program parser operates on the assembly code of a program. The output of Locksmith program parse is assembly code that has been modified to call the library stubs on all memory accesses, but is otherwise the same as the original code. The only changes required to the source code are the addition of two function calls, first at the beginning of the program to create_locksmith and then at the end to destroy_locksmith.

We have used Linux Locksmith to correctly detect race conditions and report their corresponding location. With some additional testing and program “packaging”, we believe that this tool would be an extremely useful tool for debugging multi-threaded applications on the Linux platform.

[1] S. Savage et al, Eraser: A Dynamic Race Detector for Multithreaded Programs, Proc. of the Sixteenth ACM Symposium on Operating Systems Principles, October 1997.

[2] B. Nicols et al Pthreads Programming, A POSIX Standard for Better Multiprocessing, Oreilly, September 1996.