15-618 Project: Simulating Migratory Cache Coherence

Proposal

Summary

Migratory shared data is data that read and written by a single processor a single time, and then accessed by another processor. Despite being a common access pattern, this is inefficient with traditional cache coherence mechanisms. This project is a simulation and evaluation of Cox and Folwer's1 adaptive cache cache coherence protocol, which is designed to lower the cost of migratory access without negatively affecting other data access patterns.

Background

Cox and Fowler's mechanism involves two components: a heuristic for detecting migratory blocks, and a sub-protocol for such blocks. A block is identified as migratory if, when written to, there are two cached copies, and the writing processor was not the previous writer. The sub-protocol is an extension to MESI that allows certain redundant bus transactions to be elided for migratory blocks.

The Challenge

Despite presenting the same external interface as those of years past, modern CPUs have grown exponentially more complex with the increase in transistor density. Furthermore, the emergence multicore/multiprocessor systems introduces intricate issues: correctness and progress are extremely difficult to guarantee. Even a simulation, therefore, requires special care and analysis to function properly. Furthermore, given the years since the protocol has been published, and the emergence of more complex interconnect designed and memory hierarchies, it is not a given that the protocol will work or produce desirable results.

Resources

The simulation will be built on top of gem5, a CPU simulator targeted at architecture research. Of particular interest is the "Event-driven memory system", which is designed for modeling RAM and cache hierarchies.

Goals and Deliverables

Plan to Acheive

Hope to Acheive

Platform Choice

Although I have no experience with it, gem5 was suggested by Professor Mowry, and, from its description, it appear to be a suitable bases for this project.

Original Schedule


Milestone

After exploring Gem5, Ruby, and SLICC, it became clear that despite the advantages it offered in depth and flexibility, the complexity of the system would require far more time to implement the new coherence protocol than was available. The decision was therefore made to switch to implementing a cache and cache coherence simulator as a Contech backend. While Contech provides the trace and an interface to them, all elements of the simulator itself are to be written scratch. Already implemented is a bus/snooping based MSI protocol for an arbitrary number of single level caches.

Despite the setbacks, all of the "Plan to Achieve" goals (less learning Gem5) are expected to be accomplished. Barring extraordinary successes, however, the "Hope to Achieve" goals appear unlikely to be completed. One new "Hope to Achieve" goal is to produce a simulator with a simple and straightforward enough interface that it can be used by others wishing to explore cache and cache coherence designs. Given the inherently complex nature of the simulated device and the tight integration between the components, it is not clear if this is a reasonable goal.

The poster is expected to show how the two protocols (MESI and migratory MESI) perform on a variety of metrics for an assortments of workloads, as well as a simple analysis determining if the protocol is successful.

One particular concern in the simulation architecture is the trade off between implementation speed and flexibility/complexity. While a simple implementation will be faster to write (at least initially), it will be less flexible and perhaps less accurate. Furthermore, given the complexity of the system it is not clear how to provide abstractions between components to allow for flexibility while maintaining simplicity and completing the project in the alloted time.

Updated Schedule

Report

Summary

Migratory access is a common memory access pattern in which an object in memory is read and written by a succession of processors in sequence. In contrast to more "typical" sharing, once a processor has completed its operations, it is not expected to access the object again in the near future. Common replicate-on-read-miss coherence mechanisms are optimized for shared objects, and additional overhead is required to allow multiple readers of an object, despite not being needed for migratory data. Similarly, migrate-on-read-miss policies perform suboptimal on shared data. Cox and Fowler1 developed an adaptive extension to MESI which heuristically identifies migratory blocks and switches to a migrate-on-miss sub-protocol for these blocks only. Their simulations, run on SPLASH-2, suggested that this protocol could provide a 5%-40% reduction in bus messages, depending on the particular implementation, the size of the cache, and the nature of the workload. The current project attempts to adapt Cox and Fowler's protocol to the capabilities of modern processors and assess its performance.

Background

Cox and Folwer heuristically identify a block as migratory if (a) it is in to caches and (b) the processor writing to it is not the previous writer. Migratory blocks are invalidated not only response to a busRdX, but also a busRd; the new processor does not have to then enter a Shared state and perform another bus request to begin writing (as it is expected to do), halving the number of bus requests. This is accomplished by the addition of three states to MESI (Figure 1): S2 (shared-2), MC (migratory-clean), and MD (migratory-dirty) (Figure 2). Note that they additionally rename M (modified) to D (dirty) to reduce confusion with migratory. S2 is similar to S, but once a processor is in this state, there are exactly two cached copies of a block (because it can only be entered from D or E due to a bus read, and is exited after another bus read). Note, however, that the new processor entered S. From this point, if the new processor writes, the block is identified as migratory; the new processor will enter MD instead of D, and further reads from other processors will enter MC instead of E. In addition to the permission being similar (of MD to D and MC to E), an upgrade from MC to MD can also be done without a bus request.

Figure 1: MESI State Machine. "+" means was asserted in response; "-" means was not asserted in response; "#" means is asserted with request.

Figure 2: Migratory Adaptive State Machine. "+" means was asserted in response; "-" means was not asserted in response; "#" means is asserted with request.

Approach

After studying Gem5, Ruby, and SLICC, attempting to get the system to compile, and running sample programs under various configurations, it became clear that despite the advantages it offered in depth and flexibility, the complexity of the system would require far more time to implement the new coherence protocol than was available. The decision was therefore made to switch to implementing a cache and cache coherence simulator as a Contech backend. While Contech provides the traces (including several SPLASH-2 workloads) and an interface to them, all elements of the simulator itself were written scratch. Thus, while all knowledge is useful, so the weeks spent studying Gem5 was not wasted, they did not directly contribute to the project.

The approach taken to architecture design was to initially construct a simplistic monolithic similar, roughly equivalent to that from CacheLab in 213. Then, as features were added and complexity arose, components were abstracted away into self-contained susbsystems as needed.

The system is a cycle-level simulator composed of several components: a task manager, a cache, a coherence manager, a request table, a bus, and memory (Figure 3). The task manager provides a processor-level interface to the Contech trace API. Contech traces take the form of a dependency graph of tasks, where each tasks belongs to a single processor and contains an array of memory operations. If two tasks are not mutually dependent, any interleaving of these operations is legal. The task manager keeps track of completed and pending tasks, and allows a processor to query it for its next operation; tasks are always issued in the same order they were in the original run by sorting by start timestamp, even though other order are possible.

Figure 3: Simulator Architecture

The cache is responsible for managing the organization of its data, and simulate the interface to the processor. It models the number of cycles required for a cache hit, and, upon a miss, queries its cache manager, and blocks until a response is tendered. Because only a single level cache was simulated, an Intel i7 L2 cache was chosen as the model. The cache therefore has a 64 byte block, 512 sets, is 8-way associative, and has a delay of 11 cycles. Replacement is LRU.

The coherence manager is responsible for implementing the cache coherence mechanism (described above), moderating all communications between caches, as well as between caches and memory, all via the bus. The request table is shared among the coherence managers, and is used to prevent multiple requests on the same cache line from being simultaneously issued. By keeping the coherence mechanism separate from the cache itself, code duplication is reduced. Although it was originally intended that the memory systems would be completely coherence manager-agnostic, time limitations has prevented this from being the case; nonetheless, the differences between them has been kept to the bare minimum. As it stands, the only components that are coherence-protocol-specific are the coherence manager, the memory, and the consistency checker (which is not strictly necessary, but is an aid to debugging and increases confidence in correctness). Finally, while the current implementation only simulates a single-level cache, the hierarchical nature of the interfaces should allow transitioning to multi-level caches with minimal reconfiguration. Considerable time and effort was allotted to maintaining these flexibilities, in the hope that the system would be usable in a wider variety of circumstances than the current study alone.

The implemented coherence protocol additional complexities not present in the above diagrams or standard MESI, such as cache-to-cache transfers. Furthermore, when such a transfer is required, it is tendered from the most recently allocated line, similar to MESIF (although this state is maintained as a flag rather than a state). In a case where the most recently allocated line has been invalidated or evicted, memory responds to the request. This is one of the key adaption made from the original protocol to more closely match the behavior of more modern processors.

The bus simulates a split transaction bus, another optimization. Requests are served first-come-first-serve, with a bus cycle being an 10 CPU cycles. Acks, along with shared and migratory asserts take place synchronously after a request (i.e., they do not need to wait on the queue), while data must be sent as an additional message. The memory latency is 100 CPU cycles, beginning from the cycle the request is received.

In a single cycle, each component is allowed to update its state on the granularity of one CPU cycle. First, all caches do so, or numeric order. If a cache hit is pending, the timer is decremented, and if the delay is complete, the hit is completed. After this, if there are no pending hits or coherence requests, the task manager is queried for a new action. If such an action exists, it is processed. The cache searches for the line; if it is a hit, it handles it locally, while a miss is passed to the coherence manager.

Next, each cache managers updates their state, in numeric order. They begin by reading all buffered messages; if an ack is required for a line not currently servicing a hit, it is sent immediately (and invalidated if required); otherwise, the ack is saved to be reexamined at a later time. Next, if a coherence request is pending, it is examined. If a retransmissions is required due to a conflicting request in the request table, and conflict no longer exists, it is retransmitted. Otherwise, if there is a pending request that has completed, the cache is updated with the new data. Note that communication between the cache and coherence manager takes the form of function calls, not messages. Updates to each other are done during the requester's update, not the requestee's.

After the coherence managers, the bus processes its state. If a message is currently pending, the timer is decremented, and if the delay is complete, it is broadcast to all coherence managers and memory, and begins waitings for acks. Acks are a discretized simulation of the response lines, and allow the requester to know when other processors have finished snooping, and if the block is shared or migratory. As such, they occur synchronously after a broadcast and to not require moderation or a delay. However, if a cache or memory must send data, it takes the form of an additional message, requiring both. Note, however, that bus messages are not pipelined.

Lastly, memory performs its updates. It walks it pending memory request table, decrements timers, and if any are complete, broadcasts the data. Next, it examines incoming bus messages. If a new request has been sent, it adds the request to the pending memory request table and begins the delay. If a cache has committed to sending the data, the request is removed. Although memory does not enforce a limit on the number of pending requests, each processor will not issue more than one request at a time.

Experimental Setup

The simulator was run on a number of traces representing the memory accesses of a variety of workloads with the adaptive sub-protocol enabled and disabled. Statistics were collected on the number of write hits and misses, read its and misses, evictions, bus messages, and total number of cycles. As the primary motivation was to reduce the number of bus messages, and time for analysis limited analysis focused on these data.

In contrast to Cox and Fowler's results, the adaptive protocol had a minimal effect on the number of bus messages (Figure 4).

Figure 4: Percent change in number of bus messages by workload and coherence protocol (lower is better)

Precisely why this is the case is not clear. While the possibility of a bug in the present implementation cannot be disregarded, other factors may also have an effect. According to an analysis by Barrow-Williams et al2, several of the tested workloads, including ffm, volrend, and blackscholes had moderate levels of migratory access. Curiously, Cox and Fowler found an improvements of 10-30% for Cholesky, but Barrow-Williams et al found Cholesky to have extremely low levels of migratory access. It is clear that migratory access is not a boolean condition, and that other aspects of data access can have a strong effect on efficiency.

While Fox and Folwer presented both bus-based and directory-based implementations of the protocol, they primarily reported results for the directory-based implementation, and the results for bus-based implementation was less impressive. Furthermore, they did not report their results for any workload we were able to run.

Finally, there are differences in the original protocol, and the adaption implemented here. As previously mentioned, in the current implementation, the cache with the most recently allocated line services a miss, in a mechanism similar to MESIF. Additionally, their implementation used an additional bus message, bus invalidate, which was removed in the process of adapting Cox and Fowler's protocol to the MESI protocol we learned in class (which did not have bus invalidate). Simple tests early on suggested that eliding this message would not significantly effect performance, but the tests were extremely limited in scope; perhaps the message was elided too quickly.

Surprises, Lessons Learned, and Conclusion

While it perhaps should not have been surprising, the increase in complexity with the introduction of simple features outstripped expectations. This increase also interacted poorly with the attempt to maintain generality and functional separation along interfaces. In particular, the subsystems were more tightly coupled than expected, such that small changes in one module often required modifications to another, even if they should not be logically connected. When such connections were discovered, interfaces were redesigned to eliminate them; the ability to do is this one advantage of the iterative architecture design previously described. However, modifying the interfaces invariable both requires extensive effort (not least because all coherence protocol must be updated) and increases in complexity. While the ability to run multiple protocols was useful (particularly for regression testing while making changes), and flexibility is always desirable, particularly for a system designed for the potential for long-term use, it is not clear that the extra time and effort was worth the trade off for a project required to be completed in such a limited amount of time.

Building the system from scratch was a tremendously instructive experience. It is without a doubt the most complex program we have ever written. It gave much deeper insight into the issues involved in designing both coherence protocols and hardware systems more generally than could be found by modifying an extant simulator, as we were responsible for all details of operation. While we discuses many such facets in class, they were examined separately, almost as self contained units, while in the real world, they invariable all interact with each other. These interactions were almost wholly unexpected.

With regards to the results, it is moderately surprising that they did not reflect Cox and Fowler's. While it was of course expected that they would be similar, it would be unrealistic to require it after only several weeks of work, which is why the proposal only required the construction of the simulator and a comparison, not producing similar results. Nevertheless, it would have been desirable for more time to have been available so a more thorough analysis could have been made.

As it stands, definite conclusions are difficult to define. Certainly it would be premature to draw conclusions about the efficaciousness of the protocol, given the nascence of the simulation and rudimentary analysis. However, it can be concluded that interactions between the memory access pattern, coherence protocol, and cache implementation are extremely complex, and that there are many factors that influence the effectiveness of a protocol and implementation in addition to the high-level features of each.

The are additional limitations to the faithfulnesses of the simulation. Each cache can only process one access at a time - if a coherence request is pending, for example, no further accesses can be made until it has been completed, even if they are hits. Furthermore, buffers were of unlimited size, simplifying assumptions about factors limiting forward progress, but reducing accuracy.

Future work should include extending the consistency checker, microbenchmarks on "idealized" traces representing various memory access patterns, and further characterization of the tested workloads.


  1. Cox, A. L., & Fowler, R. J. (1993, June). Adaptive cache coherency for detecting migratory shared data. In ACM SIGARCH Computer Architecture News (Vol. 21, No. 2, pp. 98-108). ACM.

  2. Barrow-Williams, N., Fensch, C., & Moore, S. (2009, October). A communication characterisation of Splash-2 and Parsec. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on (pp. 86-97). IEEE.