A REVIEW OF SHARED MEMORY SYSTEMS
IINTRODUCTION
Shared memory is one of the most popular parallel processing models because of the ready availability of bus-based shared-memory systems. Bus-based systems are relatively easy to build, but, because of their limited bus-bandwidth, do not perform well with fast processors, large number of processors, or algorithms that have a poor cache hit-ratio. In fact, most shared memory systems are limited to lo-20 processors or to slow processors and their main application is executing a nonparallel multi-user Unix load. On the other hand, the system we are building, called PLUS, is aimed at efficiently executing a single multithreaded process by using distributed memories, hardware supported memory coherence and synchronization mechanisms.
In order to maintain reasonable memory performance with a large number of fast processors it is necessary to distribute the memory among the processors and connect them with a scalable communication mechanism. The implementation of such a physically distributed, but logically shared, memory system is difficult, because communication latency hinders fast access to remote memories. The operations that cause performance degradation are remote memory access and synchronization.
SHARED-MEMORY SYSTEM
A shared-memory system consists of at least one multi-core CPU, sharing the memory available on the system, i.e., all CPUs can access the same physical address space. Examples for such a system are modern single- or multi-socket multi-core workstations. Although, these targets are shared-memory systems, and as such provide a single memory address space to the programmer, the supported memory access patterns vary fundamentally. The most commonly utilized patterns are uniform memory access ( UMA) [100] and cache-coherent ( cc) non-uniform memory access ( NUMA)Ā [131]. UMA is typically used by single multi-core workstations, whereas ( cc) NUMA is widely utilized by multi-socket systems. Basically, UMA allows to offer the same access performance to all participating cores and memory locations via a single bus (Figure below).
Therefore, UMA systems are also frequently referred to as symmetric multiprocessing systems. Such a symmetric approach does not reasonably scale with the introduction of additional CPUs, as the central bus is quickly saturated. This fact gave rise to the NUMA architecture. On NUMA systems memory is physically distributed, but logically shared. A NUMA system consists of several so-called NUMA nodes, typically representing a CPU and its local memory. The NUMA nodes are connected with interconnect links, such as AMDās HyperTransport or Intelās QuickPath technology. Consequently, latency and possibly bandwidth between the cores vary depending on the physical location (also called NUMA effects).
From a software point of view, usually - but not exclusively - thread-based approaches are utilized on shared-memory systems. Several approaches are available, such as OpenMP, Intel Cilk Plus, and Pthreads. Although thread-based programming is considered to be rather intuitive and easy to get used to, it is exactly this feature which makes it hard to achieve reasonable scaling for high core numbers. One of the challenges is the shared-memory approach, i.e., all threads can access the same memory address space. Although convenient, it does not inherently force the developer to handle memory locality, thus NUMA effects easily arise, significantly reducing the scalability. Also, a typical problem with shared-memory approaches is over-utilization, meaning that significantly more threads conducting computations are executed concurrently than CPU cores are available. Obviously severe over-utilization, for instance, more than three times, leads to considerably reduced execution speeds, thus needs to be avoided.
SHARED MEMORY PROCESS
In its simplest form, shared memory is a low-level programming process on a single server that enables clients and servers to exchange data and instructions using main memory. Performance is much faster than using system service like operating system data buffers.
For example, a client needs to exchange data with the server for modification and return. Without shared memory, both client and server use operating system buffers to accomplish the modification and exchange.
The client writes to an output file in the buffer, and the server writes the file to its workspace. When it completes the modification, the process reverses. Each single time this occurs, the system generates 2 reads and 2 writes between client and server.
With shared memory, the client writes its process directly into RAM and issues a semaphore value to flag server attention. The server accomplishes the modifications directly in main memory and alerts the client by changing the semaphore value. There is only 1 read and 1 write per communication, and the read/write is considerably faster than using system services.
Shared Memory and Single Microprocessor General Flow
- Server uses a system call to request a shared memory key, and memorizes the returned ID.
- Server starts.
- Server issues another system call to attach shared memory to the server's address space.
- Server initializes the shared memory.
- Client starts.
- Client requests shared memory
- Server issues unique memory ID to client
- Client attaches shared memory ID to the address space and uses the memory.
- When complete, client detaches all shared memory segments and exits.
- Using two more system calls, server detaches and removes shared memory .
MULTI-PROCESSOR SHARED MEMORY
This simplified scheme works for single microprocessors, but memory sharing among multiple microprocessors is more complex especially when each microprocessor has its own memory cache. Popular approaches include uniform memory access (UMA), and non-uniform memory access (NUMA). Distributed memory sharing is also possible, although it uses different sharing technology.
UMA: Shared memory in parallel computing environments
In parallel computing, multiprocessors use the same physical memory and access it in parallel, although the processors may have a private memory caches as well. Shared memory accelerates parallel execution of large applications where processing time is critical.
NUMA: Shared memory in symmetric multiprocessor systems (SMU)
NUMA configures SMU to use shared memory. SMU is a clustered architecture that tightly couples multiple processors in a share-everything single server environment with a single OS. As each processor uses the same bus, intensive operations will slow down performance and increase latency.
NUMA replaces the single system bus by grouping CPU and memory resources into configurations it calls NUMA nodes. Multiple high-performing nodes efficiently operate within clusters, allowing CPUs to treat its assigned nodes as a local shared memory resource. This relieves the load on the bus, assigning it to flexible, high performance memory nodes.
Shared memory in distributed systems
Distributed shared memory uses a different technology but has the same result: separate computers share memory for better performance and scalability. Distributed shared memory enables separate computer systems to access each otherās memory by abstracting it from the server level into a logically shared address space.
The architecture can either separate memory and distribute the parts among the nodes and main memory, or can distribute all memory between the nodes. Distributed memory sharing uses either hardware (network interfaces and cache coherence circuits) or software. Unlike single or multiprocessor shared memory, distributed memory sharing scales efficiently and supports intensive processing tasks such as large complex databases.
CHALLENGES OF SHARED MEMORY
Shared memory programming is straightforward in a single CPU or clustered CPUs. All processors share the same view of data, and communication between them is very fast; and shared memory programming is a relatively simple affair.
However, most multiprocessor systems assign individual cache memory to its processors in addition to main memory. Cache memory processing is considerably faster than using RAM, but can cause conflict and data degradation if the same system is also using shared memory. There are three main issues for shared memory in cache memory architectures: degraded access times, data incoherence and false sharing.
Degraded access time: Several processors cause contention and performance slowdowns by accessing the same memory location at the same time. For this reason, non-distributed shared memory systems do not scale very efficiently over ten processors.
Data incoherence: Multiple processors with memory sharing typically have individual memory caches to speed up performance. In this system, two or more processors may have cached copy of the same memory location. Both processors modify the data without being aware of another cacheās modifications, meaning that the data that should be identicalāi.e. coherent--is now incoherent, and can lead to corruption when that data is written back to the main memory.
Cache coherence: Cache coherence protocols manage these conflicts by synchronizing data values within multiple caches. Whenever a cache propagates modified back to the shared memory location, the data remains coherent. Cache coherence protects high-performance cache memory while supporting memory sharing.
False sharing: This memory usage pattern degrades performance, and occurs in multiprocessor systems with shared memory and individual processor caches. Caching works by reading data from the assigned memory location plus nearby locations. (The minimum size of a cache line is 64 bytes.) The problem arises when the processor accesses a shared block that contains modifiable data, or variables. Whether or not one processor actually modifies that data does not matter; reading changes, the other caches will reload their entire blocks. The cache coherency protocol does not initiate the reload and does not grant it any resources, so the incoming process must bear the overhead. This forces the main bus to reconnect with every write to shared memory locations, degrading performance and wasting bandwidth.
Programming is the solution: āCache paddingā inserts meaningless bytes between the exact memory location and its neighbors, so the single 64-byte cache line only writes the exact data. Cache coherency does the synchronization, so other caches are not forced to reload their blocks.
ADVANTAGES OF SHARED MEMORY
Multiple applications share memory for more efficient processing.
- Efficiently passes data between programs to improve communications and efficiency.
- Works in single microprocessor systems, multiprocessor parallel or symmetric systems, and distributed servers.
- Avoids redundant data copies by managing shared data in main memory in caches.
- Minimizes input/output (I/O) processes by enabling program to access a single data copy already in memory.
- For programmers, the main advantage of the shared memory is that there is no need to write explicit code for processor interaction and communication.
- Cache coherence protocols protect shared memory against data incoherence and performance slow-downs.
REFERENCES
- Bisieni, R., Nowatxyk, A, and Ravishankar, M. Coherent Shared Memory on a Message Passing Machine. Tech. Rept. CMU-CS-88-204. School of Computer Science, Carnegie Mellon University, December, 1988.
- Gottlieb, A. The NYU Ultracomputer - Designing an MIMD Shared Memory Parallel Computer. IEEE Trans. on Computers C-32,2 (February 1983),175-189.
- https://www.enterprisestorageforum.com/storage-hardware/shared-memory.html