GASNet inter-Process SHared Memory (PSHM) design --------------------------------------------- Document by: Dan Bonachea Paul H. Hargrove Filip Blagojevic Implementation by: Jason Duell Filip Blagojevic Paul H. Hargrove =============================================================== WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING The contents of this file are no longer maintained, and user documentation has moved into the top-level GASNet README. For current information on using GASNet's PSHM support, please consult the main GASNet README. WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING =============================================================== Goal: ---- Provide GASNet with a mechanism to communicate through shared memory among processes on the same compute node. This is expected to be more robust than pthreads (which greatly complicates the Berkeley UPC runtime, and prevents linking to any numeric libraries that that are not thread-safe). It is also expected to display lower latency than use of a network API's loopback capabilities (though the network hardware might provide other benefits such as asynchronous bulk memory copy w/o cache pollution). We appreciate your feedback related to PSHM (both positive and negative) and would be happy to work with you to improve PSHM. Scope: ----- * GASNet segment via PSHM only supported for SEGMENT_FAST or SEGMENT_LARGE (not meaningful for SEGMENT_EVERYTHING mode) * May eventually support AM-over-PSHM for SEGMENT_EVERYTHING (but not yet) * Applicable both w/ and w/o pthreads Terminology: ----------- * node: each UNIX process running GASNet * supernode: 1 or more nodes with cross-mapped segments using PSHM support * supernode peers: nodes which share a supernode Interface notes: --------------- * All node processes call gasnet_init(), each is a separate GASNet node * PSHM is enabled/disabled at configure time and GASNETI_PSHM_ENABLED is #defined to either 1 or 0. Each conduit can then #define GASNET_PSHM to 1 if it implements PSHM support. * gasnetc_init() performs super-node discovery, using OS-appropriate (or conduit-specific) mechanisms to figure out which nodes are capable of sharing memory with which other nodes: - unconditionally calls gasneti_nodemapInit() (to drive "discovery") - calls gasneti_pshm_init() only if PSHM support enabled (to setup data) * MaxLocal/Global return values reflecting the amount of segment space divided evenly among the supernode peers, and each node passes a size to gasnet_attach reflecting the per-node segment size they want. * gasnet_attach takes care of mapping each processor's segments as usual, but also maps the segments of supernode peers into each nodes VM space using OS-appropriate mechanisms. (shm_open()+mmap(), shmget()+shmat(), etc.). * Nodes on a supernode typically have different virtual address map of the segments on that supernode. They are typically not contiguous either. * Client calls gasnet_getSegmentInfo() to get the location of their segment and those of other nodes (as always) * Client calls gasnet_getNodeInfo() to get the compute node rank, and the supernode rank (which need not be the same when GASNET_SUPERNODE_MAXSIZE is non-zero) for each node. Only nodes with the same supernode rank as the caller are addressable via PSHM. When PSHM is enabled the "offset" field gives the difference between X's address for its own segment and the address (if any) at which that segment is mapped in the caller's address space. * Client may directly load/store into the segments of any node sharing their supernode (currently implemented in Berkeley UPC runtime library) * remotely-addressable segment restrictions on gasnet_put/get/AMLong apply to the individual segments - i.e. gasnet_put() to an address in the segment of node X must give node X as the target node, not some other supernode peer Restrictions: ------------ * gasnet_hsl_t's are node-local and while they might reside in the segment, they may not be accessed by more than one node in a supernode - we can/should add a debug-mode check for this (also applies to shmem-conduit) * Use of GASNet atomics in the segment is allowed, but they must not be weak atomics (which means using the explicitly "strong" ones in client code). Closed (previously "Open") questions: ------------------------------------ Q1) Do we need a separate build or separate configure of libgasnet and/or libupcr with PSHM enabled/disabled? A1) Since the set of conduits supported by PSHM was initially a small subset of the total list, we chose not to complicate the UPC compiler with this. Thus we've chosen to configure everything (UPCR+gasnet) w/ --enable-pshm or w/o. The number of conduits supporting PSHM is now irrelevant since in a PSHM-enabled build of GASNet any conduits not supporting PSHM are simply built w/o it (as opposed to not built at all as was once the case). Q2) If we want to use the same build, then how should GASNET_ALIGNED_SEGMENTS definition behave? Never true when any supernode contains more than one node, but don't know that until runtime. A2) We assume that you don't use PSHM unless also using > 1 proc/node. May also revisit if we don't configure PSHM as a distinct build. Q3) Can we get away with always connecting segments after all processes are created, or do we need to fork after setting up shared memory segments? Will drivers & spawners even allow that? If we decide that a fork is required after job launch, then it should definitely be done by the conduit, not the client code. But how would the interface look? (this would very likely break MPI interoperability) A3) All supported conduits are attaching to segments in gasnet_attach(). We don't need to work about fork() at all (except that smp-conduit now has a fork-based spawner inside gasnetc_init()). Q4) Does the client code between init/attach need to know the supernode associations? (e.g. to make segsize decision) A4) So far we have not seen a need for this (though internal to GASNet we do). Q5) Can/do we still get allocate on first write mapping for the segment? - If so, who's responsible for establishing processor/memory affinity with first touch? (probably the client) A5) We have each node mmap() its own segment before any cross-mapping is done which should ensure locality if the OS does allocation at mmap() time. We currently have the client doing first-touch to deal with the case that the OS does page frame allocation on touch, rather than mmap(). Q6) Do we ever want to allow supernodes to share a physical node? (e.g. to increase segment size or to leverage NUMA affinity) - if so, need an interface to specify this (probably environment variables) A6) The GASNET_SUPERNODE_MAXSIZE env var bounds the number of processes which will be joined into a single supernode (with 0 meaning no bound). When there are more than this number of processes per node, then multiple supernodes will be constructed. Open questions: -------------- * How do we handle 8 or 16-way SMPs on 32-bit platforms where VM space is already tight, or OS's where the limit on sharable memory is small? This design would make our per-node segsizes rather small. Do we want a mode where segments are not cross-mapped, but the gasnet_put/get can bypass the NIC using a two-copy scheme through bounce buffers? - This bounce buffer mode could potentially also help for EVERYTHING mode (without pshm segments), although due to attentiveness issues, it may be slower than using loopback RDMA - Is this mode just the extended-ref using AM-over-PSHM? * Will there be contention with MPI for resources (and should we care)? Known Problems / To do: ---------------------- * The mechanism we are using to probe for maximum segment size works fine on a system with plenty of memory, but dies on systems with less. The work around is to set the GASNET_MAX_SEGSIZE small enough for a given system. * GASNet conduits known NOT to work: - SHMEM conduit does not support PSHM, but there is no reason to think that doing so would be constructive. Keep in mind that if you use one of these conduits on a platform with the necessary support for PSHM, you may still configure with --enable-pshm to get PSHM support in other conduits (e.g. SMP and MPI), and non-PSHM conduits will still build (they will simply be missing PSHM support).