# Preface ``` NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE This file documents the "Memory Kinds" feature of GASNet-EX, intended only for use by developers with a specific interest in this feature. Other client developers should limit themselves to the interfaces and behaviors given in docs/GASNet-EX.txt and the GASNet-1 specification. While it is intended that features and capabilities described here will make their way into the GASNet-EX specification, the APIs in this file and their implementation are only a prototype. All aspects of the APIs and capabilities first introduced in this file are subject to non-trivial changes before the prototype stage ends. NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE ``` # Introduction This document provides a detailed status of the Memory Kinds feature implementation and is updated as that status changes. This document makes references to an external document, which is available on request from gasnet-staff@lbl.gov: + GASNet-EX API Proposal: Memory Kinds, Revision 2020.11.0 For brevity, this will be referenced as simply "the API Proposal". # General Usage By default, the `configure` script does not enable support for any non-host memory kinds. Use of new configure option `--enable-memory-kinds` enables probes for the necessary headers and libraries for all available device "kinds" (presently only "CUDA_UVA") and enables the prototype implementation of memory kinds if such support is found. This is the recommended mechanism to enable memory kinds support, since it will enable additional kinds as they are added. For more detailed control for a given kind (such as "cuda-uva") the following take precedence over `--(en|dis)able-memory-kinds`: + `--disable-kind-[name]` disables probing for support for the named kind. + `--enable-kind-[name]` probes for support for the named kind, with failure of the probe being a fatal `configure` error. + `--enable-kind-[name]=probe` probes for support for the named kind, with failure of the probe being non-fatal (the same behavior requested for all kinds by using `--enable-memory-kinds`). On our main development platforms, the logic in `configure` is sufficient to locate the required headers and libraries with no additional options. However, the following options (and environment equivalents) are available to guide the probe if needed: + `--with-cuda-home=...` or `CUDA_HOME` + `--with-cuda-cflags=...` or `CUDA_CFLAGS` + `--with-cuda-libs=...` or `CUDA_LIBS` + `--with-cuda-ldflags=...` or `CUDA_LDFLAGS` Generally, it is sufficient to provide the installation prefix of the CUDA toolkit using either `--with-cuda-home=...` or `CUDA_HOME`, since the others all have sensible defaults once the installation prefix is known. Clients can use the preprocessor identifier `GASNET_HAVE_MK_CLASS_CUDA_UVA` (defined to `1` or undefined) to determine if support for the CUDA_UVA memory kind was detected at configure time. The preprocessor identifier `GASNET_HAVE_MK_CLASS_MULTIPLE` is more general, providing the client with an indication if configure detected support for *any* memory kinds other than host memory. # Supported Configurations All current memory kinds implementation work is limited to devices with the CUDA Device API and Unified Virtual Addressing (UVA). This should include all modern NVIDIA GPUs and CUDA Toolkit versions. Support is further limited to ibv-conduit on Linux and only when using Mellanox InfiniBand hardware and drivers with support for "GPUDirect RDMA" (GDR). In some cases additional optional software, such as "nvidia_peer_memory" must be installed. Please consult Mellanox documentation for assistance determining what driver software is needed for your specific hardware and Linux distribution. The Open MPI and MVAPICH projects also have some documentation regarding deployment of GPUDirect RDMA for their respective MPI implementations. Furthermore, only `GASNET_SEGMENT_FAST` segment mode is supported. This is the default segment mode, but can be specified explicitly at configure time using the `--enable-segment-fast` option. To be clear: `--enable-segment-large` and `--enable-segment-everything` configurations of ibv-conduit do not support the memory kinds work in the current implementation. To the best of our knowledge, Mellanox currently disclaims support for GPUDirect RDMA on aarch64 (aka ARM64 or ARMv8) and NVIDIA does not support UVA on ILP32 platforms. Therefore, this work currently supports only x86-64 and ppc64le. For any configurations that do not meet all of the configure-testable requirements outlined above, the memory kinds support in the current prototype implementation will be disabled (or `configure` will fail if it was passed `--with-kind-cuda-uva`). Specifically, `GASNET_HAVE_MK_CLASS_CUDA_UVA` will be undefined and attempts to create device segments will fail at runtime. Future releases are expected to eventually include a "reference implementation" that will allow creation of device segments on a wider range of platforms and transparently stage transfers through host memory bounce buffers, but that is not yet present. If support is enabled at configure time, then the implementation will attempt to determine at runtime if GDR support is present or not. If it is not, then `gex_MK_Create()` will fail with an appropriate message. ## Limits on GPU segment size Modern NVIDIA GPUs with GPUDirect utilize a "Base Address Register" mechanism to map the device memory into the PCIe address space. The amount of memory which can be mapped by GASNet-EX as a GPU segment is limited by the `BAR1` capability of your GPU, which can also be severely limited by the motherboard and/or BIOS. To query the BAR1 capability of your GPU, run `nvidia-smi -q` and look for the "Total" value in the "BAR1 Memory Usage" section for an (optimistic) maximum on the amount of memory which can be mapped. This limit is per GPU across all process and runtimes on a given node. Thus a portion is consumed by each GASNet-EX segment created on a given node, as well as by other uses of GPUDirect RDMA such as an MPI implementation. Typically a few tens of MB are also reserved by the driver itself. ## GDR and Multi-rail Though our test and development systems have multi-rail InfiniBand networks, there are currently unresolved issues with respect to use of multiple rails within a given process. Consequentially, support is currently limited to single-rail configurations, which can be achieved ether at configure time by using the option `--disable-ibv-multirail`, or at runtime by setting the environment variable `GASNET_IBV_PORTS` to name a single valid port. Both mechanisms are documented in `ibv-conduit/README`. For the most up-to-date information on this issue see [bug 4148](https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4148) Additionally, the BAR1 usage (described earlier in this document) has been observed to be *per-HCA* and thus use of multiple rails may limit the size of GPU segments. ## Loopback The current implementation does not handle RMA operations between combinations of host and GPU memory in the same processes (loopback), though this will be supported in the future. It should be noted that this temporary limitation precludes use of GASNet for transfers which could alternatively be performed using `cudaMemcpy()`, or `cuMemcpy{DtoH,HtoD,DtoD}()` (possibly with some CUDA calls to enable peer-to-peer access). Therefore, it is recommended practice (and will *continue* to be so when this limitation is removed) that clients bypass GASNet-EX to perform such transfers. For the most up-to-date information on this issue see [bug 4149](https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4149) ## GDR and PSHM Currently the implementation is sufficient (when using supported hardware, drivers and libraries) to perform RMA operations between combinations of host and GPU memory in which the two involved endpoints are in distinct "nbrhds". There is a temporary limitation (in addition to the no-loopback limitation, above) which prohibits intra-nbrhd RMA operations. In other words, there is no support for RMA operations in which one or both endpoints has a GPU memory segment and the two processes are in the same shared memory domain (aka "nbrhd" in GASNet-EX documentation). Currently, RMA transfers involving GPU memory between processes in the same compute node are supported only when PSHM is "inactive" (meaning either `--disable-pshm` at configure time, or `GASNET_SUPERNODE_MAXSIZE=1` in ones environment at runtime). For the most up-to-date information on this issue see [bug 4148](https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4148) ## Premature local completion of GDR Puts from device memory In addition to the multi-path issues described above (under "GDR and Multi-rail" and "GDR and PSHM" sub-headings), Put operations with their source in device memory have been observed to signal local completion prior to actual transfer of the data (as can be demonstrated by writing data to the source after sync and observing it arrive in the destination buffer). This is believed to be an issue with how ibv and GDR interact, and we are hopeful that a workaround can be implemented in a future release. For the most up-to-date information on this issue see [bug 4150](https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4150) ## GDR and small Gets into device memory As documented [here](https://github.com/linux-rdma/rdma-core/blob/master/providers/mlx5/man/mlx5dv_create_qp.3.md) current versions of libibverbs for "mlx5" generation HCAs may default to performing small RDMA Gets into an unused space in the work request structure to minimize the number of PCI bus crossings required. Such Get operations are eventually completed by a `memcpy()` to the original destination when the completion queue entry is reaped. This `memcpy()` fails (with a `SIGSEGV`) when the destination is device memory. As noted in Mellanox's documentation, setting `MLX5_SCATTER_TO_CQE=0` in the environment disables this undesired behavior. We hope to be able to provide a better solution (automatic and specific to device memory Gets) in a future release. For the most up-to-date information on this issue see [bug 4151](https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4151) ## CUDA Multi-Process Service (MPS) Testing conducted to date is insufficient to establish whether the current implementation is compatible with the use of CUDA MPS. Reports (positive or negative) regarding such compatibility are welcome. # Tested Configurations We have yet to establish the minimum required versions of hardware, drivers or libraries. However, the following two platforms are our primary development systems and represent "known good" combinations (modulo those caveats listed elsewhere in this document, such as with regards to multi-rail): Summit: + ppc64le (IBM POWER9) + CUDA 10.1.243 + Mellanox ConnectX-5 HCAs + NVIDIA Volta-class GPUs Dirac: + x86_64 (Intel Nahalem) + CUDA 11.1 + Mellanox ConnectX-5 HCAs + NVIDIA Maxwell-class GPUs Eventual minimum requirements may be lower than those of either platform, or possibly higher. # Implementation Status Summary Currently the implementation is sufficient (when hardware, software and configuration constraints are met) to use the pseudo-code below to perform RMA operations between combinations of host and GPU memory (subject to the previously noted temporary prohibitions on loopback and intra-nbrhd transfers). Please note that all error checking has been elided from this example. Proper checking of return codes, if any, is especially important when using this prototype. ``` // Bootstrap and establish host memory segment for the primordial endpoints gex_Client_Init(&myClient, &myEP, &myTM, "MK example", &argc, &argv, 0) gex_Segment_Attach(...); // Create memory kind object for CUDA device 0 gex_MK_t dev0Kind; gex_MK_Create_args_t args; args.gex_flags = 0; args.gex_class = GEX_MK_CLASS_CUDA_UVA; args.gex_args.gex_class_cuda_uva.gex_CUdevice = 0; gex_MK_Create(&dev0_kind, myClient, args, 0); // Ask GASNet to allocate a 1GB CUDA UVA segment gex_Segment_t dev0Segment; gex_Segment_Create(&dev0Segment, myClient, NULL, 1024*1024*1024, dev0Kind, 0); // Create an RMA-only endpoint, bind dev0Segment and publish RMA credentials gex_EP_t dev0EP; gex_EP_Create(&dev0EP, myClient, GEX_EP_CAPABILITY_RMA, 0); gex_EP_BindSegment(dev0EP, dev0Segment, 0); gex_EP_PublishBoundSegment(myTM, &dev0EP, 1, 0); // Note assumptions made in remainder of this example assert( gex_EP_QueryIndex(myEP) == 0 ); assert( gex_EP_QueryIndex(dev0EP) == 1 ); // Query device addresses needed for RMA void *loc_dev0, *rem_dev0; loc_dev0 = gex_Segment_QueryAddr(dev0Segment); gex_Event_t ev = gex_EP_QueryBoundSegmentNB(gex_TM_Pair(myEP,1), peer_rank, &rem_dev0, NULL, NULL, 0); gex_Event_Wait(ev); // [ do something that places data in GPU memory ] // Perform a blocking 4MB GPU-to-GPU Get gex_RMA_GetBlocking(gex_TM_Pair(dev0EP,1), loc_dev0, peer_rank, rem_dev0, 4*1024*1024, 0); ``` # Implementation Status by GASNet-EX API This section describes the known limitations of each of the APIs introduced recently in order to support memory kinds. Due to interaction among APIs, it is impossible to completely avoid forward references. ## Additions: The preprocessor identifier `GASNET_HAVE_MK_CLASS_MULTIPLE` is defined to `1` if support has been compiled in for any memory kinds other than host memory. ## Renames: Some types, constants and functions have been renamed relative to their first appearances in the API Proposal (when it was known as "GASNet API Proposal: Multi-EP. Revision 2020.6.1"): + `gex_Segment_EP_Bind()` is replaced by `gex_EP_SegmentBind()` + `gex_MemKind_Create()` is replaced by `gex_MK_Create()` + `gex_MemKind_Destroy()` is replaced by `gex_MK_Destroy()` + `gex_MemKind_t` is replaced by `gex_MK_t` - With the constant `GEX_MEMKIND_HOST` replaced by `GEX_MK_HOST` + `gex_MemKind_Class_t` becomes `gex_MK_Class_t` - With `GEX_MEMKIND_CLASS_` shortened `GEX_MK_CLASS_` in the naming of the enum values + `gex_MemKind_Create_args_t` to `gex_MK_Create_args_t` - With `gex_mk_` shortened to `gex_` in naming of struct and union members The current (2020.11.0) revision of the API Proposal uses the names above, and the remainder of this section will utilize the new names exclusively. ## `gex_Segment_Attach()` The `gex_Segment_Attach()` call remains the only supported means by which to create and bind a segment to the primordial endpoint. This is true despite the recent introduction of APIs which, when used in the proper sequence, would appear to provide a suitable replacement. While this is technically a limitation of the alternative APIs, it is documented here for clarity. ## `gex_Segment_Create()` This API, along with all types and constants required to specify its arguments, are defined and will link in any conduit. However, it is useful only when multi-EP support exists (see `gex_EP_Create()` for the current scope of multi-EP support). On ibv-conduit, specifically, the implementation of this API is believed to be complete with respect to the API Proposal. In particular, it is capable of creating segments of both client-allocated and GASNet-allocated memory, using either the defined `kind` value `GEX_MK_HOST` or a kind created using `gex_MK_Create()` with a class of `GEX_MK_CLASS_CUDA_UVA`. Notably lacking from both specification and implementation is a means to request or demand allocation of memory suitable for intra-nbrhd cross-mapping via PSHM. ## `gex_Segment_Destroy()` Not currently implemented. ## `gex_EP_Create()` This API, along with all types and constants required to specify its arguments, are defined and will link in any conduit. However, it is useful only when multi-EP support exists, as described in the following paragraphs. Current builds of GASNet-EX will define a preprocessor macro `GASNET_MAXEPS` which advertises the optimistic maximum number of endpoints per process, inclusive of the primordial endpoint created by `gex_Client_Init()`. Any call to `gex_EP_Create()` which would exceed this limit will fail with a return of GASNET_ERR_RESOURCE. Currently, only ibv-conduit in FAST segment mode has a value of `GASNET_MAXEPS` larger than 1 (it is currently 33). Additionally, ibv-conduit only supports the `GEX_EP_CAPABILITY_RMA` capability for non-primordial endpoints. The `GEX_FLAG_HINT_ACCEL_*` values are currently defined, but ignored. ## `gex_EP_BindSegment()` This API, along with all types and constants required to specify its arguments, are defined and will link in any conduit. However, it is useful only when multi-EP support exists (see `gex_EP_Create()` for the current scope of multi-EP support). ## `gex_EP_PublishBoundSegment()` This API does not appear in the API Proposal, nor in related documents which preceded it. Complete semantics are documented in `docs/GASNet-EX.txt`. This call is currently necessary as the only means to actively distribute the RMA credentials required by some conduits (ibv among them). While this task is performed in `gex_Segment_Attach()` for primordial endpoints, use of this API is required prior to use of `gex_RMA_*()` APIs using non-primordial endpoints. It is hoped that this call can become optional in the future. This API, along with all types and constants required to specify its arguments, are defined and will link in any conduit. However, it is useful only when multi-EP support exists (see `gex_EP_Create()` for the current scope of multi-EP support). ## `gex_TM_Pair()` This API is believed to be fully implemented in all conduits and accepted by all APIs required to do so by the API Proposal (notably the `gex_RMA_*()`, `gex_AM_*()` and `gex_VIS_*() API families). Since multi-EP support is currently exclusive to ibv-conduit in FAST segment mode, the use in other conduits is effectively limited to aliasing of the primordial team. ## `gex_TM_Create()` At this point in time, there is no support for non-primordial endpoints as members of teams, and more specifically they are not accepted in the `args` passed to `gex_TM_Create()`. This means all RMA communication involving non-primordial endpoints must currently utilize a `gex_TM_t` returned by `gex_TM_pair()`. ## `gex_TM_Dup()` Not implemented. ## `gex_MK_Create()` This API is implemented as described in the API Proposal (with some renames relative to their first appearance, as detailed earlier in this document), This includes the conditional definition (defined to `1` or undefined) of `GASNET_HAVE_MK_CLASS_CUDA_UVA`, which is currently defined only when the necessary headers and libs were located at configure time *and* one is using ibv-conduit in FAST segment mode. In all other circumstances `GASNET_HAVE_MK_CLASS_CUDA_UVA` will be undefined. While `GASNET_HAVE_MK_CLASS_CUDA_UVA` has only a conditional definition, the enum value `GEX_MK_CLASS_CUDA_UVA` is defined unconditionally in `gasnet-mk.h`. Any calls to `gex_MK_Create()` specifying this class when *not* supported will return `GASNET_ERR_BAD_ARG`, as documented in the API Proposal. ## `gex_MK_Destroy()` This API is implemented, but is only legal for a kind not currently in use by any segment. Given the lack of a `gex_Segment_Destroy()` implementation, this API is therefore only useful for a kind which has been created but never used.