# bassi poe NAME poe - Invokes the Parallel Operating Environment (POE) for loading and executing programs on remote processor nodes. SYNOPSIS poe [-h] [program] [program_options]... [-adapter_use adapter_specifier] [-buffer_mem {buffer_size | preallocated_buffer_size,maximum_buffer_size}] [-bulk_min_msg_size message_size] [-bulk_xfer_chunk_size] [-bulk_xfer_recv_conn] [-cc_scratch_buf {yes | no}] [-clock_source {aix | switch}] [-cmdfile commands_file] [-coredir directory_prefix_string | none] [-corefile_format { lightweight_corefile_name | STDERR }] [-corefile_sigterm {yes | no}] [-cpu_use cpu_specifier] [-css_interrupt {yes | no}] [-debug_notimeout non-null string of characters] [-eager_limit size_limit] [-euidevelop {yes | no | deb | min | nor}] [-euidevice device_specifier] [-euilib {ip | us}] [-euilibpath path_specifier] [-hints_filtered {yes | no}] [{-hostfile | -hfile} host_file_name] [{-infolevel | -ilevel} message_level] [-io_buffer_size buffer_size] [-io_errlog {yes | no}] [-ionodefile io_node_file_name] [-instances number_of_instances] [-labelio {yes | no}] [-llfile loadleveler_job_command_file_name] [-msg_api {MPI | LAPI | MPI_LAPI |MPI, LAPI | LAPI, MPI }] [-msg_envelope_buf envelope_buffer_size][-newjob {yes | no}] [-nodes number_of_nodes] [-pgmmodel {spmd | mpmd}] [-pmdlog {yes | no}] [-polling_interval interval] [-printenv {yes | no | script_name }] [-procs partition_size] [-priority_log { yes | no}] [-priority_ntp { yes | no}] [-pulse interval] [-rdma_count {rCxt block value| MPI rCxt block value, LAPI rCxt block value}] [-resd {yes | no}] [-retransmit_interval interval] [-retry retry_interval|wait] [-retrycount retry_count] [-rexmit_buf_cnt number of buffers] [-rexmit_buf_size buffer_size] [-rmpool pool_ID] [-savehostfile output_file_name] [-save_llfile output_file_name] [-shared_memory {yes | no}] [-single_thread {no | yes}] [-statistics {yes | no| print}] [-stdinmode {all | none | task_ID}] [-stdoutmode {unordered | ordered | task_ID}] [-task_affinity {SNI | MCM | mcm_list}] [-tasks_per_node number_of_tasks per node] [-thread_stacksize stacksize] [-udp_packet_size {packet_size}] [-use_bulk_xfer {yes | no}] [-wait_mode {nopoll |poll | sleep | yield}] The poe command invokes the Parallel Operating Environment for loading and executing programs on remote processor nodes. The operation of POE is influenced by a number of POE environment variables. The flag options on this command are each used to temporarily override one of these environment variables. User program_options can be freely interspersed with the flag options. If no program is specified, POE will either prompt you for programs to load, or, if the MP_CMDFILE environment variable is set, will load the partition using the specified commands file. FLAGS The -h flag, when used, must appear immediately after poe, and causes the poe man page, if it exists, to be printed to stdout. The remaining flags you can specify on this command are used to temporarily override POE environment variables. For more information on valid values, and on what a particular flag sets, refer to the description of its associated environment variable in the ENVIRONMENT VARIABLES section. The following flags are grouped by function. The following Partition Manager control flags override the associated environment variables. -adapter_use MP_ADAPTER_USE -cpu_use MP_CPU_USE -euidevice MP_EUIDEVICE -euilib MP_EUILIB -euilibpath MP_EUILIBPATH -hostfile or -hfile MP_HOSTFILE -procs MP_PROCS -pulse MP_PULSE -rdma_count MP_RDMA_COUNT -resd MP_RESD -retry MP_RETRY -retrycount MP_RETRYCOUNT -msg_api MP_MSG_API -rmpool MP_RMPOOL -nodes MP_NODES -tasks_per_node MP_TASKS_PER_NODE -savehostfile MP_SAVEHOSTFILE The following Job Specification flags override the associated environment variables. -cmdfile MP_CMDFILE -instances MP_INSTANCES -llfile MP_LLFILE -newjob MP_NEWJOB -pgmmodel MP_PGMMODEL -save_llfile MP_SAVE_LLFILE -task_affinity MP_TASK_AFFINITY The following I/O Control flags override the associated environment variables. -labelio MP_LABELIO -stdinmode MP_STDINMODE -stdoutmode MP_STDOUTMODE The following generation of diagnostic information flags override the associated environment variables. -infolevel or -ilevel MP_INFOLEVEL -pmdlog MP_PMDLOG -debug_notimeout MP_DEBUG_NOTIMEOUT The following Message Passing flags override the associated environment variables. -buffer_mem MP_BUFFER_MEM -cc_scratch_buf MP_CC_SCRATCH_BUF -clock_source MP_CLOCK_SOURCE -css_interrupt MP_CSS_INTERRUPT -eager_limit MP_EAGER_LIMIT -hints_filtered MP_HINTS_FILTERED -ionodefile MP_IONODEFILE -msg_envelope_buf MP_MSG_ENVELOPE_BUF -shared_memory MP_SHARED_MEMORY -udp_packet_size MP_UDP_PACKET_SIZE -thread_stacksize MP_THREAD_STACKSIZE -single_thread MP_SINGLE_THREAD -wait_mode MP_WAIT_MODE -polling_interval MP_POLLING_INTERVAL -retransmit_interval MP_RETRANSMIT_INTERVAL -statistics MP_STATISTICS -io_buffer_size MP_IO_BUFFER_SIZE -io_errlog MP_IO_ERRLOG -use_bulk_xfer MP_USE_BULK_XFER -bulk_min_msg_size MP_BULK_MIN_MSG_SIZE -bulk_xfer_chunk_size MP_BULK_XFER_CHUNK_SIZE -bulk_xfer_recv_conn MP_BULK_XFER_RECV_CONN -rexmit_buf_size MP_REXMIT_BUF_SIZE -rexmit_buf_cnt MP_REXMIT_BUF_CNT The following corefile generation flags override the associated environment variables. -coredir MP_COREDIR -corefile_format MP_COREFILE_FORMAT -corefile_sigterm MP_COREFILE_SIGTERM The following are miscellaneous flags: -euidevelop MP_EUIDEVELOP -printenv MP_PRINTENV -statistics MP_STATISTICS -priority_log MP_PRIORITY_LOG -priority_ntp MP_PRIORITY_NTP DESCRIPTION The poe command invokes the Parallel Operating Environment for loading and executing programs on remote nodes. You can enter it at your home node to: * load and execute an SPMD program on all nodes of your partition. * individually load the nodes of your partition with an MPMD job. * load and execute a series of SPMD and MPMD programs, in individual job steps, on the same partition. * run nonparallel programs on remote nodes. The operation of POE is influenced by a number of POE environment variables. The flag options on this command are each used to temporarily override one of these environment variables. User program_options can be freely interspersed with the flag options, and additional_options not to be parsed by POE can be placed after a fence_string defined by the MP_FENCE environment variable. If no program is specified, POE will either prompt you for programs to load, or, if the MP_CMDFILE environment variable is set, will load the partition using the specified commands file. The environment variables and flags that influence the operation of this command fall into distinct categories of function. They are: * Partition Manager control. The environment variables and flags in this category determine the method of node allocation, message passing mechanism, and the PULSE monitor function. * Job specification. The environment variables and flags in this category determine whether or not the Partition Manager should maintain the partition for multiple job steps, whether commands should be read from a file or STDIN, and how the partition should be loaded. * I/O control. The environment variables and flags in this category determine how I/O from the parallel tasks should be handled. These environment variables and flags set the input and output modes, and determine whether or not output is labeled by task id. * Generation of diagnostic information. The environment variables and flags in this category enable you to generate diagnostic information that may be required by the IBM Support Center in resolving PE-related problems. * Message Passing Interface. The environment variables and flags in this category enable you to specify values for tuning message passing applications. * Corefile generation. The environment variables and flags in this category govern aspects of corefile generation including the directory name into which corefiles will be saved, or the corefile format (standard AIX or lightweight). * Miscellaneous. The additional environment variables and flags in this category enable additional error checking, and set a dispatch priority class for execution. ENVIRONMENT VARIABLES The environment variable descriptions in this section are grouped by function. The following environment variables are associated with Partition Manager control. MP_ADAPTER_USE Determines how the node's adapter should be used. The US communication subsystem library does not require dedicated use of the high performance interconnect switch on the node. Adapter use will be defaulted, as in Table 7, but shared usage may be specified. Valid values are dedicated and shared. If not set, the default is dedicated for US jobs, or shared for IP jobs. The value of this environment variable can be overridden using the -adapter_use flag. MP_CPU_USE Determines how the node's CPUs should be used. The US communication subsystem library does not require unique CPU use on the node. CPU use will be defaulted, as in Table 7, but multiple use may be specified. Valid values are multiple and unique. If not set, the default is unique for US jobs, or multiple for IP jobs. The value of this environment variable can be overridden using the -cpu_use flag. MP_EUIDEVICE Determines the adapter set to use for message passing. Valid values are en0 (for Ethernet), fi0 (for FDDI), tr0 (for token-ring), css0 (for the pSeries High Performance Switch feature and SP Switch2), csss (for the SP switch 2 high performance adapter), sn_all, and sn_single for the pSeries High Performance Switch. MP_EUILIB Determines the communication subsystem implementation to use for communication either the IP communication subsystem or the US communication subsystem. In order to use the US communication subsystem, you must have a system configured with its high performance switch feature. Valid, case-sensitive, values are ip (for the IP communication subsystem) or us (for the US communication subsystem). The value of this environment variable can be overridden using the -euilib flag. MP_EUILIBPATH Determines the path to the message passing and communication subsystem libraries. This only needs to be set if an alternate library path is desired. Valid values are any path specifier. The value of this environment variable can be overridden using the -euilibpath flag. MP_HOSTFILE Determines the name of a host list file for node allocation. Valid values are any file specifier. If not set, the default is host.list in your current directory. The value of this environment variable can be overridden using the -hostfile or -hfile flags. MP_PROCS Determines the number of program tasks. Valid values are any number from 1 to 8192. If not set, the default is 1. The value of this environment variable can be overridden using the -procs flag. MP_PULSE The interval (in seconds) at which POE checks the remote nodes to ensure that they are communicating with the home node. The default interval is 600 seconds (10 minutes). To disable the pulse function, specify an interval of 0 (zero) seconds. The pulse function is automatically disabled when running the pdbx debugger. You can override the value of this environment variable with the -pulse flag. MP_ RDMA_COUNT Specifies the number of user rCxt blocks. It supports the specification of multiple values when multiple protocols are involved. The format can be one of the following: * MP_RDMA_COUNT=m for a single protocol * MP_RDMA_COUNT=m,n for multiple protocols. Only for when MP_MSG_API="mpi.lapi" - the values are positional, m is for MPI, n for LAPI. Note that the MP_RDMA_COUNT/-rdma_count option signifies the number of rCxt blocks the user has requested for the job, and it is up to LoadLeveler to determine the actual number of rCxt blocks that will be allocated for the job. POE will use the value of MP_RDMA_COUNT to specify the number of rCxt blocks requested on the LoadLeveler MPI and/or LAPI network information when the job is submitted. The MP_RDMA_COUNT specification only has meaning for LAPI applications. When MP_RDMA_COUNT is specified for MPI applications (either when MP_MSG_API is explicitly set or defaults to "mpi"), POE will issue a warning message that the MP_RDMA_COUNT specification is unnecessary. The use of the MP_RDMA_COUNT specification applies to PE Version 4 Release 2 in AIX Version 5 Release 3 environments only. MP_REMOTEDIR Specifies the name of a script which echoes the name of the current directory to be used on the remote nodes. By default, the current directory is the current directory at the time that POE is run. You may need to specify this if the AutoMount Daemon is used to mount user file systems, and the user is not using the Korn shell. The script mpamddir is provided for mapping the C shell directory name to an AutoMount Daemon name. MP_RESD Determines whether or not the Partition Manager should connect to LoadLeveler to allocate nodes. Valid values are either yes or no, and there is no default. The value of this environment variable can be overridden using the -resd flag. MP_RETRY The period of time (in seconds) between processor node allocation retries by POE if there are not enough processor nodes immediately available to run a program. This is valid only if you are using LoadLeveler. If the (case insensitive) character string wait is specified instead of a number, no retries are attempted by POE, and the job remains enqueued in LoadLeveler until LoadLeveler either schedules the job or cancels it. MP_RETRYCOUNT The number of times (at the interval set by MP_RETRY) that the partition manager should attempt to allocate processor nodes. This value is ignored if MP_RETRY is set to the character string wait. MP_MSG_API To indicate to POE which message-passing API is being used by the parallel tasks. MPI indicates to use MPI protocol only. LAPI indicates to use LAPI protocol only. MPI_LAPI indicates that both protocols are used, sharing the same set of communication resources (windows, IP addresses). MPI, LAPI indicates that both protocols are used, with dedicated resources assigned to each of them. LAPI, MPI has a meaning identical to MPI, LAPI. MP_RMPOOL Determines the name or number of the pool that should be used for nonspecific node allocation. This environment variable/command-line flag only applies to LoadLeveler. Valid values are any identifying pool name or number. There is no default. The value of this environment variable can be overridden using the -rmpool flag. MP_NODES Specifies the number of physical nodes on which to run the parallel tasks. It may be used alone or in conjunction with MP_TASKS_PER_NODE and/or MP_PROCS, as described in Table 9. The value of this environment variable can be overridden using the -nodes flag. MP_TASKS_PER_NODE Specifies the number of tasks to be run on each of the physical nodes. It may be used in conjunction with MP_NODES and/or MP_PROCS, as described in Table 9, but may not be used alone. The value of this environment variable can be overridden using the -tasks_per_node flag. MP_SAVEHOSTFILE The name of an output host list file to be generated by the Partition Manager. Valid values are any relative or full path name. The value of this environment variable can be overridden using the -savehostfile flag. MP_TIMEOUT Controls the length of time POE waits before abandoning an attempt to connect to the remote nodes. The default is 150 seconds. MP_TIMEOUT also changes the length of time the communication subsystem will wait for a connection to be established during message passing initialization. If the SP security method is "dce and compatibility", you may need to increase the MP_TIMEOUT value to allow POE to wait for the DCE servers to respond (or timeout if the servers are down). MP_CKPTDIR Defines the directory where the checkpoint file will reside when checkpointing a program. See 4.2.7, "Checkpointing and restarting programs" for more information. MP_CKPTDIR_PERTASK Specifies whether the checkpoint files of the parallel tasks should be written to separate subdirectories under the directory that is specified by MP_CKPTDIR. The default is no. The subdirectories must exist prior to invoking the parallel checkpoint. Using separate subdirectories may provide better performance when using a shared/parallel file system (for example, GPFS) for checkpointing from more than 128 nodes, depending on the specifics of the file system, checkpoint file size, and other factors. The subdirectory name used for each task is its task number. The following environment variables are associated with Job Specification. MP_CMDFILE Determines the name of a POE commands file used to load the nodes of your partition. If set, POE will read the commands file rather than STDIN. Valid values are any file specifier. The value of this environment variable can be overridden using the -cmdfile flag. MP_INSTANCES The number of instances of User Space windows or IP addresses to be assigned per task per protocol per network. This value is expressed as an integer, or the string max. If the value specified exceeds the maximum allowed number of instances, as determined by LoadLeveler, the true maximum number determined is substituted. MP_LLFILE Determines the name of a LoadLeveler job command file for node allocation. If you are performing specific node allocation, you can use a LoadLeveler job command file in conjunction with a host list file. If you do, the specific nodes listed in the host list file will be requested from LoadLeveler. Valid values are any relative or full path name. The value of this environment variable can be overridden using the -llfile environment variable. MP_NEWJOB Determines whether or not the Partition Manager maintains your partition for multiple job steps. Valid values are yes or no. If not set, the default is no. The value of this environment variable can be overridden using the -newjob flag. MP_PGMMODEL Determines the programming model you are using. Valid values are spmd or mpmd. If not set, the default is spmd. The value of this environment variable can be overridden using the -pgmmodel flag. MP_SAVE_LLFILE When using LoadLeveler for node allocation, the name of the output LoadLeveler job command file to be generated by the Partition Manager. The output LoadLeveler job command file will show the LoadLeveler settings that result from the POE environment variables and/or command-line options for the current invocation of POE. If you use the MP_SAVE_LLFILE environment variable for a batch job, or when the MP_LLFILE environment variable is set (indicating that a LoadLeveler job command file should participate in node allocation), POE will show a warning and will not save the output job command file. Valid values are any relative or full path name. The value of this environment variable can be overridden using the -save_llfile flag. MP_TASK_AFFINITY Setting this environment variable causes the PMD to attach each task of a parallel job to one of the system resource sets at the MCM level. This constrains the task, and all its threads, to run within that MCM. If the task has an inherited resource set, the attach honors the constraints of the inherited resource set. When POE is run under LoadLeveler 3.3.1 or later (which includes all User Space jobs), POE relies on LoadLeveler to handle scheduling affinity, based on LoadLeveler job control file keywords that POE sets up in submitting the job. The possible MP_TASK_AFFINITY values are: * MP_TASK_AFFINITY=MCM - the tasks are allocated in a round-robin fashion among the MCM's attached to the job by WLM. By default, the tasks are allocated to all the MCMs in the node. When run under LoadLeveler 3.3.1 or later, POE will set the LoadLeveler MCM_AFFINITY_OPTIONS to MCM_MEM_PREF, MCM_SNI_NONE, and MCM_DISTRIBUTE, allowing LoadLeveler to handle scheduling affinity. * MP_TASK_AFFINITY=SNI - the tasks are allocated to the MCM in common with the first adapter assigned to the task by LoadLeveler. This applies only to User Space MPI jobs. MP_TASK_AFFINITY=SNI should not be specified for IP jobs. When run under LoadLeveler 3.3.1 or later, POE will set the LoadLeveler MCM_AFFINITY_OPTIONS to MCM_SNI_PREF, and MCM_DISTRIBUTE, allowing LoadLeveler to handle scheduling affinity. * MP_TASK_AFFINITY=mcm-list - tasks will be assigned on a round-robin basis to this set, within the constraint of an inherited rset, if any. 'mcm-list' specifies a set of system level (LPAR) logical MCMs that can be attached to. Any MCMs outside the constraint set will be attempted, but will fail. If a single MCM number is specified as the list, all tasks are assigned to that MCM. This option is only valid when running either without LoadLeveler, or with LoadLeveler Version 3.2 (or earlier) that does not support scheduling affinity. * When a value of -1 is specified, no affinity request will be made (effectively this disables task affinity). The following environment variables are associated with I/O Control. MP_LABELIO Determines whether or not output from the parallel tasks are labeled by task id. Valid values are yes or no. If not set, the default is no. The value of this environment variable can be overridden using the -labelio flag. MP_STDINMODE Determines the input mode how STDIN is managed for the parallel tasks. Valid values are: all all tasks receive the same input data from STDIN. none no tasks receive input data from STDIN; STDIN will be used by the home node only. n STDIN is only sent to the task identified (n). If not set, the default is all. The value of this environment variable can be overridden using the -stdinmode flag. MP_STDOUTMODE Determines the output mode how STDOUT is handled by the parallel tasks. Valid values are: unordered all tasks write output data to STDOUT asynchronously. ordered output data from each parallel task is written to its own buffer. Later, all buffers are flushed, in task order, to STDOUT. a task id only the task indicated writes output data to STDOUT. If not set, the default is unordered. The value of this environment variable can be overridden using the -stdoutmode flag. The following environment variables are associated with the generation of diagnostic information. MP_INFOLEVEL Determines the level of message reporting. Valid values are: 0 error 1 warning and error 2 informational, warning, and error 3 informational, warning, and error. Also reports diagnostic messages for use by the IBM Support Center. 4, 5, 6 Informational, warning, and error. Also reports high- and low-level diagnostic messages for use by the IBM Support Center. If not set, the default is 1 (warning and error). The value of this environment variable can be overridden using the -infolevel or -ilevel flag. MP_PMDLOG Determines whether or not diagnostic messages should be logged to a file in /tmp on each of the remote nodes. Typically, this environment variable/command-line flag is only used under the direction of the IBM Support Center in resolving a PE-related problem. Valid values are yes or no. If not set, the default is no. The value of this environment variable can be overridden using the -pmdlog flag. MP_PRINTENV Use this environment variable to activate generating a report on the parallel environment setup for the MPI job at hand. The report is printed to STDOUT. The printing of this report will have no adverse effect on the performance of the MPI program. The value can also be a user-specified script name, the output of which will be added to end of the normal environment setup report. The allowable values for MP_PRINTENV are: no Do not produce a report of environment variable settings. This is the default value. yes Produce a report of MPI environment variable settings. This report is generated when MPI job initialization is complete. script_name Produce the report (same as yes), then append the output of the script specified here. MP_STATISTICS Provides the ability to gather MPCI and LAPI communication statistics for MPI user space jobs. Valid values are yes, no and print. If not set, the default is no and the values are not case sensitive. The MPCI statistical information can be used to get a summary on the network usage at the end of the MPI job and to check the progress of inter-job message passing during the execution of an MPI program. To get a summary of the network usage, use print. A list of MPCI statistical information will be printed when MPI_Finalize is called.To check the progress of inter-job message passing, use yes and the MPCI functions 'mpci_statistics_write' and 'mpci_statistics_zero' have to be inserted strategically into the MPI program. The 'mpci_statistics_write' is for printing out the current counters and the 'mpci_statistics_zero' function is for zeroing the counters. These function prototypes are: int mpci_statistics_zero(void) int mpci_statistics_write(FILE *fptr) If ppe.poe is installed, these prototypes and a list of all MPCI statistical variables, and their explanation can be found in: /usr/lpp/ppe.poe/include/x_mpci_statistics.h. Note: Activating MPCI statistics may have a slight impact on performance of the MPI program. MP_DEBUG_INITIAL_STOP Determines the initial breakpoint in the application where pdbx will get control. MP_DEBUG_INITIAL_STOP should be specified as file_name:line_number. The line_number is the number of the line within the source file file_name; where file_name has been compiled with -g. The line number has to be one that defines executable code. In general, this is a line of code for which the compiler generates machine level code. Another way to view this is that the line number is one for which debuggers will accept a breakpoint. Another valid string for MP_DEBUG_INITIAL_STOP would be the function_name of the desired initial stopping point in the debugger. If this variable is not specified, the default is to stop at the first executable source line in the main routine. This environment variable has no associated command-line flag. MP_DEBUG_NOTIMEOUT A debugging aid that allows programmers to attach to one or more of their tasks without the concern that some other task may reach the LAPI timeout. Such a timeout would normally occur if one of the job tasks was continuing to run, and tried to communicate with a task to which the programmer has attached using a debugger. With this flag set, LAPI will never timeout and continue retransmitting message packets forever. The default setting is false, allowing LAPI to timeout. The following environment variables are associated with the Message Passing Interface. MP_UDP_PACKET_SIZE Allows the user to control the LAPI UDP datagram size. Specify a positive integer. MP_ACK_THRESH Allows the user to control the LAPI packet acknowledgement threshold. Specify a positive integer, no greater than 31. The default is 30. MP_BUFFER_MEM Specifies the size of the Early Arrival (EA) buffer that is used by the communication subsystem to buffer eager send messages that arrive before there is a matching receive posted. This value can also be specified with the -buffer_mem command line flag. The command line flag will override a value set with the environment variable. This environment variable can be used in one of two ways: * Specify the size of a pre-allocated EA buffer and have PE/MPI guarantee that no valid MPI application can require more EA buffer space than is pre-allocated. For applications without very large tasks counts or with modest memory demand per task, this form is almost always sufficient. * Specify the size of a pre-allocated EA buffer and the maximum size that PE/MPI will guarantee the buffer can never exceed. Aggressive use of EA space is rare in real MPI applications but when task counts are large, the need for PE/MPI to enforce an absolute guarantee may compromise performance. Specifying a pre-allocated EA buffer that is big enough for the application's real needs but an upper bound that loosens enforcement may provide better performance in some cases, but those cases will not be common. The default values for pre-allocated EA space are 64 MB when running with User Space and 2.8 MB when running IP. To evaluate whether overriding MP_BUFFER_MEM defaults for a particular application is worthwhile, use MP_STATISTICS. This tells you whether there is significantly more EA buffer space allocated than is used or whether EA space limits are creating potential performance impacts by forcing some messages that are smaller than the eager limit to use rendezvous protocol because EA buffer cannot be guaranteed. The value of MP_BUFFER_MEM can be overridden with the -buffer_mem command line flag. For more information about MP_BUFFER_MEM see 5.2, "Using MP_BUFFER_MEM". For information about buffering eager send messages, see IBM Parallel Environment for AIX: MPI Programming Guide. MP_CC_SCRATCH_BUF Specifies whether MPI should always use the fastest collective communication algorithm when there are alternatives, even if there is greater scratch buffer required. In some cases, the faster algorithm needs to allocate more scratch buffers and therefore, consumes more memory than a slower algorithm. The default value is yes, which means that you want MPI to choose an algorithm that has the shortest execution time, even though it may consume extra memory. A value of no specifies that MPI should choose the algorithm that uses less memory. Note that restricting MPI to the algorithm that uses the least memory normally sacrifices performance in exchange for that memory savings, so a value of no should be specified only when limiting memory usage is critical. The value of MP_CC_SCRATCH_BUF can be overridden with the -cc_scratch_buf command line flag. MP_CLOCK_SOURCE Determines whether or not to use the switch clock as a time source. Valid values are AIX and switch. There is no default value. The value of this environment variable can be overridden using the -clock_source flag. MP_CSS_INTERRUPT Determines whether or not arriving message packets cause interrupts. This may provide better performance for certain applications. Valid values are yes and no. If not set, the default is no. MP_EAGER_LIMIT Changes the threshold value for message size, above which rendezvous protocol is used. If the MP_EAGER_LIMIT environment variable is not set during initialization, MPI automatically chooses a default eager limit value, based on the number of tasks, as follows: Number of Tasks MP_EAGER_LIMIT ------------------------ 1 to 256 32768 257 to 512 16384 513 to 1024 8192 1025 to 2048 4096 2049 to 4096 2048 4097 to 8192 1024 Consider running a new application once with eager limit set to 0 (zero) because this is useful for confirming that an application is safe, but normally higher eager limit gives better performance. Note that a safe application, as defined by the MPI standard, is one that does not depend on some minimum of MPI buffer space to avoid deadlock. The maximum value for MP_EAGER_LIMIT is 256K (262144 bytes). Any value that is less than 64 bytes but greater than zero bytes is automatically increased to 64 bytes. A non-power of 2 value will be rounded up to the nearest power of 2. A value may be adjusted if the early arrival buffer (MP_BUFFER_MEM) is too small. For information about buffering eager send messages and eager limit, see IBM Parallel Environment for AIX: MPI Programming Guide. MP_HINTS_FILTERED Determines whether MPI info objects reject hints (key/value pairs) which are not meaningful to the MPI implementation. In filtered mode, an MPI_INFO_SET call which provides a key/value pair that the implementation does not understand will behave as a no-op. A subsequent MPI_INFO_GET call will find that the hint does not exist in the info object. In unfiltered mode, any key/value pair is stored and may be retrieved. Applications which wish to use MPI info objects to cache and retrieve key/value pairs other than those actually understood by the MPI implementation must use unfiltered mode. The option has no effect on the way MPI uses the hints it does understand. In unfiltered mode, there is no way for a program to discover which hints are valid to MPI and which are simply being carried as uninterpreted key/value pairs. Providing an unrecognized hint is not an error in either mode. Valid values for this environment variable are yes and no. If set to yes, unrecognized hints are be filtered. If set to no, they will not. If this environment variable is not set, the default is yes. The value of this environment variable can be overridden using the -hints_filtered command-line flag. MP_IONODEFILE The name of a parallel I/O node file -- a text file that lists the nodes that should be handling parallel I/O. This enables you to limit the number of nodes that participate in parallel I/O, guarantee that all I/O operations are performed on the same node, and so on. Valid values are any relative or full path name. If not specified, all nodes will participate in parallel I/O operations. The value of this environment variable can be overridden using the -ionodefile command-line flag. MP_MSG_ENVELOPE_BUF Changes the size of the message envelope buffer. You can specify any positive number. There is no upper limit, but any value less than 1 MB is ignored. MPI pre-allocates the message envelope buffer with a default size of 8 MB. The MPI statistics function prints out the message envelope buffer usage which you can use to determine the best envelope buffer size for a particular MPI program. The envelope buffer is used for storing both send and receive descriptors. An MPI_Isend or unmatched MPI_Irecv posting creates a descriptor that lives until the MPI_Wait completes. When a message arrives and finds no match, an early arrival descriptor is created that lives until a matching receive is posted and that receive completes. For any message at the destination, there will be only one descriptor, either the one created at the receive call or the one created at the early arrival. The more uncompleted MPI_Irecv and MPI_Isend operations an application maintains, the higher the envelope buffer requirement. Most applications will have no reason to adjust the size of this buffer. The value of MP_MSG_ENVELOPE_BUF can be overridden with the -msg_envelope_buf command line flag. MP_POLLING_INTERVAL Changes the polling interval, in microseconds. This is expressed as an integer between 1 and 2 billion, with defaults of 400000 (US) and 180000 (IP). MP_RETRANSMIT_INTERVAL Controls how often the communication subsystem library checks to see if it should retransmit packets that have not been acknowledged. This value is the number of polling loops between checks. The acceptable range is 1000 to INT_MAX. The default is 10000 for UDP and 400000 for User Space. MP_LAPI_TRACE_LEVEL Used in conjunction with AIX tracing for debug purposes. Levels 0-6 are supported. MP_SHARED_MEMORY To specify the use of shared memory (instead of the network) for message passing between tasks running on the same node. The default value is yes. Note: In past releases, the MP_SHM_CC environment variable was used to enable or disable the use of shared memory for certain 64-bit MPI collective communication operations. Beginning with the PE 4.2 release, this environment variable has been removed. You should now use MP_SHARED_MEMORY to enable shared memory for both collective communication and point-to-point routines. The default setting for MP_SHARED_MEMORY is yes (enable shared memory). MP_USE_BULK_XFER Exploit the high performance switch bulk data transfer mechanism. This variable does not have any meaning and is ignored in other environments. Before you can use MP_USE_BULK_XFER, the system administrator must first enable Remote Direct Memory Access (RDMA). For more information, see IBM Parallel Environment for AIX: Installation. Valid values are yes and no. If not set, the default is no. Note that when you use MP_USE_BULK_XFER, you also need to consider the value of the MP_BULK_MIN_MSG_SIZE environment variable. Messages with data lengths that are greater than the value specified for MP_BULK_MIN_MSG_SIZE will use the bulk transfer path, if it is available. See the description of MP_BULK_MIN_MSG_SIZE for more information. MP_BULK_MIN_MSG_SIZE Set the minimum message length for bulk transfer. Contiguous messages with data lengths greater than or equal to the value you specify for this environment variable will use the bulk transfer path, if it is available. Messages with data lengths that are smaller than the value you specify for this environment variable, or are noncontiguous, will use packet mode transfer. The valid range of values is from 4096 to 2147483647 (INT_MAX). The size can be expressed in one of the following ways: * As a number of bytes * As a number of KB (1024 bytes), using the letter k as a suffix * As a number of MB (1024 * 1024 bytes), using the letter m as a suffix * As a number of GB (1024 * 1024 * 1024 bytes), using the letter g as a suffix. The default value is 153600. MP_BULK_XFER_CHUNK_SIZE Controls the size of the internal bulk transfer requests. A message larger than this size will be split into multiple requests that are serialized. The valid range is 32K to 32MB and the default is 32MB. This variable will only effect performance of bulk transfer messages and will not effect program behavior. MP_BULK_XFER_RECV_CONN Controls the number of active bulk transfer messages being received at any one time. The valid range is 1 through 64 and the default is 64. If the value is set to one, this effectively serializes the reception of bulk transfer requests. This variable will only effect the performance of bulk transfer messages and will not effect program behavior. MP_THREAD_STACKSIZE Determines the additional stacksize allocated for user programs executing on an MPI service thread. If you allocate insufficient space, the program may encounter a SIGSEGV exception. MP_SINGLE_THREAD Avoids mutex lock overheads in a single threaded user program. This is an optimization flag, with values of no and yes. The default value is no, which means the potential for multiple user message passing threads are assumed. Note: MPI-IO and MPI-1SC (MPI One Sided Communication) cannot be used when MP_SINGLE_THREAD is set to yes. An application that tries to use nonstandard MPE_I nonblocking collective communications, MPI-IO, or MPI-1SC with MP_SINGLE_THREAD=yes will be terminated. MPI calls from multiple user threads cannot be detected and will lead to unpredictable results. MP_WAIT_MODE To specify how a thread or task behaves when it discovers it is blocked, waiting for a message to arrive. MP_POLLING_INTERVAL Defines the polling interval in microseconds. The maximum interval is approximately 2 billion microseconds (2000 seconds). The default is 180000 microseconds for IP, and 400000 microseconds for US. MP_RETRANSMIT_INTERVAL MP_RETRANSMIT_INTERVAL=nnnnn and its command line equivalent, -retransmit_interval=nnnnn, control how often the communication subsystem library checks to see if it should retransmit packets that have not been acknowledged. The value nnnnn is the number of polling loops between checks. The acceptable range is 1000 to 400000. The default is 10000 for UDP and 400000 for User Space. MP_IO_BUFFER_SIZE Indicates the default size of the data buffer used by MPI-IO agents. For example: export MP_IO_BUFFER_SIZE=16M sets the default size of the MPI-IO data buffer to 16MB. The default value of the environment variable is the number of bytes corresponding to 16 file blocks. This value depends on the block size associated with the file system storing the file. Valid values are any positive size up to 128MB. The size can be expressed as a number of bytes, as a number of KB (1024 bytes), using the letter k as a suffix, or as a number of MB (1024 * 1024 bytes), using the letter m as a suffix. MP_IO_ERRLOG Indicates whether to turn on error logging for I/O operations. For example: export MP_IO_ERRLOG=yes turns on error logging. When an error occurs, a line of information will be logged into file /tmp/mpi_io_errdump.app_name.userid.taskid, recording the time the error occurs, the POSIX file system call involved, the file descriptor, and the returned error number. MP_REXMIT_BUF_SIZE The maximum message size which LAPI will store in its local buffers so as to more quickly free up the user buffer containing message data. This size indicates the size of the local buffers LAPI will allocate to store such messages, and will impact memory usage, while potentially improving performance. Messages larger than this size will continue to be transmitted by LAPI; the only difference is that user buffers will not become available for the user to reuse until the message data has been acknowledged as received by the target. The default user message size is 16352 bytes. MP_REXMIT_BUF_CNT The number of buffers that LAPI must allocate for each target job, each buffer being of the size defined by MP_REXMIT_BUF_SIZE * MP_REXMIT_BUF_CNT. This count indicates the number of in-flight messages that LAPI can store in its local buffers so as to free up the user's message buffers. If there are no more message buffers left, LAPI will still continue transmission of messages; the only difference is that user buffers will not become available for the user to reuse until the message data has been acknowledged as received by the target. The default number of buffers is 128. The following are corefile generation environment variables: MP_COREDIR Creates a separate directory for each task's core file. The value of this environment variable can be overridden using the -coredir flag. A value of "none" signifies to bypass creating a new directory resulting in core files written to /tmp. MP_COREFILE_FORMAT Determines the format of corefiles generated when processes terminate abnormally. If not set, POE will generate standard AIX corefiles. If set to the string "STDERR", output will go to standard error. If set to any other string, POE will generate a lightweight corefile (conforming to the Parallel Tool consortium's Standardized Lightweight Corefile Format) for each process in your partition. The string you specify is the name you want to assign to each lightweight corefile. By default, these lightweight corefiles will be saved to subdirectories prefixed by the string coredir and suffixed by the task id (as in coredir.0, coredir.1, and so on). You can specify a prefix other than the default coredir by setting the MP_COREDIR environment variable. The value of this environment variable can be overridden using the -corefile_format flag. MP_COREFILE_SIGTERM Determines if POE should generate a corefile when a SIGTERM signal is received. Valid values are yes and no. If not set, the default is no. The following are miscellaneous environment variables: MP_EUIDEVELOP Determines whether PE MPI performs less, normal, or more detailed checking during execution. The additional checking is intended for developing applications, and can significantly slow performance. Valid values are yes or no, deb (for "debug"), nor (for "normal"), and min (for "minimum"). The min value shuts off parameter checking for all send and receive operations, and may improve performance, but should be used only with applications that are very well-validated. If not set, the default is no. The value of this environment variable can be overridden using the -euidevelop flag. MP_FENCE Determines a fence_string to be used for separating options you want parsed by POE from those you do not. Valid values are any string, and there is no default. Once set, you can then use the fence_string followed by additional_options on the poe command line. The additional_options will not be parsed by POE. This environment variable has no associated command-line flag. MP_NOARGLIST Determines whether or not POE ignores the argument list. Valid values are yes and no. If set to yes, POE will not attempt to remove POE command-line flags before passing the argument list to the user's program. This environment variable has no associated command-line flag. MP_PRIORITY Determines a co-scheduler dispatch parameter set for execution. See 5.15, "Improving Application Scalability Performance"for more information on co-scheduler parameters. Valid values are any of the dispatch priority classes set up by the system administrator in the file /etc/poe.priority, or a string of threshold values, as controlled by the /etc/poe.priority file contents. This environment variable has no associated command-line flag. MP_PRIORITY_LOG Determines whether diagnostic messages should be logged to the POE priority adjustment co-scheduler log file in /tmp/pmadjpri.log on each of the remote nodes. This variable should only be used in conjunction with the POE co-scheduler MP_PRIORITY variable. Valid values are yes or no. If not set, the default is yes. The value of this environment variable can be overridden using the -priority_log flag. See the section on improving application scalability performance in IBM Parallel Environment: Operation and Use, Volume 1 for more information on the POE co-scheduler. MP_PRIORITY_NTP Determines whether or not the POE priority adjustment coscheduler will turn NTP off during the priority adjustment period, or leave it running. Valid values are "yes" and "no". The value of "no" (which is the default) will instruct the POE co_scheduler to turn the NTP daemon off (if it was running) and later restart NTP after the co-scheduler completes. Specify a value of "yes" to inform the co-scheduler to keep NTP running during the priority adjustment cycles (if NTP was not running, NTP will not be started). If not set, the default is "no". The value of this environment variable can be overridden using the -priority_ntp flag. See the section on improving application scalability performance in IBM Parallel Environment: Operation and Use, Volume 1 for more information on the POE co-scheduler. EXAMPLES 1. Assume the MP_PGMMODEL environment variable is set to spmd, and MP_PROCS is set to 6. To load and execute the SPMD program sample on the six remote nodes of your partition, enter: poe sample 2. Assume you have an MPMD application consisting of two programs; master and workers. These programs are designed to run together and communicate via calls to message passing subroutines. The program master is designed to run on one processor node. The workers program is designed to run as separate tasks on any number of other nodes. The MP_PGMMODEL environment variable is set to mpmd, and MP_PROCS is set to 6. To individually load the six remote nodes with your MPMD application, enter: poe Once the partition is established, the poe command responds with the prompt: 0:host1_name> To load the master program as task 0 on host1_name, enter: master The poe command responds with a prompt for the next node to load. When you have loaded the last node of your partition, the poe command displays the message Partition loaded... and begins execution. 3. Assume you want to run three SPMD programs; setup, computation, and cleanup - as job steps on the same partition of nodes. The MP_PGMMODEL environment variable is set to spmd, and MP_NEWJOB is set to yes. You enter: poe Once the partition is established, the poe command responds with the prompt: Enter program name (or quit): To load the program setup, enter: setup The program setup executes on all nodes of your partition. When execution completes, the poe command again prompts you for a program name. Enter the program names in turn. To release the partition, enter: quit 4. To check the process status (using the nonparallel command ps) for all remote nodes in your partition, enter: poe ps FILES host.list (Default host list file) RELATED INFORMATION Commands: mpcc_r(1), , mpCC_r(1), mpxlf_r(1), pdbx(1)