Intel® Trace Analyzer and Collector User and Reference Guide

ID 767272
Date 3/31/2023
Public
Document Table of Contents

Checking Collective Operations

(GLOBAL:COLLECTIVE)

Checking correct usage of collective operations is easier than checking messages. At the beginning of each operation, Intel® Trace Collector broadcasts the same data from rank #0 of the communicator. This data includes:

  • Type of the operation

  • Root (zero if not applicable)

  • Reduction type (predefined types only)

Now all involved processes check these parameters against their own parameters and report an error in case of a mismatch. If the type is the same, for collective operations with a root process that rank and for reduce operations the reduction operation are also checked. The GLOBAL:COLLECTIVE:REDUCTION_OPERATION_MISMATCH error can only be detected for predefined reduction operation because it is impossible to verify whether the program code associated with a custom reduction operation has the same semantic on all processes. After this step depending on the operation different other parameters are also shared between the processes and checked.

Invalid parameters like MPI_DATATYPE_NULL where a valid data type is required are detected while checking the parameters. They are reported as one GLOBAL:COLLECTIVE:INVALID_PARAMETER error with a description of the parameter which is invalid in each process. This leads to less output than printing one error for each process.

If any of these checks fails, the original operation is not executed on any process. Therefore proceeding is possible, but application semantic will be affected.