Intel® Trace Analyzer and Collector User and Reference Guide

ID 767272
Date 3/31/2023
Public
Document Table of Contents

How the Collection Works

Understanding how Intel® Trace Collector finds the various supported errors is important because it helps to understand what the different configuration options mean, what the collector can do and what it cannot, and how to interpret the results.

Just as for performance analysis, Intel Trace Collector intercepts all MPI calls using the MPI profiling interface. It has different wrappers for each MPI call. In these wrappers it can execute additional checks not normally done by the MPI implementation itself.

For global checks Intel Trace Collector uses two different methods for transmitting the additional information: in collective operations it executes another collective operation before or after the original operation, using the same communicator. For point-to-point communication it sends one additional message over a shadow communicator for each message sent by the application.

In addition to exchanging this extra data through MPI itself, Intel Trace Collector also creates one background thread per process. These threads are connected to each other through TCP sockets and thus can communicate with each other even while MPI is being used by the main application thread.

For distributed memory checking and locking memory that the application should not access, Intel Trace Collector interacts with Valgrind* through Valgrind's client request mechanism. Valgrind tracks definedness of memory (that is, whether it was initialized or not) within a process; Intel Trace Collector extends that mechanism to the whole application by transmitting this additional information between processes using the same methods which also transmit the additional data type information and restoring the correct Valgrind state at the recipient.

Without Valgrind the LOCAL:MEMORY:ILLEGAL_MODIFICATION check is limited to reporting write accesses which modified buffers; typically this is detected long after the fact. With Valgrind, memory which the application hands over to MPI is set to "inaccessible" in Valgrind by Intel Trace Collector and accessibility is restored when ownership is transferred back. In between any access by the application is flagged by Valgrind right at the point where it occurs. Suppressions are used to avoid reports for the required accesses to the locked memory by the Intel MPI Library itself.