Reference: |
OpenMPLinks:
Tools
OpenMPWith "#pragma omp parallel" you declare the beginning of a part that should be parallelized. To compile such a source code use special options for your compiler: -openmp intel compiler
gcc -fopenmp -o openmp omp.c gcc compiler (since version 4.2)
#pragma omp parallel
{
}
shows openmp that the following part should be parallelized
#pragma omp parallel
{
#pragma omp for
} or in short form: #pragma omp parallel for for(...) the for loop should be devided to parallel threads. For example if the for loop has 100.000 steps and we have 5 threads the first one takes the steps from 1 to 20.000 the second 20.001 to 40.000 and so on.
Variablesare defined in two groups:
#pragma omp parallel for private(i,j,workingprogress) shared(sum,overallprogress) for(...) critical sectionsto prevent race conditions. #pragma omp atomic critVar +=1 reductionto bring the data the threads have processed together. #pragma omp parallel for reduction(operator:variable) at the end of the for loop omp brings the variable together through the given operator. Initialization value is the neutral value (+:0, *:1).
chunksto devide distinct parts of the loop: "schedule(dynamic,CHUNK) single, master#pragma omp single only one thread runs through the part
#pragma omp master only thread 0 runs through
barriersto set a point in the code that should be reached first by all the parallel threads. #pragma omp barrier MPI (Message Passing Interface)Intel offers products for mpi development and library Debugger for parallel apps are Totalview, DDT or IDB. To analyse the mpi events and communication use Intel Trace Analyzer and Collector. Start of an executable file with tcp/ip communication over ethernet: mpiexec -n 2 -env I_MPI_DEVICE sock a.out start of the same app on infiniband (high performance network): mpiexec -n 2 -env I_MPI_DEVICE rdma a.out The device shm is for shared memory. Systems with multiple cores are using ssm device Compiling MPI Programs: mpicc -o prg prg.c Running MPI Programs mpirun -np 1 prg with -np flag you indicate the number of processes to create.
You can run this command on the same machine but after I tried 200 processes the MP_Init method crashed. To run on a cluster use parameter machinefile: mpirun -n $(Anzahl Prozessoren) -machinefile $(Datei mit Rechnernamen) Programm
int MPI_Bcast (
void *buffer, address of first broadcast element
int count, number of elements
MPI_Datatype datatype, type of elements to broadcast
int root, ID of process doing broadcast
MPI_Comm comm communicator which means the group of receiver
)
int MPI_Send(
void *message, starting address of the data to be transmitted
int count, number of data items
MPI_Datatype datatype, type of the data
int dest, rank or id of the receiver process
int tag, label of the message for different purposes
MPI_Comm comm communicator in which this message is being sent
)
int MPI_Recv(
void *message, starting address of the buffer where the received data is to be stored
int count, number of data items the process can receive at a time which is
bounded through the size of the buffer
MPI_Datatype datatype, type of the data
int source, rank or id of the sender process
int tag, desired label of the message
MPI_Comm comm communicator in which this message is being passed
MPI_Status *status pointer to a MPI_Status data structure
)
Function MPI_Recv blocks until the message has been received or until an error condition causes the function to return. After it returns, the status record contains information about the just-completed function call:
To receive any message from anyone use MPI_ANY_SOURCE in MPI_Recv. Any message tag with MPI_ANY_TAG.
Function MPI_Scattervenables a single root process to distribute a contiguous group of elements to all of the processes in a communicator including itself. It is a collective communication function and all of the processes in a communicator participate in its execution. The function requires that each process has previously initialized two arrays: one that indicates the number of elements the root process should send to each of the other processes, and one that indicates the displacement of this block of elements in the array being scattered. In this case we want to scatter the blocks in process order: process 0 gets the first block, process 1 gets the second block and so on. Function MPI_Gathervallows a single MPI process to gather together data elements stored on all processes in a communicator. If every process is contributing the same number of data elements, the simpler function MPI_Gather is appropriate. Function MPI_Alltoallvallows every MPI process to gather data items fromm all the processes in the communicator. The simpler function MPI_Alltoall should be used in the case where all of the groups of data items being transferred from one process to another have the same number of elements. Example Matrix-vector MultiplicationPhases of the parallel matrix-vector multiplication algorithm based on a checkerboard block decomposition of the matrix elements. First, vector b is distributed among the tasks. Second each task performs matrix-vector multiplication on its block of matrix A and portion of vector b. Third, each row of tasks performs a sum-reduction of the result vectors, creating vector c. Redistributing vector bAlgorithm is simpler when process grid is square. Processes in the first column send their blocks of b to processes in the first row. Then each process in the first row broadcasts its blcok of b to the other processes in its column. When the process grid is not square, first the processes in the first column gather vector b onto process at grid position (0,0). Next the process (0,0) scatters b to the processes in the first row. Finally, each process in the first row broadcasts its block of b to the other processes in its column. Creating a CommunicatorThe default communicator is MPI_COMM_WORLD which is the set of all processes executing the MPI program. function MPI_Dims_create()to create a virtual mesh of processes that is as close to square as possible, which results in an algorithm having maximum scalability. For example you passed the total number of nodes desired for a cartesian grid and the number of grid dimensions, the function returns an array of integers specifying the number of nodes in each dimension of the grid, so that the sizes of the dimensions are as balanced as possible. function MPI_Cart_createcreates a communicator with its topology. The output parameter cart_comm returns the address of the newly created Cartesian communicator. function MPI_Cart_rankIn order to send a matrix row to the first process in the appropriate row of the process grid, process 0 needs to know its rank. This function when passed the coordinates of a process in the grid, returns its rank. function MPI_Cart_coordsreturns the process coords in the grid which are obviously different to the process ids. function MPI_Com_splitIn order to scatter an input row among only the processes in a single row of the process grid, we must divide the cartesian communicator into separate communicators for every row in the process grid. Collective function MPI_Comm_split partitions the processes in an existing communicator into one or more subgroups and constructs a communicator for each of these new subgroups. document classification |