|
HPCC Application Note
Step 1 – Overview
This guide is intended to help current HPCC users get better benchmark performance by utilizing Intel® Math Kernel Library (Intel® MKL).
HPCC stands for High Performance Computing Challenge benchmark and is actually a suite of benchmarks that measure performance of the CPU, memory subsystem and interconnect. It consists of 7 benchmark tests – HPL (High Performance LINPACK), DGEMM (Double-precision GEneral Matrix-Matrix multiply), STREAM, PTRANS (Parallel TRANSpose, Random Access, FFT (Fast Fourier Tranform and communication bandwidth/latency.
Please find more information on HPCC from: http://icl.cs.utk.edu/hpcc/* .
Version Information
This application note was created to help users who benchmark clusters using HPCC to make use of the latest versions of Intel MKL on Linux platforms on Xeon systems. Specifically we'll address Intel MKL version 9.1.
Step 2 – Downloading HPCC Source Code
The HPCC source code can be downloaded from: http://icl.cs.utk.edu/hpcc/software/index.html*.
Prerequisites
- Intel MKL contains highly optimized FFT and also the wrappers for FFTW, which can be obtained from the following options:
- Download a FREE evaluation version of the Intel MKL product.
- Download the FREE non-commercial* version of the Intel MKL product.
All of these can be obtained at: http://www.intel.com/software/products/mkl.
- Intel MPI can be obtained from http://www.intel.com/software/products/cluster.
Open source MPI (MPICH2) can be obtained from http://www-unix.mcs.anl.gov/mpi/mpich/*.
- Download modified MPI FFTW wrapper interfaces with 64bit long for the MKL DFT.
We have changed a few lines from 32-bit to 64 bit parameters in these modified wrappers, since MKL match the FFTW interfaces completely, given some 32-bit parameters in FFTW, the wrappers without modification will not work for the HPCC-FFT component.
Step 3 - Configuration
Use the following commands to extract the HPCC tar files from the downloaded hpcc-x.x.x.tar.gz and fftw2x_cdft.tar.gz files:
$gunzip hpcc-x.x.x.tar.gz $tar –xvf hpcc-x.x.x.tar
This will create a directory hpcc-x.x.x
Extract fftw2x_cdft.tar.gz$gunzip fftw2x_cdft.tar.gz $tar –xvf fftw2x_cdft.tar
Make sure that MPI, C++ and FORTRAN compilers are installed and they are in PATH. Also set LD_LIBRARY_PATH to your compiler (C++ and FORTRAN), MPI, and MKL libraries.
Step 4 – Building HPCC
- Build MPI MKL FFTW library.
From the fftw2x_cdft directory, run the following command:
$make libem64t mpi=intel3 comp=intel PRECISION=DOUBLE
Here we are building for EM64t architecture with Intel MPI version 3.0, with Intel compilers and DOUBLE precision. This will create the MKL FFTW interface library libfftw2x_cdft_DOUBLE.a in lib/em64t directory.
Note: Please execute $make to see the different options.
- Build FFTW C wrapper library
Change the directory to <your mkl installation>/interfaces/fftw2xc, and run the command as below
$make libem64t PRECISION=MKL_DOUBLE
This will create libfftw2xc_intel.a library in <your mkl installation>/lib/em64t directory
- Build HPCC
Change directory to hpcc-x.x.x/hpl
Create a Makefile from the existing one, for e.g. Make.mkl. You can reuse one from the hpl/setup directory.
Edit Make.mkl as follows: modify the LAdir, LAlib lines as below to point to MKL libraries. Assuming you have the double precision MPI fftw2x_cdft wrapper library built in $HOME/fftw2x_cdft/lib/em64t directory and you have installed 9.1.021 MKL version.
LAdir = /opt/intel/cmkl/9.1.021
LAlib = $(LAdir)/libmkl_em64t.a $(HOME)/fftw2x_cdft/lib/libfftw2x_cdft_DOUBLE.a $(LAdir)/libfftw2xc_intel.a $(LAdir)/libmkl_blacs.a $(LAdir)/libmkl_cdft.a $(LAdir)/libguide.a –lpthread –lm
Build HPCC by using
$make all arch=mkl This will create an executable with name hpcc in the hpcc-x.x.x directory and a file _hpccinf.txt which is a template input file for hpcc. Rename the file to hpccinf.txt.
Step 5 - Running HPCC
Modify the configuration parameters in hpccinf.txt file.
Run hpcc by executing the following command.
$mpirun –np 4 hpcc
hpccinf.txt is the same as standard hpl input file with a few additional lines. Please refer our HPL application note on tuning parameters in the configuration file.
Appendix A - Performance Results
Below are the hpcc benchmark results of Intel Atlantis cluster which can also be found in hpcc website*.
HPC Challenge Benchmark Record
| System Information |
| Affiliation: |
Intel Corporation |
URL: |
http://www.intel.com/ |
| Location: |
USA, Washington, DuPont |
System Use: |
Vendor |
| System Manufacturer: |
Intel |
System Name: |
Intel Atlantis cluster |
| Interconnect Manufacturer: |
Mellanox |
Interconnect Type: |
Infiniband |
| Operating System: |
RedHat EL4 Update 4 |
MPI: |
Intel MPI 3.1 beta |
| MPI Wtick: |
1e-06 |
BLAS: |
Intel Cluster MKL 9.1.023 |
| Language: |
C |
Compiler: |
Intel C/C++ Compiler 10.0.023 |
| Compiler Flags: |
-O2 -xT -ansi-alias -ip |
Processor Type: |
Intel Xeon 5355 |
| Processor Speed: |
2.66 GHz |
Total Processors: |
512 |
| Processors Entered: |
512 |
Processors Determined: |
512 |
| Cores Per Chip: |
4 |
HPL Processes: |
512 |
| MPI Processes: |
512 |
Threads Entered: |
1 |
| Threads Determined: |
1 |
XXFLOPs Per Cycle: |
|
| Theoretical Peak: |
5.44768 TFlop/s |
Total Memory: |
1024 GiB |
| FFT Library: |
Intel Cluster MKL 9.1.023 |
|
| |
| HPL |
| HPL: |
4.25904 Tflop/s |
HPL time: |
5129.21 |
| HPL eps: |
2.22045e-16 |
HPL Rnorm1: |
2.54184e-08 |
| HPL Anorm1: |
80376.3 |
HPL AnormI: |
81257.5 |
| HPL Xnorm1: |
322111 |
HPL XnormI: |
5.78706 |
| HPL N: |
320000 |
HPL NB: |
168 |
| HPL NProw: |
16 |
HPL NPcol: |
32 |
| HPL depth: |
0 |
HPL NBdiv: |
2 |
| HPL NBmin: |
4 |
HPL CPfact: |
R |
| HPL CRfact: |
C |
HPL CPtop: |
1 |
| HPL order: |
R |
|
|
| HPL dMach EPS: |
2.220446e-16 |
HPL sMach EPS: |
1.192093e-07 |
| HPL dMach sfMin: |
0 |
HPL sMach sfMin: |
1.175494e-38 |
| HPL dMach Base: |
2 |
HPL sMach Base: |
2 |
| HPL dMach Prec: |
4.440892e-16 |
HPL sMach Prec: |
2.384186e-07 |
| HPL dMach mLen: |
53 |
HPL sMach mLen: |
24 |
| HPL dMach Rnd: |
0 |
HPL sMach Rnd: |
0 |
| HPL dMach eMin: |
-1021 |
HPL sMach eMin: |
-125 |
| HPL dMach rMin: |
0 |
HPL sMach rMin: |
1.175494e-38 |
| HPL dMach eMax: |
1025 |
HPL sMach eMax: |
129 |
| HPL dMach rMax: |
0 |
HPL sMach rMax: |
0 |
| dweps: |
1.110223e-16 |
sweps: |
5.960464e-08 | |
| PTRANS |
| PTRANS: |
32.0329 GB/s |
PTRANS time: |
6.1632 seconds |
| PTRANS residual: |
0 |
PTRANS N: |
160000 |
| PTRANS NB: |
|
PTRANS NProw: |
16 |
| PTRANS NPcol: |
32 |
|
| |
| STREAM |
| S-STREAM Copy: |
3.87384 GB/s |
S-STREAM Scale: |
3.89985 GB/s |
| S-STREAM Add: |
3.82254 GB/s |
S-STREAM Triad: |
3.82804 GB/s |
| EP-STREAM Copy: |
0.740711 GB/s |
EP-STREAM Scale: |
0.736493 GB/s |
| EP-STREAM Add: |
0.74627 GB/s |
EP-STREAM Triad: |
0.747415 GB/s |
| STREAM Vector Size: |
66666666 |
STREAM Threads: |
1 | |
| RandomAccess |
| S-RandomAccess: |
0.0149887 Gup/s |
EP-RandomAccess: |
0.00531302 Gup/s |
| G-RandomAccess: |
0.939308 Gup/s |
G-RandomAccess N: |
68719476736 |
| G-RandomAccess time: |
292.639 seconds |
G-RandomAccess Check Time: |
291.03 seconds |
| G-RandomAccess Errors: |
58355 |
G-RandomAccess Errors Fraction: |
8.49177e-07 |
| G-RandomAccess TimeBound: |
-1 |
G-RandomAccess ExeUpdates: |
274877906944 |
| RandomAccess N: |
134217728 |
|
| |
| FFT |
| S-FFT: |
1.28699 GFlop/s |
EP-FFT: |
0.447491 GFlop/s |
| MPIFFT: |
69.9066 GFlop/s |
MPIFFT N: |
8589934592 |
| MPIFFT Max Error: |
2.64655e-15 |
MPIFFT time0: |
0 seconds |
| MPIFFT time1: |
0 seconds |
MPIFFT time2: |
0 seconds |
| MPIFFT time3: |
0 seconds |
MPIFFT time4: |
0 seconds |
| MPIFFT time5: |
0 seconds |
MPIFFT time6: |
0 seconds |
| FFTEnblk: |
16 |
FFTEnp: |
8 |
| FFTEl2size: |
1048576 |
|
| |
| DGEMM |
| S-DGEMM: |
9.67426 GFlop/s |
EP-DGEMM: |
9.09138 GFlop/s |
| DGEMM N: |
8164 |
|
| |
| RandomRing Latency/Bandwidth |
| RandomRing Latency: |
16.7534 usec |
RandomRing Bandwidth: |
0.0899317 GB/s | |
| NaturalRing Latency/Bandwidth |
| NaturalRing Latency: |
6.10352 usec |
NaturalRing Bandwidth: |
0.228898 GB/s | |
| PingPong Latency/Bandwidth |
| Maximum PingPong Latency: |
7.06315 usec |
Maximum PingPong Bandwidth: |
1.52673 GB/s |
| Minimum PingPong Latency: |
2.5034 usec |
Minimum PingPong Bandwidth: |
1.01116 GB/s |
| Average PingPong Latency: |
6.2425 usec |
Average PingPong Bandwidth: |
1.5118 GB/s | |
| Size of Data Types |
| char: |
1 byte |
short: |
2 bytes |
| int: |
4 bytes |
long: |
>8 bytes |
| void ptr: |
8 bytes |
float: |
4 bytes |
| double: |
8 bytes |
size t: |
8 bytes |
| s64Int: |
8 bytes |
u64Int: |
8 bytes | |
| OpenMP |
| M OpenMP: |
-1 |
OpenMP Num Threads: |
0 |
| OpenMP Num Procs: |
0 |
OpenMP Max Threads: |
0 | |
| Memory |
| MemProc: |
-1 |
MemSpec: |
-1 |
| MemVal: |
-1 |
|
| |
| CPS |
| CPS_HPCC_FFT_235: |
0 |
CPS_HPCC_FFTW_ESTIMATE: |
0 |
| CPS_HPCC_MEMALLCTR: |
0 |
CPS_HPL_USE_GETPROCESSTIMES: |
0 |
| CPS_RA_SANDIA_NOPT: |
0 |
CPS_RA_SANDIA_OPT2: |
1 | |
Appendix B - Known Issues and Limitations
Appendix C – References
This applies to:
|