Ci-dessous, les différences entre deux révisions de la page.
Les deux révisions précédentesRévision précédenteProchaine révision | Révision précédenteProchaine révisionLes deux révisions suivantes | ||
documentation:tools:testspgi [2015/04/28 13:21] – [Tests de fonctionnement et performance pgi et cuda fortran] cicaluga | documentation:tools:testspgi [2015/04/29 05:49] – [Tests de performance] cicaluga | ||
---|---|---|---|
Ligne 1: | Ligne 1: | ||
- | ====== Tests de fonctionnement et performance | + | ====== Tests de fonctionnement et performance |
{{INLINETOC}} | {{INLINETOC}} | ||
- | ===== Benchmarks | + | ===== PGI 15.1 sur les systèmes Debian |
- | Plusieurs tests de fonctionnement et de performance de ces cartes sont présentés : | + | La version |
- | ==== Tests de détection matériel et logiciel | + | ==== Environnement |
- | Avec la commande linux lspci (qui affiche la liste de périphériques PCI, dont les cartes GPU) : | + | Pour pouvoir utiliser PGI 15.1, il faut d' |
<code bash> | <code bash> | ||
- | c82gpgpu34: | + | e5-2670comp3: |
- | 05:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K20m] (rev a1) | + | |
- | Subsystem: NVIDIA Corporation Device 1015 | + | |
- | Kernel driver in use: nvidia | + | |
- | 83:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K20m] (rev a1) | + | |
- | Subsystem: NVIDIA Corporation Device 1015 | + | |
- | Kernel driver in use: nvidia | + | |
</ | </ | ||
- | Cette commande | + | Cette commande |
- | La sortie précédente est obtenue sur un noeud de calcul qui dispose de cartes GPU (dans cet exemple il s'agit du noeud c82gpgpu34 qui dispose de 2 cartes Tesla K20). | + | Pour vérifier que cette étape |
- | + | ||
- | La commande | + | |
<code bash> | <code bash> | ||
- | c82gpgpu34: | + | e5-2670comp3: |
- | nvidia_uvm | + | / |
- | nvidia | + | |
- | i2c_core | + | |
- | </code> | + | |
- | Pour afficher la version du driver CUDA installé : | + | e5-2670comp3: |
- | <code bash> | + | pgfortran 15.1-0 64-bit target on x86-64 Linux -tp sandybridge |
- | c82gpgpu34: | + | The Portland Group - PGI Compilers and Tools |
- | NVRM version: NVIDIA UNIX x86_64 Kernel Module | + | Copyright |
- | GCC version: | + | |
</ | </ | ||
- | Pour afficher la version du CUDA Toolkit installé | + | ==== Compilateurs et autres outils ==== |
+ | |||
+ | Les compilateurs et les autres binaires fournis par SGI se trouvent dans le répertoire bin de l' | ||
<code bash> | <code bash> | ||
- | c82gpgpu34: | + | e5-2670comp3: |
- | nvcc: NVIDIA (R) Cuda compiler driver | + | acc1rc |
- | Copyright (c) 2005-2013 NVIDIA Corporation | + | acclin8664rc |
- | Built on Wed_Jul_17_18: | + | CcffReader.jar |
- | Cuda compilation tools, release 5.5, V5.5.0 | + | ccrc |
+ | change-pgi-hostid | ||
+ | cppcurc | ||
+ | cpprc | ||
+ | c++rc | ||
+ | fnativerc | ||
+ | ganymed-ssh2-build251.jar | ||
+ | iparc | ||
+ | jide-common.jar | ||
+ | jide-dock.jar | ||
+ | jpgdbg.jar | ||
+ | Jpgprof.jar | ||
+ | libamdocl64.so | ||
+ | lin8664rc | ||
+ | lin86rc | ||
+ | llvm-as | ||
+ | llvm-link | ||
+ | lmborrow | ||
+ | lmgrd | ||
+ | lmgrd.rc | ||
+ | lmutil | ||
+ | localrc | ||
+ | makelocalrc | ||
+ | mpirun_dbg.pgdbg | ||
+ | nativerc | ||
+ | optopgprof | ||
+ | pgaccelerror | ||
+ | pgaccelinfo | ||
</ | </ | ||
- | nvcc est le compilateur fourni dans le driver pour compiler des programmes CUDA (il appelle le compilateur gcc pour compiler le code C) | + | Notons les binaires suivants : |
+ | ^ Binaire ^ Description ^ | ||
+ | | pgfortran | compilateur Fortran 2003 capable OpenMP et auto-parallélisation | | ||
+ | | pgcc | compilateur ANSI C capable OpenMP et auto-parallélisation | | ||
+ | | pgc++ | compilateur ANSI C++ capable OpenMP et auto-parallélisation | | ||
+ | | pgprof | profileur graphique MPI, OpenMP et multi-thread | | ||
+ | | pgdbg | débogueur graphique MPI, OpenMP et multi-thread | | ||
- | **Uns autre possibilité** (hors commandes Linux) pour détecter la présence et le type de GPUS NVIDIA est de faire appel au programme deviceQuery dont le source .cpp est contenu dans la suite NVIDIA_GPU_Computing_SDK (devenue NVIDIA_CUDA-x.y_Samples dans les versions x.y récentes). Après | + | ==== Options |
- | <code bash> | + | Parmi les options des compilateurs, |
- | c82gpgpu34: | + | |
- | c82gpgpu34: | + | |
- | ./ | + | ^ Option ^ Description ^ |
+ | | -c | Generate intermediate object file but does not attempt to link | | ||
+ | | -g | Adds information for debugging to the object file and/or executable | | ||
+ | | -I < | ||
+ | | -L < | ||
+ | | -r8 | Promotes REALs from the default size of 4 bytes to 8 bytes | | ||
+ | | -i8 | Promotes INTEGERs from the default size of 4 bytes to 8 bytes | | ||
+ | | -O3 | Higher level of optimization than -O2 (the default optimization level) | | ||
+ | | -fast | Higher optimization level than -O3 | | ||
+ | | -Mipa | Tells the compiler to perform interprocedural analysis. Can be very time consuming toperform. This flag should also be used in both compilation and linking steps | | ||
+ | | -Mconcur | Enables autoparallelization. Additional options can be used with -Mconcur to provide morefine-grained control of autoparallelization | | ||
+ | | -Minfo | Instructs the compiler to report optimizations that are made | | ||
+ | | -Mneginfo | Instructs the compiler to report optimizations that are not made | | ||
+ | | -mp | Enables parallelization via OpenMP directives | | ||
- | CUDA Device Query (Runtime API) version (CUDART static linking) | ||
- | Found 2 CUDA Capable device(s) | + | ==== Tests de performance ==== |
- | Device 0: "Tesla K20m" | + | Des exemples type de codes source pour évaluer les capacité du compilateur PGI sont disponibles dans le répertoire |
- | CUDA Driver Version | + | |
- | CUDA Capability Major/Minor version number: | + | |
- | Total amount of global memory: | + | |
- | MapSMtoCores SM 3.5 is undefined (please update to the latest SDK)! | + | |
- | MapSMtoCores SM 3.5 is undefined (please update to the latest SDK)! | + | |
- | (13) Multiprocessors x (-1) CUDA Cores/MP: -13 CUDA Cores | + | |
- | GPU Clock Speed: | + | |
- | Memory Clock rate: | + | |
- | Memory Bus Width: | + | |
- | L2 Cache Size: | + | |
- | Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, | + | |
- | Max Layered Texture Size (dim) x layers | + | |
- | Total amount of constant memory: 65536 bytes | + | |
- | Total amount of shared memory per block: | + | |
- | Total number of registers available per block: 65536 | + | |
- | Warp size: 32 | + | |
- | Maximum number of threads per block: | + | |
- | Maximum sizes of each dimension of a block: | + | |
- | Maximum sizes of each dimension of a grid: | + | |
- | Maximum memory pitch: | + | |
- | Texture alignment: | + | |
- | Concurrent copy and execution: | + | |
- | Run time limit on kernels: | + | |
- | Integrated GPU sharing Host Memory: | + | |
- | Support host page-locked memory mapping: | + | |
- | Concurrent kernel execution: | + | |
- | Alignment requirement for Surfaces: | + | |
- | Device has ECC support enabled: | + | |
- | Device is using TCC driver mode: No | + | |
- | Device supports Unified Addressing (UVA): | + | |
- | Device PCI Bus ID / PCI location ID: 5 / 0 | + | |
- | Compute Mode: | + | |
- | < Exclusive Process (many threads in one process is able to use :: | + | |
- | Device 1: "Tesla K20m" | + | <code bash> |
- | CUDA Driver Version | + | e5-2670comp3: |
- | CUDA Capability Major/Minor version number: | + | dr-xr-xr-x 3 root root 4096 mars |
- | Total amount of global memory: | + | dr-xr-xr-x 4 root root 4096 mars |
- | MapSMtoCores SM 3.5 is undefined (please update to the latest SDK)! | + | dr-xr-xr-x 2 root root 4096 mars |
- | MapSMtoCores SM 3.5 is undefined (please update to the latest SDK)! | + | dr-xr-xr-x 8 root root 4096 mars |
- | (13) Multiprocessors x (-1) CUDA Cores/ | + | -r--r--r-- 1 root root 659 mars |
- | GPU Clock Speed: | + | </code> |
- | | + | |
- | Memory Bus Width: | + | |
- | L2 Cache Size: | + | |
- | Max Texture Dimension Size (x,y,z) | + | |
- | Max Layered Texture Size (dim) x layers | + | |
- | Total amount of constant memory: | + | |
- | Total amount of shared memory per block: | + | |
- | Total number of registers available per block: 65536 | + | |
- | Warp size: 32 | + | |
- | Maximum number of threads per block: | + | |
- | Maximum sizes of each dimension of a block: | + | |
- | Maximum sizes of each dimension of a grid: | + | |
- | Maximum memory pitch: | + | |
- | Texture alignment: | + | |
- | Concurrent copy and execution: | + | |
- | Run time limit on kernels: | + | |
- | | + | |
- | | + | |
- | Concurrent kernel execution: | + | |
- | Alignment requirement for Surfaces: | + | |
- | | + | |
- | Device is using TCC driver mode: No | + | |
- | Device supports Unified Addressing (UVA): | + | |
- | Device PCI Bus ID / PCI location ID: 131 / 0 | + | |
- | Compute Mode: | + | |
- | < Exclusive Process (many threads in one process is able to use :: | + | |
- | deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA Runtime Version = 5.5, NumDevs = 2, Device = Tesla K20m, Device = Tesla K20m | + | Pour tester ces benchmarks, il faut copier ces répertoires sur le compte utilisateur. |
+ | Exemplifions ici l' | ||
+ | |||
+ | <code bash> | ||
+ | e5-2670comp3: | ||
+ | -r-xr-xr-x 1 root root 54 mars | ||
+ | -r-xr-xr-x 1 root root 51 mars | ||
+ | -r-xr-xr-x 1 root root 62 mars | ||
+ | -r-xr-xr-x 1 root root 59 mars | ||
+ | -r-xr-xr-x 1 root root 64 mars | ||
+ | -r--r--r-- 1 root root 2556 mars | ||
+ | -r--r--r-- 1 root root 6464 mars | ||
</ | </ | ||
- | ==== Test de la bande passante ==== | + | Le code source Fortran est contenu dans le fichier matmul.F, alors que les scripts build* contiennent |
- | Un autre test fourni avec NVIDIA_GPU_Computing_SDK est le programme bandwidthTest. Après la compilation du programme source | + | <code bash> |
- | * transfert depuis CPU sur le GPU | + | e5-2670comp3: |
- | * transfert depuis GPU sur le CPU | + | pgf77 -fast -Mconcur -Minfo matmul.F -o matmul_f77 -V |
- | * transfert depuis GPU sur le GPU (intra GPU) | + | e5-2670comp3: |
- | Ci-dessous la sortie complète | + | pgf90 -fast -Mconcur -Minfo -DPGF90 matmul.F -o matmul_f90 -V |
+ | e5-2670comp3:~> cat / | ||
+ | pgf90 -fast -mp -Minfo -DPGF90 matmul.F -o matmul_f90mp -V | ||
+ | e5-2670comp3: | ||
+ | pgf77 -fast -mp -Minfo matmul.F -o matmul_f77mp -V | ||
+ | e5-2670comp3: | ||
+ | pghpf -fast -Mautopar -Minfo -DPGF90 | ||
+ | </ | ||
+ | |||
+ | Le lancement | ||
+ | |||
+ | Leur exécution peut alors être faite. Par défaut, on utilise un seul processeur. Le nombre de processeurs utilisés peut être modifié par : | ||
+ | - pour l' | ||
<code bash> | <code bash> | ||
- | c82gpgpu34: | + | setenv NCPUS 2 |
- | c82gpgpu34: | + | </code> |
- | ./C/ | + | - pour la parallelisation avec OpenMP (matmul_f77mp, |
+ | <code bash> | ||
+ | setenv OMP_NUM_THREADS 2 | ||
+ | </code> | ||
- | Running on... | + | - pour HPF : rajouter -pghpf -np 2 au moment de l' |
- | Device | + | Voici le résultat de ces exécutions pour 1 et 2 processeurs : |
- | Quick Mode | + | |
+ | <code bash> | ||
+ | e5-2670comp3: | ||
+ | | ||
+ | M = 200, N = 200, P = 200 | ||
+ | | ||
+ | | ||
+ | e5-2670comp3:~/ | ||
+ | e5-2670comp3: | ||
+ | | ||
+ | M = 200, N = 200, P = 200 | ||
+ | | ||
+ | c(1,1) = 200.0000000000000 | ||
- | Host to Device Bandwidth, | + | e5-2670comp3: |
- | Transfer Size (Bytes) Bandwidth(MB/s) | + | e5-2670comp3: |
- | 33554432 3819.7 | + | 2.0000001E-03 |
+ | M = 200 , N = 200 , P = 200 | ||
+ | | ||
+ | c(1,1) = | ||
+ | e5-2670comp3: | ||
+ | e5-2670comp3: | ||
+ | 4.0000002E-03 | ||
+ | M = 200 , N = 200 , P = 200 | ||
+ | | ||
+ | | ||
- | Device to Host Bandwidth, 1 Device(s), Paged memory | + | e5-2670comp3: |
- | Transfer Size (Bytes) Bandwidth(MB/s) | + | e5-2670comp3: |
- | 33554432 3381.9 | + | |
+ | M = 200, N = 200, P = 200 | ||
+ | MFLOPS = 7980.000 | ||
+ | c(1,1) = 200.0000000000000 | ||
+ | e5-2670comp3: | ||
+ | e5-2670comp3: | ||
+ | 4.0000002E-03 | ||
+ | M = 200, N = 200, P = 200 | ||
+ | | ||
+ | | ||
- | | + | e5-2670comp3: |
- | Transfer Size (Bytes) Bandwidth(MB/s) | + | e5-2670comp3: |
- | 33554432 143586.3 | + | 2.0000001E-03 |
+ | M = 200 , N = 200 , P = 200 | ||
+ | | ||
+ | c(1,1) = | ||
+ | e5-2670comp3: | ||
+ | e5-2670comp3: | ||
+ | 2.0000001E-03 | ||
+ | M = 200 , N = 200 , P = 200 | ||
+ | | ||
+ | | ||
+ | e5-2670comp3: | ||
+ | e5-2670comp3: | ||
+ | e5-2670comp3: | ||
+ | | ||
+ | M = 200 , N = 200 , P = 200 | ||
+ | | ||
+ | | ||
+ | e5-2670comp3: | ||
+ | | ||
+ | M = 200 , N = 200 , P = 200 | ||
+ | | ||
+ | | ||
</ | </ |