[cig-commits] r22997 - seismo/3D/SPECFEM3D_GLOBE/trunk/src/cuda
lefebvre at geodynamics.org
lefebvre at geodynamics.org
Wed Feb 12 07:08:13 PST 2014
Author: lefebvre
Date: 2014-02-12 07:08:13 -0800 (Wed, 12 Feb 2014)
New Revision: 22997
Modified:
seismo/3D/SPECFEM3D_GLOBE/trunk/src/cuda/initialize_cuda.cu
Log:
Vicious bug fix. Multi GPU per nodes implied multiple synchronizations of the same device when checking error after the call to deviceCount.
Modified: seismo/3D/SPECFEM3D_GLOBE/trunk/src/cuda/initialize_cuda.cu
===================================================================
--- seismo/3D/SPECFEM3D_GLOBE/trunk/src/cuda/initialize_cuda.cu 2014-02-11 15:18:25 UTC (rev 22996)
+++ seismo/3D/SPECFEM3D_GLOBE/trunk/src/cuda/initialize_cuda.cu 2014-02-12 15:08:13 UTC (rev 22997)
@@ -88,10 +88,13 @@
// Gets number of GPU devices
device_count = 0;
cudaGetDeviceCount(&device_count);
+ // Do not check if command failed:
+ // `exit_on_cuda_error` call cudaDevice/ThreadSynchronize. If multiple
+ // MPI tasks access multiple GPUs per node, they will try to synchronize
+ // GPU 0 and depending on the order of the calls error will be raised
+ // when setting the device number. If MPS is enabled, some GPUs will silently
+ // not be used.
- // checks if command failed
- exit_on_cuda_error("CUDA runtime error: cudaGetDeviceCount failed\n\nplease check if driver and runtime libraries work together\nor on titan enable environment: CRAY_CUDA_PROXY=1 to use single GPU with multiple MPI processes\n\nexiting...\n");
-
// returns device count to fortran
if (device_count == 0) exit_on_error("CUDA runtime error: there is no device supporting CUDA\n");
*ncuda_devices = device_count;
More information about the CIG-COMMITS
mailing list