[cig-commits] r22997 - seismo/3D/SPECFEM3D_GLOBE/trunk/src/cuda

lefebvre at geodynamics.org lefebvre at geodynamics.org
Wed Feb 12 07:08:13 PST 2014


Author: lefebvre
Date: 2014-02-12 07:08:13 -0800 (Wed, 12 Feb 2014)
New Revision: 22997

Modified:
   seismo/3D/SPECFEM3D_GLOBE/trunk/src/cuda/initialize_cuda.cu
Log:
Vicious bug fix. Multi GPU per nodes implied multiple synchronizations of the same device when checking error after the call to deviceCount.

Modified: seismo/3D/SPECFEM3D_GLOBE/trunk/src/cuda/initialize_cuda.cu
===================================================================
--- seismo/3D/SPECFEM3D_GLOBE/trunk/src/cuda/initialize_cuda.cu	2014-02-11 15:18:25 UTC (rev 22996)
+++ seismo/3D/SPECFEM3D_GLOBE/trunk/src/cuda/initialize_cuda.cu	2014-02-12 15:08:13 UTC (rev 22997)
@@ -88,10 +88,13 @@
   // Gets number of GPU devices
   device_count = 0;
   cudaGetDeviceCount(&device_count);
+  // Do not check if command failed: 
+  // `exit_on_cuda_error` call cudaDevice/ThreadSynchronize. If multiple 
+  // MPI tasks access multiple GPUs per node, they will try to synchronize
+  // GPU 0 and depending on the order of the calls error will be raised
+  // when setting the device number. If MPS is enabled, some GPUs will silently
+  // not be used.
 
-  // checks if command failed
-  exit_on_cuda_error("CUDA runtime error: cudaGetDeviceCount failed\n\nplease check if driver and runtime libraries work together\nor on titan enable environment: CRAY_CUDA_PROXY=1 to use single GPU with multiple MPI processes\n\nexiting...\n");
-
   // returns device count to fortran
   if (device_count == 0) exit_on_error("CUDA runtime error: there is no device supporting CUDA\n");
   *ncuda_devices = device_count;



More information about the CIG-COMMITS mailing list