[cig-commits] [commit] devel: Vicious bug fix. Multi GPU per nodes implied multiple synchronizations of the same device when checking error after the call to deviceCount. (1d2434c)
cig_noreply at geodynamics.org
cig_noreply at geodynamics.org
Wed Feb 12 07:12:18 PST 2014
Repository : ssh://geoshell/specfem3d
On branch : devel
Link : https://github.com/geodynamics/specfem3d/compare/cc878e6a5c1692b8aaeaca1803d4685e56b20e41...1d2434c01aa85bb8e6d5f2e1c4897e5a23651615
>---------------------------------------------------------------
commit 1d2434c01aa85bb8e6d5f2e1c4897e5a23651615
Author: Matthieu Lefebvre <ml15 at princeton.edu>
Date: Wed Feb 12 09:51:18 2014 -0500
Vicious bug fix. Multi GPU per nodes implied multiple synchronizations of the same device when checking error after the call to deviceCount.
>---------------------------------------------------------------
1d2434c01aa85bb8e6d5f2e1c4897e5a23651615
src/cuda/initialize_cuda.cu | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/src/cuda/initialize_cuda.cu b/src/cuda/initialize_cuda.cu
index ef53f7a..79a7d2d 100644
--- a/src/cuda/initialize_cuda.cu
+++ b/src/cuda/initialize_cuda.cu
@@ -93,9 +93,12 @@ void FC_FUNC_(initialize_cuda_device,
// Gets number of GPU devices
device_count = 0;
cudaGetDeviceCount(&device_count);
-
- // checks if command failed
- exit_on_cuda_error("CUDA runtime error: cudaGetDeviceCount failed\ncheck if driver and runtime libraries work together\nexiting...\n");
+ // Do not check if command failed:
+ // `exit_on_cuda_error` call cudaDevice/ThreadSynchronize. If multiple
+ // MPI tasks access multiple GPUs per node, they will try to synchronize
+ // GPU 0 and depending on the order of the calls error will be raised
+ // when setting the device number. If MPS is enabled, some GPUs will silently
+ // not be used.
// returns device count to fortran
if (device_count == 0) exit_on_error("CUDA runtime error: there is no device supporting CUDA\n");
More information about the CIG-COMMITS
mailing list