Featured image of post EulerP CUDA Optimizations

EulerP CUDA Optimizations

The machine: 2xIntel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz, 2xNVIDIA A100 80GB PCIe, no p2p GPU access.

AMD CPU server: 2xAMD EPYC 9754


Periodic shock box

Test solver. 2nd order + Barth limiter + SSPRK3, CFL=0.5

Number of cells iterations per second, CI/s

Using 1024^2 cells.

MachinePerformance (CI/s)Power estimated ( by software) (W)Efficiency (MCI/kJ)
1 A1007.5M120 (GPU) + 175 (CPU Package) + 25 (RAM) = 32023.4 MCI/kJ
2 A10014.7M120 (GPU) * 2 + 175 (CPU Package) + 25 (RAM) = 44033.4 MCI/kJ
32 CPU cores4.2M360 (CPU Package) + 45 (RAM) = 40510.37 MCI/kJ

Using 16 OMP thread x 2 ranks performs nearly the same as (slightly worse than) 32 ranks.

GPU python profile results cProfile_GPU

CPU python profile results cProfile_CPU

NSYS results cProfile_GPU

HUGE RX (host to device memcpy), 6-8GB/s why?

GPU occupancy (nvtop): ~60-70%!

Problem: unintended to_device() calls in initializing (rechecking) face buffer.


Fixing extra to_device

Fixed:

MachinePerformance (CI/s)Power estimated ( by software) (W)Efficiency (MCI/kJ)
1 A10022.7M195 (GPU) + 165 (CPU Package) + 20 (RAM) = 38059.7 MCI/kJ
2 A10039.8M180 (GPU) * 2 + 170 (CPU Package) + 20 (RAM) = 55072.3 MCI/kJ
32 CPU cores4.2M360 (CPU Package) + 45 (RAM) = 40510.37 MCI/kJ

GPU occupancy (nvtop): ~90%, RX/TX several MB/s

2 GPU v.s. 1 GPU: 88% strong scaling efficiency.


Optimized RecGradient and RecFace2nd

Primary optimization: local cache.

When only optimize RecGradient, consider the effect of using shared shuffle write or not:

Shared write (to 3x5 gradient) vs. direct write:

  • total: 36.4MCI/s vs. 35.8MCI/s
  • RecGradient: 2677 Iter, 8.79s vs/ 9.99s

Around 10% improvement.

Pitfall

If we use a buffer write function using __shared__ + __syncthreads() inside, you might want:

1
2
3
4
5
6
7
8
int tid_global = blockDim.x * blockIdx.x + threadIdx.x; 
if (tid_global >= max) 
{ 
  write_data(dummy_data); 
  return; 
} 
do_calculation.... 
write_data(real_data);

To handle OOB threads. If write_data is templated or inlined, the __shared__ buffer could diverge.

Safe pattern:

1
2
3
4
5
6
7
int tid_global = blockDim.x * blockIdx.x + threadIdx.x; 
t_buffer real_data;
if (tid_global < max) 
{ 
  do_calculation.... 
} 
write_data(real_data);

When both of RecGradient and RecFace2nd are optimized, performance:

MachinePerformance (CI/s)Power estimated ( by software) (W)Efficiency (MCI/kJ)
1 A10058.1M237 (GPU) + 170 (CPU Package) + 21 (RAM) = 428136 MCI/kJ
2 A10084.5M185 (GPU) * 2 + 175 (CPU Package) + 21 (RAM) = 566149 MCI/kJ
32 CPU cores4.2M360 (CPU Package) + 45 (RAM) = 40510.37 MCI/kJ

Occupancy: 86% 1 GPU / 71% 2 GPU


Write coalescing optimized

MachinePerformance (CI/s)Power estimated ( by software) (W)Efficiency (MCI/kJ)
1 A10072.3M245 (GPU) + 170 (CPU Package) + 21 (RAM) = 436166 MCI/kJ
2 A10098.2M195 (GPU) * 2 + 175 (CPU Package) + 21 (RAM) = 586168 MCI/kJ
32 CPU cores4.2M360 (CPU Package) + 45 (RAM) = 40510.37 MCI/kJ
64 CPU cores (AMD)10.9M384 (CPU Package) + 45? (RAM) = 429?25.41 MCI/kJ
128 CPU cores (AMD)17.2M524 (CPU Package) + 45? (RAM) = 569?30.23 MCI/kJ
256 CPU cores (AMD)19.6M570 (CPU Package) + 45? (RAM) = 615?31.87 MCI/kJ

Occupancy: 83% 1 GPU / 66% 2 GPU

Larger case 256^3 (3D, more work per cell)

MachinePerformance (CI/s)Power estimated ( by software) (W)Efficiency (MCI/kJ)
1 A10060.9M298 (GPU) + 170 (CPU Package) + 21 (RAM) = 489124 MCI/kJ
2 A10078.1M230 (GPU) * 2 + 178 (CPU Package) + 22 (RAM) = 660118 MCI/kJ
32 CPU cores3.29M370 (GPU) (CPU Package) + 45 (RAM) = 4157.93 MCI/kJ

Occupancy: 99% 1 GPU / 90% 2 GPU

by Harry
Built with Hugo
Theme Stack designed by Jimmy