[LU-15883] Lustre 2.15 GPUDirect Testing fullperf crash Created: 24/May/22  Updated: 03/Jun/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Oleg Kulachenko (Inactive) Assignee: Serguei Smirnov
Resolution: Unresolved Votes: 0
Labels: None
Environment:

NVIDIA DGX A100


Attachments: PNG File Remote KVM [10.36.11.67] - [800 x 600 ] 2022-05-24 09-35-48.png     PNG File Remote KVM [10.36.11.67] - [800 x 600 ] 2022-05-24 09-38-15.png    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Kernel crash happens after running fullperf tests.

Crash happens after:

root@a100-01:/usr/local/gds/docker# ./gds_docker.sh -p /lustre/ai400x2/client -v 1.2.0 -c 11.7.0 -m -t fullperf
SKIP DRIVER INSTALL 0
Available space in /lustre/ai400x2/client = 268740285
CONFIG_MOFED_VERSION
Found MOFED version 5.6-1.0.3.3
using nvidia driver version 515.43.04 on kernel 5.4.0-109-generic
f87d047a1632feeb1bd51a5544ac541ea91fd58910ce5d358540cc2b7da08fc5
Started container fullperf_135939
check output: docker container logs --follow fullperf_135939
root@a100-01:/usr/local/gds/docker# docker container logs --follow fullperf_135939
UserSpace RDMA Support Ok
logs file in /results/build_7-20220523_2228.log, /results/gds_7-20220523_2228.log
downloading dependencies for nvidia-fs
^M0% [Working]^M            ^MGet:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease [1581 B]
^M0% [Connecting to archive.ubuntu.com (91.189.91.38)] [Waiting for headers] [1 I^M0% [Connecting to archive.ubuntu.com (91.189.91.38)] [Waiting for headers] [Con^M                                                                               ^MGet:2 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
^M0% [Waiting for headers] [2 InRelease 12.3 kB/114 kB 11%] [Waiting for headers]^M0% [Waiting for headers] [2 InRelease 14.2 kB/114 kB 12%] [Waiting for headers]^M                                                                               ^MHit:3 http://archive.ubuntu.com/ubuntu focal InRelease
^M0% [2 InRelease 14.2 kB/114 kB 12%] [Waiting for headers] [Waiting for headers]^M                                                                               ^MGet:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages [557 kB]
^M0% [Waiting for headers] [2 InRelease 14.2 kB/114 kB 12%] [4 Packages 4096 B/55^M                                                                               ^MHit:5 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal InRelease
^M0% [Waiting for headers] [2 InRelease 14.2 kB/114 kB 12%] [4 Packages 557 kB/55^M0% [Waiting for headers] [2 InRelease 14.2 kB/114 kB 12%] [Connecting to repo.d^M0% [4 Packages store 0 B] [Waiting for headers] [2 InRelease 14.2 kB/114 kB 12%^M                                                                               ^MGet:6 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
^M0% [4 Packages store 0 B] [6 InRelease 14.2 kB/114 kB 12%] [2 InRelease 43.1 kB^M0% [6 InRelease 15.6 kB/114 kB 14%] [2 InRelease 43.1 kB/114 kB 38%] [Connected^M0% [6 InRelease 15.6 kB/114 kB 14%] [2 InRelease 43.1 kB/114 kB 38%] [Waiting f^M
...skipping...
g package lists... 98%^MReading package lists... 98%^MReading package lists... 98%^MReading package lists... 99%^MReading package lists... 99%^MReading package lists... 99%^MReading package lists... 99%^MReading package lists... Done
W: Target Packages (Packages) is configured multiple times in /etc/apt/sources.list:50 and /etc/apt/sources.list.d/cuda-compute-repo.list:1
W: Target Packages (Packages) is configured multiple times in /etc/apt/sources.list:50 and /etc/apt/sources.list.d/cuda-compute-repo.list:1
nvidia-fs driver build success
cat: /sys/kernel/mm/memory_peers/nvidia-fs/version: No such file or directory
GDS verfication passed
GDS check passed
Max allowed GPUS: 8
path /data/0 is not a mount point
mount /data/0 not found, creating directory /data/GPU0
mount /data/0 not found, creating directory /data/GPU1
mount /data/0 not found, creating directory /data/GPU2
mount /data/0 not found, creating directory /data/GPU3
mount /data/0 not found, creating directory /data/GPU4
mount /data/0 not found, creating directory /data/GPU5
mount /data/0 not found, creating directory /data/GPU6
mount /data/0 not found, creating directory /data/GPU7
mount path:  /data/GPU0  -> GPU device: 0
mount path:  /data/GPU1  -> GPU device: 1
mount path:  /data/GPU2  -> GPU device: 2
mount path:  /data/GPU3  -> GPU device: 3
mount path:  /data/GPU4  -> GPU device: 4
mount path:  /data/GPU5  -> GPU device: 5
mount path:  /data/GPU6  -> GPU device: 6
mount path:  /data/GPU7  -> GPU device: 7
populating files:
/usr/local/gds/tools/gdsio -s 4096M -V -I 1 -x 0 -D /data/GPU0/gds -w 128 -d 0 -n 3 -D /data/GPU1/gds -w 128 -d 1 -n 3 -D /data/GPU2/gds -w 128 -d 2 -n 1 -D /data/GPU3/gds -w 128 -d 3 -n 1 -D /data/GPU4/gds -w 128 -d 4 -n 7 -D /data/GPU5/gds -w 128 -d 5 -n 7 -D /data/GPU6/gds -w 128 -d 6 -n 5 -D /data/GPU7/gds -w 128 -d 7 -n 5 -i 1M
Done populating
Running iter 1 for IOTYPE: 0 for XFERTYPE: -x 0 IOSIZE: 4 kb with threads: 128
/usr/local/gds/tools/gdsio -T 45 -s 512M -I 0 -x 0 -D /data/GPU0/gds -w 128 -d 0 -n 3 -D /data/GPU1/gds -w 128 -d 1 -n 3 -D /data/GPU2/gds -w 128 -d 2 -n 1 -D /data/GPU3/gds -w 128 -d 3 -n 1 -D /data/GPU4/gds -w 128 -d 4 -n 7 -D /data/GPU5/gds -w 128 -d 5 -n 7 -D /data/GPU6/gds -w 128 -d 6 -n 5 -D /data/GPU7/gds -w 128 -d 7 -n 5 -i 4k

ddn@a100-01:~$ lctl get_param version
version=2.15.50_13_gc524079_dirty

NVIDIA-SMI 515.43.04    

Driver Version: 515.43.04    

CUDA Version: 11.7

Kernel: 5.4.0-109-generic



 Comments   
Comment by Colin Faber [ 24/May/22 ]

Hi ssmirnov 

I believe the GPU direct stuff was originally handled by ashehata, can you please take a look and assign out to someone else on your team?

Thank you!

 

Comment by Serguei Smirnov [ 24/May/22 ]

From the provided traces, it didn't look to me like any of the lustre code is having an issue. 

After checking with Amir, I'd like to request that a manual gdsio run is used:

  • Ensure that gds driver is installed.
    • You can verify by: lsmod | grep nvidia_fs
  • Verify that nvidia_fs is working properly by using gdscheck -p.
  • Mount Lustre
  • Tune lustre Clients 
  • Run a quick gdsio test to ensure it works properly, for example: 
     ./gdsio -f /mnt/ai400/test -d 4 -n 0 -w 8 -s 1G -i 4M -x 0 -I 0IoType: READ XferType: GPUD Threads: 8  DataSetSize: 809500672/1073741824 IOSize: 4096(KB),Throughput: 1.209208 GB/sec, Avg_Latency: 24341.183046 usecs ops: 193 total_time 623471.000000 usecs

Do you see the same problem then?

Thanks,

Serguei.

Generated at Sat Feb 10 03:22:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.