[LU-15883] Lustre 2.15 GPUDirect Testing fullperf crash Created: 24/May/22 Updated: 03/Jun/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Oleg Kulachenko (Inactive) | Assignee: | Serguei Smirnov |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
NVIDIA DGX A100 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Kernel crash happens after running fullperf tests. Crash happens after: root@a100-01:/usr/local/gds/docker# ./gds_docker.sh -p /lustre/ai400x2/client -v 1.2.0 -c 11.7.0 -m -t fullperf SKIP DRIVER INSTALL 0 Available space in /lustre/ai400x2/client = 268740285 CONFIG_MOFED_VERSION Found MOFED version 5.6-1.0.3.3 using nvidia driver version 515.43.04 on kernel 5.4.0-109-generic f87d047a1632feeb1bd51a5544ac541ea91fd58910ce5d358540cc2b7da08fc5 Started container fullperf_135939 check output: docker container logs --follow fullperf_135939 root@a100-01:/usr/local/gds/docker# docker container logs --follow fullperf_135939 UserSpace RDMA Support Ok logs file in /results/build_7-20220523_2228.log, /results/gds_7-20220523_2228.log downloading dependencies for nvidia-fs ^M0% [Working]^M ^MGet:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 InRelease [1581 B] ^M0% [Connecting to archive.ubuntu.com (91.189.91.38)] [Waiting for headers] [1 I^M0% [Connecting to archive.ubuntu.com (91.189.91.38)] [Waiting for headers] [Con^M ^MGet:2 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB] ^M0% [Waiting for headers] [2 InRelease 12.3 kB/114 kB 11%] [Waiting for headers]^M0% [Waiting for headers] [2 InRelease 14.2 kB/114 kB 12%] [Waiting for headers]^M ^MHit:3 http://archive.ubuntu.com/ubuntu focal InRelease ^M0% [2 InRelease 14.2 kB/114 kB 12%] [Waiting for headers] [Waiting for headers]^M ^MGet:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages [557 kB] ^M0% [Waiting for headers] [2 InRelease 14.2 kB/114 kB 12%] [4 Packages 4096 B/55^M ^MHit:5 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal InRelease ^M0% [Waiting for headers] [2 InRelease 14.2 kB/114 kB 12%] [4 Packages 557 kB/55^M0% [Waiting for headers] [2 InRelease 14.2 kB/114 kB 12%] [Connecting to repo.d^M0% [4 Packages store 0 B] [Waiting for headers] [2 InRelease 14.2 kB/114 kB 12%^M ^MGet:6 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB] ^M0% [4 Packages store 0 B] [6 InRelease 14.2 kB/114 kB 12%] [2 InRelease 43.1 kB^M0% [6 InRelease 15.6 kB/114 kB 14%] [2 InRelease 43.1 kB/114 kB 38%] [Connected^M0% [6 InRelease 15.6 kB/114 kB 14%] [2 InRelease 43.1 kB/114 kB 38%] [Waiting f^M ...skipping... g package lists... 98%^MReading package lists... 98%^MReading package lists... 98%^MReading package lists... 99%^MReading package lists... 99%^MReading package lists... 99%^MReading package lists... 99%^MReading package lists... Done W: Target Packages (Packages) is configured multiple times in /etc/apt/sources.list:50 and /etc/apt/sources.list.d/cuda-compute-repo.list:1 W: Target Packages (Packages) is configured multiple times in /etc/apt/sources.list:50 and /etc/apt/sources.list.d/cuda-compute-repo.list:1 nvidia-fs driver build success cat: /sys/kernel/mm/memory_peers/nvidia-fs/version: No such file or directory GDS verfication passed GDS check passed Max allowed GPUS: 8 path /data/0 is not a mount point mount /data/0 not found, creating directory /data/GPU0 mount /data/0 not found, creating directory /data/GPU1 mount /data/0 not found, creating directory /data/GPU2 mount /data/0 not found, creating directory /data/GPU3 mount /data/0 not found, creating directory /data/GPU4 mount /data/0 not found, creating directory /data/GPU5 mount /data/0 not found, creating directory /data/GPU6 mount /data/0 not found, creating directory /data/GPU7 mount path: /data/GPU0 -> GPU device: 0 mount path: /data/GPU1 -> GPU device: 1 mount path: /data/GPU2 -> GPU device: 2 mount path: /data/GPU3 -> GPU device: 3 mount path: /data/GPU4 -> GPU device: 4 mount path: /data/GPU5 -> GPU device: 5 mount path: /data/GPU6 -> GPU device: 6 mount path: /data/GPU7 -> GPU device: 7 populating files: /usr/local/gds/tools/gdsio -s 4096M -V -I 1 -x 0 -D /data/GPU0/gds -w 128 -d 0 -n 3 -D /data/GPU1/gds -w 128 -d 1 -n 3 -D /data/GPU2/gds -w 128 -d 2 -n 1 -D /data/GPU3/gds -w 128 -d 3 -n 1 -D /data/GPU4/gds -w 128 -d 4 -n 7 -D /data/GPU5/gds -w 128 -d 5 -n 7 -D /data/GPU6/gds -w 128 -d 6 -n 5 -D /data/GPU7/gds -w 128 -d 7 -n 5 -i 1M Done populating Running iter 1 for IOTYPE: 0 for XFERTYPE: -x 0 IOSIZE: 4 kb with threads: 128 /usr/local/gds/tools/gdsio -T 45 -s 512M -I 0 -x 0 -D /data/GPU0/gds -w 128 -d 0 -n 3 -D /data/GPU1/gds -w 128 -d 1 -n 3 -D /data/GPU2/gds -w 128 -d 2 -n 1 -D /data/GPU3/gds -w 128 -d 3 -n 1 -D /data/GPU4/gds -w 128 -d 4 -n 7 -D /data/GPU5/gds -w 128 -d 5 -n 7 -D /data/GPU6/gds -w 128 -d 6 -n 5 -D /data/GPU7/gds -w 128 -d 7 -n 5 -i 4k ddn@a100-01:~$ lctl get_param version NVIDIA-SMI 515.43.04 Driver Version: 515.43.04 CUDA Version: 11.7 Kernel: 5.4.0-109-generic |
| Comments |
| Comment by Colin Faber [ 24/May/22 ] |
|
Hi ssmirnov I believe the GPU direct stuff was originally handled by ashehata, can you please take a look and assign out to someone else on your team? Thank you!
|
| Comment by Serguei Smirnov [ 24/May/22 ] |
|
From the provided traces, it didn't look to me like any of the lustre code is having an issue. After checking with Amir, I'd like to request that a manual gdsio run is used:
Do you see the same problem then? Thanks, Serguei. |