Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.15.0
-
None
-
NVIDIA DGX A100
-
3
-
9223372036854775807
Description
Kernel crash happens after running fullperf tests.
Crash happens after:
root@a100-01:/usr/local/gds/docker# ./gds_docker.sh -p /lustre/ai400x2/client -v 1.2.0 -c 11.7.0 -m -t fullperf SKIP DRIVER INSTALL 0 Available space in /lustre/ai400x2/client = 268740285 CONFIG_MOFED_VERSION Found MOFED version 5.6-1.0.3.3 using nvidia driver version 515.43.04 on kernel 5.4.0-109-generic f87d047a1632feeb1bd51a5544ac541ea91fd58910ce5d358540cc2b7da08fc5 Started container fullperf_135939 check output: docker container logs --follow fullperf_135939 root@a100-01:/usr/local/gds/docker# docker container logs --follow fullperf_135939 UserSpace RDMA Support Ok logs file in /results/build_7-20220523_2228.log, /results/gds_7-20220523_2228.log downloading dependencies for nvidia-fs ^M0% [Working]^M ^MGet:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 InRelease [1581 B] ^M0% [Connecting to archive.ubuntu.com (91.189.91.38)] [Waiting for headers] [1 I^M0% [Connecting to archive.ubuntu.com (91.189.91.38)] [Waiting for headers] [Con^M ^MGet:2 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB] ^M0% [Waiting for headers] [2 InRelease 12.3 kB/114 kB 11%] [Waiting for headers]^M0% [Waiting for headers] [2 InRelease 14.2 kB/114 kB 12%] [Waiting for headers]^M ^MHit:3 http://archive.ubuntu.com/ubuntu focal InRelease ^M0% [2 InRelease 14.2 kB/114 kB 12%] [Waiting for headers] [Waiting for headers]^M ^MGet:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages [557 kB] ^M0% [Waiting for headers] [2 InRelease 14.2 kB/114 kB 12%] [4 Packages 4096 B/55^M ^MHit:5 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal InRelease ^M0% [Waiting for headers] [2 InRelease 14.2 kB/114 kB 12%] [4 Packages 557 kB/55^M0% [Waiting for headers] [2 InRelease 14.2 kB/114 kB 12%] [Connecting to repo.d^M0% [4 Packages store 0 B] [Waiting for headers] [2 InRelease 14.2 kB/114 kB 12%^M ^MGet:6 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB] ^M0% [4 Packages store 0 B] [6 InRelease 14.2 kB/114 kB 12%] [2 InRelease 43.1 kB^M0% [6 InRelease 15.6 kB/114 kB 14%] [2 InRelease 43.1 kB/114 kB 38%] [Connected^M0% [6 InRelease 15.6 kB/114 kB 14%] [2 InRelease 43.1 kB/114 kB 38%] [Waiting f^M ...skipping... g package lists... 98%^MReading package lists... 98%^MReading package lists... 98%^MReading package lists... 99%^MReading package lists... 99%^MReading package lists... 99%^MReading package lists... 99%^MReading package lists... Done W: Target Packages (Packages) is configured multiple times in /etc/apt/sources.list:50 and /etc/apt/sources.list.d/cuda-compute-repo.list:1 W: Target Packages (Packages) is configured multiple times in /etc/apt/sources.list:50 and /etc/apt/sources.list.d/cuda-compute-repo.list:1 nvidia-fs driver build success cat: /sys/kernel/mm/memory_peers/nvidia-fs/version: No such file or directory GDS verfication passed GDS check passed Max allowed GPUS: 8 path /data/0 is not a mount point mount /data/0 not found, creating directory /data/GPU0 mount /data/0 not found, creating directory /data/GPU1 mount /data/0 not found, creating directory /data/GPU2 mount /data/0 not found, creating directory /data/GPU3 mount /data/0 not found, creating directory /data/GPU4 mount /data/0 not found, creating directory /data/GPU5 mount /data/0 not found, creating directory /data/GPU6 mount /data/0 not found, creating directory /data/GPU7 mount path: /data/GPU0 -> GPU device: 0 mount path: /data/GPU1 -> GPU device: 1 mount path: /data/GPU2 -> GPU device: 2 mount path: /data/GPU3 -> GPU device: 3 mount path: /data/GPU4 -> GPU device: 4 mount path: /data/GPU5 -> GPU device: 5 mount path: /data/GPU6 -> GPU device: 6 mount path: /data/GPU7 -> GPU device: 7 populating files: /usr/local/gds/tools/gdsio -s 4096M -V -I 1 -x 0 -D /data/GPU0/gds -w 128 -d 0 -n 3 -D /data/GPU1/gds -w 128 -d 1 -n 3 -D /data/GPU2/gds -w 128 -d 2 -n 1 -D /data/GPU3/gds -w 128 -d 3 -n 1 -D /data/GPU4/gds -w 128 -d 4 -n 7 -D /data/GPU5/gds -w 128 -d 5 -n 7 -D /data/GPU6/gds -w 128 -d 6 -n 5 -D /data/GPU7/gds -w 128 -d 7 -n 5 -i 1M Done populating Running iter 1 for IOTYPE: 0 for XFERTYPE: -x 0 IOSIZE: 4 kb with threads: 128 /usr/local/gds/tools/gdsio -T 45 -s 512M -I 0 -x 0 -D /data/GPU0/gds -w 128 -d 0 -n 3 -D /data/GPU1/gds -w 128 -d 1 -n 3 -D /data/GPU2/gds -w 128 -d 2 -n 1 -D /data/GPU3/gds -w 128 -d 3 -n 1 -D /data/GPU4/gds -w 128 -d 4 -n 7 -D /data/GPU5/gds -w 128 -d 5 -n 7 -D /data/GPU6/gds -w 128 -d 6 -n 5 -D /data/GPU7/gds -w 128 -d 7 -n 5 -i 4k
ddn@a100-01:~$ lctl get_param version
version=2.15.50_13_gc524079_dirty
NVIDIA-SMI 515.43.04
Driver Version: 515.43.04
CUDA Version: 11.7
Kernel: 5.4.0-109-generic