Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15883

Lustre 2.15 GPUDirect Testing fullperf crash

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.15.0
    • None
    • NVIDIA DGX A100
    • 3
    • 9223372036854775807

    Description

      Kernel crash happens after running fullperf tests.

      Crash happens after:

      root@a100-01:/usr/local/gds/docker# ./gds_docker.sh -p /lustre/ai400x2/client -v 1.2.0 -c 11.7.0 -m -t fullperf
      SKIP DRIVER INSTALL 0
      Available space in /lustre/ai400x2/client = 268740285
      CONFIG_MOFED_VERSION
      Found MOFED version 5.6-1.0.3.3
      using nvidia driver version 515.43.04 on kernel 5.4.0-109-generic
      f87d047a1632feeb1bd51a5544ac541ea91fd58910ce5d358540cc2b7da08fc5
      Started container fullperf_135939
      check output: docker container logs --follow fullperf_135939
      root@a100-01:/usr/local/gds/docker# docker container logs --follow fullperf_135939
      UserSpace RDMA Support Ok
      logs file in /results/build_7-20220523_2228.log, /results/gds_7-20220523_2228.log
      downloading dependencies for nvidia-fs
      ^M0% [Working]^M            ^MGet:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease [1581 B]
      ^M0% [Connecting to archive.ubuntu.com (91.189.91.38)] [Waiting for headers] [1 I^M0% [Connecting to archive.ubuntu.com (91.189.91.38)] [Waiting for headers] [Con^M                                                                               ^MGet:2 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
      ^M0% [Waiting for headers] [2 InRelease 12.3 kB/114 kB 11%] [Waiting for headers]^M0% [Waiting for headers] [2 InRelease 14.2 kB/114 kB 12%] [Waiting for headers]^M                                                                               ^MHit:3 http://archive.ubuntu.com/ubuntu focal InRelease
      ^M0% [2 InRelease 14.2 kB/114 kB 12%] [Waiting for headers] [Waiting for headers]^M                                                                               ^MGet:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages [557 kB]
      ^M0% [Waiting for headers] [2 InRelease 14.2 kB/114 kB 12%] [4 Packages 4096 B/55^M                                                                               ^MHit:5 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal InRelease
      ^M0% [Waiting for headers] [2 InRelease 14.2 kB/114 kB 12%] [4 Packages 557 kB/55^M0% [Waiting for headers] [2 InRelease 14.2 kB/114 kB 12%] [Connecting to repo.d^M0% [4 Packages store 0 B] [Waiting for headers] [2 InRelease 14.2 kB/114 kB 12%^M                                                                               ^MGet:6 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
      ^M0% [4 Packages store 0 B] [6 InRelease 14.2 kB/114 kB 12%] [2 InRelease 43.1 kB^M0% [6 InRelease 15.6 kB/114 kB 14%] [2 InRelease 43.1 kB/114 kB 38%] [Connected^M0% [6 InRelease 15.6 kB/114 kB 14%] [2 InRelease 43.1 kB/114 kB 38%] [Waiting f^M
      ...skipping...
      g package lists... 98%^MReading package lists... 98%^MReading package lists... 98%^MReading package lists... 99%^MReading package lists... 99%^MReading package lists... 99%^MReading package lists... 99%^MReading package lists... Done
      W: Target Packages (Packages) is configured multiple times in /etc/apt/sources.list:50 and /etc/apt/sources.list.d/cuda-compute-repo.list:1
      W: Target Packages (Packages) is configured multiple times in /etc/apt/sources.list:50 and /etc/apt/sources.list.d/cuda-compute-repo.list:1
      nvidia-fs driver build success
      cat: /sys/kernel/mm/memory_peers/nvidia-fs/version: No such file or directory
      GDS verfication passed
      GDS check passed
      Max allowed GPUS: 8
      path /data/0 is not a mount point
      mount /data/0 not found, creating directory /data/GPU0
      mount /data/0 not found, creating directory /data/GPU1
      mount /data/0 not found, creating directory /data/GPU2
      mount /data/0 not found, creating directory /data/GPU3
      mount /data/0 not found, creating directory /data/GPU4
      mount /data/0 not found, creating directory /data/GPU5
      mount /data/0 not found, creating directory /data/GPU6
      mount /data/0 not found, creating directory /data/GPU7
      mount path:  /data/GPU0  -> GPU device: 0
      mount path:  /data/GPU1  -> GPU device: 1
      mount path:  /data/GPU2  -> GPU device: 2
      mount path:  /data/GPU3  -> GPU device: 3
      mount path:  /data/GPU4  -> GPU device: 4
      mount path:  /data/GPU5  -> GPU device: 5
      mount path:  /data/GPU6  -> GPU device: 6
      mount path:  /data/GPU7  -> GPU device: 7
      populating files:
      /usr/local/gds/tools/gdsio -s 4096M -V -I 1 -x 0 -D /data/GPU0/gds -w 128 -d 0 -n 3 -D /data/GPU1/gds -w 128 -d 1 -n 3 -D /data/GPU2/gds -w 128 -d 2 -n 1 -D /data/GPU3/gds -w 128 -d 3 -n 1 -D /data/GPU4/gds -w 128 -d 4 -n 7 -D /data/GPU5/gds -w 128 -d 5 -n 7 -D /data/GPU6/gds -w 128 -d 6 -n 5 -D /data/GPU7/gds -w 128 -d 7 -n 5 -i 1M
      Done populating
      Running iter 1 for IOTYPE: 0 for XFERTYPE: -x 0 IOSIZE: 4 kb with threads: 128
      /usr/local/gds/tools/gdsio -T 45 -s 512M -I 0 -x 0 -D /data/GPU0/gds -w 128 -d 0 -n 3 -D /data/GPU1/gds -w 128 -d 1 -n 3 -D /data/GPU2/gds -w 128 -d 2 -n 1 -D /data/GPU3/gds -w 128 -d 3 -n 1 -D /data/GPU4/gds -w 128 -d 4 -n 7 -D /data/GPU5/gds -w 128 -d 5 -n 7 -D /data/GPU6/gds -w 128 -d 6 -n 5 -D /data/GPU7/gds -w 128 -d 7 -n 5 -i 4k

      ddn@a100-01:~$ lctl get_param version
      version=2.15.50_13_gc524079_dirty

      NVIDIA-SMI 515.43.04    

      Driver Version: 515.43.04    

      CUDA Version: 11.7

      Kernel: 5.4.0-109-generic

      Attachments

        Activity

          People

            ssmirnov Serguei Smirnov
            okulachenko Oleg Kulachenko (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: