Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15831

Lustre 2.15 client breaks DGXA100 MOFED

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      When trying to run GPUDirect, it was found on install required software step:

      $ sudo ./mlnxofedinstall
      ...
      Installation passed successfully
      To load the new driver, run:
      /etc/init.d/openibd restart
      
      $ sudo /etc/init.d/openibd restart
      Unloading ib_uverbs [FAILED]
      rmmod: ERROR: Module ib_uverbs is in use by: nv_peer_mem
      
      $ sudo rmmod nv_peer_mem
      
      $ sudo /etc/init.d/openibd restart
      Unloading HCA driver:[ OK ]
      Loading Mellanox MLX5_IB HCA driver:  [FAILED]
      Loading Mellanox MLX5 HCA driver: [FAILED]
      Loading HCA driver and Access Layer:  [FAILED]
      Please run /usr/sbin/sysinfo-snapshot.py to collect the debug information
      and open an issue in the http://support.mellanox.com/SupportWeb/service_center/SelfService
      
      $ sudo modprobe nv_peer_mem
      modprobe: FATAL: Module nv_peer_mem not found in directory /lib/modules/5.4.0-109-generic
      
      $ sudo modprobe lustre
      modprobe: ERROR: could not insert 'lustre': Invalid argument

       

      $ sudo ./mlnxofedinstall
      
      Checking SW Requirements...Removing old packages...
      Installing new packages
      Installing ofed-scripts-5.5...
      Installing mlnx-tools-5.2.0...
      Installing mlnx-ofed-kernel-utils-5.5...
      Installing mlnx-ofed-kernel-dkms-5.5...Error: mlnx-ofed-kernel-dkms installation failed!
      Problem: mlx5_ib: module file: /lib/modules/5.4.0-105-generic/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko, from package: linux-modules-extra-5.4.0-105-generic.
      Collecting debug info...
      See:
          /tmp/MLNX_OFED_LINUX.1302312.logs/mlnx-ofed-kernel-dkms.debinstall.log
      Removing newly installed packages...

      This prevents gds tests from running completely:

      =========================
       Platform verification error :
      nvidia-fs driver is not loadedSUCCESS
      FILESYSTEM VERSION CHECK:
      ofed_info:
      current version: MLNX_OFED_LINUX-5.5-1.0.3.2: (Supported)
      min version supported: MLNX_OFED_LINUX-4.6-1.0.1.1
      SUCCESS
      nvidia-fs driver is not loadedSUCCESS
      usage: gdscheck.py [-h] [-p] [-f FILE] [-v] [-V]GPUDirectStorage platform checkeroptional arguments:
        -h, --help  show this help message and exit
        -p          gds platform check
        -f FILE     gds file check
        -v          gds version checks
        -V          gds fs checks
      SUCCESS
      gdscheck.py python2 tests
      =========================
       Platform verification error :
      nvidia-fs driver is not loadedSUCCESS
      FILESYSTEM VERSION CHECK:
      ofed_info:
      current version: MLNX_OFED_LINUX-5.5-1.0.3.2: (Supported)
      min version supported: MLNX_OFED_LINUX-4.6-1.0.1.1
      SUCCESS
      nvidia-fs driver is not loadedSUCCESS
      usage: gdscheck.py [-h] [-p] [-f FILE] [-v] [-V]GPUDirectStorage platform checkeroptional arguments:
        -h, --help  show this help message and exit
        -p          gds platform check
        -f FILE     gds file check
        -v          gds version checks
        -V          gds fs checks
      SUCCESS
      gdscheck.py current running python tests
      =========================
       Platform verification error :
      nvidia-fs driver is not loadedSUCCESS
      FILESYSTEM VERSION CHECK:
      ofed_info:
      current version: MLNX_OFED_LINUX-5.5-1.0.3.2: (Supported)
      min version supported: MLNX_OFED_LINUX-4.6-1.0.1.1
      SUCCESS
      nvidia-fs driver is not loadedSUCCESS
      usage: gdscheck.py [-h] [-p] [-f FILE] [-v] [-V]GPUDirectStorage platform checkeroptional arguments:
        -h, --help  show this help message and exit
        -p          gds platform check
        -f FILE     gds file check
        -v          gds version checks
        -V          gds fs checks
      SUCCESS
      **************************************************
      gdscheck.py test results : 12 /  12 tests passed
      **************************************************
      Starting basic gdsio Tests
      /usr/local/gds/tools/gdsio -f  /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 4k
      SUCCESS
      /usr/local/gds/tools/gdsio -f  /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 3k
      SUCCESS
      /usr/local/gds/tools/gdsio -f  /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 3k -o 1
      SUCCESS
      /usr/local/gds/tools/gdsio -f  /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 2k
      SUCCESS
      /usr/local/gds/tools/gdsio -f  /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 2k -o 1
      SUCCESS
      /usr/local/gds/tools/gdsio -f  /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 1k
      SUCCESS
      /usr/local/gds/tools/gdsio -f  /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 1k -o 1
      SUCCESS
      /usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 1 -o 1
      Verifying data 
      SUCCESS
      /usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 3 -k 1234 -o 1
      Verifying data 
      SUCCESS
      /usr/local/gds/tools/gdsio -V -D /data/sanity/tests// -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 1 -o 1
      Verifying data 
      SUCCESS
      /usr/local/gds/tools/gdsio -V -D /data/sanity/tests// -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 3 -k 1234 -o 1
      Verifying data 
      SUCCESS
      /usr/local/gds/tools/gdsio -D /data/sanity/tests// -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 3 -k 1234 -o 1 -F -R
      SUCCESS
      /usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 1 -o 1 -b
      Verifying data 
      SUCCESS
      /usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 8M:32M -x 0 -I 1 -o 1 -b
      Verifying data 
      SUCCESS
      /usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 0 -o 1 -b
      SUCCESS
      /usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 8M:32M -x 0 -I 0 -o 1 -b
      SUCCESS
      /usr/local/gds/tools/gdsio -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 0 -o 0 -b
      SUCCESS
      /usr/local/gds/tools/gdsio -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 8M:32M -x 0 -I 0 -o 0 -b
      SUCCESS
      **************************************************
      gdiso tests : 18 /  18 tests passed
      **************************************************
      Starting Offset Tests
      TestCase:Read odd offset
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 616  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 616  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read odd gpu offsets 1, 2, 3, 4, 4K-1, 4K, 4K+1, 60K, 64K, 68K
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 4096 -o 0  -d 0 -t 1 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 4096 -o 0  -d 0 -t 1 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 1 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 2 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 3 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 4 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 4095 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 4096 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 4097 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 4097 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 61440 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 65536 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 69632 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 1 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 2 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 3 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 4 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 4095 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 4096 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 4097 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 61440 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 65536 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 69632 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read/write odd size - sync
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485761 -o 4096  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read/write odd size - async
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485761 -o 4096  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:odd offset and odd size - sync
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485748 -o 119  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:odd offset and odd size - async
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485748 -o 119  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read/write 1 byte from offset 0 (sync and async)
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 1 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 1 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read/write 1 byte from offset 3 (sync and async)
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 1 -o 3  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 1 -o 3  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read/write big file 10G (odd size) - sync
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 220201060 -o 4096  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read/write big file 10G (odd size) - async
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 220201060 -o 4096  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read beyond EOF (read 2G on a 1G file - async)
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1G -n 1 -m 0 -s 209714688 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read beyond EOF (read 2G on a 1G file - sync)
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1G -n 1 -m 1 -s 209714688 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read beyond EOF - odd size (read 2G on a 1G file - sync and async)
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1G -n 1 -m 0 -s 209714689 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1G -n 1 -m 1 -s 209714689 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read beyond EOF odd offset (read 2G on a 1G file - sync and async)
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1G -n 1 -m 1 -s 209714688 -o 616  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1G -n 1 -m 0 -s 209714688 -o 616  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read size beyond EOF (small file)
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1099 -o 1  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1099 -o 1  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1099 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1099 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read just short of EOF
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1000 -o 1  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1000 -o 1  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1000 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1000 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read offset from EOF
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 999 -o 1024  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 999 -o 1024  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1 -o 1024  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1 -o 1024  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read offset beyond EOF (sync and async)
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1 -o 1025  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1 -o 1025  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read with odd gpu_offset
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read with odd gpu_offset and odd file offset
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 617  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 617  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read at 128k GPU page offset
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read beyond 64k (odd gpu offset)
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Overwrite an existing file within EOF
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 52428800 -o 1  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 52428800 -o 1  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 52428805 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 52428805 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Offset beyond EOF writes
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 4096 -o 4099  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 4096 -o 4099  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:offset just short of EOF writes
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 3 -o 4094  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 3 -o 4094  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:offset from EOF writes
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 4096 -o 3  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 4096 -o 3  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      **************************************************
      File offset and GPU Buffer offset Tests : 0 /  73 tests passed
      **************************************************
      running cufile sample tests
      sample 1
      FAILED
      sample 2
      opening file /data/sanity/tests//sparse1G_sample2
      FAILED
      sample 3
      FAILED
      sample 4
      FAILED
      sample 5
      FAILED
      sample 6
      FAILED
      sample 7
      FAILED
      sample 8
      PASS: cufile success status:Success
      SUCCESS
      sample 14
      opening file /data/sanity/tests//sparse1G
      FAILED
      sample 15
      FAILED
      **************************************************
      cufile sample tests : 1 /  10 tests passed
      **************************************************
      Testing gdscp functionality
      /usr/local/gds/tools/gdscp /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G_copy 0 -v
      file register error: nvidia-fs driver is not loaded
      FAILED
      **************************************************
      gdscp tests : 0 /  1 tests passed
      **************************************************
      Testing Batch State Machine
      /usr/local/gds/tools//tests/cufile_batch_test_state_machine /data/sanity/tests//sparse1G 0 && pass || fail
      FAILED
      /usr/local/gds/tools//tests/cufile_batch_test_state_machine /data/sanity/tests//sparse1G 1 && pass || fail
      FAILED
      /usr/local/gds/tools//tests/cufile_batch_test_state_machine /data/sanity/tests//sparse1G 2 && pass || fail
      FAILED
      /usr/local/gds/tools//tests/cufile_batch_test_state_machine /data/sanity/tests//sparse1G 3 && pass || fail
      FAILED
      /usr/local/gds/tools//tests/cufile_batch_test_state_machine /data/sanity/tests//sparse1G 4 && pass || fail
      FAILED
      **************************************************
      Batch State Machine Tests : 0 /  5 tests passed
      **************************************************
      Performing cufile API tests
      /usr/local/gds/tools//api_tests/cufile_testbufregister 0
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufregister 1
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufregister 2
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufregister 3
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufregister 4
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufregister 5
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufregister 6
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufregister 7
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufregister 8
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufregister 9
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufregister 10
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufderegister 0
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufderegister 1
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufderegister 2
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufderegister 3
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufderegister 4
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufderegister 5
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testclosefd /data/sanity/tests//sparse1G 0
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testclosefd /data/sanity/tests//sparse1G 1
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testclosefd /data/sanity/tests//sparse1G 2
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testclosefd /data/sanity/tests//sparse1G 3
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testdriver 0
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testdriver 1
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testdriver 2
      cufile driver close: nvidia-fs driver is not loaded
      SUCCESS
      /usr/local/gds/tools//api_tests/cufile_testopenfd  /data/sanity/tests//sparse1G 0
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenfd  /data/sanity/tests//sparse1G 1
      SUCCESS
      /usr/local/gds/tools//api_tests/cufile_testopenfd  /data/sanity/tests//sparse1G 2
      SUCCESS
      /usr/local/gds/tools//api_tests/cufile_testopenfd  /data/sanity/tests//sparse1G 3
      SUCCESS
      /usr/local/gds/tools//api_tests/cufile_testopenfd  /data/sanity/tests//sparse1G 4
      SUCCESS
      /usr/local/gds/tools//api_tests/cufile_testopenfd  /data/sanity/tests//sparse1G 5
      SUCCESS
      /usr/local/gds/tools//api_tests/cufile_rw  /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G_VERIFY 0 
      FAILED
      /usr/local/gds/tools//api_tests/cufile_rwmanaged  /data/sanity/tests//sparse1G 0 
      FAILED
      /usr/local/gds/tools//api_tests/cufile_rw_unreg  /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G_VERIFY 0 1
      FAILED
      /usr/local/gds/tools//api_tests/cufile_rw_unreg  /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G_VERIFY 0 2
      FAILED
      /usr/local/gds/tools//api_tests/cufile_rw_unreg  /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G_VERIFY 0 3
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 0
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 1
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 2
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 3
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 4
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 5
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 6
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 7
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 8
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 9
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testdriverprops -p 8
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testdriverprops -b 8
      SUCCESS
      /usr/local/gds/tools//api_tests/cufile_testdriverprops -d 8
      SUCCESS
      /usr/local/gds/tools//api_tests/cufile_testdriverprops -c 8
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testdriverprops -b 1024
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testdriverprops -d 1024
      FAILED
      /usr/local/gds/tools//api_tests/cufile_driver_close /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G 0
      FAILED
      /usr/local/gds/tools//api_tests/cufile_driver_close /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G 1
      FAILED
      /usr/local/gds/tools//api_tests/cufile_driver_close /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G 2
      FAILED
      /usr/local/gds/tools//api_tests/cufile_io_race /data/sanity/tests//sparse1G 
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testvalidnvbuf  /data/sanity/tests//sparse1G 0 
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testvalidnvbuf  /data/sanity/tests//sparse1G 0 
      FAILED
      /usr/local/gds/tools//api_tests/cufile_driver_close /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G 3
      FAILED
      /usr/local/gds/tools//api_tests/cufile_io_race /data/sanity/tests//sparse1G 
      FAILED
      /usr/local/gds/tools//api_tests/cufile_invalid_write /data/sanity/tests//sparse1G 0 0
      SUCCESS
      /usr/local/gds/tools//api_tests/cufile_invalid_write /data/sanity/tests//sparse1G 0 1
      SUCCESS
      /usr/local/gds/tools//api_tests/cufile_invalid_offsets /data/sanity/tests//sparse1G 0 0
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testcudacontext_switch /data/sanity/tests//sparse_CTX_VERIFY 0
      FAILED
      End: nvidia-fs:
      GDS Version: 1.2.1.4 
      NVFS statistics(ver: 4.0)
      NVFS Driver(version: 2.11.0)
      Mellanox PeerDirect Supported: True
      IO stats: Disabled, peer IO stats: Disabled
      Logging level: infoActive Shadow-Buffer (MiB): 0
      Active Process: 0
      Reads                : err=0 io_state_err=0
      Sparse Reads                : n=0 io=0 holes=0 pages=0 
      Writes                : err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0
      Mmap                : n=0 ok=0 err=0 munmap=0
      Bar1-map            : n=0 ok=0 err=0 free=0 callbacks=0 active=0
      Error                : cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0
      Ops                : Read=0 Write=0 BatchIO=0
      **************************************************
      API Tests, : 10 /  63 tests passed
      **************************************************
      Testsuite : 41 / 182 tests passed
      done tests:Mon May 2 16:48:42 UTC 2022
      

      It seems that a patch https://review.whamcloud.com/#/c/45327/ needs to be applied to the master.

      With MLNX_OFED_LINUX-5.6-1.0.3.3-ubuntu20.04-x86_64 and MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 the result is the same.

       

       

      Attachments

        Issue Links

          Activity

            [LU-15831] Lustre 2.15 client breaks DGXA100 MOFED
            pjones Peter Jones made changes -
            Fix Version/s Original: Lustre 2.16.0 [ 15190 ]
            okulachenko Oleg Kulachenko (Inactive) made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Reopened [ 4 ] New: Resolved [ 5 ]

            Updating cuda to the latest version fixed it.

            okulachenko Oleg Kulachenko (Inactive) added a comment - Updating cuda to the latest version fixed it.
            cfaber Colin Faber made changes -
            Priority Original: Blocker [ 1 ] New: Major [ 3 ]
            cfaber Colin Faber made changes -
            Fix Version/s New: Lustre 2.16.0 [ 15190 ]
            Fix Version/s Original: Lustre 2.15.0 [ 14791 ]

            I'm trying to run tests.
            Now this error:

            /usr/local/cuda-11.5/gds/tools/gdscheck.py -p
             cuInit Failed, error CUDA_ERROR_SYSTEM_NOT_READY
             cuFile initialization failed
             Platform verification error :
            CUDA Driver API error 

            But these are gds tools problems, not Lustre

            okulachenko Oleg Kulachenko (Inactive) added a comment - I'm trying to run tests. Now this error: /usr/local/cuda-11.5/gds/tools/gdscheck.py -p  cuInit Failed, error CUDA_ERROR_SYSTEM_NOT_READY  cuFile initialization failed  Platform verification error : CUDA Driver API error But these are gds tools problems, not Lustre

            Just built the master branch of fs/lustre-release on DGX A100 and could load the lustre module as well.

            root@a100-01:/home/ddn/gtapase/exa-client# dpkg -l | grep lustre
            rc  lustre-client-modules-5.14.0-1032-oem             2.14.0-ddn39-11-g767352e-1              amd64        Lustre Linux kernel module (kernel 5.14.0-1032-oem)
            ii  lustre-client-modules-5.4.0-109-generic           2.15.50-13-gc524079-dirty-1             amd64        Lustre Linux kernel module (kernel 5.4.0-109-generic)
            rc  lustre-client-modules-5.4.0-96-generic            2.15.0-RC3-2-g7905359-1                 amd64        Lustre Linux kernel module (kernel 5.4.0-96-generic)
            ii  lustre-client-utils                               2.15.50-13-gc524079-dirty-1             amd64        Userspace utilities for the Lustre filesystem (client)
            ii  lustre-dev                                        2.15.50-13-gc524079-dirty-1             amd64        Development files for the Lustre filesystem
            rc  lustre-source                                     2.15.0-RC3-2-g7905359-1                 all          source for Lustre filesystem client kernel modules
            
            root@a100-01:/home/ddn/gtapase/exa-client# lsmod | grep lustre 
            lustre               1007616  0 
            lmv                   212992  1 lustre 
            mdc                   274432  1 lustre 
            lov                   331776  2 mdc,lustre 
            ptlrpc               1355776  7 fld,osc,fid,lov,mdc,lmv,lustre 
            obdclass             3297280  8 fld,osc,fid,ptlrpc,lov,mdc,lmv,lustre 
            lnet                  659456  6 osc,ko2iblnd,obdclass,ptlrpc,lmv,lustre 
            libcfs                245760  11 fld,lnet,osc,fid,ko2iblnd,obdclass,ptlrpc,lov,mdc,lmv,lustre
            

             

            gtapase Gaurang Tapase added a comment - Just built the master branch of fs/lustre-release on DGX A100 and could load the lustre module as well. root@a100-01:/home/ddn/gtapase/exa-client# dpkg -l | grep lustre rc  lustre-client-modules-5.14.0-1032-oem             2.14.0-ddn39-11-g767352e-1              amd64        Lustre Linux kernel module (kernel 5.14.0-1032-oem) ii  lustre-client-modules-5.4.0-109-generic           2.15.50-13-gc524079-dirty-1             amd64        Lustre Linux kernel module (kernel 5.4.0-109-generic) rc  lustre-client-modules-5.4.0-96-generic            2.15.0-RC3-2-g7905359-1                 amd64        Lustre Linux kernel module (kernel 5.4.0-96-generic) ii  lustre-client-utils                               2.15.50-13-gc524079-dirty-1             amd64        Userspace utilities for the Lustre filesystem (client) ii  lustre-dev                                        2.15.50-13-gc524079-dirty-1             amd64        Development files for the Lustre filesystem rc  lustre-source                                     2.15.0-RC3-2-g7905359-1                 all          source for Lustre filesystem client kernel modules root@a100-01:/home/ddn/gtapase/exa-client# lsmod | grep lustre lustre               1007616  0 lmv                   212992  1 lustre mdc                   274432  1 lustre lov                   331776  2 mdc,lustre ptlrpc               1355776  7 fld,osc,fid,lov,mdc,lmv,lustre obdclass             3297280  8 fld,osc,fid,ptlrpc,lov,mdc,lmv,lustre lnet                  659456  6 osc,ko2iblnd,obdclass,ptlrpc,lmv,lustre libcfs                245760  11 fld,lnet,osc,fid,ko2iblnd,obdclass,ptlrpc,lov,mdc,lmv,lustre  

            I don't have access to DGX A100 system, but I tried compiling lustre 2.15 on a ubuntu 20.04 system with MoFED 5.5 installed and it worked fine. I could load the lustre module as well.

            Is this something specific to DGX A100?

            gtapase Gaurang Tapase added a comment - I don't have access to DGX A100 system, but I tried compiling lustre 2.15 on a ubuntu 20.04 system with MoFED 5.5 installed and it worked fine. I could load the lustre module as well. Is this something specific to DGX A100?

            Gaurang, any update on this issue? This is one of the few remaining issues before the 2.15.0 release.

            adilger Andreas Dilger added a comment - Gaurang, any update on this issue? This is one of the few remaining issues before the 2.15.0 release.
            mdiep Minh Diep made changes -
            Assignee Original: Minh Diep [ mdiep ] New: Gaurang Tapase [ gtapase ]

            People

              gtapase Gaurang Tapase
              okulachenko Oleg Kulachenko (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: