Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15831

Lustre 2.15 client breaks DGXA100 MOFED

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      When trying to run GPUDirect, it was found on install required software step:

      $ sudo ./mlnxofedinstall
      ...
      Installation passed successfully
      To load the new driver, run:
      /etc/init.d/openibd restart
      
      $ sudo /etc/init.d/openibd restart
      Unloading ib_uverbs [FAILED]
      rmmod: ERROR: Module ib_uverbs is in use by: nv_peer_mem
      
      $ sudo rmmod nv_peer_mem
      
      $ sudo /etc/init.d/openibd restart
      Unloading HCA driver:[ OK ]
      Loading Mellanox MLX5_IB HCA driver:  [FAILED]
      Loading Mellanox MLX5 HCA driver: [FAILED]
      Loading HCA driver and Access Layer:  [FAILED]
      Please run /usr/sbin/sysinfo-snapshot.py to collect the debug information
      and open an issue in the http://support.mellanox.com/SupportWeb/service_center/SelfService
      
      $ sudo modprobe nv_peer_mem
      modprobe: FATAL: Module nv_peer_mem not found in directory /lib/modules/5.4.0-109-generic
      
      $ sudo modprobe lustre
      modprobe: ERROR: could not insert 'lustre': Invalid argument

       

      $ sudo ./mlnxofedinstall
      
      Checking SW Requirements...Removing old packages...
      Installing new packages
      Installing ofed-scripts-5.5...
      Installing mlnx-tools-5.2.0...
      Installing mlnx-ofed-kernel-utils-5.5...
      Installing mlnx-ofed-kernel-dkms-5.5...Error: mlnx-ofed-kernel-dkms installation failed!
      Problem: mlx5_ib: module file: /lib/modules/5.4.0-105-generic/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko, from package: linux-modules-extra-5.4.0-105-generic.
      Collecting debug info...
      See:
          /tmp/MLNX_OFED_LINUX.1302312.logs/mlnx-ofed-kernel-dkms.debinstall.log
      Removing newly installed packages...

      This prevents gds tests from running completely:

      =========================
       Platform verification error :
      nvidia-fs driver is not loadedSUCCESS
      FILESYSTEM VERSION CHECK:
      ofed_info:
      current version: MLNX_OFED_LINUX-5.5-1.0.3.2: (Supported)
      min version supported: MLNX_OFED_LINUX-4.6-1.0.1.1
      SUCCESS
      nvidia-fs driver is not loadedSUCCESS
      usage: gdscheck.py [-h] [-p] [-f FILE] [-v] [-V]GPUDirectStorage platform checkeroptional arguments:
        -h, --help  show this help message and exit
        -p          gds platform check
        -f FILE     gds file check
        -v          gds version checks
        -V          gds fs checks
      SUCCESS
      gdscheck.py python2 tests
      =========================
       Platform verification error :
      nvidia-fs driver is not loadedSUCCESS
      FILESYSTEM VERSION CHECK:
      ofed_info:
      current version: MLNX_OFED_LINUX-5.5-1.0.3.2: (Supported)
      min version supported: MLNX_OFED_LINUX-4.6-1.0.1.1
      SUCCESS
      nvidia-fs driver is not loadedSUCCESS
      usage: gdscheck.py [-h] [-p] [-f FILE] [-v] [-V]GPUDirectStorage platform checkeroptional arguments:
        -h, --help  show this help message and exit
        -p          gds platform check
        -f FILE     gds file check
        -v          gds version checks
        -V          gds fs checks
      SUCCESS
      gdscheck.py current running python tests
      =========================
       Platform verification error :
      nvidia-fs driver is not loadedSUCCESS
      FILESYSTEM VERSION CHECK:
      ofed_info:
      current version: MLNX_OFED_LINUX-5.5-1.0.3.2: (Supported)
      min version supported: MLNX_OFED_LINUX-4.6-1.0.1.1
      SUCCESS
      nvidia-fs driver is not loadedSUCCESS
      usage: gdscheck.py [-h] [-p] [-f FILE] [-v] [-V]GPUDirectStorage platform checkeroptional arguments:
        -h, --help  show this help message and exit
        -p          gds platform check
        -f FILE     gds file check
        -v          gds version checks
        -V          gds fs checks
      SUCCESS
      **************************************************
      gdscheck.py test results : 12 /  12 tests passed
      **************************************************
      Starting basic gdsio Tests
      /usr/local/gds/tools/gdsio -f  /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 4k
      SUCCESS
      /usr/local/gds/tools/gdsio -f  /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 3k
      SUCCESS
      /usr/local/gds/tools/gdsio -f  /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 3k -o 1
      SUCCESS
      /usr/local/gds/tools/gdsio -f  /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 2k
      SUCCESS
      /usr/local/gds/tools/gdsio -f  /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 2k -o 1
      SUCCESS
      /usr/local/gds/tools/gdsio -f  /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 1k
      SUCCESS
      /usr/local/gds/tools/gdsio -f  /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 1k -o 1
      SUCCESS
      /usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 1 -o 1
      Verifying data 
      SUCCESS
      /usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 3 -k 1234 -o 1
      Verifying data 
      SUCCESS
      /usr/local/gds/tools/gdsio -V -D /data/sanity/tests// -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 1 -o 1
      Verifying data 
      SUCCESS
      /usr/local/gds/tools/gdsio -V -D /data/sanity/tests// -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 3 -k 1234 -o 1
      Verifying data 
      SUCCESS
      /usr/local/gds/tools/gdsio -D /data/sanity/tests// -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 3 -k 1234 -o 1 -F -R
      SUCCESS
      /usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 1 -o 1 -b
      Verifying data 
      SUCCESS
      /usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 8M:32M -x 0 -I 1 -o 1 -b
      Verifying data 
      SUCCESS
      /usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 0 -o 1 -b
      SUCCESS
      /usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 8M:32M -x 0 -I 0 -o 1 -b
      SUCCESS
      /usr/local/gds/tools/gdsio -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 0 -o 0 -b
      SUCCESS
      /usr/local/gds/tools/gdsio -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 8M:32M -x 0 -I 0 -o 0 -b
      SUCCESS
      **************************************************
      gdiso tests : 18 /  18 tests passed
      **************************************************
      Starting Offset Tests
      TestCase:Read odd offset
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 616  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 616  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read odd gpu offsets 1, 2, 3, 4, 4K-1, 4K, 4K+1, 60K, 64K, 68K
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 4096 -o 0  -d 0 -t 1 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 4096 -o 0  -d 0 -t 1 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 1 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 2 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 3 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 4 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 4095 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 4096 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 4097 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 4097 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 61440 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 65536 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 69632 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 1 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 2 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 3 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 4 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 4095 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 4096 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 4097 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 61440 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 65536 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 69632 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read/write odd size - sync
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485761 -o 4096  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read/write odd size - async
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485761 -o 4096  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:odd offset and odd size - sync
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485748 -o 119  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:odd offset and odd size - async
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485748 -o 119  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read/write 1 byte from offset 0 (sync and async)
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 1 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 1 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read/write 1 byte from offset 3 (sync and async)
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 1 -o 3  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 1 -o 3  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read/write big file 10G (odd size) - sync
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 220201060 -o 4096  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read/write big file 10G (odd size) - async
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 220201060 -o 4096  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read beyond EOF (read 2G on a 1G file - async)
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1G -n 1 -m 0 -s 209714688 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read beyond EOF (read 2G on a 1G file - sync)
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1G -n 1 -m 1 -s 209714688 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read beyond EOF - odd size (read 2G on a 1G file - sync and async)
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1G -n 1 -m 0 -s 209714689 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1G -n 1 -m 1 -s 209714689 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read beyond EOF odd offset (read 2G on a 1G file - sync and async)
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1G -n 1 -m 1 -s 209714688 -o 616  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1G -n 1 -m 0 -s 209714688 -o 616  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read size beyond EOF (small file)
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1099 -o 1  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1099 -o 1  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1099 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1099 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read just short of EOF
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1000 -o 1  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1000 -o 1  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1000 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1000 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read offset from EOF
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 999 -o 1024  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 999 -o 1024  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1 -o 1024  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1 -o 1024  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read offset beyond EOF (sync and async)
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1 -o 1025  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1 -o 1025  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read with odd gpu_offset
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read with odd gpu_offset and odd file offset
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 617  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 617  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read at 128k GPU page offset
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Read beyond 64k (odd gpu offset)
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Overwrite an existing file within EOF
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 52428800 -o 1  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 52428800 -o 1  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 52428805 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 52428805 -o 0  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:Offset beyond EOF writes
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 4096 -o 4099  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 4096 -o 4099  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:offset just short of EOF writes
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 3 -o 4094  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 3 -o 4094  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      TestCase:offset from EOF writes
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 4096 -o 3  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      /usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 4096 -o 3  -d 0 -t 0 -p 0
      file register error: nvidia-fs driver is not loaded
      FAILED
      **************************************************
      File offset and GPU Buffer offset Tests : 0 /  73 tests passed
      **************************************************
      running cufile sample tests
      sample 1
      FAILED
      sample 2
      opening file /data/sanity/tests//sparse1G_sample2
      FAILED
      sample 3
      FAILED
      sample 4
      FAILED
      sample 5
      FAILED
      sample 6
      FAILED
      sample 7
      FAILED
      sample 8
      PASS: cufile success status:Success
      SUCCESS
      sample 14
      opening file /data/sanity/tests//sparse1G
      FAILED
      sample 15
      FAILED
      **************************************************
      cufile sample tests : 1 /  10 tests passed
      **************************************************
      Testing gdscp functionality
      /usr/local/gds/tools/gdscp /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G_copy 0 -v
      file register error: nvidia-fs driver is not loaded
      FAILED
      **************************************************
      gdscp tests : 0 /  1 tests passed
      **************************************************
      Testing Batch State Machine
      /usr/local/gds/tools//tests/cufile_batch_test_state_machine /data/sanity/tests//sparse1G 0 && pass || fail
      FAILED
      /usr/local/gds/tools//tests/cufile_batch_test_state_machine /data/sanity/tests//sparse1G 1 && pass || fail
      FAILED
      /usr/local/gds/tools//tests/cufile_batch_test_state_machine /data/sanity/tests//sparse1G 2 && pass || fail
      FAILED
      /usr/local/gds/tools//tests/cufile_batch_test_state_machine /data/sanity/tests//sparse1G 3 && pass || fail
      FAILED
      /usr/local/gds/tools//tests/cufile_batch_test_state_machine /data/sanity/tests//sparse1G 4 && pass || fail
      FAILED
      **************************************************
      Batch State Machine Tests : 0 /  5 tests passed
      **************************************************
      Performing cufile API tests
      /usr/local/gds/tools//api_tests/cufile_testbufregister 0
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufregister 1
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufregister 2
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufregister 3
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufregister 4
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufregister 5
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufregister 6
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufregister 7
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufregister 8
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufregister 9
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufregister 10
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufderegister 0
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufderegister 1
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufderegister 2
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufderegister 3
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufderegister 4
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testbufderegister 5
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testclosefd /data/sanity/tests//sparse1G 0
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testclosefd /data/sanity/tests//sparse1G 1
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testclosefd /data/sanity/tests//sparse1G 2
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testclosefd /data/sanity/tests//sparse1G 3
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testdriver 0
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testdriver 1
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testdriver 2
      cufile driver close: nvidia-fs driver is not loaded
      SUCCESS
      /usr/local/gds/tools//api_tests/cufile_testopenfd  /data/sanity/tests//sparse1G 0
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenfd  /data/sanity/tests//sparse1G 1
      SUCCESS
      /usr/local/gds/tools//api_tests/cufile_testopenfd  /data/sanity/tests//sparse1G 2
      SUCCESS
      /usr/local/gds/tools//api_tests/cufile_testopenfd  /data/sanity/tests//sparse1G 3
      SUCCESS
      /usr/local/gds/tools//api_tests/cufile_testopenfd  /data/sanity/tests//sparse1G 4
      SUCCESS
      /usr/local/gds/tools//api_tests/cufile_testopenfd  /data/sanity/tests//sparse1G 5
      SUCCESS
      /usr/local/gds/tools//api_tests/cufile_rw  /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G_VERIFY 0 
      FAILED
      /usr/local/gds/tools//api_tests/cufile_rwmanaged  /data/sanity/tests//sparse1G 0 
      FAILED
      /usr/local/gds/tools//api_tests/cufile_rw_unreg  /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G_VERIFY 0 1
      FAILED
      /usr/local/gds/tools//api_tests/cufile_rw_unreg  /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G_VERIFY 0 2
      FAILED
      /usr/local/gds/tools//api_tests/cufile_rw_unreg  /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G_VERIFY 0 3
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 0
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 1
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 2
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 3
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 4
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 5
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 6
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 7
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 8
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 9
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testdriverprops -p 8
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testdriverprops -b 8
      SUCCESS
      /usr/local/gds/tools//api_tests/cufile_testdriverprops -d 8
      SUCCESS
      /usr/local/gds/tools//api_tests/cufile_testdriverprops -c 8
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testdriverprops -b 1024
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testdriverprops -d 1024
      FAILED
      /usr/local/gds/tools//api_tests/cufile_driver_close /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G 0
      FAILED
      /usr/local/gds/tools//api_tests/cufile_driver_close /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G 1
      FAILED
      /usr/local/gds/tools//api_tests/cufile_driver_close /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G 2
      FAILED
      /usr/local/gds/tools//api_tests/cufile_io_race /data/sanity/tests//sparse1G 
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testvalidnvbuf  /data/sanity/tests//sparse1G 0 
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testvalidnvbuf  /data/sanity/tests//sparse1G 0 
      FAILED
      /usr/local/gds/tools//api_tests/cufile_driver_close /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G 3
      FAILED
      /usr/local/gds/tools//api_tests/cufile_io_race /data/sanity/tests//sparse1G 
      FAILED
      /usr/local/gds/tools//api_tests/cufile_invalid_write /data/sanity/tests//sparse1G 0 0
      SUCCESS
      /usr/local/gds/tools//api_tests/cufile_invalid_write /data/sanity/tests//sparse1G 0 1
      SUCCESS
      /usr/local/gds/tools//api_tests/cufile_invalid_offsets /data/sanity/tests//sparse1G 0 0
      FAILED
      /usr/local/gds/tools//api_tests/cufile_testcudacontext_switch /data/sanity/tests//sparse_CTX_VERIFY 0
      FAILED
      End: nvidia-fs:
      GDS Version: 1.2.1.4 
      NVFS statistics(ver: 4.0)
      NVFS Driver(version: 2.11.0)
      Mellanox PeerDirect Supported: True
      IO stats: Disabled, peer IO stats: Disabled
      Logging level: infoActive Shadow-Buffer (MiB): 0
      Active Process: 0
      Reads                : err=0 io_state_err=0
      Sparse Reads                : n=0 io=0 holes=0 pages=0 
      Writes                : err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0
      Mmap                : n=0 ok=0 err=0 munmap=0
      Bar1-map            : n=0 ok=0 err=0 free=0 callbacks=0 active=0
      Error                : cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0
      Ops                : Read=0 Write=0 BatchIO=0
      **************************************************
      API Tests, : 10 /  63 tests passed
      **************************************************
      Testsuite : 41 / 182 tests passed
      done tests:Mon May 2 16:48:42 UTC 2022
      

      It seems that a patch https://review.whamcloud.com/#/c/45327/ needs to be applied to the master.

      With MLNX_OFED_LINUX-5.6-1.0.3.3-ubuntu20.04-x86_64 and MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 the result is the same.

       

       

      Attachments

        Issue Links

          Activity

            [LU-15831] Lustre 2.15 client breaks DGXA100 MOFED

            Updating cuda to the latest version fixed it.

            okulachenko Oleg Kulachenko (Inactive) added a comment - Updating cuda to the latest version fixed it.

            I'm trying to run tests.
            Now this error:

            /usr/local/cuda-11.5/gds/tools/gdscheck.py -p
             cuInit Failed, error CUDA_ERROR_SYSTEM_NOT_READY
             cuFile initialization failed
             Platform verification error :
            CUDA Driver API error 

            But these are gds tools problems, not Lustre

            okulachenko Oleg Kulachenko (Inactive) added a comment - I'm trying to run tests. Now this error: /usr/local/cuda-11.5/gds/tools/gdscheck.py -p  cuInit Failed, error CUDA_ERROR_SYSTEM_NOT_READY  cuFile initialization failed  Platform verification error : CUDA Driver API error But these are gds tools problems, not Lustre

            Just built the master branch of fs/lustre-release on DGX A100 and could load the lustre module as well.

            root@a100-01:/home/ddn/gtapase/exa-client# dpkg -l | grep lustre
            rc  lustre-client-modules-5.14.0-1032-oem             2.14.0-ddn39-11-g767352e-1              amd64        Lustre Linux kernel module (kernel 5.14.0-1032-oem)
            ii  lustre-client-modules-5.4.0-109-generic           2.15.50-13-gc524079-dirty-1             amd64        Lustre Linux kernel module (kernel 5.4.0-109-generic)
            rc  lustre-client-modules-5.4.0-96-generic            2.15.0-RC3-2-g7905359-1                 amd64        Lustre Linux kernel module (kernel 5.4.0-96-generic)
            ii  lustre-client-utils                               2.15.50-13-gc524079-dirty-1             amd64        Userspace utilities for the Lustre filesystem (client)
            ii  lustre-dev                                        2.15.50-13-gc524079-dirty-1             amd64        Development files for the Lustre filesystem
            rc  lustre-source                                     2.15.0-RC3-2-g7905359-1                 all          source for Lustre filesystem client kernel modules
            
            root@a100-01:/home/ddn/gtapase/exa-client# lsmod | grep lustre 
            lustre               1007616  0 
            lmv                   212992  1 lustre 
            mdc                   274432  1 lustre 
            lov                   331776  2 mdc,lustre 
            ptlrpc               1355776  7 fld,osc,fid,lov,mdc,lmv,lustre 
            obdclass             3297280  8 fld,osc,fid,ptlrpc,lov,mdc,lmv,lustre 
            lnet                  659456  6 osc,ko2iblnd,obdclass,ptlrpc,lmv,lustre 
            libcfs                245760  11 fld,lnet,osc,fid,ko2iblnd,obdclass,ptlrpc,lov,mdc,lmv,lustre
            

             

            gtapase Gaurang Tapase added a comment - Just built the master branch of fs/lustre-release on DGX A100 and could load the lustre module as well. root@a100-01:/home/ddn/gtapase/exa-client# dpkg -l | grep lustre rc  lustre-client-modules-5.14.0-1032-oem             2.14.0-ddn39-11-g767352e-1              amd64        Lustre Linux kernel module (kernel 5.14.0-1032-oem) ii  lustre-client-modules-5.4.0-109-generic           2.15.50-13-gc524079-dirty-1             amd64        Lustre Linux kernel module (kernel 5.4.0-109-generic) rc  lustre-client-modules-5.4.0-96-generic            2.15.0-RC3-2-g7905359-1                 amd64        Lustre Linux kernel module (kernel 5.4.0-96-generic) ii  lustre-client-utils                               2.15.50-13-gc524079-dirty-1             amd64        Userspace utilities for the Lustre filesystem (client) ii  lustre-dev                                        2.15.50-13-gc524079-dirty-1             amd64        Development files for the Lustre filesystem rc  lustre-source                                     2.15.0-RC3-2-g7905359-1                 all          source for Lustre filesystem client kernel modules root@a100-01:/home/ddn/gtapase/exa-client# lsmod | grep lustre lustre               1007616  0 lmv                   212992  1 lustre mdc                   274432  1 lustre lov                   331776  2 mdc,lustre ptlrpc               1355776  7 fld,osc,fid,lov,mdc,lmv,lustre obdclass             3297280  8 fld,osc,fid,ptlrpc,lov,mdc,lmv,lustre lnet                  659456  6 osc,ko2iblnd,obdclass,ptlrpc,lmv,lustre libcfs                245760  11 fld,lnet,osc,fid,ko2iblnd,obdclass,ptlrpc,lov,mdc,lmv,lustre  

            I don't have access to DGX A100 system, but I tried compiling lustre 2.15 on a ubuntu 20.04 system with MoFED 5.5 installed and it worked fine. I could load the lustre module as well.

            Is this something specific to DGX A100?

            gtapase Gaurang Tapase added a comment - I don't have access to DGX A100 system, but I tried compiling lustre 2.15 on a ubuntu 20.04 system with MoFED 5.5 installed and it worked fine. I could load the lustre module as well. Is this something specific to DGX A100?

            Gaurang, any update on this issue? This is one of the few remaining issues before the 2.15.0 release.

            adilger Andreas Dilger added a comment - Gaurang, any update on this issue? This is one of the few remaining issues before the 2.15.0 release.
            mdiep Minh Diep added a comment -

            gtapase please take a look

            mdiep Minh Diep added a comment - gtapase please take a look

            mdiep It seems that the fix did not help.

            $ lctl get_param version
            version=2.15.0_RC3_2_g7905359
            
            ddn@a100-01:~/okulachenko/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64$ sudo ./mlnxofedinstall --add-kernel-support --distro ubuntu20.04
            Note: This program will create MLNX_OFED_LINUX TGZ for ubuntu20.04 under /tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-5.4.0-109-generic directory.
            See log file /tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-5.4.0-109-generic/mlnx_iso.3333249_logs/mlnx_ofed_iso.3333249.logChecking if all needed packages are installed...
            Building MLNX_OFED_LINUX DEBS . Please wait...
            Creating metadata-rpms for 5.4.0-109-generic ...
            WARNING: If you are going to configure this package as a repository, then please note
            WARNING: that it is not signed, therefore, you need to set 'trusted=yes' in the sources.list file.
            WARNING: Example: deb [trusted=yes] file:/<path to MLNX_OFED DEBS folder> ./
            Created /tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-5.4.0-109-generic/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-ext.tgz
            Removing old packages...
            Uninstalling the previous version of MLNX_OFED_LINUX
            Installing /tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-5.4.0-109-generic/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-ext
            /tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-5.4.0-109-generic/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-ext/mlnxofedinstall --force --without-dkms --distro ubuntu20.04
            Logs dir: /tmp/MLNX_OFED_LINUX.3838794.logs
            General log file: /tmp/MLNX_OFED_LINUX.3838794.logs/general.logBelow is the list of MLNX_OFED_LINUX packages that you have chosen
            (some may have been added by the installer due to package dependencies):ofed-scripts
            mlnx-tools
            mlnx-ofed-kernel-utils
            mlnx-ofed-kernel-modules
            iser-modules
            isert-modules
            srp-modules
            rdma-core
            libibverbs1
            ibverbs-utils
            ibverbs-providers
            libibverbs-dev
            libibverbs1-dbg
            libibumad3
            libibumad-dev
            ibacm
            librdmacm1
            rdmacm-utils
            librdmacm-dev
            mstflint
            ibdump
            libibmad5
            libibmad-dev
            libopensm
            opensm
            opensm-doc
            libopensm-devel
            libibnetdisc5
            infiniband-diags
            mft
            kernel-mft-modules
            perftest
            ibutils2
            ar-mgr
            dump-pr
            ibsim
            ibsim-doc
            ucx
            sharp
            hcoll
            openmpi
            mpitests
            knem-modules
            libdapl2
            dapl2-utils
            libdapl-dev
            dpcp
            srptools
            mlnx-ethtool
            mlnx-iproute2
            rshimThis program will install the MLNX_OFED_LINUX package on your machine.
            Note that all other Mellanox, OEM, OFED, RDMA or Distribution IB packages will be removed.
            Those packages are removed due to conflicts with MLNX_OFED_LINUX, do not reinstall them.Checking SW Requirements...
            Removing old packages...
            Installing new packages
            Installing ofed-scripts-5.5...
            Installing mlnx-tools-5.2.0...
            Installing mlnx-ofed-kernel-utils-5.5...
            Installing mlnx-ofed-kernel-modules-5.5...
            Installing iser-modules-5.5...
            Installing isert-modules-5.5...
            Installing srp-modules-5.5...
            Installing rdma-core-55mlnx37...
            Installing libibverbs1-55mlnx37...
            Installing ibverbs-utils-55mlnx37...
            Installing ibverbs-providers-55mlnx37...
            Installing libibverbs-dev-55mlnx37...
            Installing libibverbs1-dbg-55mlnx37...
            Installing libibumad3-55mlnx37...
            Installing libibumad-dev-55mlnx37...
            Installing ibacm-55mlnx37...
            Installing librdmacm1-55mlnx37...
            Installing rdmacm-utils-55mlnx37...
            Installing librdmacm-dev-55mlnx37...
            Installing mstflint-4.16.0...
            Installing ibdump-6.0.0...
            Installing libibmad5-55mlnx37...
            Installing libibmad-dev-55mlnx37...
            Installing libopensm-5.10.0.MLNX20211115.e645cc83...
            Installing opensm-5.10.0.MLNX20211115.e645cc83...
            Installing opensm-doc-5.10.0.MLNX20211115.e645cc83...
            Installing libopensm-devel-5.10.0.MLNX20211115.e645cc83...
            Installing libibnetdisc5-55mlnx37...
            Installing infiniband-diags-55mlnx37...
            Installing mft-4.18.0...
            Installing kernel-mft-modules-4.18.0...
            Installing perftest-4.5...
            Installing ibutils2-2.1.1...
            Installing ar-mgr-1.0...
            Installing dump-pr-1.0...
            Installing ibsim-0.10...
            Installing ibsim-doc-0.10...
            Installing ucx-1.12.0...
            Installing sharp-2.6.1.MLNX20211124.aac4a56...
            Installing hcoll-4.7.3202...
            Installing openmpi-4.1.2rc2...
            Installing mpitests-3.2.20...
            Installing knem-modules-1.1.4.90mlnx1...
            Installing libdapl2-2.1.10.1.mlnx...
            Installing dapl2-utils-2.1.10.1.mlnx...
            Installing libdapl-dev-2.1.10.1.mlnx...
            Installing dpcp-1.1.17...
            Installing srptools-55mlnx37...
            Installing mlnx-ethtool-5.13...
            Installing mlnx-iproute2-5.14.0...
            Installing rshim-2.0.6...
            Selecting previously unselected package mlnx-fw-updater.
            (Reading database ... 224924 files and directories currently installed.)
            Preparing to unpack .../mlnx-fw-updater_5.5-1.0.3.2_amd64.deb ...
            Unpacking mlnx-fw-updater (5.5-1.0.3.2) ...
            Setting up mlnx-fw-updater (5.5-1.0.3.2) ...Added 'RUN_FW_UPDATER_ONBOOT=no to /etc/infiniband/openib.confInitializing...
            Attempting to perform Firmware update...
            Querying Mellanox devices firmware ...
            Querying Mellanox devices firmware ...
            Querying Mellanox devices firmware ...Device #1:
            ----------  Device Type:      ConnectX6
              Part Number:      MCX653105A-HDA_Ax
              Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
              PSID:             MT_0000000223
              PCI Device Name:  0c:00.0
              Base GUID:        0c42a10300555aaa
              Versions:         Current        Available
                 FW             20.33.1048     20.32.1010
                 PXE            3.6.0502       3.6.0502
                 UEFI           14.26.0017     14.25.0017  Status:           Up to date
            Log File: /tmp/lJcLV6m0FI
            Querying Mellanox devices firmware ...Device #1:
            ----------  Device Type:      ConnectX6
              Part Number:      MCX653105A-HDA_Ax
              Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
              PSID:             MT_0000000223
              PCI Device Name:  12:00.0
              Base GUID:        0c42a10300555dee
              Versions:         Current        Available
                 FW             20.33.1048     20.32.1010
                 PXE            3.6.0502       3.6.0502
                 UEFI           14.26.0017     14.25.0017  Status:           Up to date
            Log File: /tmp/paBqhmZPH7
            Querying Mellanox devices firmware ...Device #1:
            ----------  Device Type:      ConnectX6
              Part Number:      MCX653105A-HDA_Ax
              Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
              PSID:             MT_0000000223
              PCI Device Name:  4b:00.0
              Base GUID:        043f720300f55646
              Versions:         Current        Available
                 FW             20.33.1048     20.32.1010
                 PXE            3.6.0502       3.6.0502
                 UEFI           14.26.0017     14.25.0017  Status:           Up to date
            Log File: /tmp/ovxoNfgM6c
            Querying Mellanox devices firmware ...Device #1:
            ----------  Device Type:      ConnectX6
              Part Number:      MCX653105A-HDA_Ax
              Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
              PSID:             MT_0000000223
              PCI Device Name:  54:00.0
              Base GUID:        0c42a10300555dbe
              Versions:         Current        Available
                 FW             20.33.1048     20.32.1010
                 PXE            3.6.0502       3.6.0502
                 UEFI           14.26.0017     14.25.0017  Status:           Up to date
            Log File: /tmp/FZHvz2eu6S
            Querying Mellanox devices firmware ...
            Querying Mellanox devices firmware ...Device #1:
            ----------  Device Type:      ConnectX6
              Part Number:      MCX653106A-HDA_Ax
              Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; dual-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
              PSID:             MT_0000000225
              PCI Device Name:  61:00.0
              Base MAC:         1c34da6c9046
              Versions:         Current        Available
                 FW             20.33.1048     20.32.1010
                 PXE            3.6.0502       3.6.0502
                 UEFI           14.26.0017     14.25.0017  Status:           Up to date
            Log File: /tmp/J6qmeEZp3kDevice #1:
            ----------  Device Type:      ConnectX6
              Part Number:      MCX653105A-HDA_Ax
              Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
              PSID:             MT_0000000223
              PCI Device Name:  8d:00.0
              Base GUID:        0c42a10300555d62
              Versions:         Current        Available
                 FW             20.33.1048     20.32.1010
                 PXE            3.6.0502       3.6.0502
                 UEFI           14.26.0017     14.25.0017  Status:           Up to date
            Log File: /tmp/i3Ih6BBFkg
            Querying Mellanox devices firmware ...Device #1:
            ----------  Device Type:      ConnectX6
              Part Number:      MCX653105A-HDA_Ax
              Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
              PSID:             MT_0000000223
              PCI Device Name:  94:00.0
              Base GUID:        0c42a10300555afe
              Versions:         Current        Available
                 FW             20.33.1048     20.32.1010
                 PXE            3.6.0502       3.6.0502
                 UEFI           14.26.0017     14.25.0017  Status:           Up to date
            Log File: /tmp/ITvKfCj1dJ
            Querying Mellanox devices firmware ...Device #1:
            ----------  Device Type:      ConnectX6
              Part Number:      MCX653105A-HDA_Ax
              Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
              PSID:             MT_0000000223
              PCI Device Name:  ba:00.0
              Base GUID:        0c42a10300555af6
              Versions:         Current        Available
                 FW             20.33.1048     20.32.1010
                 PXE            3.6.0502       3.6.0502
                 UEFI           14.26.0017     14.25.0017  Status:           Up to date
            Log File: /tmp/Q8rB0zMS9EDevice #1:
            ----------  Device Type:      ConnectX6
              Part Number:      MCX653105A-HDA_Ax
              Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
              PSID:             MT_0000000223
              PCI Device Name:  cc:00.0
              Base GUID:        0c42a10300555e02
              Versions:         Current        Available
                 FW             20.33.1048     20.32.1010
                 PXE            3.6.0502       3.6.0502
                 UEFI           14.26.0017     14.25.0017  Status:           Up to date
            Log File: /tmp/7EZbD0vG0HDevice #1:
            ----------  Device Type:      ConnectX6
              Part Number:      MCX653106A-HDA_Ax
              Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; dual-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
              PSID:             MT_0000000225
              PCI Device Name:  e1:00.0
              Base MAC:         0c42a11b7dee
              Versions:         Current        Available
                 FW             20.33.1048     20.32.1010
                 PXE            3.6.0502       3.6.0502
                 UEFI           14.26.0017     14.25.0017  Status:           Up to date
            Log File: /tmp/gQkEPSV1Mp
            Real log file: /tmp/MLNX_OFED_LINUX.3838794.logs/fw_update.log
            Device (0c:00.0):
                0c:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
                Link Width: x16
                PCI Link Speed: 16GT/sDevice (12:00.0):
                12:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
                Link Width: x16
                PCI Link Speed: 16GT/sDevice (4b:00.0):
                4b:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
                Link Width: x16
                PCI Link Speed: 16GT/sDevice (54:00.0):
                54:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
                Link Width: x16
                PCI Link Speed: 16GT/sDevice (61:00.0):
                61:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
                Link Width: x16
                PCI Link Speed: 16GT/sDevice (61:00.1):
                61:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
                Link Width: x16
                PCI Link Speed: 16GT/sDevice (8d:00.0):
                8d:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
                Link Width: x16
                PCI Link Speed: 16GT/sDevice (94:00.0):
                94:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
                Link Width: x16
                PCI Link Speed: 16GT/sDevice (ba:00.0):
                ba:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
                Link Width: x16
                PCI Link Speed: 16GT/sDevice (cc:00.0):
                cc:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
                Link Width: x16
                PCI Link Speed: 16GT/sDevice (e1:00.0):
                e1:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
                Link Width: x16
                PCI Link Speed: 16GT/sDevice (e1:00.1):
                e1:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
                Link Width: x16
                PCI Link Speed: 16GT/sInstallation passed successfully
            To load the new driver, run:
            /etc/init.d/openibd restart
            
            ddn@a100-01:~/okulachenko/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64$ sudo /etc/init.d/openibd restart
            Unloading HCA driver:                                      [  OK  ]
            Loading Mellanox MLX5_IB HCA driver:                       [FAILED]
            Loading Mellanox MLX5 HCA driver:                          [FAILED]
            Loading HCA driver and Access Layer:                       [FAILED]Please run /usr/sbin/sysinfo-snapshot.py to collect the debug information
            and open an issue in the http://support.mellanox.com/SupportWeb/service_center/SelfService
            
            ddn@a100-01:~/okulachenko/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64$ sudo modprobe -v lustre
            insmod /lib/modules/5.4.0-109-generic/updates/kernel/net/libcfs.ko cpu_npartitions=32
            insmod /lib/modules/5.4.0-109-generic/updates/kernel/net/lnet.ko networks="o2ib(ibp12s0,ibp18s0,enp225s0f1,ibp75s0,ibp84s0,enp97s0f1,ibp141s0,ibp148s0,ibp186s0,ibp204s0)" lnet_transaction_timeout=100 lnet_retry_count=2
            insmod /lib/modules/5.4.0-109-generic/updates/kernel/fs/obdclass.ko
            insmod /lib/modules/5.4.0-109-generic/updates/kernel/fs/ptlrpc.ko
            modprobe: ERROR: could not insert 'lustre': Network is down
            okulachenko Oleg Kulachenko (Inactive) added a comment - mdiep It seems that the fix did not help. $ lctl get_param version version=2.15.0_RC3_2_g7905359 ddn@a100-01:~/okulachenko/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64$ sudo ./mlnxofedinstall --add-kernel-support --distro ubuntu20.04 Note: This program will create MLNX_OFED_LINUX TGZ for ubuntu20.04 under /tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-5.4.0-109- generic directory. See log file /tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-5.4.0-109- generic /mlnx_iso.3333249_logs/mlnx_ofed_iso.3333249.logChecking if all needed packages are installed... Building MLNX_OFED_LINUX DEBS . Please wait... Creating metadata-rpms for 5.4.0-109- generic ... WARNING: If you are going to configure this package as a repository, then please note WARNING: that it is not signed, therefore, you need to set 'trusted=yes' in the sources.list file. WARNING: Example: deb [trusted=yes] file:/<path to MLNX_OFED DEBS folder> ./ Created /tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-5.4.0-109- generic /MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-ext.tgz Removing old packages... Uninstalling the previous version of MLNX_OFED_LINUX Installing /tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-5.4.0-109- generic /MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-ext /tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-5.4.0-109- generic /MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-ext/mlnxofedinstall --force --without-dkms --distro ubuntu20.04 Logs dir: /tmp/MLNX_OFED_LINUX.3838794.logs General log file: /tmp/MLNX_OFED_LINUX.3838794.logs/general.logBelow is the list of MLNX_OFED_LINUX packages that you have chosen (some may have been added by the installer due to package dependencies):ofed-scripts mlnx-tools mlnx-ofed-kernel-utils mlnx-ofed-kernel-modules iser-modules isert-modules srp-modules rdma-core libibverbs1 ibverbs-utils ibverbs-providers libibverbs-dev libibverbs1-dbg libibumad3 libibumad-dev ibacm librdmacm1 rdmacm-utils librdmacm-dev mstflint ibdump libibmad5 libibmad-dev libopensm opensm opensm-doc libopensm-devel libibnetdisc5 infiniband-diags mft kernel-mft-modules perftest ibutils2 ar-mgr dump-pr ibsim ibsim-doc ucx sharp hcoll openmpi mpitests knem-modules libdapl2 dapl2-utils libdapl-dev dpcp srptools mlnx-ethtool mlnx-iproute2 rshimThis program will install the MLNX_OFED_LINUX package on your machine. Note that all other Mellanox, OEM, OFED, RDMA or Distribution IB packages will be removed. Those packages are removed due to conflicts with MLNX_OFED_LINUX, do not reinstall them.Checking SW Requirements... Removing old packages... Installing new packages Installing ofed-scripts-5.5... Installing mlnx-tools-5.2.0... Installing mlnx-ofed-kernel-utils-5.5... Installing mlnx-ofed-kernel-modules-5.5... Installing iser-modules-5.5... Installing isert-modules-5.5... Installing srp-modules-5.5... Installing rdma-core-55mlnx37... Installing libibverbs1-55mlnx37... Installing ibverbs-utils-55mlnx37... Installing ibverbs-providers-55mlnx37... Installing libibverbs-dev-55mlnx37... Installing libibverbs1-dbg-55mlnx37... Installing libibumad3-55mlnx37... Installing libibumad-dev-55mlnx37... Installing ibacm-55mlnx37... Installing librdmacm1-55mlnx37... Installing rdmacm-utils-55mlnx37... Installing librdmacm-dev-55mlnx37... Installing mstflint-4.16.0... Installing ibdump-6.0.0... Installing libibmad5-55mlnx37... Installing libibmad-dev-55mlnx37... Installing libopensm-5.10.0.MLNX20211115.e645cc83... Installing opensm-5.10.0.MLNX20211115.e645cc83... Installing opensm-doc-5.10.0.MLNX20211115.e645cc83... Installing libopensm-devel-5.10.0.MLNX20211115.e645cc83... Installing libibnetdisc5-55mlnx37... Installing infiniband-diags-55mlnx37... Installing mft-4.18.0... Installing kernel-mft-modules-4.18.0... Installing perftest-4.5... Installing ibutils2-2.1.1... Installing ar-mgr-1.0... Installing dump-pr-1.0... Installing ibsim-0.10... Installing ibsim-doc-0.10... Installing ucx-1.12.0... Installing sharp-2.6.1.MLNX20211124.aac4a56... Installing hcoll-4.7.3202... Installing openmpi-4.1.2rc2... Installing mpitests-3.2.20... Installing knem-modules-1.1.4.90mlnx1... Installing libdapl2-2.1.10.1.mlnx... Installing dapl2-utils-2.1.10.1.mlnx... Installing libdapl-dev-2.1.10.1.mlnx... Installing dpcp-1.1.17... Installing srptools-55mlnx37... Installing mlnx-ethtool-5.13... Installing mlnx-iproute2-5.14.0... Installing rshim-2.0.6... Selecting previously unselected package mlnx-fw-updater. (Reading database ... 224924 files and directories currently installed.) Preparing to unpack .../mlnx-fw-updater_5.5-1.0.3.2_amd64.deb ... Unpacking mlnx-fw-updater (5.5-1.0.3.2) ... Setting up mlnx-fw-updater (5.5-1.0.3.2) ...Added 'RUN_FW_UPDATER_ONBOOT=no to /etc/infiniband/openib.confInitializing... Attempting to perform Firmware update... Querying Mellanox devices firmware ... Querying Mellanox devices firmware ... Querying Mellanox devices firmware ...Device #1: ----------  Device Type:      ConnectX6   Part Number :      MCX653105A-HDA_Ax   Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6   PSID:             MT_0000000223   PCI Device Name:  0c:00.0   Base GUID:        0c42a10300555aaa   Versions:         Current        Available      FW             20.33.1048     20.32.1010      PXE            3.6.0502       3.6.0502      UEFI           14.26.0017     14.25.0017  Status:           Up to date Log File: /tmp/lJcLV6m0FI Querying Mellanox devices firmware ...Device #1: ----------  Device Type:      ConnectX6   Part Number :      MCX653105A-HDA_Ax   Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6   PSID:             MT_0000000223   PCI Device Name:  12:00.0   Base GUID:        0c42a10300555dee   Versions:         Current        Available      FW             20.33.1048     20.32.1010      PXE            3.6.0502       3.6.0502      UEFI           14.26.0017     14.25.0017  Status:           Up to date Log File: /tmp/paBqhmZPH7 Querying Mellanox devices firmware ...Device #1: ----------  Device Type:      ConnectX6   Part Number :      MCX653105A-HDA_Ax   Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6   PSID:             MT_0000000223   PCI Device Name:  4b:00.0   Base GUID:        043f720300f55646   Versions:         Current        Available      FW             20.33.1048     20.32.1010      PXE            3.6.0502       3.6.0502      UEFI           14.26.0017     14.25.0017  Status:           Up to date Log File: /tmp/ovxoNfgM6c Querying Mellanox devices firmware ...Device #1: ----------  Device Type:      ConnectX6   Part Number :      MCX653105A-HDA_Ax   Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6   PSID:             MT_0000000223   PCI Device Name:  54:00.0   Base GUID:        0c42a10300555dbe   Versions:         Current        Available      FW             20.33.1048     20.32.1010      PXE            3.6.0502       3.6.0502      UEFI           14.26.0017     14.25.0017  Status:           Up to date Log File: /tmp/FZHvz2eu6S Querying Mellanox devices firmware ... Querying Mellanox devices firmware ...Device #1: ----------  Device Type:      ConnectX6   Part Number :      MCX653106A-HDA_Ax   Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; dual-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6   PSID:             MT_0000000225   PCI Device Name:  61:00.0   Base MAC:         1c34da6c9046   Versions:         Current        Available      FW             20.33.1048     20.32.1010      PXE            3.6.0502       3.6.0502      UEFI           14.26.0017     14.25.0017  Status:           Up to date Log File: /tmp/J6qmeEZp3kDevice #1: ----------  Device Type:      ConnectX6   Part Number :      MCX653105A-HDA_Ax   Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6   PSID:             MT_0000000223   PCI Device Name:  8d:00.0   Base GUID:        0c42a10300555d62   Versions:         Current        Available      FW             20.33.1048     20.32.1010      PXE            3.6.0502       3.6.0502      UEFI           14.26.0017     14.25.0017  Status:           Up to date Log File: /tmp/i3Ih6BBFkg Querying Mellanox devices firmware ...Device #1: ----------  Device Type:      ConnectX6   Part Number :      MCX653105A-HDA_Ax   Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6   PSID:             MT_0000000223   PCI Device Name:  94:00.0   Base GUID:        0c42a10300555afe   Versions:         Current        Available      FW             20.33.1048     20.32.1010      PXE            3.6.0502       3.6.0502      UEFI           14.26.0017     14.25.0017  Status:           Up to date Log File: /tmp/ITvKfCj1dJ Querying Mellanox devices firmware ...Device #1: ----------  Device Type:      ConnectX6   Part Number :      MCX653105A-HDA_Ax   Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6   PSID:             MT_0000000223   PCI Device Name:  ba:00.0   Base GUID:        0c42a10300555af6   Versions:         Current        Available      FW             20.33.1048     20.32.1010      PXE            3.6.0502       3.6.0502      UEFI           14.26.0017     14.25.0017  Status:           Up to date Log File: /tmp/Q8rB0zMS9EDevice #1: ----------  Device Type:      ConnectX6   Part Number :      MCX653105A-HDA_Ax   Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6   PSID:             MT_0000000223   PCI Device Name:  cc:00.0   Base GUID:        0c42a10300555e02   Versions:         Current        Available      FW             20.33.1048     20.32.1010      PXE            3.6.0502       3.6.0502      UEFI           14.26.0017     14.25.0017  Status:           Up to date Log File: /tmp/7EZbD0vG0HDevice #1: ----------  Device Type:      ConnectX6   Part Number :      MCX653106A-HDA_Ax   Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; dual-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6   PSID:             MT_0000000225   PCI Device Name:  e1:00.0   Base MAC:         0c42a11b7dee   Versions:         Current        Available      FW             20.33.1048     20.32.1010      PXE            3.6.0502       3.6.0502      UEFI           14.26.0017     14.25.0017  Status:           Up to date Log File: /tmp/gQkEPSV1Mp Real log file: /tmp/MLNX_OFED_LINUX.3838794.logs/fw_update.log Device (0c:00.0):     0c:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]     Link Width: x16     PCI Link Speed: 16GT/sDevice (12:00.0):     12:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]     Link Width: x16     PCI Link Speed: 16GT/sDevice (4b:00.0):     4b:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]     Link Width: x16     PCI Link Speed: 16GT/sDevice (54:00.0):     54:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]     Link Width: x16     PCI Link Speed: 16GT/sDevice (61:00.0):     61:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]     Link Width: x16     PCI Link Speed: 16GT/sDevice (61:00.1):     61:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]     Link Width: x16     PCI Link Speed: 16GT/sDevice (8d:00.0):     8d:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]     Link Width: x16     PCI Link Speed: 16GT/sDevice (94:00.0):     94:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]     Link Width: x16     PCI Link Speed: 16GT/sDevice (ba:00.0):     ba:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]     Link Width: x16     PCI Link Speed: 16GT/sDevice (cc:00.0):     cc:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]     Link Width: x16     PCI Link Speed: 16GT/sDevice (e1:00.0):     e1:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]     Link Width: x16     PCI Link Speed: 16GT/sDevice (e1:00.1):     e1:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]     Link Width: x16     PCI Link Speed: 16GT/sInstallation passed successfully To load the new driver, run: /etc/init.d/openibd restart ddn@a100-01:~/okulachenko/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64$ sudo /etc/init.d/openibd restart Unloading HCA driver:                                      [  OK  ] Loading Mellanox MLX5_IB HCA driver:                       [FAILED] Loading Mellanox MLX5 HCA driver:                          [FAILED] Loading HCA driver and Access Layer:                       [FAILED]Please run /usr/sbin/sysinfo-snapshot.py to collect the debug information and open an issue in the http: //support.mellanox.com/SupportWeb/service_center/SelfService ddn@a100-01:~/okulachenko/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64$ sudo modprobe -v lustre insmod /lib/modules/5.4.0-109- generic /updates/kernel/net/libcfs.ko cpu_npartitions=32 insmod /lib/modules/5.4.0-109- generic /updates/kernel/net/lnet.ko networks= "o2ib(ibp12s0,ibp18s0,enp225s0f1,ibp75s0,ibp84s0,enp97s0f1,ibp141s0,ibp148s0,ibp186s0,ibp204s0)" lnet_transaction_timeout=100 lnet_retry_count=2 insmod /lib/modules/5.4.0-109- generic /updates/kernel/fs/obdclass.ko insmod /lib/modules/5.4.0-109- generic /updates/kernel/fs/ptlrpc.ko modprobe: ERROR: could not insert 'lustre' : Network is down

            After reinstall Lustre mlnxofedinstall works without errors.

            Logs: mlnx_logs.txt

            But:

            ddn@a100-01:~/okulachenko/MLNX_OFED_LINUX-5.6-1.0.3.3-ubuntu20.04-x86_64$ sudo lustre_rmmod
            [sudo] password for ddn:
            ddn@a100-01:~/okulachenko/MLNX_OFED_LINUX-5.6-1.0.3.3-ubuntu20.04-x86_64$ sudo /etc/init.d/openibd restart
            Unloading HCA driver:                                      [  OK  ]
            Loading Mellanox MLX5_IB HCA driver:                       [FAILED]
            Loading Mellanox MLX5 HCA driver:                          [FAILED]
            Loading HCA driver and Access Layer:                       [FAILED]
            Please run /usr/sbin/sysinfo-snapshot.py to collect the debug information
            and open an issue in the http://support.mellanox.com/SupportWeb/service_center/SelfService

            I think that we found a new issue. We will re-image A100 and run firmware upgrade. Most likely the new issue is not in the Lustre, but in the firmware.

             

             

             

            okulachenko Oleg Kulachenko (Inactive) added a comment - After reinstall Lustre mlnxofedinstall works without errors. Logs: mlnx_logs.txt But: ddn@a100-01:~/okulachenko/MLNX_OFED_LINUX-5.6-1.0.3.3-ubuntu20.04-x86_64$ sudo lustre_rmmod [sudo] password for ddn: ddn@a100-01:~/okulachenko/MLNX_OFED_LINUX-5.6-1.0.3.3-ubuntu20.04-x86_64$ sudo /etc/init.d/openibd restart Unloading HCA driver:                                      [  OK  ] Loading Mellanox MLX5_IB HCA driver:                       [FAILED] Loading Mellanox MLX5 HCA driver:                          [FAILED] Loading HCA driver and Access Layer:                       [FAILED] Please run /usr/sbin/sysinfo-snapshot.py to collect the debug information and open an issue in the http: //support.mellanox.com/SupportWeb/service_center/SelfService I think that we found a new issue. We will re-image A100 and run firmware upgrade. Most likely the new issue is not in the Lustre, but in the firmware.      
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15
            mdiep Minh Diep added a comment - revert patch https://review.whamcloud.com/47238

            People

              gtapase Gaurang Tapase
              okulachenko Oleg Kulachenko (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: