[LU-15831] Lustre 2.15 client breaks DGXA100 MOFED Created: 06/May/22  Updated: 25/May/22  Resolved: 25/May/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oleg Kulachenko (Inactive) Assignee: Gaurang Tapase
Resolution: Fixed Votes: 0
Labels: debian, ubuntu

Attachments: Text File mlnx_logs.txt    
Issue Links:
Related
is related to LU-12019 Recognize Debian Kernel in autoconf a... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When trying to run GPUDirect, it was found on install required software step:

$ sudo ./mlnxofedinstall
...
Installation passed successfully
To load the new driver, run:
/etc/init.d/openibd restart

$ sudo /etc/init.d/openibd restart
Unloading ib_uverbs [FAILED]
rmmod: ERROR: Module ib_uverbs is in use by: nv_peer_mem

$ sudo rmmod nv_peer_mem

$ sudo /etc/init.d/openibd restart
Unloading HCA driver:[ OK ]
Loading Mellanox MLX5_IB HCA driver:  [FAILED]
Loading Mellanox MLX5 HCA driver: [FAILED]
Loading HCA driver and Access Layer:  [FAILED]
Please run /usr/sbin/sysinfo-snapshot.py to collect the debug information
and open an issue in the http://support.mellanox.com/SupportWeb/service_center/SelfService

$ sudo modprobe nv_peer_mem
modprobe: FATAL: Module nv_peer_mem not found in directory /lib/modules/5.4.0-109-generic

$ sudo modprobe lustre
modprobe: ERROR: could not insert 'lustre': Invalid argument

 

$ sudo ./mlnxofedinstall

Checking SW Requirements...Removing old packages...
Installing new packages
Installing ofed-scripts-5.5...
Installing mlnx-tools-5.2.0...
Installing mlnx-ofed-kernel-utils-5.5...
Installing mlnx-ofed-kernel-dkms-5.5...Error: mlnx-ofed-kernel-dkms installation failed!
Problem: mlx5_ib: module file: /lib/modules/5.4.0-105-generic/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko, from package: linux-modules-extra-5.4.0-105-generic.
Collecting debug info...
See:
    /tmp/MLNX_OFED_LINUX.1302312.logs/mlnx-ofed-kernel-dkms.debinstall.log
Removing newly installed packages...

This prevents gds tests from running completely:

=========================
 Platform verification error :
nvidia-fs driver is not loadedSUCCESS
FILESYSTEM VERSION CHECK:
ofed_info:
current version: MLNX_OFED_LINUX-5.5-1.0.3.2: (Supported)
min version supported: MLNX_OFED_LINUX-4.6-1.0.1.1
SUCCESS
nvidia-fs driver is not loadedSUCCESS
usage: gdscheck.py [-h] [-p] [-f FILE] [-v] [-V]GPUDirectStorage platform checkeroptional arguments:
  -h, --help  show this help message and exit
  -p          gds platform check
  -f FILE     gds file check
  -v          gds version checks
  -V          gds fs checks
SUCCESS
gdscheck.py python2 tests
=========================
 Platform verification error :
nvidia-fs driver is not loadedSUCCESS
FILESYSTEM VERSION CHECK:
ofed_info:
current version: MLNX_OFED_LINUX-5.5-1.0.3.2: (Supported)
min version supported: MLNX_OFED_LINUX-4.6-1.0.1.1
SUCCESS
nvidia-fs driver is not loadedSUCCESS
usage: gdscheck.py [-h] [-p] [-f FILE] [-v] [-V]GPUDirectStorage platform checkeroptional arguments:
  -h, --help  show this help message and exit
  -p          gds platform check
  -f FILE     gds file check
  -v          gds version checks
  -V          gds fs checks
SUCCESS
gdscheck.py current running python tests
=========================
 Platform verification error :
nvidia-fs driver is not loadedSUCCESS
FILESYSTEM VERSION CHECK:
ofed_info:
current version: MLNX_OFED_LINUX-5.5-1.0.3.2: (Supported)
min version supported: MLNX_OFED_LINUX-4.6-1.0.1.1
SUCCESS
nvidia-fs driver is not loadedSUCCESS
usage: gdscheck.py [-h] [-p] [-f FILE] [-v] [-V]GPUDirectStorage platform checkeroptional arguments:
  -h, --help  show this help message and exit
  -p          gds platform check
  -f FILE     gds file check
  -v          gds version checks
  -V          gds fs checks
SUCCESS
**************************************************
gdscheck.py test results : 12 /  12 tests passed
**************************************************
Starting basic gdsio Tests
/usr/local/gds/tools/gdsio -f  /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 4k
SUCCESS
/usr/local/gds/tools/gdsio -f  /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 3k
SUCCESS
/usr/local/gds/tools/gdsio -f  /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 3k -o 1
SUCCESS
/usr/local/gds/tools/gdsio -f  /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 2k
SUCCESS
/usr/local/gds/tools/gdsio -f  /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 2k -o 1
SUCCESS
/usr/local/gds/tools/gdsio -f  /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 1k
SUCCESS
/usr/local/gds/tools/gdsio -f  /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 1k -o 1
SUCCESS
/usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 1 -o 1
Verifying data 
SUCCESS
/usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 3 -k 1234 -o 1
Verifying data 
SUCCESS
/usr/local/gds/tools/gdsio -V -D /data/sanity/tests// -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 1 -o 1
Verifying data 
SUCCESS
/usr/local/gds/tools/gdsio -V -D /data/sanity/tests// -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 3 -k 1234 -o 1
Verifying data 
SUCCESS
/usr/local/gds/tools/gdsio -D /data/sanity/tests// -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 3 -k 1234 -o 1 -F -R
SUCCESS
/usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 1 -o 1 -b
Verifying data 
SUCCESS
/usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 8M:32M -x 0 -I 1 -o 1 -b
Verifying data 
SUCCESS
/usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 0 -o 1 -b
SUCCESS
/usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 8M:32M -x 0 -I 0 -o 1 -b
SUCCESS
/usr/local/gds/tools/gdsio -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 0 -o 0 -b
SUCCESS
/usr/local/gds/tools/gdsio -f /data/sanity/tests//sparse1G -d 0  -w 8 -s 1G -i 8M:32M -x 0 -I 0 -o 0 -b
SUCCESS
**************************************************
gdiso tests : 18 /  18 tests passed
**************************************************
Starting Offset Tests
TestCase:Read odd offset
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 616  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 616  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:Read odd gpu offsets 1, 2, 3, 4, 4K-1, 4K, 4K+1, 60K, 64K, 68K
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 4096 -o 0  -d 0 -t 1 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 4096 -o 0  -d 0 -t 1 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 1 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 2 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 3 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 4 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 4095 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 4096 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 4097 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 4097 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 61440 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 65536 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 69632 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 1 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 2 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 3 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 4 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 4095 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 4096 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 4097 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 61440 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 65536 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 69632 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:Read/write odd size - sync
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485761 -o 4096  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:Read/write odd size - async
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485761 -o 4096  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:odd offset and odd size - sync
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485748 -o 119  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:odd offset and odd size - async
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485748 -o 119  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:Read/write 1 byte from offset 0 (sync and async)
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 1 -o 0  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 1 -o 0  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:Read/write 1 byte from offset 3 (sync and async)
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 1 -o 3  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 1 -o 3  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:Read/write big file 10G (odd size) - sync
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 220201060 -o 4096  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:Read/write big file 10G (odd size) - async
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 220201060 -o 4096  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:Read beyond EOF (read 2G on a 1G file - async)
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1G -n 1 -m 0 -s 209714688 -o 0  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:Read beyond EOF (read 2G on a 1G file - sync)
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1G -n 1 -m 1 -s 209714688 -o 0  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:Read beyond EOF - odd size (read 2G on a 1G file - sync and async)
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1G -n 1 -m 0 -s 209714689 -o 0  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1G -n 1 -m 1 -s 209714689 -o 0  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:Read beyond EOF odd offset (read 2G on a 1G file - sync and async)
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1G -n 1 -m 1 -s 209714688 -o 616  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1G -n 1 -m 0 -s 209714688 -o 616  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:Read size beyond EOF (small file)
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1099 -o 1  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1099 -o 1  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1099 -o 0  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1099 -o 0  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:Read just short of EOF
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1000 -o 1  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1000 -o 1  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1000 -o 0  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1000 -o 0  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:Read offset from EOF
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 999 -o 1024  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 999 -o 1024  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1 -o 1024  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1 -o 1024  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:Read offset beyond EOF (sync and async)
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1 -o 1025  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1 -o 1025  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:Read with odd gpu_offset
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:Read with odd gpu_offset and odd file offset
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 617  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 617  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:Read at 128k GPU page offset
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:Read beyond 64k (odd gpu offset)
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:Overwrite an existing file within EOF
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 52428800 -o 1  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 52428800 -o 1  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 52428805 -o 0  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 52428805 -o 0  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:Offset beyond EOF writes
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 4096 -o 4099  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 4096 -o 4099  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:offset just short of EOF writes
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 3 -o 4094  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 3 -o 4094  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
TestCase:offset from EOF writes
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 4096 -o 3  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
/usr/local/gds/tools/gdsio_verify  -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 4096 -o 3  -d 0 -t 0 -p 0
file register error: nvidia-fs driver is not loaded
FAILED
**************************************************
File offset and GPU Buffer offset Tests : 0 /  73 tests passed
**************************************************
running cufile sample tests
sample 1
FAILED
sample 2
opening file /data/sanity/tests//sparse1G_sample2
FAILED
sample 3
FAILED
sample 4
FAILED
sample 5
FAILED
sample 6
FAILED
sample 7
FAILED
sample 8
PASS: cufile success status:Success
SUCCESS
sample 14
opening file /data/sanity/tests//sparse1G
FAILED
sample 15
FAILED
**************************************************
cufile sample tests : 1 /  10 tests passed
**************************************************
Testing gdscp functionality
/usr/local/gds/tools/gdscp /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G_copy 0 -v
file register error: nvidia-fs driver is not loaded
FAILED
**************************************************
gdscp tests : 0 /  1 tests passed
**************************************************
Testing Batch State Machine
/usr/local/gds/tools//tests/cufile_batch_test_state_machine /data/sanity/tests//sparse1G 0 && pass || fail
FAILED
/usr/local/gds/tools//tests/cufile_batch_test_state_machine /data/sanity/tests//sparse1G 1 && pass || fail
FAILED
/usr/local/gds/tools//tests/cufile_batch_test_state_machine /data/sanity/tests//sparse1G 2 && pass || fail
FAILED
/usr/local/gds/tools//tests/cufile_batch_test_state_machine /data/sanity/tests//sparse1G 3 && pass || fail
FAILED
/usr/local/gds/tools//tests/cufile_batch_test_state_machine /data/sanity/tests//sparse1G 4 && pass || fail
FAILED
**************************************************
Batch State Machine Tests : 0 /  5 tests passed
**************************************************
Performing cufile API tests
/usr/local/gds/tools//api_tests/cufile_testbufregister 0
FAILED
/usr/local/gds/tools//api_tests/cufile_testbufregister 1
FAILED
/usr/local/gds/tools//api_tests/cufile_testbufregister 2
FAILED
/usr/local/gds/tools//api_tests/cufile_testbufregister 3
FAILED
/usr/local/gds/tools//api_tests/cufile_testbufregister 4
FAILED
/usr/local/gds/tools//api_tests/cufile_testbufregister 5
FAILED
/usr/local/gds/tools//api_tests/cufile_testbufregister 6
FAILED
/usr/local/gds/tools//api_tests/cufile_testbufregister 7
FAILED
/usr/local/gds/tools//api_tests/cufile_testbufregister 8
FAILED
/usr/local/gds/tools//api_tests/cufile_testbufregister 9
FAILED
/usr/local/gds/tools//api_tests/cufile_testbufregister 10
FAILED
/usr/local/gds/tools//api_tests/cufile_testbufderegister 0
FAILED
/usr/local/gds/tools//api_tests/cufile_testbufderegister 1
FAILED
/usr/local/gds/tools//api_tests/cufile_testbufderegister 2
FAILED
/usr/local/gds/tools//api_tests/cufile_testbufderegister 3
FAILED
/usr/local/gds/tools//api_tests/cufile_testbufderegister 4
FAILED
/usr/local/gds/tools//api_tests/cufile_testbufderegister 5
FAILED
/usr/local/gds/tools//api_tests/cufile_testclosefd /data/sanity/tests//sparse1G 0
FAILED
/usr/local/gds/tools//api_tests/cufile_testclosefd /data/sanity/tests//sparse1G 1
FAILED
/usr/local/gds/tools//api_tests/cufile_testclosefd /data/sanity/tests//sparse1G 2
FAILED
/usr/local/gds/tools//api_tests/cufile_testclosefd /data/sanity/tests//sparse1G 3
FAILED
/usr/local/gds/tools//api_tests/cufile_testdriver 0
FAILED
/usr/local/gds/tools//api_tests/cufile_testdriver 1
FAILED
/usr/local/gds/tools//api_tests/cufile_testdriver 2
cufile driver close: nvidia-fs driver is not loaded
SUCCESS
/usr/local/gds/tools//api_tests/cufile_testopenfd  /data/sanity/tests//sparse1G 0
FAILED
/usr/local/gds/tools//api_tests/cufile_testopenfd  /data/sanity/tests//sparse1G 1
SUCCESS
/usr/local/gds/tools//api_tests/cufile_testopenfd  /data/sanity/tests//sparse1G 2
SUCCESS
/usr/local/gds/tools//api_tests/cufile_testopenfd  /data/sanity/tests//sparse1G 3
SUCCESS
/usr/local/gds/tools//api_tests/cufile_testopenfd  /data/sanity/tests//sparse1G 4
SUCCESS
/usr/local/gds/tools//api_tests/cufile_testopenfd  /data/sanity/tests//sparse1G 5
SUCCESS
/usr/local/gds/tools//api_tests/cufile_rw  /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G_VERIFY 0 
FAILED
/usr/local/gds/tools//api_tests/cufile_rwmanaged  /data/sanity/tests//sparse1G 0 
FAILED
/usr/local/gds/tools//api_tests/cufile_rw_unreg  /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G_VERIFY 0 1
FAILED
/usr/local/gds/tools//api_tests/cufile_rw_unreg  /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G_VERIFY 0 2
FAILED
/usr/local/gds/tools//api_tests/cufile_rw_unreg  /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G_VERIFY 0 3
FAILED
/usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 0
FAILED
/usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 1
FAILED
/usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 2
FAILED
/usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 3
FAILED
/usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 4
FAILED
/usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 5
FAILED
/usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 6
FAILED
/usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 7
FAILED
/usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 8
FAILED
/usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 9
FAILED
/usr/local/gds/tools//api_tests/cufile_testdriverprops -p 8
FAILED
/usr/local/gds/tools//api_tests/cufile_testdriverprops -b 8
SUCCESS
/usr/local/gds/tools//api_tests/cufile_testdriverprops -d 8
SUCCESS
/usr/local/gds/tools//api_tests/cufile_testdriverprops -c 8
FAILED
/usr/local/gds/tools//api_tests/cufile_testdriverprops -b 1024
FAILED
/usr/local/gds/tools//api_tests/cufile_testdriverprops -d 1024
FAILED
/usr/local/gds/tools//api_tests/cufile_driver_close /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G 0
FAILED
/usr/local/gds/tools//api_tests/cufile_driver_close /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G 1
FAILED
/usr/local/gds/tools//api_tests/cufile_driver_close /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G 2
FAILED
/usr/local/gds/tools//api_tests/cufile_io_race /data/sanity/tests//sparse1G 
FAILED
/usr/local/gds/tools//api_tests/cufile_testvalidnvbuf  /data/sanity/tests//sparse1G 0 
FAILED
/usr/local/gds/tools//api_tests/cufile_testvalidnvbuf  /data/sanity/tests//sparse1G 0 
FAILED
/usr/local/gds/tools//api_tests/cufile_driver_close /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G 3
FAILED
/usr/local/gds/tools//api_tests/cufile_io_race /data/sanity/tests//sparse1G 
FAILED
/usr/local/gds/tools//api_tests/cufile_invalid_write /data/sanity/tests//sparse1G 0 0
SUCCESS
/usr/local/gds/tools//api_tests/cufile_invalid_write /data/sanity/tests//sparse1G 0 1
SUCCESS
/usr/local/gds/tools//api_tests/cufile_invalid_offsets /data/sanity/tests//sparse1G 0 0
FAILED
/usr/local/gds/tools//api_tests/cufile_testcudacontext_switch /data/sanity/tests//sparse_CTX_VERIFY 0
FAILED
End: nvidia-fs:
GDS Version: 1.2.1.4 
NVFS statistics(ver: 4.0)
NVFS Driver(version: 2.11.0)
Mellanox PeerDirect Supported: True
IO stats: Disabled, peer IO stats: Disabled
Logging level: infoActive Shadow-Buffer (MiB): 0
Active Process: 0
Reads                : err=0 io_state_err=0
Sparse Reads                : n=0 io=0 holes=0 pages=0 
Writes                : err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0
Mmap                : n=0 ok=0 err=0 munmap=0
Bar1-map            : n=0 ok=0 err=0 free=0 callbacks=0 active=0
Error                : cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0
Ops                : Read=0 Write=0 BatchIO=0
**************************************************
API Tests, : 10 /  63 tests passed
**************************************************
Testsuite : 41 / 182 tests passed
done tests:Mon May 2 16:48:42 UTC 2022

It seems that a patch https://review.whamcloud.com/#/c/45327/ needs to be applied to the master.

With MLNX_OFED_LINUX-5.6-1.0.3.3-ubuntu20.04-x86_64 and MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 the result is the same.

 

 



 Comments   
Comment by Peter Jones [ 06/May/22 ]

Minh

Please can you advise

Thanks

Peter

Comment by Minh Diep [ 06/May/22 ]

this is caused by LU-12019 https://review.whamcloud.com/34329

Comment by Peter Jones [ 06/May/22 ]

Ok. Can we revert that change for 2.15.0 and then take longer to assess how to support that change without introducing a regression for other usage?

Comment by Minh Diep [ 06/May/22 ]

revert patch https://review.whamcloud.com/47238

Comment by Peter Jones [ 09/May/22 ]

Landed for 2.15

Comment by Oleg Kulachenko (Inactive) [ 11/May/22 ]

After reinstall Lustre mlnxofedinstall works without errors.

Logs: mlnx_logs.txt

But:

ddn@a100-01:~/okulachenko/MLNX_OFED_LINUX-5.6-1.0.3.3-ubuntu20.04-x86_64$ sudo lustre_rmmod
[sudo] password for ddn:
ddn@a100-01:~/okulachenko/MLNX_OFED_LINUX-5.6-1.0.3.3-ubuntu20.04-x86_64$ sudo /etc/init.d/openibd restart
Unloading HCA driver:                                      [  OK  ]
Loading Mellanox MLX5_IB HCA driver:                       [FAILED]
Loading Mellanox MLX5 HCA driver:                          [FAILED]
Loading HCA driver and Access Layer:                       [FAILED]
Please run /usr/sbin/sysinfo-snapshot.py to collect the debug information
and open an issue in the http://support.mellanox.com/SupportWeb/service_center/SelfService

I think that we found a new issue. We will re-image A100 and run firmware upgrade. Most likely the new issue is not in the Lustre, but in the firmware.

 

 

 

Comment by Oleg Kulachenko (Inactive) [ 12/May/22 ]

mdiep It seems that the fix did not help.

$ lctl get_param version
version=2.15.0_RC3_2_g7905359

ddn@a100-01:~/okulachenko/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64$ sudo ./mlnxofedinstall --add-kernel-support --distro ubuntu20.04
Note: This program will create MLNX_OFED_LINUX TGZ for ubuntu20.04 under /tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-5.4.0-109-generic directory.
See log file /tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-5.4.0-109-generic/mlnx_iso.3333249_logs/mlnx_ofed_iso.3333249.logChecking if all needed packages are installed...
Building MLNX_OFED_LINUX DEBS . Please wait...
Creating metadata-rpms for 5.4.0-109-generic ...
WARNING: If you are going to configure this package as a repository, then please note
WARNING: that it is not signed, therefore, you need to set 'trusted=yes' in the sources.list file.
WARNING: Example: deb [trusted=yes] file:/<path to MLNX_OFED DEBS folder> ./
Created /tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-5.4.0-109-generic/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-ext.tgz
Removing old packages...
Uninstalling the previous version of MLNX_OFED_LINUX
Installing /tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-5.4.0-109-generic/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-ext
/tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-5.4.0-109-generic/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-ext/mlnxofedinstall --force --without-dkms --distro ubuntu20.04
Logs dir: /tmp/MLNX_OFED_LINUX.3838794.logs
General log file: /tmp/MLNX_OFED_LINUX.3838794.logs/general.logBelow is the list of MLNX_OFED_LINUX packages that you have chosen
(some may have been added by the installer due to package dependencies):ofed-scripts
mlnx-tools
mlnx-ofed-kernel-utils
mlnx-ofed-kernel-modules
iser-modules
isert-modules
srp-modules
rdma-core
libibverbs1
ibverbs-utils
ibverbs-providers
libibverbs-dev
libibverbs1-dbg
libibumad3
libibumad-dev
ibacm
librdmacm1
rdmacm-utils
librdmacm-dev
mstflint
ibdump
libibmad5
libibmad-dev
libopensm
opensm
opensm-doc
libopensm-devel
libibnetdisc5
infiniband-diags
mft
kernel-mft-modules
perftest
ibutils2
ar-mgr
dump-pr
ibsim
ibsim-doc
ucx
sharp
hcoll
openmpi
mpitests
knem-modules
libdapl2
dapl2-utils
libdapl-dev
dpcp
srptools
mlnx-ethtool
mlnx-iproute2
rshimThis program will install the MLNX_OFED_LINUX package on your machine.
Note that all other Mellanox, OEM, OFED, RDMA or Distribution IB packages will be removed.
Those packages are removed due to conflicts with MLNX_OFED_LINUX, do not reinstall them.Checking SW Requirements...
Removing old packages...
Installing new packages
Installing ofed-scripts-5.5...
Installing mlnx-tools-5.2.0...
Installing mlnx-ofed-kernel-utils-5.5...
Installing mlnx-ofed-kernel-modules-5.5...
Installing iser-modules-5.5...
Installing isert-modules-5.5...
Installing srp-modules-5.5...
Installing rdma-core-55mlnx37...
Installing libibverbs1-55mlnx37...
Installing ibverbs-utils-55mlnx37...
Installing ibverbs-providers-55mlnx37...
Installing libibverbs-dev-55mlnx37...
Installing libibverbs1-dbg-55mlnx37...
Installing libibumad3-55mlnx37...
Installing libibumad-dev-55mlnx37...
Installing ibacm-55mlnx37...
Installing librdmacm1-55mlnx37...
Installing rdmacm-utils-55mlnx37...
Installing librdmacm-dev-55mlnx37...
Installing mstflint-4.16.0...
Installing ibdump-6.0.0...
Installing libibmad5-55mlnx37...
Installing libibmad-dev-55mlnx37...
Installing libopensm-5.10.0.MLNX20211115.e645cc83...
Installing opensm-5.10.0.MLNX20211115.e645cc83...
Installing opensm-doc-5.10.0.MLNX20211115.e645cc83...
Installing libopensm-devel-5.10.0.MLNX20211115.e645cc83...
Installing libibnetdisc5-55mlnx37...
Installing infiniband-diags-55mlnx37...
Installing mft-4.18.0...
Installing kernel-mft-modules-4.18.0...
Installing perftest-4.5...
Installing ibutils2-2.1.1...
Installing ar-mgr-1.0...
Installing dump-pr-1.0...
Installing ibsim-0.10...
Installing ibsim-doc-0.10...
Installing ucx-1.12.0...
Installing sharp-2.6.1.MLNX20211124.aac4a56...
Installing hcoll-4.7.3202...
Installing openmpi-4.1.2rc2...
Installing mpitests-3.2.20...
Installing knem-modules-1.1.4.90mlnx1...
Installing libdapl2-2.1.10.1.mlnx...
Installing dapl2-utils-2.1.10.1.mlnx...
Installing libdapl-dev-2.1.10.1.mlnx...
Installing dpcp-1.1.17...
Installing srptools-55mlnx37...
Installing mlnx-ethtool-5.13...
Installing mlnx-iproute2-5.14.0...
Installing rshim-2.0.6...
Selecting previously unselected package mlnx-fw-updater.
(Reading database ... 224924 files and directories currently installed.)
Preparing to unpack .../mlnx-fw-updater_5.5-1.0.3.2_amd64.deb ...
Unpacking mlnx-fw-updater (5.5-1.0.3.2) ...
Setting up mlnx-fw-updater (5.5-1.0.3.2) ...Added 'RUN_FW_UPDATER_ONBOOT=no to /etc/infiniband/openib.confInitializing...
Attempting to perform Firmware update...
Querying Mellanox devices firmware ...
Querying Mellanox devices firmware ...
Querying Mellanox devices firmware ...Device #1:
----------  Device Type:      ConnectX6
  Part Number:      MCX653105A-HDA_Ax
  Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
  PSID:             MT_0000000223
  PCI Device Name:  0c:00.0
  Base GUID:        0c42a10300555aaa
  Versions:         Current        Available
     FW             20.33.1048     20.32.1010
     PXE            3.6.0502       3.6.0502
     UEFI           14.26.0017     14.25.0017  Status:           Up to date
Log File: /tmp/lJcLV6m0FI
Querying Mellanox devices firmware ...Device #1:
----------  Device Type:      ConnectX6
  Part Number:      MCX653105A-HDA_Ax
  Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
  PSID:             MT_0000000223
  PCI Device Name:  12:00.0
  Base GUID:        0c42a10300555dee
  Versions:         Current        Available
     FW             20.33.1048     20.32.1010
     PXE            3.6.0502       3.6.0502
     UEFI           14.26.0017     14.25.0017  Status:           Up to date
Log File: /tmp/paBqhmZPH7
Querying Mellanox devices firmware ...Device #1:
----------  Device Type:      ConnectX6
  Part Number:      MCX653105A-HDA_Ax
  Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
  PSID:             MT_0000000223
  PCI Device Name:  4b:00.0
  Base GUID:        043f720300f55646
  Versions:         Current        Available
     FW             20.33.1048     20.32.1010
     PXE            3.6.0502       3.6.0502
     UEFI           14.26.0017     14.25.0017  Status:           Up to date
Log File: /tmp/ovxoNfgM6c
Querying Mellanox devices firmware ...Device #1:
----------  Device Type:      ConnectX6
  Part Number:      MCX653105A-HDA_Ax
  Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
  PSID:             MT_0000000223
  PCI Device Name:  54:00.0
  Base GUID:        0c42a10300555dbe
  Versions:         Current        Available
     FW             20.33.1048     20.32.1010
     PXE            3.6.0502       3.6.0502
     UEFI           14.26.0017     14.25.0017  Status:           Up to date
Log File: /tmp/FZHvz2eu6S
Querying Mellanox devices firmware ...
Querying Mellanox devices firmware ...Device #1:
----------  Device Type:      ConnectX6
  Part Number:      MCX653106A-HDA_Ax
  Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; dual-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
  PSID:             MT_0000000225
  PCI Device Name:  61:00.0
  Base MAC:         1c34da6c9046
  Versions:         Current        Available
     FW             20.33.1048     20.32.1010
     PXE            3.6.0502       3.6.0502
     UEFI           14.26.0017     14.25.0017  Status:           Up to date
Log File: /tmp/J6qmeEZp3kDevice #1:
----------  Device Type:      ConnectX6
  Part Number:      MCX653105A-HDA_Ax
  Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
  PSID:             MT_0000000223
  PCI Device Name:  8d:00.0
  Base GUID:        0c42a10300555d62
  Versions:         Current        Available
     FW             20.33.1048     20.32.1010
     PXE            3.6.0502       3.6.0502
     UEFI           14.26.0017     14.25.0017  Status:           Up to date
Log File: /tmp/i3Ih6BBFkg
Querying Mellanox devices firmware ...Device #1:
----------  Device Type:      ConnectX6
  Part Number:      MCX653105A-HDA_Ax
  Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
  PSID:             MT_0000000223
  PCI Device Name:  94:00.0
  Base GUID:        0c42a10300555afe
  Versions:         Current        Available
     FW             20.33.1048     20.32.1010
     PXE            3.6.0502       3.6.0502
     UEFI           14.26.0017     14.25.0017  Status:           Up to date
Log File: /tmp/ITvKfCj1dJ
Querying Mellanox devices firmware ...Device #1:
----------  Device Type:      ConnectX6
  Part Number:      MCX653105A-HDA_Ax
  Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
  PSID:             MT_0000000223
  PCI Device Name:  ba:00.0
  Base GUID:        0c42a10300555af6
  Versions:         Current        Available
     FW             20.33.1048     20.32.1010
     PXE            3.6.0502       3.6.0502
     UEFI           14.26.0017     14.25.0017  Status:           Up to date
Log File: /tmp/Q8rB0zMS9EDevice #1:
----------  Device Type:      ConnectX6
  Part Number:      MCX653105A-HDA_Ax
  Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
  PSID:             MT_0000000223
  PCI Device Name:  cc:00.0
  Base GUID:        0c42a10300555e02
  Versions:         Current        Available
     FW             20.33.1048     20.32.1010
     PXE            3.6.0502       3.6.0502
     UEFI           14.26.0017     14.25.0017  Status:           Up to date
Log File: /tmp/7EZbD0vG0HDevice #1:
----------  Device Type:      ConnectX6
  Part Number:      MCX653106A-HDA_Ax
  Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; dual-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
  PSID:             MT_0000000225
  PCI Device Name:  e1:00.0
  Base MAC:         0c42a11b7dee
  Versions:         Current        Available
     FW             20.33.1048     20.32.1010
     PXE            3.6.0502       3.6.0502
     UEFI           14.26.0017     14.25.0017  Status:           Up to date
Log File: /tmp/gQkEPSV1Mp
Real log file: /tmp/MLNX_OFED_LINUX.3838794.logs/fw_update.log
Device (0c:00.0):
    0c:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
    Link Width: x16
    PCI Link Speed: 16GT/sDevice (12:00.0):
    12:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
    Link Width: x16
    PCI Link Speed: 16GT/sDevice (4b:00.0):
    4b:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
    Link Width: x16
    PCI Link Speed: 16GT/sDevice (54:00.0):
    54:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
    Link Width: x16
    PCI Link Speed: 16GT/sDevice (61:00.0):
    61:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
    Link Width: x16
    PCI Link Speed: 16GT/sDevice (61:00.1):
    61:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
    Link Width: x16
    PCI Link Speed: 16GT/sDevice (8d:00.0):
    8d:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
    Link Width: x16
    PCI Link Speed: 16GT/sDevice (94:00.0):
    94:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
    Link Width: x16
    PCI Link Speed: 16GT/sDevice (ba:00.0):
    ba:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
    Link Width: x16
    PCI Link Speed: 16GT/sDevice (cc:00.0):
    cc:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
    Link Width: x16
    PCI Link Speed: 16GT/sDevice (e1:00.0):
    e1:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
    Link Width: x16
    PCI Link Speed: 16GT/sDevice (e1:00.1):
    e1:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
    Link Width: x16
    PCI Link Speed: 16GT/sInstallation passed successfully
To load the new driver, run:
/etc/init.d/openibd restart

ddn@a100-01:~/okulachenko/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64$ sudo /etc/init.d/openibd restart
Unloading HCA driver:                                      [  OK  ]
Loading Mellanox MLX5_IB HCA driver:                       [FAILED]
Loading Mellanox MLX5 HCA driver:                          [FAILED]
Loading HCA driver and Access Layer:                       [FAILED]Please run /usr/sbin/sysinfo-snapshot.py to collect the debug information
and open an issue in the http://support.mellanox.com/SupportWeb/service_center/SelfService

ddn@a100-01:~/okulachenko/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64$ sudo modprobe -v lustre
insmod /lib/modules/5.4.0-109-generic/updates/kernel/net/libcfs.ko cpu_npartitions=32
insmod /lib/modules/5.4.0-109-generic/updates/kernel/net/lnet.ko networks="o2ib(ibp12s0,ibp18s0,enp225s0f1,ibp75s0,ibp84s0,enp97s0f1,ibp141s0,ibp148s0,ibp186s0,ibp204s0)" lnet_transaction_timeout=100 lnet_retry_count=2
insmod /lib/modules/5.4.0-109-generic/updates/kernel/fs/obdclass.ko
insmod /lib/modules/5.4.0-109-generic/updates/kernel/fs/ptlrpc.ko
modprobe: ERROR: could not insert 'lustre': Network is down
Comment by Minh Diep [ 13/May/22 ]

gtapase please take a look

Comment by Andreas Dilger [ 19/May/22 ]

Gaurang, any update on this issue? This is one of the few remaining issues before the 2.15.0 release.

Comment by Gaurang Tapase [ 20/May/22 ]

I don't have access to DGX A100 system, but I tried compiling lustre 2.15 on a ubuntu 20.04 system with MoFED 5.5 installed and it worked fine. I could load the lustre module as well.

Is this something specific to DGX A100?

Comment by Gaurang Tapase [ 20/May/22 ]

Just built the master branch of fs/lustre-release on DGX A100 and could load the lustre module as well.

root@a100-01:/home/ddn/gtapase/exa-client# dpkg -l | grep lustre
rc  lustre-client-modules-5.14.0-1032-oem             2.14.0-ddn39-11-g767352e-1              amd64        Lustre Linux kernel module (kernel 5.14.0-1032-oem)
ii  lustre-client-modules-5.4.0-109-generic           2.15.50-13-gc524079-dirty-1             amd64        Lustre Linux kernel module (kernel 5.4.0-109-generic)
rc  lustre-client-modules-5.4.0-96-generic            2.15.0-RC3-2-g7905359-1                 amd64        Lustre Linux kernel module (kernel 5.4.0-96-generic)
ii  lustre-client-utils                               2.15.50-13-gc524079-dirty-1             amd64        Userspace utilities for the Lustre filesystem (client)
ii  lustre-dev                                        2.15.50-13-gc524079-dirty-1             amd64        Development files for the Lustre filesystem
rc  lustre-source                                     2.15.0-RC3-2-g7905359-1                 all          source for Lustre filesystem client kernel modules
root@a100-01:/home/ddn/gtapase/exa-client# lsmod | grep lustre 
lustre               1007616  0 
lmv                   212992  1 lustre 
mdc                   274432  1 lustre 
lov                   331776  2 mdc,lustre 
ptlrpc               1355776  7 fld,osc,fid,lov,mdc,lmv,lustre 
obdclass             3297280  8 fld,osc,fid,ptlrpc,lov,mdc,lmv,lustre 
lnet                  659456  6 osc,ko2iblnd,obdclass,ptlrpc,lmv,lustre 
libcfs                245760  11 fld,lnet,osc,fid,ko2iblnd,obdclass,ptlrpc,lov,mdc,lmv,lustre

 

Comment by Oleg Kulachenko (Inactive) [ 20/May/22 ]

I'm trying to run tests.
Now this error:

/usr/local/cuda-11.5/gds/tools/gdscheck.py -p
 cuInit Failed, error CUDA_ERROR_SYSTEM_NOT_READY
 cuFile initialization failed
 Platform verification error :
CUDA Driver API error 

But these are gds tools problems, not Lustre

Comment by Oleg Kulachenko (Inactive) [ 25/May/22 ]

Updating cuda to the latest version fixed it.

Generated at Sat Feb 10 03:21:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.