[LU-15831] Lustre 2.15 client breaks DGXA100 MOFED Created: 06/May/22 Updated: 25/May/22 Resolved: 25/May/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Oleg Kulachenko (Inactive) | Assignee: | Gaurang Tapase |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | debian, ubuntu | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
When trying to run GPUDirect, it was found on install required software step: $ sudo ./mlnxofedinstall ... Installation passed successfully To load the new driver, run: /etc/init.d/openibd restart $ sudo /etc/init.d/openibd restart Unloading ib_uverbs [FAILED] rmmod: ERROR: Module ib_uverbs is in use by: nv_peer_mem $ sudo rmmod nv_peer_mem $ sudo /etc/init.d/openibd restart Unloading HCA driver:[ OK ] Loading Mellanox MLX5_IB HCA driver: [FAILED] Loading Mellanox MLX5 HCA driver: [FAILED] Loading HCA driver and Access Layer: [FAILED] Please run /usr/sbin/sysinfo-snapshot.py to collect the debug information and open an issue in the http://support.mellanox.com/SupportWeb/service_center/SelfService $ sudo modprobe nv_peer_mem modprobe: FATAL: Module nv_peer_mem not found in directory /lib/modules/5.4.0-109-generic $ sudo modprobe lustre modprobe: ERROR: could not insert 'lustre': Invalid argument
$ sudo ./mlnxofedinstall Checking SW Requirements...Removing old packages... Installing new packages Installing ofed-scripts-5.5... Installing mlnx-tools-5.2.0... Installing mlnx-ofed-kernel-utils-5.5... Installing mlnx-ofed-kernel-dkms-5.5...Error: mlnx-ofed-kernel-dkms installation failed! Problem: mlx5_ib: module file: /lib/modules/5.4.0-105-generic/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko, from package: linux-modules-extra-5.4.0-105-generic. Collecting debug info... See: /tmp/MLNX_OFED_LINUX.1302312.logs/mlnx-ofed-kernel-dkms.debinstall.log Removing newly installed packages... This prevents gds tests from running completely: ========================= Platform verification error : nvidia-fs driver is not loadedSUCCESS FILESYSTEM VERSION CHECK: ofed_info: current version: MLNX_OFED_LINUX-5.5-1.0.3.2: (Supported) min version supported: MLNX_OFED_LINUX-4.6-1.0.1.1 SUCCESS nvidia-fs driver is not loadedSUCCESS usage: gdscheck.py [-h] [-p] [-f FILE] [-v] [-V]GPUDirectStorage platform checkeroptional arguments: -h, --help show this help message and exit -p gds platform check -f FILE gds file check -v gds version checks -V gds fs checks SUCCESS gdscheck.py python2 tests ========================= Platform verification error : nvidia-fs driver is not loadedSUCCESS FILESYSTEM VERSION CHECK: ofed_info: current version: MLNX_OFED_LINUX-5.5-1.0.3.2: (Supported) min version supported: MLNX_OFED_LINUX-4.6-1.0.1.1 SUCCESS nvidia-fs driver is not loadedSUCCESS usage: gdscheck.py [-h] [-p] [-f FILE] [-v] [-V]GPUDirectStorage platform checkeroptional arguments: -h, --help show this help message and exit -p gds platform check -f FILE gds file check -v gds version checks -V gds fs checks SUCCESS gdscheck.py current running python tests ========================= Platform verification error : nvidia-fs driver is not loadedSUCCESS FILESYSTEM VERSION CHECK: ofed_info: current version: MLNX_OFED_LINUX-5.5-1.0.3.2: (Supported) min version supported: MLNX_OFED_LINUX-4.6-1.0.1.1 SUCCESS nvidia-fs driver is not loadedSUCCESS usage: gdscheck.py [-h] [-p] [-f FILE] [-v] [-V]GPUDirectStorage platform checkeroptional arguments: -h, --help show this help message and exit -p gds platform check -f FILE gds file check -v gds version checks -V gds fs checks SUCCESS ************************************************** gdscheck.py test results : 12 / 12 tests passed ************************************************** Starting basic gdsio Tests /usr/local/gds/tools/gdsio -f /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 4k SUCCESS /usr/local/gds/tools/gdsio -f /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 3k SUCCESS /usr/local/gds/tools/gdsio -f /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 3k -o 1 SUCCESS /usr/local/gds/tools/gdsio -f /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 2k SUCCESS /usr/local/gds/tools/gdsio -f /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 2k -o 1 SUCCESS /usr/local/gds/tools/gdsio -f /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 1k SUCCESS /usr/local/gds/tools/gdsio -f /data/sanity/tests//sparse1G -d 0 -f /data/sanity/tests//sparse1G -d 0 -s 128K -i 1k -o 1 SUCCESS /usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0 -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 1 -o 1 Verifying data SUCCESS /usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0 -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 3 -k 1234 -o 1 Verifying data SUCCESS /usr/local/gds/tools/gdsio -V -D /data/sanity/tests// -d 0 -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 1 -o 1 Verifying data SUCCESS /usr/local/gds/tools/gdsio -V -D /data/sanity/tests// -d 0 -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 3 -k 1234 -o 1 Verifying data SUCCESS /usr/local/gds/tools/gdsio -D /data/sanity/tests// -d 0 -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 3 -k 1234 -o 1 -F -R SUCCESS /usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0 -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 1 -o 1 -b Verifying data SUCCESS /usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0 -w 8 -s 1G -i 8M:32M -x 0 -I 1 -o 1 -b Verifying data SUCCESS /usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0 -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 0 -o 1 -b SUCCESS /usr/local/gds/tools/gdsio -V -f /data/sanity/tests//sparse1G -d 0 -w 8 -s 1G -i 8M:32M -x 0 -I 0 -o 1 -b SUCCESS /usr/local/gds/tools/gdsio -f /data/sanity/tests//sparse1G -d 0 -w 8 -s 1G -i 32K:1024K:1K -x 0 -I 0 -o 0 -b SUCCESS /usr/local/gds/tools/gdsio -f /data/sanity/tests//sparse1G -d 0 -w 8 -s 1G -i 8M:32M -x 0 -I 0 -o 0 -b SUCCESS ************************************************** gdiso tests : 18 / 18 tests passed ************************************************** Starting Offset Tests TestCase:Read odd offset /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 616 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 616 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:Read odd gpu offsets 1, 2, 3, 4, 4K-1, 4K, 4K+1, 60K, 64K, 68K /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 4096 -o 0 -d 0 -t 1 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 4096 -o 0 -d 0 -t 1 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0 -d 0 -t 1 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0 -d 0 -t 2 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0 -d 0 -t 3 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0 -d 0 -t 4 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0 -d 0 -t 4095 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0 -d 0 -t 4096 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0 -d 0 -t 4097 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0 -d 0 -t 4097 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0 -d 0 -t 61440 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0 -d 0 -t 65536 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0 -d 0 -t 69632 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0 -d 0 -t 1 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0 -d 0 -t 2 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0 -d 0 -t 3 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0 -d 0 -t 4 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0 -d 0 -t 4095 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0 -d 0 -t 4096 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0 -d 0 -t 4097 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0 -d 0 -t 61440 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0 -d 0 -t 65536 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0 -d 0 -t 69632 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:Read/write odd size - sync /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485761 -o 4096 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:Read/write odd size - async /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485761 -o 4096 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:odd offset and odd size - sync /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485748 -o 119 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:odd offset and odd size - async /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485748 -o 119 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:Read/write 1 byte from offset 0 (sync and async) /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 1 -o 0 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 1 -o 0 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:Read/write 1 byte from offset 3 (sync and async) /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 1 -o 3 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 1 -o 3 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:Read/write big file 10G (odd size) - sync /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 220201060 -o 4096 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:Read/write big file 10G (odd size) - async /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 220201060 -o 4096 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:Read beyond EOF (read 2G on a 1G file - async) /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparse1G -n 1 -m 0 -s 209714688 -o 0 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:Read beyond EOF (read 2G on a 1G file - sync) /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparse1G -n 1 -m 1 -s 209714688 -o 0 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:Read beyond EOF - odd size (read 2G on a 1G file - sync and async) /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparse1G -n 1 -m 0 -s 209714689 -o 0 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparse1G -n 1 -m 1 -s 209714689 -o 0 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:Read beyond EOF odd offset (read 2G on a 1G file - sync and async) /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparse1G -n 1 -m 1 -s 209714688 -o 616 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparse1G -n 1 -m 0 -s 209714688 -o 616 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:Read size beyond EOF (small file) /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1099 -o 1 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1099 -o 1 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1099 -o 0 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1099 -o 0 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:Read just short of EOF /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1000 -o 1 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1000 -o 1 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1000 -o 0 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1000 -o 0 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:Read offset from EOF /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 999 -o 1024 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 999 -o 1024 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1 -o 1024 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1 -o 1024 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:Read offset beyond EOF (sync and async) /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparse1K -n 1 -m 1 -s 1 -o 1025 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparse1K -n 1 -m 0 -s 1 -o 1025 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:Read with odd gpu_offset /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:Read with odd gpu_offset and odd file offset /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 617 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 617 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:Read at 128k GPU page offset /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:Read beyond 64k (odd gpu offset) /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 10485760 -o 0 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 10485760 -o 0 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:Overwrite an existing file within EOF /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 52428800 -o 1 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 52428800 -o 1 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 52428805 -o 0 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 52428805 -o 0 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:Offset beyond EOF writes /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 4096 -o 4099 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 4096 -o 4099 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:offset just short of EOF writes /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 3 -o 4094 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 3 -o 4094 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED TestCase:offset from EOF writes /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 0 -s 4096 -o 3 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED /usr/local/gds/tools/gdsio_verify -f /data/sanity/tests//sparserandom_big -n 1 -m 1 -s 4096 -o 3 -d 0 -t 0 -p 0 file register error: nvidia-fs driver is not loaded FAILED ************************************************** File offset and GPU Buffer offset Tests : 0 / 73 tests passed ************************************************** running cufile sample tests sample 1 FAILED sample 2 opening file /data/sanity/tests//sparse1G_sample2 FAILED sample 3 FAILED sample 4 FAILED sample 5 FAILED sample 6 FAILED sample 7 FAILED sample 8 PASS: cufile success status:Success SUCCESS sample 14 opening file /data/sanity/tests//sparse1G FAILED sample 15 FAILED ************************************************** cufile sample tests : 1 / 10 tests passed ************************************************** Testing gdscp functionality /usr/local/gds/tools/gdscp /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G_copy 0 -v file register error: nvidia-fs driver is not loaded FAILED ************************************************** gdscp tests : 0 / 1 tests passed ************************************************** Testing Batch State Machine /usr/local/gds/tools//tests/cufile_batch_test_state_machine /data/sanity/tests//sparse1G 0 && pass || fail FAILED /usr/local/gds/tools//tests/cufile_batch_test_state_machine /data/sanity/tests//sparse1G 1 && pass || fail FAILED /usr/local/gds/tools//tests/cufile_batch_test_state_machine /data/sanity/tests//sparse1G 2 && pass || fail FAILED /usr/local/gds/tools//tests/cufile_batch_test_state_machine /data/sanity/tests//sparse1G 3 && pass || fail FAILED /usr/local/gds/tools//tests/cufile_batch_test_state_machine /data/sanity/tests//sparse1G 4 && pass || fail FAILED ************************************************** Batch State Machine Tests : 0 / 5 tests passed ************************************************** Performing cufile API tests /usr/local/gds/tools//api_tests/cufile_testbufregister 0 FAILED /usr/local/gds/tools//api_tests/cufile_testbufregister 1 FAILED /usr/local/gds/tools//api_tests/cufile_testbufregister 2 FAILED /usr/local/gds/tools//api_tests/cufile_testbufregister 3 FAILED /usr/local/gds/tools//api_tests/cufile_testbufregister 4 FAILED /usr/local/gds/tools//api_tests/cufile_testbufregister 5 FAILED /usr/local/gds/tools//api_tests/cufile_testbufregister 6 FAILED /usr/local/gds/tools//api_tests/cufile_testbufregister 7 FAILED /usr/local/gds/tools//api_tests/cufile_testbufregister 8 FAILED /usr/local/gds/tools//api_tests/cufile_testbufregister 9 FAILED /usr/local/gds/tools//api_tests/cufile_testbufregister 10 FAILED /usr/local/gds/tools//api_tests/cufile_testbufderegister 0 FAILED /usr/local/gds/tools//api_tests/cufile_testbufderegister 1 FAILED /usr/local/gds/tools//api_tests/cufile_testbufderegister 2 FAILED /usr/local/gds/tools//api_tests/cufile_testbufderegister 3 FAILED /usr/local/gds/tools//api_tests/cufile_testbufderegister 4 FAILED /usr/local/gds/tools//api_tests/cufile_testbufderegister 5 FAILED /usr/local/gds/tools//api_tests/cufile_testclosefd /data/sanity/tests//sparse1G 0 FAILED /usr/local/gds/tools//api_tests/cufile_testclosefd /data/sanity/tests//sparse1G 1 FAILED /usr/local/gds/tools//api_tests/cufile_testclosefd /data/sanity/tests//sparse1G 2 FAILED /usr/local/gds/tools//api_tests/cufile_testclosefd /data/sanity/tests//sparse1G 3 FAILED /usr/local/gds/tools//api_tests/cufile_testdriver 0 FAILED /usr/local/gds/tools//api_tests/cufile_testdriver 1 FAILED /usr/local/gds/tools//api_tests/cufile_testdriver 2 cufile driver close: nvidia-fs driver is not loaded SUCCESS /usr/local/gds/tools//api_tests/cufile_testopenfd /data/sanity/tests//sparse1G 0 FAILED /usr/local/gds/tools//api_tests/cufile_testopenfd /data/sanity/tests//sparse1G 1 SUCCESS /usr/local/gds/tools//api_tests/cufile_testopenfd /data/sanity/tests//sparse1G 2 SUCCESS /usr/local/gds/tools//api_tests/cufile_testopenfd /data/sanity/tests//sparse1G 3 SUCCESS /usr/local/gds/tools//api_tests/cufile_testopenfd /data/sanity/tests//sparse1G 4 SUCCESS /usr/local/gds/tools//api_tests/cufile_testopenfd /data/sanity/tests//sparse1G 5 SUCCESS /usr/local/gds/tools//api_tests/cufile_rw /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G_VERIFY 0 FAILED /usr/local/gds/tools//api_tests/cufile_rwmanaged /data/sanity/tests//sparse1G 0 FAILED /usr/local/gds/tools//api_tests/cufile_rw_unreg /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G_VERIFY 0 1 FAILED /usr/local/gds/tools//api_tests/cufile_rw_unreg /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G_VERIFY 0 2 FAILED /usr/local/gds/tools//api_tests/cufile_rw_unreg /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G_VERIFY 0 3 FAILED /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 0 FAILED /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 1 FAILED /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 2 FAILED /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 3 FAILED /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 4 FAILED /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 5 FAILED /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 6 FAILED /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 7 FAILED /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 8 FAILED /usr/local/gds/tools//api_tests/cufile_testopenflags /data/sanity/tests//sparse1G 9 FAILED /usr/local/gds/tools//api_tests/cufile_testdriverprops -p 8 FAILED /usr/local/gds/tools//api_tests/cufile_testdriverprops -b 8 SUCCESS /usr/local/gds/tools//api_tests/cufile_testdriverprops -d 8 SUCCESS /usr/local/gds/tools//api_tests/cufile_testdriverprops -c 8 FAILED /usr/local/gds/tools//api_tests/cufile_testdriverprops -b 1024 FAILED /usr/local/gds/tools//api_tests/cufile_testdriverprops -d 1024 FAILED /usr/local/gds/tools//api_tests/cufile_driver_close /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G 0 FAILED /usr/local/gds/tools//api_tests/cufile_driver_close /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G 1 FAILED /usr/local/gds/tools//api_tests/cufile_driver_close /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G 2 FAILED /usr/local/gds/tools//api_tests/cufile_io_race /data/sanity/tests//sparse1G FAILED /usr/local/gds/tools//api_tests/cufile_testvalidnvbuf /data/sanity/tests//sparse1G 0 FAILED /usr/local/gds/tools//api_tests/cufile_testvalidnvbuf /data/sanity/tests//sparse1G 0 FAILED /usr/local/gds/tools//api_tests/cufile_driver_close /data/sanity/tests//sparse1G /data/sanity/tests//sparse1G 3 FAILED /usr/local/gds/tools//api_tests/cufile_io_race /data/sanity/tests//sparse1G FAILED /usr/local/gds/tools//api_tests/cufile_invalid_write /data/sanity/tests//sparse1G 0 0 SUCCESS /usr/local/gds/tools//api_tests/cufile_invalid_write /data/sanity/tests//sparse1G 0 1 SUCCESS /usr/local/gds/tools//api_tests/cufile_invalid_offsets /data/sanity/tests//sparse1G 0 0 FAILED /usr/local/gds/tools//api_tests/cufile_testcudacontext_switch /data/sanity/tests//sparse_CTX_VERIFY 0 FAILED End: nvidia-fs: GDS Version: 1.2.1.4 NVFS statistics(ver: 4.0) NVFS Driver(version: 2.11.0) Mellanox PeerDirect Supported: True IO stats: Disabled, peer IO stats: Disabled Logging level: infoActive Shadow-Buffer (MiB): 0 Active Process: 0 Reads : err=0 io_state_err=0 Sparse Reads : n=0 io=0 holes=0 pages=0 Writes : err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0 Mmap : n=0 ok=0 err=0 munmap=0 Bar1-map : n=0 ok=0 err=0 free=0 callbacks=0 active=0 Error : cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0 Ops : Read=0 Write=0 BatchIO=0 ************************************************** API Tests, : 10 / 63 tests passed ************************************************** Testsuite : 41 / 182 tests passed done tests:Mon May 2 16:48:42 UTC 2022 It seems that a patch https://review.whamcloud.com/#/c/45327/ needs to be applied to the master. With MLNX_OFED_LINUX-5.6-1.0.3.3-ubuntu20.04-x86_64 and MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 the result is the same.
|
| Comments |
| Comment by Peter Jones [ 06/May/22 ] |
|
Minh Please can you advise Thanks Peter |
| Comment by Minh Diep [ 06/May/22 ] |
|
this is caused by |
| Comment by Peter Jones [ 06/May/22 ] |
|
Ok. Can we revert that change for 2.15.0 and then take longer to assess how to support that change without introducing a regression for other usage? |
| Comment by Minh Diep [ 06/May/22 ] |
|
revert patch https://review.whamcloud.com/47238 |
| Comment by Peter Jones [ 09/May/22 ] |
|
Landed for 2.15 |
| Comment by Oleg Kulachenko (Inactive) [ 11/May/22 ] |
|
After reinstall Lustre mlnxofedinstall works without errors. Logs: mlnx_logs.txt But: ddn@a100-01:~/okulachenko/MLNX_OFED_LINUX-5.6-1.0.3.3-ubuntu20.04-x86_64$ sudo lustre_rmmod [sudo] password for ddn: ddn@a100-01:~/okulachenko/MLNX_OFED_LINUX-5.6-1.0.3.3-ubuntu20.04-x86_64$ sudo /etc/init.d/openibd restart Unloading HCA driver: [ OK ] Loading Mellanox MLX5_IB HCA driver: [FAILED] Loading Mellanox MLX5 HCA driver: [FAILED] Loading HCA driver and Access Layer: [FAILED] Please run /usr/sbin/sysinfo-snapshot.py to collect the debug information and open an issue in the http://support.mellanox.com/SupportWeb/service_center/SelfService I think that we found a new issue. We will re-image A100 and run firmware upgrade. Most likely the new issue is not in the Lustre, but in the firmware.
|
| Comment by Oleg Kulachenko (Inactive) [ 12/May/22 ] |
|
mdiep It seems that the fix did not help. $ lctl get_param version version=2.15.0_RC3_2_g7905359 ddn@a100-01:~/okulachenko/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64$ sudo ./mlnxofedinstall --add-kernel-support --distro ubuntu20.04 Note: This program will create MLNX_OFED_LINUX TGZ for ubuntu20.04 under /tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-5.4.0-109-generic directory. See log file /tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-5.4.0-109-generic/mlnx_iso.3333249_logs/mlnx_ofed_iso.3333249.logChecking if all needed packages are installed... Building MLNX_OFED_LINUX DEBS . Please wait... Creating metadata-rpms for 5.4.0-109-generic ... WARNING: If you are going to configure this package as a repository, then please note WARNING: that it is not signed, therefore, you need to set 'trusted=yes' in the sources.list file. WARNING: Example: deb [trusted=yes] file:/<path to MLNX_OFED DEBS folder> ./ Created /tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-5.4.0-109-generic/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-ext.tgz Removing old packages... Uninstalling the previous version of MLNX_OFED_LINUX Installing /tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-5.4.0-109-generic/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-ext /tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-5.4.0-109-generic/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-ext/mlnxofedinstall --force --without-dkms --distro ubuntu20.04 Logs dir: /tmp/MLNX_OFED_LINUX.3838794.logs General log file: /tmp/MLNX_OFED_LINUX.3838794.logs/general.logBelow is the list of MLNX_OFED_LINUX packages that you have chosen (some may have been added by the installer due to package dependencies):ofed-scripts mlnx-tools mlnx-ofed-kernel-utils mlnx-ofed-kernel-modules iser-modules isert-modules srp-modules rdma-core libibverbs1 ibverbs-utils ibverbs-providers libibverbs-dev libibverbs1-dbg libibumad3 libibumad-dev ibacm librdmacm1 rdmacm-utils librdmacm-dev mstflint ibdump libibmad5 libibmad-dev libopensm opensm opensm-doc libopensm-devel libibnetdisc5 infiniband-diags mft kernel-mft-modules perftest ibutils2 ar-mgr dump-pr ibsim ibsim-doc ucx sharp hcoll openmpi mpitests knem-modules libdapl2 dapl2-utils libdapl-dev dpcp srptools mlnx-ethtool mlnx-iproute2 rshimThis program will install the MLNX_OFED_LINUX package on your machine. Note that all other Mellanox, OEM, OFED, RDMA or Distribution IB packages will be removed. Those packages are removed due to conflicts with MLNX_OFED_LINUX, do not reinstall them.Checking SW Requirements... Removing old packages... Installing new packages Installing ofed-scripts-5.5... Installing mlnx-tools-5.2.0... Installing mlnx-ofed-kernel-utils-5.5... Installing mlnx-ofed-kernel-modules-5.5... Installing iser-modules-5.5... Installing isert-modules-5.5... Installing srp-modules-5.5... Installing rdma-core-55mlnx37... Installing libibverbs1-55mlnx37... Installing ibverbs-utils-55mlnx37... Installing ibverbs-providers-55mlnx37... Installing libibverbs-dev-55mlnx37... Installing libibverbs1-dbg-55mlnx37... Installing libibumad3-55mlnx37... Installing libibumad-dev-55mlnx37... Installing ibacm-55mlnx37... Installing librdmacm1-55mlnx37... Installing rdmacm-utils-55mlnx37... Installing librdmacm-dev-55mlnx37... Installing mstflint-4.16.0... Installing ibdump-6.0.0... Installing libibmad5-55mlnx37... Installing libibmad-dev-55mlnx37... Installing libopensm-5.10.0.MLNX20211115.e645cc83... Installing opensm-5.10.0.MLNX20211115.e645cc83... Installing opensm-doc-5.10.0.MLNX20211115.e645cc83... Installing libopensm-devel-5.10.0.MLNX20211115.e645cc83... Installing libibnetdisc5-55mlnx37... Installing infiniband-diags-55mlnx37... Installing mft-4.18.0... Installing kernel-mft-modules-4.18.0... Installing perftest-4.5... Installing ibutils2-2.1.1... Installing ar-mgr-1.0... Installing dump-pr-1.0... Installing ibsim-0.10... Installing ibsim-doc-0.10... Installing ucx-1.12.0... Installing sharp-2.6.1.MLNX20211124.aac4a56... Installing hcoll-4.7.3202... Installing openmpi-4.1.2rc2... Installing mpitests-3.2.20... Installing knem-modules-1.1.4.90mlnx1... Installing libdapl2-2.1.10.1.mlnx... Installing dapl2-utils-2.1.10.1.mlnx... Installing libdapl-dev-2.1.10.1.mlnx... Installing dpcp-1.1.17... Installing srptools-55mlnx37... Installing mlnx-ethtool-5.13... Installing mlnx-iproute2-5.14.0... Installing rshim-2.0.6... Selecting previously unselected package mlnx-fw-updater. (Reading database ... 224924 files and directories currently installed.) Preparing to unpack .../mlnx-fw-updater_5.5-1.0.3.2_amd64.deb ... Unpacking mlnx-fw-updater (5.5-1.0.3.2) ... Setting up mlnx-fw-updater (5.5-1.0.3.2) ...Added 'RUN_FW_UPDATER_ONBOOT=no to /etc/infiniband/openib.confInitializing... Attempting to perform Firmware update... Querying Mellanox devices firmware ... Querying Mellanox devices firmware ... Querying Mellanox devices firmware ...Device #1: ---------- Device Type: ConnectX6 Part Number: MCX653105A-HDA_Ax Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6 PSID: MT_0000000223 PCI Device Name: 0c:00.0 Base GUID: 0c42a10300555aaa Versions: Current Available FW 20.33.1048 20.32.1010 PXE 3.6.0502 3.6.0502 UEFI 14.26.0017 14.25.0017 Status: Up to date Log File: /tmp/lJcLV6m0FI Querying Mellanox devices firmware ...Device #1: ---------- Device Type: ConnectX6 Part Number: MCX653105A-HDA_Ax Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6 PSID: MT_0000000223 PCI Device Name: 12:00.0 Base GUID: 0c42a10300555dee Versions: Current Available FW 20.33.1048 20.32.1010 PXE 3.6.0502 3.6.0502 UEFI 14.26.0017 14.25.0017 Status: Up to date Log File: /tmp/paBqhmZPH7 Querying Mellanox devices firmware ...Device #1: ---------- Device Type: ConnectX6 Part Number: MCX653105A-HDA_Ax Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6 PSID: MT_0000000223 PCI Device Name: 4b:00.0 Base GUID: 043f720300f55646 Versions: Current Available FW 20.33.1048 20.32.1010 PXE 3.6.0502 3.6.0502 UEFI 14.26.0017 14.25.0017 Status: Up to date Log File: /tmp/ovxoNfgM6c Querying Mellanox devices firmware ...Device #1: ---------- Device Type: ConnectX6 Part Number: MCX653105A-HDA_Ax Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6 PSID: MT_0000000223 PCI Device Name: 54:00.0 Base GUID: 0c42a10300555dbe Versions: Current Available FW 20.33.1048 20.32.1010 PXE 3.6.0502 3.6.0502 UEFI 14.26.0017 14.25.0017 Status: Up to date Log File: /tmp/FZHvz2eu6S Querying Mellanox devices firmware ... Querying Mellanox devices firmware ...Device #1: ---------- Device Type: ConnectX6 Part Number: MCX653106A-HDA_Ax Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; dual-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6 PSID: MT_0000000225 PCI Device Name: 61:00.0 Base MAC: 1c34da6c9046 Versions: Current Available FW 20.33.1048 20.32.1010 PXE 3.6.0502 3.6.0502 UEFI 14.26.0017 14.25.0017 Status: Up to date Log File: /tmp/J6qmeEZp3kDevice #1: ---------- Device Type: ConnectX6 Part Number: MCX653105A-HDA_Ax Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6 PSID: MT_0000000223 PCI Device Name: 8d:00.0 Base GUID: 0c42a10300555d62 Versions: Current Available FW 20.33.1048 20.32.1010 PXE 3.6.0502 3.6.0502 UEFI 14.26.0017 14.25.0017 Status: Up to date Log File: /tmp/i3Ih6BBFkg Querying Mellanox devices firmware ...Device #1: ---------- Device Type: ConnectX6 Part Number: MCX653105A-HDA_Ax Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6 PSID: MT_0000000223 PCI Device Name: 94:00.0 Base GUID: 0c42a10300555afe Versions: Current Available FW 20.33.1048 20.32.1010 PXE 3.6.0502 3.6.0502 UEFI 14.26.0017 14.25.0017 Status: Up to date Log File: /tmp/ITvKfCj1dJ Querying Mellanox devices firmware ...Device #1: ---------- Device Type: ConnectX6 Part Number: MCX653105A-HDA_Ax Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6 PSID: MT_0000000223 PCI Device Name: ba:00.0 Base GUID: 0c42a10300555af6 Versions: Current Available FW 20.33.1048 20.32.1010 PXE 3.6.0502 3.6.0502 UEFI 14.26.0017 14.25.0017 Status: Up to date Log File: /tmp/Q8rB0zMS9EDevice #1: ---------- Device Type: ConnectX6 Part Number: MCX653105A-HDA_Ax Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6 PSID: MT_0000000223 PCI Device Name: cc:00.0 Base GUID: 0c42a10300555e02 Versions: Current Available FW 20.33.1048 20.32.1010 PXE 3.6.0502 3.6.0502 UEFI 14.26.0017 14.25.0017 Status: Up to date Log File: /tmp/7EZbD0vG0HDevice #1: ---------- Device Type: ConnectX6 Part Number: MCX653106A-HDA_Ax Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; dual-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6 PSID: MT_0000000225 PCI Device Name: e1:00.0 Base MAC: 0c42a11b7dee Versions: Current Available FW 20.33.1048 20.32.1010 PXE 3.6.0502 3.6.0502 UEFI 14.26.0017 14.25.0017 Status: Up to date Log File: /tmp/gQkEPSV1Mp Real log file: /tmp/MLNX_OFED_LINUX.3838794.logs/fw_update.log Device (0c:00.0): 0c:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6] Link Width: x16 PCI Link Speed: 16GT/sDevice (12:00.0): 12:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6] Link Width: x16 PCI Link Speed: 16GT/sDevice (4b:00.0): 4b:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6] Link Width: x16 PCI Link Speed: 16GT/sDevice (54:00.0): 54:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6] Link Width: x16 PCI Link Speed: 16GT/sDevice (61:00.0): 61:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6] Link Width: x16 PCI Link Speed: 16GT/sDevice (61:00.1): 61:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6] Link Width: x16 PCI Link Speed: 16GT/sDevice (8d:00.0): 8d:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6] Link Width: x16 PCI Link Speed: 16GT/sDevice (94:00.0): 94:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6] Link Width: x16 PCI Link Speed: 16GT/sDevice (ba:00.0): ba:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6] Link Width: x16 PCI Link Speed: 16GT/sDevice (cc:00.0): cc:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6] Link Width: x16 PCI Link Speed: 16GT/sDevice (e1:00.0): e1:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6] Link Width: x16 PCI Link Speed: 16GT/sDevice (e1:00.1): e1:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6] Link Width: x16 PCI Link Speed: 16GT/sInstallation passed successfully To load the new driver, run: /etc/init.d/openibd restart ddn@a100-01:~/okulachenko/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64$ sudo /etc/init.d/openibd restart Unloading HCA driver: [ OK ] Loading Mellanox MLX5_IB HCA driver: [FAILED] Loading Mellanox MLX5 HCA driver: [FAILED] Loading HCA driver and Access Layer: [FAILED]Please run /usr/sbin/sysinfo-snapshot.py to collect the debug information and open an issue in the http://support.mellanox.com/SupportWeb/service_center/SelfService ddn@a100-01:~/okulachenko/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64$ sudo modprobe -v lustre insmod /lib/modules/5.4.0-109-generic/updates/kernel/net/libcfs.ko cpu_npartitions=32 insmod /lib/modules/5.4.0-109-generic/updates/kernel/net/lnet.ko networks="o2ib(ibp12s0,ibp18s0,enp225s0f1,ibp75s0,ibp84s0,enp97s0f1,ibp141s0,ibp148s0,ibp186s0,ibp204s0)" lnet_transaction_timeout=100 lnet_retry_count=2 insmod /lib/modules/5.4.0-109-generic/updates/kernel/fs/obdclass.ko insmod /lib/modules/5.4.0-109-generic/updates/kernel/fs/ptlrpc.ko modprobe: ERROR: could not insert 'lustre': Network is down |
| Comment by Minh Diep [ 13/May/22 ] |
|
gtapase please take a look |
| Comment by Andreas Dilger [ 19/May/22 ] |
|
Gaurang, any update on this issue? This is one of the few remaining issues before the 2.15.0 release. |
| Comment by Gaurang Tapase [ 20/May/22 ] |
|
I don't have access to DGX A100 system, but I tried compiling lustre 2.15 on a ubuntu 20.04 system with MoFED 5.5 installed and it worked fine. I could load the lustre module as well. Is this something specific to DGX A100? |
| Comment by Gaurang Tapase [ 20/May/22 ] |
|
Just built the master branch of fs/lustre-release on DGX A100 and could load the lustre module as well. root@a100-01:/home/ddn/gtapase/exa-client# dpkg -l | grep lustre rc lustre-client-modules-5.14.0-1032-oem 2.14.0-ddn39-11-g767352e-1 amd64 Lustre Linux kernel module (kernel 5.14.0-1032-oem) ii lustre-client-modules-5.4.0-109-generic 2.15.50-13-gc524079-dirty-1 amd64 Lustre Linux kernel module (kernel 5.4.0-109-generic) rc lustre-client-modules-5.4.0-96-generic 2.15.0-RC3-2-g7905359-1 amd64 Lustre Linux kernel module (kernel 5.4.0-96-generic) ii lustre-client-utils 2.15.50-13-gc524079-dirty-1 amd64 Userspace utilities for the Lustre filesystem (client) ii lustre-dev 2.15.50-13-gc524079-dirty-1 amd64 Development files for the Lustre filesystem rc lustre-source 2.15.0-RC3-2-g7905359-1 all source for Lustre filesystem client kernel modules root@a100-01:/home/ddn/gtapase/exa-client# lsmod | grep lustre lustre 1007616 0 lmv 212992 1 lustre mdc 274432 1 lustre lov 331776 2 mdc,lustre ptlrpc 1355776 7 fld,osc,fid,lov,mdc,lmv,lustre obdclass 3297280 8 fld,osc,fid,ptlrpc,lov,mdc,lmv,lustre lnet 659456 6 osc,ko2iblnd,obdclass,ptlrpc,lmv,lustre libcfs 245760 11 fld,lnet,osc,fid,ko2iblnd,obdclass,ptlrpc,lov,mdc,lmv,lustre
|
| Comment by Oleg Kulachenko (Inactive) [ 20/May/22 ] |
|
I'm trying to run tests. /usr/local/cuda-11.5/gds/tools/gdscheck.py -p cuInit Failed, error CUDA_ERROR_SYSTEM_NOT_READY cuFile initialization failed Platform verification error : CUDA Driver API error But these are gds tools problems, not Lustre |
| Comment by Oleg Kulachenko (Inactive) [ 25/May/22 ] |
|
Updating cuda to the latest version fixed it. |