[LU-14798] NVIDIA GPUDirect Storage Support Created: 29/Jun/21  Updated: 26/Nov/21  Resolved: 10/Aug/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: New Feature Priority: Minor
Reporter: Amir Shehata (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: PNG File GDS-10GBFIOread-FIO.png     PNG File GDS-10GBFileWrite-FIO.png     PNG File GDS-1GBFileRead-FIO.png     PNG File GDS-1GBFileWrite-FIO.png    
Issue Links:
Related
is related to LU-14795 NVidia GDS support in lustre Closed
Rank (Obsolete): 9223372036854775807

 Description   

Now that NVIDIA has made the official release of GPUDirect Storage, we are able to release the GDS feature integration for Lustre that has been under development and testing in conjunction with NVIDIA for sometime.

This feature provides the following:

  1. use direct bulk IO with GPU workload
  2. Select the interface nearest the GPU for optimal performance
  3. Integrate GPU selection criteria into the LNet multi-rail selection algorithm.
  4. Handle IO less than 4K in a manner which works with the GPU direct workflow
  5. Use the memory registration/deregistration mechanism provided by the nvidia-fs driver.

Performance comparison between GPU and CPU workloads attached. Bandwidth in GB/s.

 



 Comments   
Comment by Patrick Farrell [ 29/Jun/21 ]

Amir, do you have non-GDS graphs available for the same hardware/configuration?

Comment by Gerrit Updater [ 29/Jun/21 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44109
Subject: LU-14798 lnet: RMDA infrastructure updates
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 00faa6f56ed24e15810a97ee1d2dd56e124ceaaa

Comment by Gerrit Updater [ 29/Jun/21 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44110
Subject: LU-14798 lnet: add LNet GPU Direct Support
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 181a0e8cae63a9e41e8109807a658df77eba71f4

Comment by Gerrit Updater [ 29/Jun/21 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44111
Subject: LU-14798 lustre: Support RDMA only pages
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ef3b31675623e87af7f32f0424514bae599573fa

Comment by James A Simmons [ 29/Jun/21 ]

I see Cray also is interest in this work. This approach has several problems. One is how does this work with a CUDA stack which is heavily used by our user applications. Second problem is this is so one driver specific. That specific driver is not even upstreamed which makes a large barrier for acceptance. Lastly the main problem with our o2iblnd driver is it does way too much low level handling which has made the inifinband maintainers unhappy. Most of the IB type drivers, iserp for example, have been moved to the generic RW api which hides all the RDMA handling internally. Honestly this should be done working with the DRI (TTM) and infinband developers upstream. It would also give the added benefit of things like srp using this feature as well. 

Comment by James A Simmons [ 29/Jun/21 ]

I remembered this being an issue with NVME devices and yes their is a generic solution that landed upstream with commit 

52916982af48d9f9fc01ad825259de1eb3a9b25e. Additionally I looked at the IB generic RW api and its aware of this API.

I believe we can do this is a way that will benefit many people on many different platforms.

Comment by Andreas Dilger [ 30/Jun/21 ]

James,
these patches were developed in close conjunction with the GDS team at NVIDIA and they have tested these patches on their systems. It is explicitly designed by NVIDIA to integrate with CUDA "Magnum IO", and the goal is to improve GPU application performance for those applications using the CUDA IO interfaces by avoiding CPU usage (memory and PCI bus). An overview is available at https://developer.nvidia.com/gpudirect-storage with links to more information there if you want.

I agree with you that there may be other cleanups possible in the o2iblnd, but I don't think that is relevant to these patches. This patch adds a handful of lines to o2iblnd see if the pages are mapped to GPU memory or not. This code is a no-op if the application is not using the CUDA IO driver at runtime to actively submit IO through the GDS interface.

I also agree that there may be additional ways to implement this functionality in the future, but I think the core interfaces would be the same - a check to RDMA pages into GPU memory, when the IO request is coming from the GPU. I don't think it makes sense to prevent this implementation from landing today, when it is something that is actively used/developed already, on the hopes that someone might implement a different interface in the future. Even then, CUDA IO applications are using the NVFS interface already today, and whether other applications or libraries (e.g. MPIIO) use a different interface (e.g. io_uring with GPU support, or whatever) is something that can be added separately in the future.

Comment by Amir Shehata (Inactive) [ 30/Jun/21 ]

paf0186, the graphs attached show a comparison between the GPU load (using GDS) and the standard CPU load not going through the GPU Direct workflow.

Comment by James A Simmons [ 30/Jun/21 ]

I asked in the other ticket but I don't think you are on the ticket. One of the pushes for ko2iblnd upstream has been the request to use the Generic RDMA API implemented by Chris Hellwig. The problem is it removes doing the SG mapping yourself which conflicts with this work. So should we drop the potential work for ko2iblnd using that API. If that is the case then I will just update the ko2iblnd driver for submission to Linus with the move to the Generic RDMA API. What is your suggestion?

Note the use of the nv driver is a future concern. Like many sites we will only deploy  supported and distributed packages that come with RedHat. We can't risk deploying anything unless the vendor is willing to address issues in the middle of the night.

Comment by Patrick Farrell [ 30/Jun/21 ]

How can you make use of this functionality at all without the NVIDIA driver?  Specifically on NVIDIA hardware - not in a different context.

Comment by Andreas Dilger [ 01/Jul/21 ]

Patrick, I'm not sure of the direction of your question, but the answer is that you can't use this without the nvfs interface today. The nvfs driver is itself a kernel module and injects O_DIRECT IO similar to io_uring, though I don't know the exact details. The userspace interface is integrated into the CUDA application librarY similar to applications using MPIIO/ADIO, and is intended to completely avoid pulling data via the CPU PCI/RAM from the filesystem before applications can start processing it.

While I agree that this is very NVIDIA specific, I don't think it is a significant burden on the code, and is definitely a feature that many sites are interested in. If it doesn't go into the upstream kernel, then it won't be the end of the world, it just means that users that use this NVIDIA GDS interface will not use the vanilla kernel client, no different than they use MOFED instead of OFED for networking, or Cray MPI instead of another version.

Comment by Patrick Farrell [ 01/Jul/21 ]

Oh, sure - it was directed at James’s comment about the use of the NVIDIA driver being potentially problematic.

Comment by Cory Spitz [ 02/Jul/21 ]

LU-14795 "pre-dates" this ticket.

Comment by Alexey Lyashkov [ 03/Jul/21 ]

James,
>I see Cray also is interest in this work. This approach has several problems. One is how does this work with a CUDA stack which is heavily used by our user applications.

I have a modified gdrcopy version which able to do zero-copy without any lustre modification. but cuFile API (it's GPUfs successor - https://sites.google.com/site/silbersteinmark/Home/gpufs) have locked with this API.
This code is very close to the our comment into gerrit.
>>
Take a look at https://www.kernel.org/doc/html/latest/driver-api/pci/p2pdma.html.

This is the correct approach. Yes we would have to find the struct device for the GPU (NVIDA, AMD, etc) and then export its PCI bar for use. Plus this approach is supported by the IB Generic RDMA
>>

but this code will don't work with cuFile API. It probably we can/should rewrite an nvidia-fs module to do both.

Comment by James A Simmons [ 06/Jul/21 ]

@Alexey nvidia-fs would have to do both since p2pdma is only available after 5.0 kernels.

@Fararrell Their are 2 issues with using the NVIDIA driver.

The first is that many sites like ours will not run software without some kind of support from either the distribution (RedHat) or the vendor (NVIDIA). Will NVIDIA engineers help us address problems with their drivers at 3 AM? Also the software needs to be packaged nicely for easy image management and have regular security updates.

The second issue deals with the long term state of the ko2iblnd driver itself. The infiniband layer under goes many changes every few kernel release cycles. When Lustre was in staging the infiniband maintainers hated the burden of updating our driver. This was an issue for other IB components as well so they developed a generic RDMA API to be used. It was requested by the IB maintainers to move our drivers to this API. Now this API does all the SG mappings internally with the IB core kernel code so it conflicts with this work. Thankfully due to NVME the IB generic RDMA API has adopted the p2pdma already. So it would be just a matter of updating Linux DRM drivers to support this. I agree with Alexey it might be a good idea to update the nvidia-fs driver to support the p2pdma API.

Comment by Alexey Lyashkov [ 06/Jul/21 ]

James. I looked into p2p DMA and it's not a this case.
p2p dma is between GPU's, or something similar. in case IB <> GPU, DMA buf good to use, or something similar. Just because IB card don't accept any DMA stream, but IB card have interpret an own IB work requests and may start PCIe transaction in bus master mode.
in this case, GPU software only prepare a buffer and physical addresses obtained to fill an IB WR structs. Very similar to the dma buf allocation.

But you are true. GPU zero copy is possible without these Lustre changes. I hope you known an https://github.com/NVIDIA/gdrcopy used in some MPI implementation for GPU memory exchange.
This module can adjusted to be DIO compatible (~200 LOC) but it want to change license to the GPL, as some GPL symbols is requested.

as about ko2iblnd. ko2iblnd is IB UPL and no large changes in this layer. Just new features but most of them, is Mellanox only.

Comment by Andreas Dilger [ 06/Jul/21 ]

James wrote:

Will NVIDIA engineers help us address problems with their drivers at 3 AM?

That's obviously a question between you and NVIDIA.

Since the NVIDIA driver is part of CUDA and (probably) you are already using MOFED instead of OFED, I don't see that as a new support issue. The landing of these patches does not obligate you, or anyone, to use these interfaces. This change is a total no-op until the nvidia-fs.ko module is loaded and the application is calling specific GDS IO APIs that submit IO directly from the GPU pages. I don't see it as being any different than the Lustre tree containing the Cray GNI LND, which cannot be tested without the appropriate extra software/hardware.

The second issue deals with the long term state of the ko2iblnd driver itself.

Sure, but we also need to maintain compatibility with older kernels and MOFED, so even if there are changes to ko2iblnd for newer kernels, there would need to be compatibility with older OFED/MOFED for some number of years. I don't see this is a problem that is affecting the current patches. At worst, the GDS support would need a configure check if the old interfaces were removed, or it would need to be updated to work with the new APIs. That is an issue to be addressed when those patches arrive.

I agree with Alexey it might be a good idea to update the nvidia-fs driver to support the p2pdma API.

That is something that NVIDIA would have to do themselves, I don't control the nvidia-fs.ko driver. They say on GPU Direct Storage Overview from NVIDIA themselves that discusses GDS + CUDA integration.:

There are efforts in the Linux community to add native support for DMA among peer devices, which can include NICs and GPUs. After this support is upstreamed, it will take time for all users to adopt the new Linux versions via distributions. Until then, NVIDIA will work with third-party vendors to enable GDS.

That said, there are customer sites that are interested in GDS today, as evidenced by both DDN and HPE submitting patches to add this interface. When/if nvfs/GDS changes to use the new P2PDMA API then we can deprecate the current interface over time, but in the meantime it would be good to land this patch into the release.

See also GPU Direct IO with HDF5 from the HDF Group, for other HPC projects that are adding GDS that were tested with these patches. This is really a major win for GPU applications, or I wouldn't be spending my time discussing this with you. This isn't just something that we've made up in our spare time, this is what GPU users want, so that they don't burn CPU cycles pushing data through the CPU RAM and PCI lanes. I think one of the major wins for applications is that this can be enabled transparently for GPU apps using cuFile* APIs, HDF5, etc.

If there is no interest to land these patches (and I'd think you should discuss this with users at ORNL, and not just your "comply with upstream kernel style" hat on), then we can patch this only into the EXAScaler releases and be done with it, but I think that would be doing a disservice to the broader Lustre community since that would put Lustre at a disadvantage to all of the proprietary filesystems that are happy to support GDS, and where Lustre would actually have an advantage today because these GDS patches support full RDMA read/write into GPU pages.

Comment by James A Simmons [ 06/Jul/21 ]

@Andreas.  You are correct in that we use the Cray GNI LND driver which has support behind it from Cray so if we do have issues we can work with engineers to resolve problems. ORNL tends to be more conservative in what it deploys due to the fact at our scales we see problems others don't. ORNL can consider GDS when we have fully tested it at scale and can work in partnership with NVIDIA engineers to resolve any at scale bugs. MOFED is the same way. In fact we have had the Mellanox engineers on site before. I was just answering Patrick his question about our own hesitation to deployment this new feature.

I'm not standing in the way of landing this work. I understand as an organization your focus is Lustre and what is today. ORNL is looking to expand into all areas in the Linux kernel that is HPC related for forward facing work.  I do have the flexibility to rework the GDS driver and submit changes. I also can approach the DRM / TTM guys to create GDS like work on other platforms besides NVIDIA. I'm thinking of the longer term road map to that is good for everyone. For example we would like to be able to use something like GDS for SRP as well. That is not of much interest to Lustre but it is to us.

 

Comment by Alexey Lyashkov [ 07/Jul/21 ]

@James,

Can you ask an NVidia guys - is they ready to "fix" an nvida driver to register GPU memory as system memory (it's part of DMA bufs or p2p DMA process). It they can add this into driver (it's easy - like 100-200 LOC - i can share a code example). This will avoid requirements to patch a lustre. As current problem is "NVidia don't register a PCI-e device memory as sytem one, it mean this memory can't be subject of DIO".
Performance questions can be solved in generic way via PCIe distance handling.

@Andeas,
can you trust me, if i say - GDS can work without any lustre patches? but it's true. I'm not clear about legal part of it - but this is possible from technical view. Current nvidia-fs module have a clean IOCTL API + nv_peer module for the MOFED <> GPU connection.

Comment by Andreas Dilger [ 07/Jul/21 ]

Sure, I believe that this is possible to change in the future. However, NVIDIA just spent more than a year developing and testing GDS in the current form before releasing it, so they are not going to change it quickly because we ask them. There are many other companies involved using GDS, so they would all have to change their interface as well. I think it is more likely that this may be changed in a year or two in a "GDS 2.0" release. In the meantime, I think we have to accept that GDS works the way it does today, and if/when it changes in the future we would update to match that interface when it becomes available.

Comment by Alexey Lyashkov [ 07/Jul/21 ]

i confused. I spent a two weeks to add 200 LOC patch into gdrcopy to have a DIO + zero copy with GPU.
But I can't publish it because it want to change a gdrcopy license. It have now an MIT but this implementation want to be GPL-ed.
That implementation don't needs to have change a GDS interface for CUDA programs just to rewrite a some parts inside of nvidia-fs module.
Main problem - nvidia driver isn't register an PCIe memory as system resource. One this memory will be registered and "struct page exist" - it's easy to export it to use with userspace. Like dma_buf API.
And it have nothing for lustre changes.

Comment by Andreas Dilger [ 07/Jul/21 ]

Sure, that is great. But unless NVIDIA accepts your changes and ships then out, everyone other than you will be using their version of the nvidia-fs.ko module that needs these Lustre changes to work.

Comment by Gerrit Updater [ 08/Jul/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/44109/
Subject: LU-14798 lnet: RMDA infrastructure updates
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7ac839837c1c6cd1ff629c7898349db8cc55891d

Comment by Alexey Lyashkov [ 08/Jul/21 ]

I run smoke performance testing. my results is WC implementation in 5% slow than Cray for 1G steam IO and 23% slow for 16k IO.

Comment by Patrick Farrell [ 08/Jul/21 ]

Can you share more details of the tests so someone can try to reproduce?

Comment by Alexey Lyashkov [ 08/Jul/21 ]
root@ynode02:/home/hpcd/alyashkov# bash test-wc1.sh
LNET busy
/home/hpcd/alyashkov/work/lustre-wc/lustre/tests /home/hpcd/alyashkov
Loading modules from /home/hpcd/alyashkov/work/lustre-wc/lustre/tests/..
detected 56 online CPUs by sysfs
libcfs will create CPU partition based on online CPUs
../lnet/lnet/lnet options: 'networks=o2ib(ibs9f0) accept=all'
gss/krb5 is not supported
/home/hpcd/alyashkov
debug=0
subsystem_debug=0
IoType: READ XferType: GPUD Threads: 32 DataSetSize: 30885168/1048576(KiB) IOSize: 16(KiB) Throughput: 0.244956 GiB/sec, Avg_Latency: 1993.301007 usecs ops: 1930323 total_time 120.243414 secs
IoType: READ XferType: GPUD Threads: 32 DataSetSize: 190027184/1048576(KiB) IOSize: 16(KiB) Throughput: 1.517209 GiB/sec, Avg_Latency: 321.826865 usecs ops: 11876699 total_time 119.445644 secs
root@ynode02:/home/hpcd/alyashkov# bash test1-1.sh
LNET busy
/home/hpcd/alyashkov/work/lustre/lustre/tests /home/hpcd/alyashkov
e2label: No such file or directory while trying to open /tmp/lustre-mdt1
Couldn't find valid filesystem superblock.
e2label: No such file or directory while trying to open /tmp/lustre-mdt1
Couldn't find valid filesystem superblock.
e2label: No such file or directory while trying to open /tmp/lustre-ost1
Couldn't find valid filesystem superblock.
Loading modules from /home/hpcd/alyashkov/work/lustre/lustre/tests/..
detected 56 online CPUs by sysfs
libcfs will create CPU partition based on online CPUs
../lnet/lnet/lnet options: 'networks=o2ib(ibs9f0) accept=all'
enable_experimental_features=1
gss/krb5 is not supported
/home/hpcd/alyashkov
debug=0
subsystem_debug=0
IoType: READ XferType: GPUD Threads: 32 DataSetSize: 30880000/1048576(KiB) IOSize: 16(KiB) Throughput: 0.247379 GiB/sec, Avg_Latency: 1973.783888 usecs ops: 1930000 total_time 119.045719 secs
IoType: READ XferType: GPUD Threads: 32 DataSetSize: 236412320/1048576(KiB) IOSize: 16(KiB) Throughput: 1.880670 GiB/sec, Avg_Latency: 259.630924 usecs ops: 14775770 total_time 119.883004 secs

IO load generated by

/usr/local/cuda-11.2/gds/tools/gdsio -f /lustre/hdd/alyashkov/foo -d 7 -w 32 -s 1G -i 16k -x 0 -I 0 -T 120
/usr/local/cuda-11.2/gds/tools/gdsio -f /lustre/hdd/alyashkov/foo -d 0 -w 32 -s 1G -i 16k -x 0 -I 0 -T 120

Host is HP Proliant -> 8 GPU + 2 IB cards. GPU in the two NUMA nodes - GPU0 .. GPU3 in NUMA0, GPU4 - 7 in NUMA1. IB0 (active) NUMA0, IB1 (inactive) NUMA1. Connected to the L300 system. 1 stripe per file.
OS Ubuntu 20.04 + 5.4 kernel + MOFED 5.3.
nvidia-fs from the GDS 1.0 release.

This difference likely because lnet_select_best_ni differences and expected for the WC patch version.

Comment by Alexey Lyashkov [ 08/Jul/21 ]

1 Gb chunk size have a less perf drop - like 5%.

+ bash llmount.sh
Loading modules from /home/hpcd/alyashkov/work/lustre-wc/lustre/tests/..
detected 56 online CPUs by sysfs
libcfs will create CPU partition based on online CPUs
../lnet/lnet/lnet options: 'networks=o2ib(ibs9f0) accept=all'
gss/krb5 is not supported
+ popd
/home/hpcd/alyashkov
+ mount -t lustre 192.168.0.210@o2ib:/hdd /lustre/hdd
+ lctl set_param debug=0 subsystem_debug=0
debug=0
subsystem_debug=0
+ CUFILE_ENV_PATH_JSON=/home/hpcd/alyashkov/cufile.json
+ /usr/local/cuda-11.2/gds/tools/gdsio -f /lustre/hdd/alyashkov/foo -d 7 -w 32 -s 1G -i 1M -x 0 -I 0 -T 120
IoType: READ XferType: GPUD Threads: 32 DataSetSize: 27216896/1048576(KiB) IOSize: 1024(KiB) Throughput: 0.217678 GiB/sec, Avg_Latency: 143491.908904 usecs ops: 26579 total_time 119.240755 secs
+ /usr/local/cuda-11.2/gds/tools/gdsio -f /lustre/hdd/alyashkov/foo -d 0 -w 32 -s 1G -i 1M -x 0 -I 0 -T 120
IoType: READ XferType: GPUD Threads: 32 DataSetSize: 1072263168/1048576(KiB) IOSize: 1024(KiB) Throughput: 8.589992 GiB/sec, Avg_Latency: 3637.855962 usecs ops: 1047132 total_time 119.044332 secs
root@ynode02:/home/hpcd/alyashkov# bash test1.sh
LNET busy
/home/hpcd/alyashkov/work/lustre/lustre/tests /home/hpcd/alyashkov
e2label: No such file or directory while trying to open /tmp/lustre-mdt1
Couldn't find valid filesystem superblock.
e2label: No such file or directory while trying to open /tmp/lustre-mdt1
Couldn't find valid filesystem superblock.
e2label: No such file or directory while trying to open /tmp/lustre-ost1
Couldn't find valid filesystem superblock.
Loading modules from /home/hpcd/alyashkov/work/lustre/lustre/tests/..
detected 56 online CPUs by sysfs
libcfs will create CPU partition based on online CPUs
../lnet/lnet/lnet options: 'networks=o2ib(ibs9f0) accept=all'
enable_experimental_features=1
gss/krb5 is not supported
/home/hpcd/alyashkov
debug=0
subsystem_debug=0
IoType: READ XferType: GPUD Threads: 32 DataSetSize: 27275264/1048576(KiB) IOSize: 1024(KiB) Throughput: 0.217832 GiB/sec, Avg_Latency: 143344.443230 usecs ops: 26636 total_time 119.411700 secs
IoType: READ XferType: GPUD Threads: 32 DataSetSize: 1117265920/1048576(KiB) IOSize: 1024(KiB) Throughput: 8.940439 GiB/sec, Avg_Latency: 3495.253255 usecs ops: 1091080 total_time 119.178470 secs
root@ynode02:/home/hpcd/alyashkov#
Comment by Alexey Lyashkov [ 08/Jul/21 ]

results after 10 iterations.

[alyashkov@hpcgate ~]$ for i in `ls log-*16k`; do echo $i; grep "Throughput: 1." $i | awk '{if ($10 == "16(KiB)") {sum += $12;}} END { print sum/10;}'; done
log-cray-16k
1.84928
log-master-16k
1.87858
log-wc-16k
1.54516
[alyashkov@hpcgate ~]$ for i in `ls log-*16k`; do echo $i; grep "Throughput: 0." $i | awk '{if ($10 == "16(KiB)") {sum += $12;}} END { print sum/10;}'; done
log-cray-16k
0.247549
log-master-16k
0.247369
log-wc-16k
0.245084

test script is same for each tree except a directory to module load.

# cat test-wc1.sh
#!/bin/bash

# echo 1 > /sys/module/nvidia_fs/parameters/dbg_enabled
umount /lustre/hdd && lctl net down ; lustre_rmmod

pushd /home/hpcd/alyashkov/work/lustre-wc/lustre/tests
#PTLDEBUG=-1 SUBSYSTEM=-1 DEBUG_SIZE=1000

NETTYPE=o2ib LOAD=yes bash llmount.sh
popd


mount -t lustre 192.168.0.210@o2ib:/hdd /lustre/hdd
lctl set_param debug=0 subsystem_debug=0
# && lctl set_param debug=-1 subsystem_debug=-1 debug_mb=10000
CUFILE_ENV_PATH_JSON=/home/hpcd/alyashkov/cufile.json
for i in $(seq 10); do
/usr/local/cuda-11.2/gds/tools/gdsio -f /lustre/hdd/alyashkov/foo -d 7 -w 32 -s 1G -i 16k -x 0 -I 0 -T 120
/usr/local/cuda-11.2/gds/tools/gdsio -f /lustre/hdd/alyashkov/foo -d 0 -w 32 -s 1G -i 16k -x 0 -I 0 -T 120
done

# -d 0 -w 4 -s 4G -i 1M -I 1 -x 0 -V
#lctl dk > /tmp/llog
#dmesg -c > /tmp/n-log
#umount /lustre/hdd && lctl net down ; lustre_rmmod

test system -HPe ProLiant XL270d Gen9

PCIe tree

root@ynode02:/home/hpcd/alyashkov# lspci -tv
-+-[0000:ff]-+-08.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0
 |           +-08.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0
 |           +-09.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 1
 |           +-09.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 1
 |           +-0b.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R3 QPI Link 0/1
 |           +-0b.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R3 QPI Link 0/1
 |           +-0b.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R3 QPI Link 0/1
 |           +-0b.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R3 QPI Link Debug
 |           +-0c.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0d.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0d.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0d.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0d.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0d.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0d.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-10.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R2PCIe Agent
 |           +-10.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R2PCIe Agent
 |           +-10.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Ubox
 |           +-10.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Ubox
 |           +-10.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Ubox
 |           +-12.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Home Agent 0
 |           +-12.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Home Agent 0
 |           +-12.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Home Agent 0 Debug
 |           +-12.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Home Agent 1
 |           +-12.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Home Agent 1
 |           +-12.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Home Agent 1 Debug
 |           +-13.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Target Address/Thermal/RAS
 |           +-13.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Target Address/Thermal/RAS
 |           +-13.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel Target Address Decoder
 |           +-13.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel Target Address Decoder
 |           +-13.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 0/1 Broadcast
 |           +-13.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Global Broadcast
 |           +-14.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel 0 Thermal Control
 |           +-14.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel 1 Thermal Control
 |           +-14.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel 0 Error
 |           +-14.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel 1 Error
 |           +-14.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 0/1 Interface
 |           +-14.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 0/1 Interface
 |           +-14.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 0/1 Interface
 |           +-14.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 0/1 Interface
 |           +-16.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Target Address/Thermal/RAS
 |           +-16.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Target Address/Thermal/RAS
 |           +-16.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Channel Target Address Decoder
 |           +-16.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Channel Target Address Decoder
 |           +-16.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 2/3 Broadcast
 |           +-16.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Global Broadcast
 |           +-17.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 1 - Channel 0 Thermal Control
 |           +-17.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 1 - Channel 1 Thermal Control
 |           +-17.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 1 - Channel 0 Error
 |           +-17.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 1 - Channel 1 Error
 |           +-17.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 2/3 Interface
 |           +-17.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 2/3 Interface
 |           +-17.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 2/3 Interface
 |           +-17.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 2/3 Interface
 |           +-1e.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           +-1e.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           +-1e.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           +-1e.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           +-1e.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           +-1f.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           \-1f.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 +-[0000:80]-+-00.0-[94]--
 |           +-01.0-[95]--
 |           +-01.1-[96]--
 |           +-02.0-[81-8b]----00.0-[82-8b]--+-04.0-[83]----00.0  NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB]
 |           |                               +-08.0-[86]--
 |           |                               \-0c.0-[89]----00.0  NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB]
 |           +-02.1-[97]--
 |           +-02.2-[98]--
 |           +-02.3-[99]--
 |           +-03.0-[8c-93]----00.0-[8d-93]--+-08.0-[8e]----00.0  NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB]
 |           |                               \-10.0-[91]----00.0  NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB]
 |           +-03.1-[9a]--
 |           +-03.2-[9b]--
 |           +-03.3-[9c]--
 |           +-04.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 0
 |           +-04.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 1
 |           +-04.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 2
 |           +-04.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 3
 |           +-04.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 4
 |           +-04.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 5
 |           +-04.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 6
 |           +-04.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 7
 |           +-05.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Map/VTd_Misc/System Management
 |           +-05.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D IIO Hot Plug
 |           +-05.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D IIO RAS/Control Status/Global Errors
 |           \-05.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D I/O APIC
 +-[0000:7f]-+-08.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0
 |           +-08.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0
 |           +-09.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 1
 |           +-09.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 1
 |           +-0b.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R3 QPI Link 0/1
 |           +-0b.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R3 QPI Link 0/1
 |           +-0b.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R3 QPI Link 0/1
 |           +-0b.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R3 QPI Link Debug
 |           +-0c.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0d.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0d.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0d.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0d.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0d.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0d.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-10.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R2PCIe Agent
 |           +-10.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R2PCIe Agent
 |           +-10.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Ubox
 |           +-10.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Ubox
 |           +-10.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Ubox
 |           +-12.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Home Agent 0
 |           +-12.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Home Agent 0
 |           +-12.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Home Agent 0 Debug
 |           +-12.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Home Agent 1
 |           +-12.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Home Agent 1
 |           +-12.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Home Agent 1 Debug
 |           +-13.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Target Address/Thermal/RAS
 |           +-13.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Target Address/Thermal/RAS
 |           +-13.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel Target Address Decoder
 |           +-13.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel Target Address Decoder
 |           +-13.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 0/1 Broadcast
 |           +-13.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Global Broadcast
 |           +-14.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel 0 Thermal Control
 |           +-14.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel 1 Thermal Control
 |           +-14.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel 0 Error
 |           +-14.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel 1 Error
 |           +-14.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 0/1 Interface
 |           +-14.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 0/1 Interface
 |           +-14.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 0/1 Interface
 |           +-14.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 0/1 Interface
 |           +-16.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Target Address/Thermal/RAS
 |           +-16.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Target Address/Thermal/RAS
 |           +-16.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Channel Target Address Decoder
 |           +-16.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Channel Target Address Decoder
 |           +-16.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 2/3 Broadcast
 |           +-16.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Global Broadcast
 |           +-17.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 1 - Channel 0 Thermal Control
 |           +-17.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 1 - Channel 1 Thermal Control
 |           +-17.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 1 - Channel 0 Error
 |           +-17.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 1 - Channel 1 Error
 |           +-17.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 2/3 Interface
 |           +-17.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 2/3 Interface
 |           +-17.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 2/3 Interface
 |           +-17.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 2/3 Interface
 |           +-1e.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           +-1e.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           +-1e.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           +-1e.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           +-1e.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           +-1f.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           \-1f.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 \-[0000:00]-+-00.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DMI2
             +-01.0-[16]--
             +-01.1-[1c]--
             +-02.0-[03-0a]----00.0-[04-0a]--+-08.0-[05]----00.0  NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB]
             |                               \-10.0-[08]----00.0  NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB]
             +-02.1-[1d]--
             +-02.2-[1e]--
             +-02.3-[1f]--
             +-03.0-[0b-15]----00.0-[0c-15]--+-04.0-[0d]----00.0  NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB]
             |                               +-08.0-[10]--+-00.0  Mellanox Technologies MT27700 Family [ConnectX-4]
             |                               |            \-00.1  Mellanox Technologies MT27700 Family [ConnectX-4]
             |                               \-0c.0-[13]----00.0  NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB]
             +-03.1-[19]--
             +-03.2-[1a]--
             +-03.3-[1b]--
             +-04.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 0
             +-04.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 1
             +-04.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 2
             +-04.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 3
             +-04.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 4
             +-04.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 5
             +-04.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 6
             +-04.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Crystal Beach DMA Channel 7
             +-05.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Map/VTd_Misc/System Management
             +-05.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D IIO Hot Plug
             +-05.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D IIO RAS/Control Status/Global Errors
             +-05.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D I/O APIC
             +-11.0  Intel Corporation C610/X99 series chipset SPSR
             +-14.0  Intel Corporation C610/X99 series chipset USB xHCI Host Controller
             +-1a.0  Intel Corporation C610/X99 series chipset USB Enhanced Host Controller #2
             +-1c.0-[20]--
             +-1c.2-[01]--+-00.0  Hewlett-Packard Company Integrated Lights-Out Standard Slave Instrumentation & System Support
             |            +-00.1  Matrox Electronics Systems Ltd. MGA G200EH
             |            +-00.2  Hewlett-Packard Company Integrated Lights-Out Standard Management Processor Support and Messaging
             |            \-00.4  Hewlett-Packard Company Integrated Lights-Out Standard Virtual USB Controller
             +-1c.4-[02]--+-00.0  Intel Corporation I350 Gigabit Network Connection
             |            \-00.1  Intel Corporation I350 Gigabit Network Connection
             +-1d.0  Intel Corporation C610/X99 series chipset USB Enhanced Host Controller #1
             +-1f.0  Intel Corporation C610/X99 series chipset LPC Controller
             +-1f.2  Intel Corporation C610/X99 series chipset 6-Port SATA Controller [AHCI mode]
             \-1f.3  Intel Corporation C610/X99 series chipset SMBus Controller
# lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          56
On-line CPU(s) list:             0-55
Thread(s) per core:              2
Core(s) per socket:              14
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           79
Model name:                      Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Stepping:                        1
CPU MHz:                         2220.098
BogoMIPS:                        4789.01
Virtualization:                  VT-x
L1d cache:                       896 KiB
L1i cache:                       896 KiB
L2 cache:                        7 MiB
L3 cache:                        70 MiB
NUMA node0 CPU(s):               0-13,28-41
NUMA node1 CPU(s):               14-27,42-55
# uname -a
Linux ynode02 5.4.0-77-generic #86-Ubuntu SMP Thu Jun 17 02:35:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  1. ofed_info | head -1
    MLNX_OFED_LINUX-5.3-1.0.0.1 (OFED-5.3-1.0.0):
  2. ls -d /usr/src/nvidia*
    /usr/src/nvidia-460.80 /usr/src/nvidia-fs-2.3.4 /usr/src/nvidia-fs-2.7.49
    
    
Comment by Alexey Lyashkov [ 12/Jul/21 ]

any news to replicate an issue ?

Comment by Shuichi Ihara [ 07/Aug/21 ]

shadow please check LU-14795 which i got build fails with latest GDS codes which is part of CUDA 11.4.1. patch LU-14798 was fine to build against CUDA 11.4 and 11.4.1 without any changes though.

Comment by Shuichi Ihara [ 09/Aug/21 ]

Due to client (DGX-A100) availability, sorry delay for posting test results of both patch LU-14795 and LU-14798 comparisons.
Here is test results in detail.

Tested Hardware
1 x AI400x (23 x NVMe)
1 x NVIDIA DGX-A100

DGX-A100 supports up to 8 x GPU on DGX-A100 against 8 x IB-HDR200 and 2 x CPU. In my testing, 2 x IB-HDR2000 and 2 and 4 GPU were used in GDS-IO. This is all NUMA-aware (GPU and IB-HDR200 are on same NUMA node) and symmetric configuration.

The test case are "thr=32, mode=0 (GDS-IO), op=1/0 (write/read) and iosize=16KB/1MB" with gdsio below.

GDSIO=/usr/local/cuda-11.4/gds/tools/gdsio
TARGET=/lustre/ai400x/client/gdsio

mode=$1
op=$2
thr=$3
iosize=$4

$GDSIO -T 60 \
	-D $TARGET/md0 -d 0 -n 3 -w $thr -s 1G -i $iosize -x $mode -I $op \
	-D $TARGET/md4 -d 4 -n 7 -w $thr -s 1G -i $iosize -x $mode -I $op

$GDSIO -T 60 \
	-D $TARGET/md0 -d 0 -n 3 -w $thr -s 1G -i $iosize -x $mode -I $op \
	-D $TARGET/md1 -d 1 -n 3 -w $thr -s 1G -i $iosize -x $mode -I $op \
	-D $TARGET/md4 -d 4 -n 7 -w $thr -s 1G -i $iosize -x $mode -I $op \
	-D $TARGET/md5 -d 5 -n 7 -w $thr -s 1G -i $iosize -x $mode -I $op 

2 x GPU, 2 x IB-HDR200

		iosize=16k			iosize=1m
		Write		Read		Write		Read
LU-14795	 0.968215	 2.3704		35.3331	 	35.5543
LU-14798  	 0.979587        2.24632        34.7941         34.0566

4 x GPU, 2 x IB-HDR200

		iosize=16k			iosize=1m
		Write		Read		Write		Read
LU-14795	 1.05208	 2.62914	34.8957	 	37.4645
LU-14798  	 1.28675         2.53229        36.0412         39.2747

I saw that patch LU-14798 was ~5% slower than LU-14795 for 16K and 1M read in 2 x GPU but I didn't see 23% drops.
However, patch LU-14795 was overall slower than LU-14798 in 4 x GPU, 2 x HDR200 case. (22% slower for 16K write in particular)

Comment by Alexey Lyashkov [ 10/Aug/21 ]

@lhara - you have different test than i show. My test choose a SINGLE CPU + GPU which near to the IB card. you choose different number GPU's with unknown distance. And what is distance between CPU and GPU? can you please attach an lspci to understand it.

PS. NUMA aware isn't applicable to the GPU <> IB communications. It's based on PCI root complex config. NUMA applicable just to the CPU <> local memory fact.

Comment by Gerrit Updater [ 10/Aug/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44110/
Subject: LU-14798 lnet: add LNet GPU Direct Support
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a7a889f77cec3ad44543fd0b33669521e612097d

Comment by Gerrit Updater [ 10/Aug/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44111/
Subject: LU-14798 lustre: Support RDMA only pages
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 29eabeb34c5ba2cffdb5353d108ea56e0549665b

Comment by Peter Jones [ 10/Aug/21 ]

Landed for 2.15

Comment by Shuichi Ihara [ 10/Aug/21 ]

My setup is fully proper numa-ware configuration I mentioned above.
Tested GPU and IB interfaces are located on same NUMA node. see below.

root@dgxa100:~# nvidia-smi 
Tue Aug 10 00:15:20 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   27C    P0    53W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   26C    P0    54W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   27C    P0    52W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   26C    P0    51W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   31C    P0    53W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                    0 |
| N/A   31C    P0    58W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   31C    P0    55W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   31C    P0    54W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

What I selected GPU index were 0, 1, 4 and 5. (see "-d X" option in my test script which I ran gdsio)
Those GPU's PCI bus id can be identified by "nvidia-smi" command above.

GPU index	PCIBus-ID
0		00000000:07:00.0
1		00000000:0F:00.0
4		00000000:87:00.0
5		00000000:90:00.0

Those PCI device's numa node are 3 or 7 below. That's why "-n 3" or "-n 7" with gdsio.

root@dgxa100:~# cat /sys/bus/pci/drivers/nvidia/0000\:07\:00.0/numa_node 
3
root@dgxa100:~# cat /sys/bus/pci/drivers/nvidia/0000\:0f\:00.0/numa_node 
3
root@dgxa100:~# cat /sys/bus/pci/drivers/nvidia/0000\:87\:00.0/numa_node 
7
root@dgxa100:~# cat /sys/bus/pci/drivers/nvidia/0000\:90\:00.0/numa_node 
7

And two IB interfaces (ibp12s0 and ibp141s0) were configured as LNET.

root@dgxa100:~# lnetctl net show 
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: o2ib
      local NI(s):
        - nid: 172.16.167.67@o2ib
          status: up
          interfaces:
              0: ibp12s0
        - nid: 172.16.178.67@o2ib
          status: up
          interfaces:
              0: ibp141s0

Those IB interface's PCI bus are 0000:0c:00.0(ibp12s0) and 0000:8d:00.0(ibp141s0).

root@dgxa100:~# ls -l /sys/class/net/ibp12s0/device /sys/class/net/ibp141s0/device
lrwxrwxrwx 1 root root 0 Aug  9 18:45 /sys/class/net/ibp12s0/device -> ../../../0000:0c:00.0
lrwxrwxrwx 1 root root 0 Aug  9 15:01 /sys/class/net/ibp141s0/device -> ../../../0000:8d:00.0

ibp12s0 is represented by mlx5_0 and mlx5_6 represented ibp141s0 and their numa nodes are also 3 and 7 as below.

root@dgxa100:~# for a in /sys/class/infiniband/*/device; do
> ls -l $a
> done
lrwxrwxrwx 1 root root 0 Aug  6 17:36 /sys/class/infiniband/mlx5_0/device -> ../../../0000:0c:00.0 <- ibp12s0
lrwxrwxrwx 1 root root 0 Aug  6 17:36 /sys/class/infiniband/mlx5_1/device -> ../../../0000:12:00.0
lrwxrwxrwx 1 root root 0 Aug  6 17:36 /sys/class/infiniband/mlx5_10/device -> ../../../0000:e1:00.0
lrwxrwxrwx 1 root root 0 Aug  6 17:36 /sys/class/infiniband/mlx5_11/device -> ../../../0000:e1:00.1
lrwxrwxrwx 1 root root 0 Aug  6 17:36 /sys/class/infiniband/mlx5_2/device -> ../../../0000:4b:00.0
lrwxrwxrwx 1 root root 0 Aug  6 17:36 /sys/class/infiniband/mlx5_3/device -> ../../../0000:54:00.0
lrwxrwxrwx 1 root root 0 Aug  6 17:36 /sys/class/infiniband/mlx5_4/device -> ../../../0000:61:00.0
lrwxrwxrwx 1 root root 0 Aug  6 17:36 /sys/class/infiniband/mlx5_5/device -> ../../../0000:61:00.1
lrwxrwxrwx 1 root root 0 Aug  6 17:36 /sys/class/infiniband/mlx5_6/device -> ../../../0000:8d:00.0 <- ibp141s0
lrwxrwxrwx 1 root root 0 Aug  6 17:36 /sys/class/infiniband/mlx5_7/device -> ../../../0000:94:00.0
lrwxrwxrwx 1 root root 0 Aug  6 17:36 /sys/class/infiniband/mlx5_8/device -> ../../../0000:ba:00.0
lrwxrwxrwx 1 root root 0 Aug  6 17:36 /sys/class/infiniband/mlx5_9/device -> ../../../0000:cc:00.0
root@dgxa100:~# cat /sys/class/infiniband/mlx5_0/device/numa_node 
3
root@dgxa100:~# cat /sys/class/infiniband/mlx5_6/device/numa_node 
7

So, GPU id 0 and 1 as well as IB interface ibp12s0 (mlx5_0) are located on same numa node 3, and GPU id 4, 5 and IB interface ibp141s0(mlx5_6) are located on numa node7.
In fact, GDX-A100 has 8 x GPU, 8 x IB interfaces and PCI switch between GPU (or IB) <-> CPU in above setting. I've been testing multiple GPUs and IB interfaces, one of GDS-IO benefits, it can eliminate bandwidth limitation on PCI switches and all GPU talks to storage through closest IB interfaces.

Comment by Alexey Lyashkov [ 10/Aug/21 ]

Ihara, Yours comments about GPU <> CPU/RAM config. not about GPU <> IB.
and NUMA nodes is about GPU <> CPU/RAM access.

did you read an https://docs.nvidia.com/gpudirect-storage/configuration-guide/index.html ?
if yes, can i ask you to look examples around of lspci -tv | egrep -i "nvidia | micron" or nvidia-smi topo -mp ?
and understand what is differences with info you provided? That info don't say anything about NUMA nodes (CPU config) - this info about PCI bus config. on AMD CPU system may have a 2-4 NUMA nodes - but 8 PCIe root complex nodes so IB and GPU may exist on SAME NUMA node, but via different PCIe complex nodes - which limits an P2P transfers.

But i don't see a reasons to continue to discussion as Whamcloud hurry to land a patches before all tests and discussion finished. So i think Whamcloud don't interested with this discussion.

Comment by Shuichi Ihara [ 10/Aug/21 ]

Ihara, Yours comments about GPU <> CPU/RAM config. not about GPU <> IB.
and NUMA nodes is about GPU <> CPU/RAM access.

You can find information from NVIDIA's DGX-A100 or SuperPOD. e.g. see page 10
https://hotchips.org/assets/program/tutorials/HC2020.NVIDIA.MichaelHouston.v02.pdf
Again, GPU0, GPU1 and mlx5_0 are under same PCI switch against NUMA node3, GPU4, GPU5 and mlx5_6 are under same PCI switch against NUMA node7. Our test configuration was surely correct.

Generated at Sat Feb 10 03:12:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.