[LU-13311] lustre2.13 module load gets stuck at lnet Created: 28/Feb/20 Updated: 06/Jan/21 Resolved: 06/Jan/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Mahmoud Hanafi | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
We started testing 2.13. Module load gets stuck at open("/sys/module/ptlrpc/initstate", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/sys/module/ptlrpc", 0x7fffffffc660) = -1 ENOENT (No such file or directory) open("/lib/modules/3.10.0-1062.12.1.el7_lustre2130.x86_64/extra/lustre/fs/ptlrpc.ko", O_RDONLY|O_CLOEXEC) = 7 read(7, "\177ELF\2\1", 6) = 6 lseek(7, 0, SEEK_SET) = 0 fstat(7, {st_mode=S_IFREG|0744, st_size=4181088, ...}) = 0 mmap(NULL, 4181088, PROT_READ, MAP_PRIVATE, 7, 0) = 0x2aaaabca1000 finit_module(7, "at_max=600 at_min=275 ", 0 See attached file for complete trace. |
| Comments |
| Comment by Andreas Dilger [ 29/Feb/20 ] |
|
This looks like it is getting stuck during loading of ptlrpc.ko, which is when the network connections are initialized. Could you please enable full debugging before starting to load ptlrpc.ko and dump the debug logs after it gets stuck to see where it is having problems: # modprobe libcfs # lctl set_param debug=-1 # modprobe ptlrpc.ko & sleep 10 # lctl debug_kernel /tmp/debug.ptlrpc.log and then attach the log here (maybe gzipped if very large). |
| Comment by Mahmoud Hanafi [ 29/Feb/20 ] |
|
So I tracked down the issue to our module.conf setting for the routers
options lnet routes="o2ib233 10.151.26.[80-94]@o2ib; o2ib313 10.151.25.[167-171,195-198,202-205,222]@o2ib 10.151.26.[60,127,140-144,146-154]@o2ib; o2ib417 10.151.26.[230-239,244-249]@o2ib"
See uploaded debug file. |
| Comment by Andreas Dilger [ 29/Feb/20 ] |
|
I'm not able to necessarily able diagnose the LNet issue, that may have to be our LNet team. That said, there looks to be something strange going on in the setup: 00000800:00000010:7.0:1582936086.907926:0:9830:0:(o2iblnd.c:2381:kiblnd_create_tx_pool()) alloc '(tx->tx_sge)': 8224 at ffffa9c30d64b000 (tot 1972499764). 00000800:00000010:7.0:1582936086.907928:0:9830:0:(o2iblnd.c:2387:kiblnd_create_tx_pool()) alloc '(tx->tx_rd)': 3080 at ffff9add9603f000 (tot 1972502844). It looks like it has allocated almost 2GB of memory for IB buffers (less a few MB for other things at the start). That seems pretty high. Are you sure the IB parameters are correct? options ko2iblnd ntx=251072 credits=125536 fmr_pool_size=62769 However, some suggestions to help debug:
|
| Comment by Mahmoud Hanafi [ 29/Feb/20 ] |
|
We have use the same ko2iblnd setting for a long time. We came up with those values for our large number of clients (over 13K). 2.12.3 servers work just fine. I make made a mistake listing effective version it should be 2.13. So far we have only seen this issue with 2.13 servers. I think one of the routers listed is actually down. I'll remove that and try it. Our routers have 2 interfaces on each side, but they are not configure to use multirail. It may be an issue with multi-rail discovery. |
| Comment by Peter Jones [ 29/Feb/20 ] |
|
Amir Could you please advise Peter |
| Comment by Mahmoud Hanafi [ 03/Mar/20 ] |
|
Module load hangs if a router listed that is down. [ 224.782043] LNet: 8054:0:(config.c:1641:lnet_inet_enumerate()) lnet: Ignoring interface eth1: it's down [ 227.476974] LNet: 8054:0:(router.c:692:lnet_add_route()) Consider turning discovery on to enable full Multi-Rail routing functionality [ 227.521176] LNetError: 8092:0:(router.c:315:lnet_set_route_aliveness()) route to o2ib233 through 10.151.26.80@o2ib has gone from down to up [ 228.781636] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 227 seconds [ 228.817288] LNetError: 2950:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.151.26.239@o2ib added to recovery queue. Health = 999 [ 229.781593] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 228 seconds [ 262.779754] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 0 seconds [ 292.778101] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 0 seconds [ 302.777536] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 10 seconds [ 322.776432] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 0 seconds [ 353.774712] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 1 seconds |
| Comment by Amir Shehata (Inactive) [ 04/Mar/20 ] |
|
I believe this issue was fixed with the following patch:
LU-13001 lnet: Wait for single discovery attempt of routers
|
| Comment by Mahmoud Hanafi [ 06/Jan/21 ] |
|
We can close this |