[LU-13311] lustre2.13 module load gets stuck at lnet Created: 28/Feb/20  Updated: 06/Jan/21  Resolved: 06/Jan/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Amir Shehata (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Attachments: File debug.ptlrpc.log.gz     Text File lustre.conf     File module_load_issue.out    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We started testing 2.13.

Module load gets stuck at

 open("/sys/module/ptlrpc/initstate", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/sys/module/ptlrpc", 0x7fffffffc660) = -1 ENOENT (No such file or directory)
open("/lib/modules/3.10.0-1062.12.1.el7_lustre2130.x86_64/extra/lustre/fs/ptlrpc.ko", O_RDONLY|O_CLOEXEC) = 7
read(7, "\177ELF\2\1", 6)               = 6
lseek(7, 0, SEEK_SET)                   = 0
fstat(7, {st_mode=S_IFREG|0744, st_size=4181088, ...}) = 0
mmap(NULL, 4181088, PROT_READ, MAP_PRIVATE, 7, 0) = 0x2aaaabca1000
finit_module(7, "at_max=600 at_min=275 ", 0

See attached file for complete trace.
I also include our module.conf file.



 Comments   
Comment by Andreas Dilger [ 29/Feb/20 ]

This looks like it is getting stuck during loading of ptlrpc.ko, which is when the network connections are initialized. Could you please enable full debugging before starting to load ptlrpc.ko and dump the debug logs after it gets stuck to see where it is having problems:

# modprobe libcfs
# lctl set_param debug=-1
# modprobe ptlrpc.ko & sleep 10
# lctl debug_kernel /tmp/debug.ptlrpc.log

and then attach the log here (maybe gzipped if very large).

Comment by Mahmoud Hanafi [ 29/Feb/20 ]

So I tracked down the issue to our module.conf setting for the routers
removing this line the module loads find.

options lnet routes="o2ib233 10.151.26.[80-94]@o2ib; o2ib313 10.151.25.[167-171,195-198,202-205,222]@o2ib 10.151.26.[60,127,140-144,146-154]@o2ib; o2ib417 10.151.26.[230-239,244-249]@o2ib"

See uploaded debug file.

Comment by Andreas Dilger [ 29/Feb/20 ]

I'm not able to necessarily able diagnose the LNet issue, that may have to be our LNet team. That said, there looks to be something strange going on in the setup:

00000800:00000010:7.0:1582936086.907926:0:9830:0:(o2iblnd.c:2381:kiblnd_create_tx_pool()) alloc '(tx->tx_sge)': 8224 at ffffa9c30d64b000 (tot 1972499764).
00000800:00000010:7.0:1582936086.907928:0:9830:0:(o2iblnd.c:2387:kiblnd_create_tx_pool()) alloc '(tx->tx_rd)': 3080 at ffff9add9603f000 (tot 1972502844).

It looks like it has allocated almost 2GB of memory for IB buffers (less a few MB for other things at the start). That seems pretty high. Are you sure the IB parameters are correct?

options ko2iblnd ntx=251072 credits=125536 fmr_pool_size=62769 

However, some suggestions to help debug:

  • does the hang relate to any specific router? If you remove one or two of the routers does it load OK? I'm assuming the routers are working for 2.10 clients? In the logs it looks like it is parsing the o2ib233, o2ib313, and o2ib417 routes correctly. Then it looks like it gets into peer discovery (or something) with the routers and never really finishes 20 seconds later when the debug log is dumped:
    00000800:00000200:7.0:1582936102.918672:0:3029:0:(o2iblnd.c:1356:kiblnd_map_rx_descs()) rx 63: ffff9acdfe38a000 0xffe38a000(0xffe38a000)
    00000800:00000200:7.0:1582936102.918674:0:3029:0:(o2iblnd.c:1356:kiblnd_map_rx_descs()) rx 64: ffff9acdfe38b000 0xffe38b000(0xffe38b000)
    00000800:00000200:7.0:1582936102.918676:0:3029:0:(o2iblnd.c:1356:kiblnd_map_rx_descs()) rx 65: ffff9acdfc71e000 0xffc71e000(0xffc71e000)
    00000800:00000010:7.0:1582936102.918678:0:3029:0:(o2iblnd.c:956:kiblnd_create_conn()) kfreed 'init_qp_attr': 112 at ffff9acdfb2e5d00 (tot 1973018084).
    00000800:00000200:7.0:1582936102.918680:0:3029:0:(o2iblnd_cb.c:205:kiblnd_post_rx()) conn[ffff9acdd096a200] (67)++
    00000800:00000200:7.0:1582936102.918682:0:3029:0:(o2iblnd_cb.c:239:kiblnd_post_rx()) conn[ffff9acdd096a200] (68)--
    00000800:00000200:7.0:1582936102.918684:0:3029:0:(o2iblnd_cb.c:205:kiblnd_post_rx()) conn[ffff9acdd096a200] (67)++
    00000800:00000200:7.0:1582936102.918685:0:3029:0:(o2iblnd_cb.c:239:kiblnd_post_rx()) conn[ffff9acdd096a200] (68)--
    :
    :
    00000800:00000200:7.0:1582936102.918879:0:3029:0:(o2iblnd_cb.c:205:kiblnd_post_rx()) conn[ffff9acdd096a200] (67)++
    00000800:00000200:7.0:1582936102.918881:0:3029:0:(o2iblnd_cb.c:239:kiblnd_post_rx()) conn[ffff9acdd096a200] (68)--
    00000800:00000200:7.2:1582936102.919380:0:0:0:(o2iblnd_cb.c:3737:kiblnd_cq_completion()) conn[ffff9acdd096a200] (67)++
    00000800:00000200:7.0:1582936102.919391:0:9865:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff9acdd096a200] (68)++
    00000800:00000100:7.0:1582936102.919397:0:9865:0:(o2iblnd_cb.c:507:kiblnd_rx_complete()) Rx from 10.151.26.239@o2ib failed: 5
    00000800:00000200:7.0:1582936102.919399:0:9865:0:(o2iblnd_cb.c:553:kiblnd_rx_complete()) rx ffff9acd505ec000 conn ffff9acdd096a200
    :
    :
    00000800:00000200:7.0:1582936102.919845:0:9865:0:(o2iblnd_cb.c:158:kiblnd_drop_rx()) conn[ffff9acdd096a200] (3)--
    00000800:00000200:1.0:1582936102.919845:0:9864:0:(o2iblnd_cb.c:3875:kiblnd_scheduler()) conn[ffff9acdd096a200] (3)--
    00000800:00000200:7.0:1582936102.919849:0:9865:0:(o2iblnd_cb.c:3875:kiblnd_scheduler()) conn[ffff9acdd096a200] (1)--
    00000800:00000200:1.0:1582936102.919857:0:9863:0:(o2iblnd_cb.c:3528:kiblnd_connd()) peer_ni[ffff9acdfd59b300] -> 10.151.26.239@o2ib (2)++
    00000800:00000010:7.0:1582936102.920306:0:9863:0:(o2iblnd.c:1272:kiblnd_free_pages()) kfreed 'p': 536 at ffff9acd4b677c00 (tot 1973017412).
    00000800:00000010:7.0:1582936102.920309:0:9863:0:(o2iblnd.c:1059:kiblnd_destroy_conn()) kfreed 'conn->ibc_rxs': 6864 at ffff9acd505ec000 (tot 1973010548).
    00000800:00000200:7.0:1582936102.920314:0:9863:0:(o2iblnd.c:1072:kiblnd_destroy_conn()) peer_ni[ffff9acdfd59b300] -> 10.151.26.239@o2ib (3)--
    
  • have you tried 2.12.4 yet? It fixed several LNet issues that were in 2.12.3 (though I see you already have some of the tunables in lustre.conf to disable lnet_transaction_timeout). Since this problem hits immediately, it should be easy to verify whether it is already fixed.
Comment by Mahmoud Hanafi [ 29/Feb/20 ]

We have use the same ko2iblnd setting for a long time. We came up with those values for our large number of clients (over 13K). 2.12.3 servers  work just fine. I make made a mistake listing effective version it should be 2.13.

So far we have only seen this issue with 2.13 servers. I think one of the routers listed is actually down. I'll remove that and try it.

Our routers have 2 interfaces on each side, but they are not configure to use multirail. It may be an issue with multi-rail discovery.

Comment by Peter Jones [ 29/Feb/20 ]

Amir

Could you please advise

Peter

Comment by Mahmoud Hanafi [ 03/Mar/20 ]

Module load hangs if a router listed that is down.
If I remove 10.151.26.239@o2ib module load works.
When it is listed in module.conf
This is the output on console. Why is it enumerating eth1?

[  224.782043] LNet: 8054:0:(config.c:1641:lnet_inet_enumerate()) lnet: Ignoring interface eth1: it's down
[  227.476974] LNet: 8054:0:(router.c:692:lnet_add_route()) Consider turning discovery on to enable full Multi-Rail routing functionality
[  227.521176] LNetError: 8092:0:(router.c:315:lnet_set_route_aliveness()) route to o2ib233 through 10.151.26.80@o2ib has gone from down to up
[  228.781636] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 227 seconds
[  228.817288] LNetError: 2950:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.151.26.239@o2ib added to recovery queue. Health = 999
[  229.781593] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 228 seconds
[  262.779754] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 0 seconds
[  292.778101] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 0 seconds
[  302.777536] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 10 seconds
[  322.776432] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 0 seconds
[  353.774712] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 1 seconds

Comment by Amir Shehata (Inactive) [ 04/Mar/20 ]

I believe this issue was fixed with the following patch:

  LU-13001 lnet: Wait for single discovery attempt of routers
Comment by Mahmoud Hanafi [ 06/Jan/21 ]

We can close this

Generated at Sat Feb 10 03:00:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.