Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13311

lustre2.13 module load gets stuck at lnet

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

    Description

      We started testing 2.13.

      Module load gets stuck at

       open("/sys/module/ptlrpc/initstate", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      stat("/sys/module/ptlrpc", 0x7fffffffc660) = -1 ENOENT (No such file or directory)
      open("/lib/modules/3.10.0-1062.12.1.el7_lustre2130.x86_64/extra/lustre/fs/ptlrpc.ko", O_RDONLY|O_CLOEXEC) = 7
      read(7, "\177ELF\2\1", 6)               = 6
      lseek(7, 0, SEEK_SET)                   = 0
      fstat(7, {st_mode=S_IFREG|0744, st_size=4181088, ...}) = 0
      mmap(NULL, 4181088, PROT_READ, MAP_PRIVATE, 7, 0) = 0x2aaaabca1000
      finit_module(7, "at_max=600 at_min=275 ", 0

      See attached file for complete trace.
      I also include our module.conf file.

      Attachments

        1. lustre.conf
          1.0 kB
        2. module_load_issue.out
          27 kB
        3. debug.ptlrpc.log.gz
          2.49 MB

        Activity

          [LU-13311] lustre2.13 module load gets stuck at lnet

          We can close this

          mhanafi Mahmoud Hanafi added a comment - We can close this

          I believe this issue was fixed with the following patch:

            LU-13001 lnet: Wait for single discovery attempt of routers
          ashehata Amir Shehata (Inactive) added a comment - I believe this issue was fixed with the following patch:   LU-13001 lnet: Wait for single discovery attempt of routers

          Module load hangs if a router listed that is down.
          If I remove 10.151.26.239@o2ib module load works.
          When it is listed in module.conf
          This is the output on console. Why is it enumerating eth1?

          [  224.782043] LNet: 8054:0:(config.c:1641:lnet_inet_enumerate()) lnet: Ignoring interface eth1: it's down
          [  227.476974] LNet: 8054:0:(router.c:692:lnet_add_route()) Consider turning discovery on to enable full Multi-Rail routing functionality
          [  227.521176] LNetError: 8092:0:(router.c:315:lnet_set_route_aliveness()) route to o2ib233 through 10.151.26.80@o2ib has gone from down to up
          [  228.781636] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 227 seconds
          [  228.817288] LNetError: 2950:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.151.26.239@o2ib added to recovery queue. Health = 999
          [  229.781593] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 228 seconds
          [  262.779754] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 0 seconds
          [  292.778101] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 0 seconds
          [  302.777536] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 10 seconds
          [  322.776432] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 0 seconds
          [  353.774712] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 1 seconds
          
          
          mhanafi Mahmoud Hanafi added a comment - Module load hangs if a router listed that is down. If I remove 10.151.26.239@o2ib module load works. When it is listed in module.conf This is the output on console. Why is it enumerating eth1? [ 224.782043] LNet: 8054:0:(config.c:1641:lnet_inet_enumerate()) lnet: Ignoring interface eth1: it's down [ 227.476974] LNet: 8054:0:(router.c:692:lnet_add_route()) Consider turning discovery on to enable full Multi-Rail routing functionality [ 227.521176] LNetError: 8092:0:(router.c:315:lnet_set_route_aliveness()) route to o2ib233 through 10.151.26.80@o2ib has gone from down to up [ 228.781636] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 227 seconds [ 228.817288] LNetError: 2950:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.151.26.239@o2ib added to recovery queue. Health = 999 [ 229.781593] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 228 seconds [ 262.779754] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 0 seconds [ 292.778101] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 0 seconds [ 302.777536] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 10 seconds [ 322.776432] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 0 seconds [ 353.774712] LNet: 8089:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.151.26.239@o2ib: 1 seconds
          pjones Peter Jones added a comment -

          Amir

          Could you please advise

          Peter

          pjones Peter Jones added a comment - Amir Could you please advise Peter

          We have use the same ko2iblnd setting for a long time. We came up with those values for our large number of clients (over 13K). 2.12.3 servers  work just fine. I make made a mistake listing effective version it should be 2.13.

          So far we have only seen this issue with 2.13 servers. I think one of the routers listed is actually down. I'll remove that and try it.

          Our routers have 2 interfaces on each side, but they are not configure to use multirail. It may be an issue with multi-rail discovery.

          mhanafi Mahmoud Hanafi added a comment - We have use the same ko2iblnd setting for a long time. We came up with those values for our large number of clients (over 13K). 2.12.3 servers  work just fine. I make made a mistake listing effective version it should be 2.13. So far we have only seen this issue with 2.13 servers. I think one of the routers listed is actually down. I'll remove that and try it. Our routers have 2 interfaces on each side, but they are not configure to use multirail. It may be an issue with multi-rail discovery.

          I'm not able to necessarily able diagnose the LNet issue, that may have to be our LNet team. That said, there looks to be something strange going on in the setup:

          00000800:00000010:7.0:1582936086.907926:0:9830:0:(o2iblnd.c:2381:kiblnd_create_tx_pool()) alloc '(tx->tx_sge)': 8224 at ffffa9c30d64b000 (tot 1972499764).
          00000800:00000010:7.0:1582936086.907928:0:9830:0:(o2iblnd.c:2387:kiblnd_create_tx_pool()) alloc '(tx->tx_rd)': 3080 at ffff9add9603f000 (tot 1972502844).
          

          It looks like it has allocated almost 2GB of memory for IB buffers (less a few MB for other things at the start). That seems pretty high. Are you sure the IB parameters are correct?

          options ko2iblnd ntx=251072 credits=125536 fmr_pool_size=62769 
          

          However, some suggestions to help debug:

          • does the hang relate to any specific router? If you remove one or two of the routers does it load OK? I'm assuming the routers are working for 2.10 clients? In the logs it looks like it is parsing the o2ib233, o2ib313, and o2ib417 routes correctly. Then it looks like it gets into peer discovery (or something) with the routers and never really finishes 20 seconds later when the debug log is dumped:
            00000800:00000200:7.0:1582936102.918672:0:3029:0:(o2iblnd.c:1356:kiblnd_map_rx_descs()) rx 63: ffff9acdfe38a000 0xffe38a000(0xffe38a000)
            00000800:00000200:7.0:1582936102.918674:0:3029:0:(o2iblnd.c:1356:kiblnd_map_rx_descs()) rx 64: ffff9acdfe38b000 0xffe38b000(0xffe38b000)
            00000800:00000200:7.0:1582936102.918676:0:3029:0:(o2iblnd.c:1356:kiblnd_map_rx_descs()) rx 65: ffff9acdfc71e000 0xffc71e000(0xffc71e000)
            00000800:00000010:7.0:1582936102.918678:0:3029:0:(o2iblnd.c:956:kiblnd_create_conn()) kfreed 'init_qp_attr': 112 at ffff9acdfb2e5d00 (tot 1973018084).
            00000800:00000200:7.0:1582936102.918680:0:3029:0:(o2iblnd_cb.c:205:kiblnd_post_rx()) conn[ffff9acdd096a200] (67)++
            00000800:00000200:7.0:1582936102.918682:0:3029:0:(o2iblnd_cb.c:239:kiblnd_post_rx()) conn[ffff9acdd096a200] (68)--
            00000800:00000200:7.0:1582936102.918684:0:3029:0:(o2iblnd_cb.c:205:kiblnd_post_rx()) conn[ffff9acdd096a200] (67)++
            00000800:00000200:7.0:1582936102.918685:0:3029:0:(o2iblnd_cb.c:239:kiblnd_post_rx()) conn[ffff9acdd096a200] (68)--
            :
            :
            00000800:00000200:7.0:1582936102.918879:0:3029:0:(o2iblnd_cb.c:205:kiblnd_post_rx()) conn[ffff9acdd096a200] (67)++
            00000800:00000200:7.0:1582936102.918881:0:3029:0:(o2iblnd_cb.c:239:kiblnd_post_rx()) conn[ffff9acdd096a200] (68)--
            00000800:00000200:7.2:1582936102.919380:0:0:0:(o2iblnd_cb.c:3737:kiblnd_cq_completion()) conn[ffff9acdd096a200] (67)++
            00000800:00000200:7.0:1582936102.919391:0:9865:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff9acdd096a200] (68)++
            00000800:00000100:7.0:1582936102.919397:0:9865:0:(o2iblnd_cb.c:507:kiblnd_rx_complete()) Rx from 10.151.26.239@o2ib failed: 5
            00000800:00000200:7.0:1582936102.919399:0:9865:0:(o2iblnd_cb.c:553:kiblnd_rx_complete()) rx ffff9acd505ec000 conn ffff9acdd096a200
            :
            :
            00000800:00000200:7.0:1582936102.919845:0:9865:0:(o2iblnd_cb.c:158:kiblnd_drop_rx()) conn[ffff9acdd096a200] (3)--
            00000800:00000200:1.0:1582936102.919845:0:9864:0:(o2iblnd_cb.c:3875:kiblnd_scheduler()) conn[ffff9acdd096a200] (3)--
            00000800:00000200:7.0:1582936102.919849:0:9865:0:(o2iblnd_cb.c:3875:kiblnd_scheduler()) conn[ffff9acdd096a200] (1)--
            00000800:00000200:1.0:1582936102.919857:0:9863:0:(o2iblnd_cb.c:3528:kiblnd_connd()) peer_ni[ffff9acdfd59b300] -> 10.151.26.239@o2ib (2)++
            00000800:00000010:7.0:1582936102.920306:0:9863:0:(o2iblnd.c:1272:kiblnd_free_pages()) kfreed 'p': 536 at ffff9acd4b677c00 (tot 1973017412).
            00000800:00000010:7.0:1582936102.920309:0:9863:0:(o2iblnd.c:1059:kiblnd_destroy_conn()) kfreed 'conn->ibc_rxs': 6864 at ffff9acd505ec000 (tot 1973010548).
            00000800:00000200:7.0:1582936102.920314:0:9863:0:(o2iblnd.c:1072:kiblnd_destroy_conn()) peer_ni[ffff9acdfd59b300] -> 10.151.26.239@o2ib (3)--
            
          • have you tried 2.12.4 yet? It fixed several LNet issues that were in 2.12.3 (though I see you already have some of the tunables in lustre.conf to disable lnet_transaction_timeout). Since this problem hits immediately, it should be easy to verify whether it is already fixed.
          adilger Andreas Dilger added a comment - I'm not able to necessarily able diagnose the LNet issue, that may have to be our LNet team. That said, there looks to be something strange going on in the setup: 00000800:00000010:7.0:1582936086.907926:0:9830:0:(o2iblnd.c:2381:kiblnd_create_tx_pool()) alloc '(tx->tx_sge)': 8224 at ffffa9c30d64b000 (tot 1972499764). 00000800:00000010:7.0:1582936086.907928:0:9830:0:(o2iblnd.c:2387:kiblnd_create_tx_pool()) alloc '(tx->tx_rd)': 3080 at ffff9add9603f000 (tot 1972502844). It looks like it has allocated almost 2GB of memory for IB buffers (less a few MB for other things at the start). That seems pretty high. Are you sure the IB parameters are correct? options ko2iblnd ntx=251072 credits=125536 fmr_pool_size=62769 However, some suggestions to help debug: does the hang relate to any specific router? If you remove one or two of the routers does it load OK? I'm assuming the routers are working for 2.10 clients? In the logs it looks like it is parsing the o2ib233 , o2ib313 , and o2ib417 routes correctly. Then it looks like it gets into peer discovery (or something) with the routers and never really finishes 20 seconds later when the debug log is dumped: 00000800:00000200:7.0:1582936102.918672:0:3029:0:(o2iblnd.c:1356:kiblnd_map_rx_descs()) rx 63: ffff9acdfe38a000 0xffe38a000(0xffe38a000) 00000800:00000200:7.0:1582936102.918674:0:3029:0:(o2iblnd.c:1356:kiblnd_map_rx_descs()) rx 64: ffff9acdfe38b000 0xffe38b000(0xffe38b000) 00000800:00000200:7.0:1582936102.918676:0:3029:0:(o2iblnd.c:1356:kiblnd_map_rx_descs()) rx 65: ffff9acdfc71e000 0xffc71e000(0xffc71e000) 00000800:00000010:7.0:1582936102.918678:0:3029:0:(o2iblnd.c:956:kiblnd_create_conn()) kfreed 'init_qp_attr': 112 at ffff9acdfb2e5d00 (tot 1973018084). 00000800:00000200:7.0:1582936102.918680:0:3029:0:(o2iblnd_cb.c:205:kiblnd_post_rx()) conn[ffff9acdd096a200] (67)++ 00000800:00000200:7.0:1582936102.918682:0:3029:0:(o2iblnd_cb.c:239:kiblnd_post_rx()) conn[ffff9acdd096a200] (68)-- 00000800:00000200:7.0:1582936102.918684:0:3029:0:(o2iblnd_cb.c:205:kiblnd_post_rx()) conn[ffff9acdd096a200] (67)++ 00000800:00000200:7.0:1582936102.918685:0:3029:0:(o2iblnd_cb.c:239:kiblnd_post_rx()) conn[ffff9acdd096a200] (68)-- : : 00000800:00000200:7.0:1582936102.918879:0:3029:0:(o2iblnd_cb.c:205:kiblnd_post_rx()) conn[ffff9acdd096a200] (67)++ 00000800:00000200:7.0:1582936102.918881:0:3029:0:(o2iblnd_cb.c:239:kiblnd_post_rx()) conn[ffff9acdd096a200] (68)-- 00000800:00000200:7.2:1582936102.919380:0:0:0:(o2iblnd_cb.c:3737:kiblnd_cq_completion()) conn[ffff9acdd096a200] (67)++ 00000800:00000200:7.0:1582936102.919391:0:9865:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff9acdd096a200] (68)++ 00000800:00000100:7.0:1582936102.919397:0:9865:0:(o2iblnd_cb.c:507:kiblnd_rx_complete()) Rx from 10.151.26.239@o2ib failed: 5 00000800:00000200:7.0:1582936102.919399:0:9865:0:(o2iblnd_cb.c:553:kiblnd_rx_complete()) rx ffff9acd505ec000 conn ffff9acdd096a200 : : 00000800:00000200:7.0:1582936102.919845:0:9865:0:(o2iblnd_cb.c:158:kiblnd_drop_rx()) conn[ffff9acdd096a200] (3)-- 00000800:00000200:1.0:1582936102.919845:0:9864:0:(o2iblnd_cb.c:3875:kiblnd_scheduler()) conn[ffff9acdd096a200] (3)-- 00000800:00000200:7.0:1582936102.919849:0:9865:0:(o2iblnd_cb.c:3875:kiblnd_scheduler()) conn[ffff9acdd096a200] (1)-- 00000800:00000200:1.0:1582936102.919857:0:9863:0:(o2iblnd_cb.c:3528:kiblnd_connd()) peer_ni[ffff9acdfd59b300] -> 10.151.26.239@o2ib (2)++ 00000800:00000010:7.0:1582936102.920306:0:9863:0:(o2iblnd.c:1272:kiblnd_free_pages()) kfreed 'p': 536 at ffff9acd4b677c00 (tot 1973017412). 00000800:00000010:7.0:1582936102.920309:0:9863:0:(o2iblnd.c:1059:kiblnd_destroy_conn()) kfreed 'conn->ibc_rxs': 6864 at ffff9acd505ec000 (tot 1973010548). 00000800:00000200:7.0:1582936102.920314:0:9863:0:(o2iblnd.c:1072:kiblnd_destroy_conn()) peer_ni[ffff9acdfd59b300] -> 10.151.26.239@o2ib (3)-- have you tried 2.12.4 yet? It fixed several LNet issues that were in 2.12.3 (though I see you already have some of the tunables in lustre.conf to disable lnet_transaction_timeout ). Since this problem hits immediately, it should be easy to verify whether it is already fixed.
          mhanafi Mahmoud Hanafi added a comment - - edited

          So I tracked down the issue to our module.conf setting for the routers
          removing this line the module loads find.

          options lnet routes="o2ib233 10.151.26.[80-94]@o2ib; o2ib313 10.151.25.[167-171,195-198,202-205,222]@o2ib 10.151.26.[60,127,140-144,146-154]@o2ib; o2ib417 10.151.26.[230-239,244-249]@o2ib"
          

          See uploaded debug file.

          mhanafi Mahmoud Hanafi added a comment - - edited So I tracked down the issue to our module.conf setting for the routers removing this line the module loads find. options lnet routes= "o2ib233 10.151.26.[80-94]@o2ib; o2ib313 10.151.25.[167-171,195-198,202-205,222]@o2ib 10.151.26.[60,127,140-144,146-154]@o2ib; o2ib417 10.151.26.[230-239,244-249]@o2ib" See uploaded debug file.

          This looks like it is getting stuck during loading of ptlrpc.ko, which is when the network connections are initialized. Could you please enable full debugging before starting to load ptlrpc.ko and dump the debug logs after it gets stuck to see where it is having problems:

          # modprobe libcfs
          # lctl set_param debug=-1
          # modprobe ptlrpc.ko & sleep 10
          # lctl debug_kernel /tmp/debug.ptlrpc.log
          

          and then attach the log here (maybe gzipped if very large).

          adilger Andreas Dilger added a comment - This looks like it is getting stuck during loading of ptlrpc.ko , which is when the network connections are initialized. Could you please enable full debugging before starting to load ptlrpc.ko and dump the debug logs after it gets stuck to see where it is having problems: # modprobe libcfs # lctl set_param debug=-1 # modprobe ptlrpc.ko & sleep 10 # lctl debug_kernel /tmp/debug.ptlrpc.log and then attach the log here (maybe gzipped if very large).

          People

            ashehata Amir Shehata (Inactive)
            mhanafi Mahmoud Hanafi
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: