Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15541

Soft lockups in LNetPrimaryNID() and lnet_discover_peer_locked()

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.12.7
    • 3.10.0-1160.45.1.1chaos.ch6.x86_64
      lustre-2.12.7_2.llnl
      3.10.0-1160.53.1.1chaos.ch6.x86_64
      lustre-2.12.8_6.llnl
      RHEL7.9
      zfs-0.7.11-9.8llnl
    • 3
    • 9223372036854775807

    Description

      We upgraded a lustre server cluster from lustre-2.12.7_2.llnl to lustre-2.12.8_6.llnl. Almost immediately after boot, clients begin reporting soft lockups on the console, with stacks like this:

      2022-02-08 09:43:10 [1644342190.528916] 
      Call Trace:
       queued_spin_lock_slowpath+0xb/0xf
       _raw_spin_lock+0x30/0x40
       cfs_percpt_lock+0xc1/0x110 [libcfs]
       lnet_discover_peer_locked+0xa0/0x450 [lnet]
       ? wake_up_atomic_t+0x30/0x30
       LNetPrimaryNID+0xd5/0x220 [lnet]
       ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
       target_handle_connect+0x12f1/0x2b90 [ptlrpc]
       ? enqueue_task_fair+0x208/0x6c0
       ? check_preempt_curr+0x80/0xa0
       ? ttwu_do_wakeup+0x19/0x100
       tgt_request_handle+0x4fa/0x1570 [ptlrpc]
       ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
       ? __getnstimeofday64+0x3f/0xd0
       ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
       ? ptlrpc_wait_event+0xb8/0x370 [ptlrpc]
       ? __wake_up_common_lock+0x91/0xc0
       ? sched_feat_set+0xf0/0xf0
       ptlrpc_main+0xc49/0x1c50 [ptlrpc]
       ? __switch_to+0xce/0x5a0
       ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
       kthread+0xd1/0xe0
       ? insert_kthread_work+0x40/0x40
       ret_from_fork_nospec_begin+0x21/0x21
       ? insert_kthread_work+0x40/0x40
      

      Some servers never exit recovery, and others do but seem to be unable to service requests.

      Seen during the same lustre server update where we saw LU-15539 but appears to be a separate issue.

      Patch stacks are:
      https://github.com/LLNL/lustre/releases/tag/2.12.8_6.llnl
      https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl

      Attachments

        Issue Links

          Activity

            [LU-15541] Soft lockups in LNetPrimaryNID() and lnet_discover_peer_locked()

            Thank you, Serguei. We'll add them to our stack and do some testing. We haven't successfully reproduced the original issue, so we'll only be able to tell you if we have unexpected new symptoms with LNet; but that's a start.

            ofaaland Olaf Faaland added a comment - Thank you, Serguei. We'll add them to our stack and do some testing. We haven't successfully reproduced the original issue, so we'll only be able to tell you if we have unexpected new symptoms with LNet; but that's a start.

            Here's the link to the LU-14668 patch series ported to b2_15: https://review.whamcloud.com/51135/

            ssmirnov Serguei Smirnov added a comment - Here's the link to the LU-14668 patch series ported to b2_15: https://review.whamcloud.com/51135/

            Hi Olaf,

            Yes, there were some distractions so I started on this only late last week. I'm still porting the patches. There's a chance I'll push the ports by the end of this week.

            Thanks,

            Serguei.

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, Yes, there were some distractions so I started on this only late last week. I'm still porting the patches. There's a chance I'll push the ports by the end of this week. Thanks, Serguei.

            > OK, I'll Port them to b2_15.

            Is this still being done?

            thanks

            ofaaland Olaf Faaland added a comment - > OK, I'll Port them to b2_15. Is this still being done? thanks
            hxing Xing Huang added a comment -

            OK, I'll Port them to b2_15.

            hxing Xing Huang added a comment - OK, I'll Port them to b2_15.
            pjones Peter Jones added a comment -

            hxing could you please port the LU-14668 patches to b2_15?

            pjones Peter Jones added a comment - hxing  could you please port the LU-14668 patches to b2_15?

            The patches of LU-14668 seem to resolve this issue.

            client_import_add_conn() do not hang anymore because LNetPrimaryNID() does not wait the end of node discovery (it does the discovery in background).

            eaujames Etienne Aujames added a comment - The patches of LU-14668 seem to resolve this issue. client_import_add_conn() do not hang anymore because LNetPrimaryNID() does not wait the end of node discovery (it does the discovery in background).
            eaujames Etienne Aujames added a comment - - edited

            Hello,

            We observed the same kind of stack trace than Olaf when mounting new clients after losing a DDN controller (4 OSS, targets are mounted on the other controller):

             [<ffffffff98ba9ae6>] queued_spin_lock_slowpath+0xb/0xf
             [<ffffffff98bb8b00>] _raw_spin_lock+0x30/0x40
             [<ffffffffc0df7b51>] cfs_percpt_lock+0xc1/0x110 [libcfs]
             [<ffffffffc10637a0>] lnet_discover_peer_locked+0xa0/0x450 [lnet]
             [<ffffffff984cc540>] ? wake_up_atomic_t+0x30/0x30
             [<ffffffffc1063c25>] LNetPrimaryNID+0xd5/0x220 [lnet]
             [<ffffffffc15cf57e>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
             [<ffffffffc15c344c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc]
             [<ffffffffc1594292>] import_set_conn+0xb2/0x7e0 [ptlrpc]
             [<ffffffffc15949d3>] client_import_add_conn+0x13/0x20 [ptlrpc]
             [<ffffffffc1339e98>] class_add_conn+0x418/0x630 [obdclass]
             [<ffffffffc133bb31>] class_process_config+0x1a81/0x2830 [obdclass]
            

            I have done some testing on master branch with discovery enable, and I was able to reproduce this.
            I can only reproduce this with client behind the LNet router or with message drop rules on the OSS (lctl net_drop_add).

            The client parses the MGS configuration llog "<fsname>-client" to initialize all the lustre device.
            e.g: MGS client osc configuration

            #35 (224)marker  24 (flags=0x01, v2.15.51.0) lustrefs-OST0000 'add osc' Fri Sep 30 09:57:22 2022-       
            #36 (080)add_uuid  nid=10.0.2.5@tcp(0x200000a000205)  0:  1:10.0.2.5@tcp                                
            #37 (128)attach    0:lustrefs-OST0000-osc  1:osc  2:lustrefs-clilov_UUID                                
            #38 (136)setup     0:lustrefs-OST0000-osc  1:lustrefs-OST0000_UUID  2:10.0.2.5@tcp                      
            #39 (080)add_uuid  nid=10.0.2.4@tcp(0x200000a000204)  0:  1:10.0.2.4@tcp                                
            #40 (104)add_conn  0:lustrefs-OST0000-osc  1:10.0.2.4@tcp                                               
            #41 (128)lov_modify_tgts add 0:lustrefs-clilov  1:lustrefs-OST0000_UUID  2:0  3:1                       
            #42 (224)END   marker  24 (flags=0x02, v2.15.51.0) lustrefs-OST0000 'add osc' Fri Sep 30 09:57:22 2022- 
            

            The records are parsed sequentially:

            • setup -> client_obd_setup(): initialize the device and the primary connection (10.0.2.5@tcp, client_import_add_conn())
            • add_conn -> client_import_add_conn(): initialize the failover connections (10.0.2.4@tcp, client_import_add_conn())

            The issue here is that client_import_add_conn() call LNetPrimaryNID() and do discovery to get the remote node interfaces.
            Discovery thread start by pinging the node and take transaction_timeout (+- transaction_timeout/2) for it.
            In our case, we lost 4 oss with 2 failover unreachable nodes for each: 50s * 4 * 2 = 400s (max time is 600s).

            On client side (2.12.7 LTS) we do not have the https://review.whamcloud.com/#/c/43537/ ("LU-14566 lnet: Skip discovery in LNetPrimaryNID if DD disabled"), so to workaround this issue we decrease the transaction_timeout before mounting the client and then restore it: 2s * 4 * 2 = 16s (max time: 24s).

            With the https://review.whamcloud.com/39613 ("LU-10360 mgc: Use IR for client->MDS/OST connections"), I am not sure if we have to do discovery when parsing the configuration. But I have not played a lot with multirail, so someone have to confirm/infirm this.

            eaujames Etienne Aujames added a comment - - edited Hello, We observed the same kind of stack trace than Olaf when mounting new clients after losing a DDN controller (4 OSS, targets are mounted on the other controller): [<ffffffff98ba9ae6>] queued_spin_lock_slowpath+0xb/0xf [<ffffffff98bb8b00>] _raw_spin_lock+0x30/0x40 [<ffffffffc0df7b51>] cfs_percpt_lock+0xc1/0x110 [libcfs] [<ffffffffc10637a0>] lnet_discover_peer_locked+0xa0/0x450 [lnet] [<ffffffff984cc540>] ? wake_up_atomic_t+0x30/0x30 [<ffffffffc1063c25>] LNetPrimaryNID+0xd5/0x220 [lnet] [<ffffffffc15cf57e>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc] [<ffffffffc15c344c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc] [<ffffffffc1594292>] import_set_conn+0xb2/0x7e0 [ptlrpc] [<ffffffffc15949d3>] client_import_add_conn+0x13/0x20 [ptlrpc] [<ffffffffc1339e98>] class_add_conn+0x418/0x630 [obdclass] [<ffffffffc133bb31>] class_process_config+0x1a81/0x2830 [obdclass] I have done some testing on master branch with discovery enable, and I was able to reproduce this. I can only reproduce this with client behind the LNet router or with message drop rules on the OSS (lctl net_drop_add). The client parses the MGS configuration llog "<fsname>-client" to initialize all the lustre device. e.g: MGS client osc configuration #35 (224)marker 24 (flags=0x01, v2.15.51.0) lustrefs-OST0000 'add osc' Fri Sep 30 09:57:22 2022- #36 (080)add_uuid nid=10.0.2.5@tcp(0x200000a000205) 0: 1:10.0.2.5@tcp #37 (128)attach 0:lustrefs-OST0000-osc 1:osc 2:lustrefs-clilov_UUID #38 (136)setup 0:lustrefs-OST0000-osc 1:lustrefs-OST0000_UUID 2:10.0.2.5@tcp #39 (080)add_uuid nid=10.0.2.4@tcp(0x200000a000204) 0: 1:10.0.2.4@tcp #40 (104)add_conn 0:lustrefs-OST0000-osc 1:10.0.2.4@tcp #41 (128)lov_modify_tgts add 0:lustrefs-clilov 1:lustrefs-OST0000_UUID 2:0 3:1 #42 (224)END marker 24 (flags=0x02, v2.15.51.0) lustrefs-OST0000 'add osc' Fri Sep 30 09:57:22 2022- The records are parsed sequentially: setup -> client_obd_setup(): initialize the device and the primary connection (10.0.2.5@tcp, client_import_add_conn()) add_conn -> client_import_add_conn(): initialize the failover connections (10.0.2.4@tcp, client_import_add_conn()) The issue here is that client_import_add_conn() call LNetPrimaryNID() and do discovery to get the remote node interfaces. Discovery thread start by pinging the node and take transaction_timeout (+- transaction_timeout/2) for it. In our case, we lost 4 oss with 2 failover unreachable nodes for each: 50s * 4 * 2 = 400s (max time is 600s). On client side (2.12.7 LTS) we do not have the https://review.whamcloud.com/#/c/43537/ (" LU-14566 lnet: Skip discovery in LNetPrimaryNID if DD disabled"), so to workaround this issue we decrease the transaction_timeout before mounting the client and then restore it: 2s * 4 * 2 = 16s (max time: 24s). With the https://review.whamcloud.com/39613 (" LU-10360 mgc: Use IR for client->MDS/OST connections"), I am not sure if we have to do discovery when parsing the configuration. But I have not played a lot with multirail, so someone have to confirm/infirm this.
            sarah Sarah Liu added a comment - similar one on 2.12.9 https://testing.whamcloud.com/test_sets/c858c157-7ecd-4c98-bfe3-1da2ce125f8c
            ofaaland Olaf Faaland added a comment -

            Thanks, Etienne

            ofaaland Olaf Faaland added a comment - Thanks, Etienne

            Hello,

            The CEA has seen this kind of symptoms on servers for missing routes (asymmetrical routes) or on clients (at mount time) when a target is failed over another node (with the original node not responding, lnet module unloaded).
            At that time the CEA had lnet credit issue (starvation) for the client mount cases. Mounting all the clients at the same time could result to lnet credits starvation.

            I have backported the https://review.whamcloud.com/#/c/43537/ because the CEA doesn't use multirail.
            And we set drop_asym_route=1 on servers to protect against routes misconfiguration between clients and servers.

            Maybe the https://review.whamcloud.com/45898/ ("LU-10931 lnet: handle unlink before send completes") could fix that issue with discovery on.

            eaujames Etienne Aujames added a comment - Hello, The CEA has seen this kind of symptoms on servers for missing routes (asymmetrical routes) or on clients (at mount time) when a target is failed over another node (with the original node not responding, lnet module unloaded). At that time the CEA had lnet credit issue (starvation) for the client mount cases. Mounting all the clients at the same time could result to lnet credits starvation. I have backported the https://review.whamcloud.com/#/c/43537/ because the CEA doesn't use multirail. And we set drop_asym_route=1 on servers to protect against routes misconfiguration between clients and servers. Maybe the https://review.whamcloud.com/45898/ (" LU-10931 lnet: handle unlink before send completes") could fix that issue with discovery on.

            People

              ssmirnov Serguei Smirnov
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: