Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15541

Soft lockups in LNetPrimaryNID() and lnet_discover_peer_locked()

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.12.7
    • 3.10.0-1160.45.1.1chaos.ch6.x86_64
      lustre-2.12.7_2.llnl
      3.10.0-1160.53.1.1chaos.ch6.x86_64
      lustre-2.12.8_6.llnl
      RHEL7.9
      zfs-0.7.11-9.8llnl
    • 3
    • 9223372036854775807

    Description

      We upgraded a lustre server cluster from lustre-2.12.7_2.llnl to lustre-2.12.8_6.llnl. Almost immediately after boot, clients begin reporting soft lockups on the console, with stacks like this:

      2022-02-08 09:43:10 [1644342190.528916] 
      Call Trace:
       queued_spin_lock_slowpath+0xb/0xf
       _raw_spin_lock+0x30/0x40
       cfs_percpt_lock+0xc1/0x110 [libcfs]
       lnet_discover_peer_locked+0xa0/0x450 [lnet]
       ? wake_up_atomic_t+0x30/0x30
       LNetPrimaryNID+0xd5/0x220 [lnet]
       ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
       target_handle_connect+0x12f1/0x2b90 [ptlrpc]
       ? enqueue_task_fair+0x208/0x6c0
       ? check_preempt_curr+0x80/0xa0
       ? ttwu_do_wakeup+0x19/0x100
       tgt_request_handle+0x4fa/0x1570 [ptlrpc]
       ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
       ? __getnstimeofday64+0x3f/0xd0
       ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
       ? ptlrpc_wait_event+0xb8/0x370 [ptlrpc]
       ? __wake_up_common_lock+0x91/0xc0
       ? sched_feat_set+0xf0/0xf0
       ptlrpc_main+0xc49/0x1c50 [ptlrpc]
       ? __switch_to+0xce/0x5a0
       ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
       kthread+0xd1/0xe0
       ? insert_kthread_work+0x40/0x40
       ret_from_fork_nospec_begin+0x21/0x21
       ? insert_kthread_work+0x40/0x40
      

      Some servers never exit recovery, and others do but seem to be unable to service requests.

      Seen during the same lustre server update where we saw LU-15539 but appears to be a separate issue.

      Patch stacks are:
      https://github.com/LLNL/lustre/releases/tag/2.12.8_6.llnl
      https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl

      Attachments

        Issue Links

          Activity

            [LU-15541] Soft lockups in LNetPrimaryNID() and lnet_discover_peer_locked()

            We have not seen this issue since landing the patches.

            ofaaland Olaf Faaland added a comment - We have not seen this issue since landing the patches.
            pjones Peter Jones added a comment -

            As per Olaf - this is resolved

            pjones Peter Jones added a comment - As per Olaf - this is resolved
            pjones Peter Jones added a comment -

            The patch series has merged to b2_15 for 2.15.4

            pjones Peter Jones added a comment - The patch series has merged to b2_15 for 2.15.4

            Thank you, Serguei. We'll add them to our stack and do some testing. We haven't successfully reproduced the original issue, so we'll only be able to tell you if we have unexpected new symptoms with LNet; but that's a start.

            ofaaland Olaf Faaland added a comment - Thank you, Serguei. We'll add them to our stack and do some testing. We haven't successfully reproduced the original issue, so we'll only be able to tell you if we have unexpected new symptoms with LNet; but that's a start.

            Here's the link to the LU-14668 patch series ported to b2_15: https://review.whamcloud.com/51135/

            ssmirnov Serguei Smirnov added a comment - Here's the link to the LU-14668 patch series ported to b2_15: https://review.whamcloud.com/51135/

            Hi Olaf,

            Yes, there were some distractions so I started on this only late last week. I'm still porting the patches. There's a chance I'll push the ports by the end of this week.

            Thanks,

            Serguei.

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, Yes, there were some distractions so I started on this only late last week. I'm still porting the patches. There's a chance I'll push the ports by the end of this week. Thanks, Serguei.

            > OK, I'll Port them to b2_15.

            Is this still being done?

            thanks

            ofaaland Olaf Faaland added a comment - > OK, I'll Port them to b2_15. Is this still being done? thanks
            hxing Xing Huang added a comment -

            OK, I'll Port them to b2_15.

            hxing Xing Huang added a comment - OK, I'll Port them to b2_15.
            pjones Peter Jones added a comment -

            hxing could you please port the LU-14668 patches to b2_15?

            pjones Peter Jones added a comment - hxing  could you please port the LU-14668 patches to b2_15?

            The patches of LU-14668 seem to resolve this issue.

            client_import_add_conn() do not hang anymore because LNetPrimaryNID() does not wait the end of node discovery (it does the discovery in background).

            eaujames Etienne Aujames added a comment - The patches of LU-14668 seem to resolve this issue. client_import_add_conn() do not hang anymore because LNetPrimaryNID() does not wait the end of node discovery (it does the discovery in background).

            People

              ssmirnov Serguei Smirnov
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: