Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15541

Soft lockups in LNetPrimaryNID() and lnet_discover_peer_locked()

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.12.7
    • 3.10.0-1160.45.1.1chaos.ch6.x86_64
      lustre-2.12.7_2.llnl
      3.10.0-1160.53.1.1chaos.ch6.x86_64
      lustre-2.12.8_6.llnl
      RHEL7.9
      zfs-0.7.11-9.8llnl
    • 3
    • 9223372036854775807

    Description

      We upgraded a lustre server cluster from lustre-2.12.7_2.llnl to lustre-2.12.8_6.llnl. Almost immediately after boot, clients begin reporting soft lockups on the console, with stacks like this:

      2022-02-08 09:43:10 [1644342190.528916] 
      Call Trace:
       queued_spin_lock_slowpath+0xb/0xf
       _raw_spin_lock+0x30/0x40
       cfs_percpt_lock+0xc1/0x110 [libcfs]
       lnet_discover_peer_locked+0xa0/0x450 [lnet]
       ? wake_up_atomic_t+0x30/0x30
       LNetPrimaryNID+0xd5/0x220 [lnet]
       ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
       target_handle_connect+0x12f1/0x2b90 [ptlrpc]
       ? enqueue_task_fair+0x208/0x6c0
       ? check_preempt_curr+0x80/0xa0
       ? ttwu_do_wakeup+0x19/0x100
       tgt_request_handle+0x4fa/0x1570 [ptlrpc]
       ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
       ? __getnstimeofday64+0x3f/0xd0
       ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
       ? ptlrpc_wait_event+0xb8/0x370 [ptlrpc]
       ? __wake_up_common_lock+0x91/0xc0
       ? sched_feat_set+0xf0/0xf0
       ptlrpc_main+0xc49/0x1c50 [ptlrpc]
       ? __switch_to+0xce/0x5a0
       ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
       kthread+0xd1/0xe0
       ? insert_kthread_work+0x40/0x40
       ret_from_fork_nospec_begin+0x21/0x21
       ? insert_kthread_work+0x40/0x40
      

      Some servers never exit recovery, and others do but seem to be unable to service requests.

      Seen during the same lustre server update where we saw LU-15539 but appears to be a separate issue.

      Patch stacks are:
      https://github.com/LLNL/lustre/releases/tag/2.12.8_6.llnl
      https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl

      Attachments

        1. vmcore-dmesg.copper1.txt
          993 kB
          Olaf Faaland
        2. vmcore-dmesg.copper2.txt
          569 kB
          Olaf Faaland

        Issue Links

          Activity

            People

              ssmirnov Serguei Smirnov
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: