Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13218

conf-sanity test 98 hangs in socknal_sd00_01: lnet_nid2peerni_locked

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.12.4
    • None
    • 3
    • 9223372036854775807

    Description

      conf-sanity test_98 hangs for review-dne-zfs-part-3 for the patch https://review.whamcloud.com/37445/ for LU-12593. We've seen this test hang with the errors below once only so far.

      Looking at the hang at https://testing.whamcloud.com/test_sets/469de552-4869-11ea-b58e-52540065bddc, in the client1 console log we see LNET issues

      [19629.856282] Lustre: Unmounted lustre-client
      [19659.244701] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true
      [19659.384715] Key type lgssc unregistered
      [19661.272912] LNet: 5951:0:(socklnd.c:2550:ksocknal_shutdown()) waiting for 1 peers to disconnect
      [19665.275764] LNet: 5951:0:(socklnd.c:2550:ksocknal_shutdown()) waiting for 1 peers to disconnect
      [19673.280446] LNet: 5951:0:(socklnd.c:2550:ksocknal_shutdown()) waiting for 1 peers to disconnect
      [19689.288850] LNet: 5951:0:(socklnd.c:2550:ksocknal_shutdown()) waiting for 1 peers to disconnect
      [19721.304636] LNet: 5951:0:(socklnd.c:2550:ksocknal_shutdown()) waiting for 1 peers to disconnect
      [19785.335214] LNet: 5951:0:(socklnd.c:2550:ksocknal_shutdown()) waiting for 1 peers to disconnect
      [19809.073214] INFO: task socknal_sd00_01:4386 blocked for more than 120 seconds.
      [19809.074558] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [19809.075866] socknal_sd00_01 D ffff8e17ffd1ac80     0  4386      2 0x00000080
      [19809.077194] Call Trace:
      [19809.077665]  [<ffffffffb0581929>] schedule_preempt_disabled+0x29/0x70
      [19809.078735]  [<ffffffffb057f8b7>] __mutex_lock_slowpath+0xc7/0x1d0
      [19809.079762]  [<ffffffffb057ec8f>] mutex_lock+0x1f/0x2f
      [19809.080741]  [<ffffffffc0891e31>] lnet_nid2peerni_locked+0x71/0x150 [lnet]
      [19809.082031]  [<ffffffffc087ed01>] lnet_parse+0x791/0x11f0 [lnet]
      [19809.083044]  [<ffffffffc0916838>] ksocknal_process_receive+0x498/0xde0 [ksocklnd]
      [19809.084277]  [<ffffffffc0917626>] ksocknal_scheduler+0x206/0xd50 [ksocklnd]
      [19809.085499]  [<ffffffffafec72e0>] ? wake_up_atomic_t+0x30/0x30
      [19809.086567]  [<ffffffffc0917420>] ? ksocknal_recv+0x2a0/0x2a0 [ksocklnd]
      [19809.087734]  [<ffffffffafec61f1>] kthread+0xd1/0xe0
      [19809.088612]  [<ffffffffafec6120>] ? insert_kthread_work+0x40/0x40
      [19809.089668]  [<ffffffffb058dd37>] ret_from_fork_nospec_begin+0x21/0x21
      [19809.090850]  [<ffffffffafec6120>] ? insert_kthread_work+0x40/0x40
      [19913.395403] LNet: 5951:0:(socklnd.c:2550:ksocknal_shutdown()) waiting for 1 peers to disconnect
      [19929.146713] INFO: task socknal_sd00_01:4386 blocked for more than 120 seconds.
      

      Attachments

        Issue Links

          Activity

            People

              ashehata Amir Shehata (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: