Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15616

sanity-lnet test_226: Timeout occurred after 112 minutes, last suite running was sanity-lnet

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Chris Horn <chris.horn@hpe.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/01cd6eae-ac03-48f6-980d-3977224dfaad

      test_226 failed with the following error:

      Timeout occurred after 112 minutes, last suite running was sanity-lnet
      

      LNetNIFini() and discovery thread appear to have hit a deadlock:

      [Thu Mar  3 19:42:41 2022] INFO: task lnet_discovery:424118 blocked for more than 120 seconds.
      [Thu Mar  3 19:42:41 2022]       Tainted: G           OE    --------- -  - 4.18.0-240.22.1.el8_lustre.x86_64 #1
      [Thu Mar  3 19:42:41 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [Thu Mar  3 19:42:41 2022] lnet_discovery  D    0 424118      2 0x80004080
      [Thu Mar  3 19:42:41 2022] Call Trace:
      [Thu Mar  3 19:42:41 2022]  __schedule+0x2c4/0x700
      [Thu Mar  3 19:42:41 2022]  schedule+0x38/0xa0
      [Thu Mar  3 19:42:41 2022]  schedule_preempt_disabled+0xa/0x10
      [Thu Mar  3 19:42:41 2022]  __mutex_lock.isra.5+0x2d0/0x4a0
      [Thu Mar  3 19:42:41 2022]  lnet_peer_discovery+0x929/0x16c0 [lnet]
      [Thu Mar  3 19:42:41 2022]  ? finish_wait+0x80/0x80
      [Thu Mar  3 19:42:41 2022]  ? lnet_peer_merge_data+0xff0/0xff0 [lnet]
      [Thu Mar  3 19:42:41 2022]  kthread+0x112/0x130
      [Thu Mar  3 19:42:41 2022]  ? kthread_flush_work_fn+0x10/0x10
      [Thu Mar  3 19:42:41 2022]  ret_from_fork+0x35/0x40
      [Thu Mar  3 19:42:41 2022] INFO: task lnetctl:428295 blocked for more than 120 seconds.
      [Thu Mar  3 19:42:41 2022]       Tainted: G           OE    --------- -  - 4.18.0-240.22.1.el8_lustre.x86_64 #1
      [Thu Mar  3 19:42:41 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [Thu Mar  3 19:42:41 2022] lnetctl         D    0 428295 428283 0x00004080
      [Thu Mar  3 19:42:41 2022] Call Trace:
      [Thu Mar  3 19:42:41 2022]  __schedule+0x2c4/0x700
      [Thu Mar  3 19:42:41 2022]  ? __wake_up_common_lock+0x89/0xc0
      [Thu Mar  3 19:42:41 2022]  schedule+0x38/0xa0
      [Thu Mar  3 19:42:41 2022]  lnet_peer_discovery_stop+0x112/0x260 [lnet]
      [Thu Mar  3 19:42:41 2022]  ? finish_wait+0x80/0x80
      [Thu Mar  3 19:42:41 2022]  LNetNIFini+0x5e/0x100 [lnet]
      [Thu Mar  3 19:42:41 2022]  lnet_ioctl+0x220/0x260 [lnet]
      [Thu Mar  3 19:42:41 2022]  notifier_call_chain+0x47/0x70
      [Thu Mar  3 19:42:41 2022]  blocking_notifier_call_chain+0x3e/0x60
      [Thu Mar  3 19:42:41 2022]  libcfs_psdev_ioctl+0x346/0x590 [libcfs]
      [Thu Mar  3 19:42:41 2022]  do_vfs_ioctl+0xa4/0x640
      [Thu Mar  3 19:42:41 2022]  ? syscall_trace_enter+0x1d3/0x2c0
      [Thu Mar  3 19:42:41 2022]  ksys_ioctl+0x60/0x90
      [Thu Mar  3 19:42:41 2022]  __x64_sys_ioctl+0x16/0x20
      [Thu Mar  3 19:42:41 2022]  do_syscall_64+0x5b/0x1a0
      [Thu Mar  3 19:42:41 2022]  entry_SYSCALL_64_after_hwframe+0x65/0xca
      

      LNetNIFini() has the ln_api_mutex and is waiting for the discovery thread to stop. The discovery thread needs the mutex to progress.

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-lnet test_226 - Timeout occurred after 112 minutes, last suite running was sanity-lnet

      Attachments

        Issue Links

          Activity

            [LU-15616] sanity-lnet test_226: Timeout occurred after 112 minutes, last suite running was sanity-lnet
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46727/
            Subject: LU-15616 lnet: ln_api_mutex deadlocks
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 22de0bd145b649768b16dd42559d326af3c13200

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46727/ Subject: LU-15616 lnet: ln_api_mutex deadlocks Project: fs/lustre-release Branch: master Current Patch Set: Commit: 22de0bd145b649768b16dd42559d326af3c13200
            hornc Chris Horn added a comment -

            One more example of this with 2.12:

            <root@ncn-w001>ps axOT | grep -Pi '^\s*\S+\s+\S+\s+D'
            1117335 ?        D      0:00 [kiblnd_sd_02_00]
            1142272 ?        D      0:00 lnetctl lnet unconfigure
            <root@ncn-w001>cat /proc/1117335/stack
            [<0>] lnet_nid2peerni_locked+0x53/0x170 [lnet]
            [<0>] lnet_parse+0x4c9/0x12e0 [lnet]
            [<0>] kiblnd_handle_rx+0x1bd/0x630 [ko2iblnd]
            [<0>] kiblnd_scheduler+0xfc8/0x1020 [ko2iblnd]
            [<0>] kthread+0x10d/0x130
            [<0>] ret_from_fork+0x22/0x40
            <root@ncn-w001>cat /proc/1142272/stack
            [<0>] kiblnd_shutdown+0xfa/0x490 [ko2iblnd]
            [<0>] lnet_shutdown_lndni+0x265/0x460 [lnet]
            [<0>] lnet_shutdown_lndnet+0x67/0xd0 [lnet]
            [<0>] lnet_shutdown_lndnets+0x119/0x2d0 [lnet]
            [<0>] LNetNIFini+0x8b/0x100 [lnet]
            [<0>] lnet_ioctl+0x222/0x250 [lnet]
            [<0>] notifier_call_chain+0x47/0x70
            [<0>] blocking_notifier_call_chain+0x3e/0x60
            [<0>] libcfs_ioctl+0xab/0x4a0 [libcfs]
            [<0>] libcfs_psdev_ioctl+0xba/0xd0 [libcfs]
            [<0>] do_vfs_ioctl+0xa0/0x680
            [<0>] ksys_ioctl+0x70/0x80
            [<0>] __x64_sys_ioctl+0x16/0x20
            [<0>] do_syscall_64+0x5b/0x1e0
            [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
            <root@ncn-w001
            
            hornc Chris Horn added a comment - One more example of this with 2.12: <root@ncn-w001>ps axOT | grep -Pi '^\s*\S+\s+\S+\s+D' 1117335 ? D 0:00 [kiblnd_sd_02_00] 1142272 ? D 0:00 lnetctl lnet unconfigure <root@ncn-w001>cat /proc/1117335/stack [<0>] lnet_nid2peerni_locked+0x53/0x170 [lnet] [<0>] lnet_parse+0x4c9/0x12e0 [lnet] [<0>] kiblnd_handle_rx+0x1bd/0x630 [ko2iblnd] [<0>] kiblnd_scheduler+0xfc8/0x1020 [ko2iblnd] [<0>] kthread+0x10d/0x130 [<0>] ret_from_fork+0x22/0x40 <root@ncn-w001>cat /proc/1142272/stack [<0>] kiblnd_shutdown+0xfa/0x490 [ko2iblnd] [<0>] lnet_shutdown_lndni+0x265/0x460 [lnet] [<0>] lnet_shutdown_lndnet+0x67/0xd0 [lnet] [<0>] lnet_shutdown_lndnets+0x119/0x2d0 [lnet] [<0>] LNetNIFini+0x8b/0x100 [lnet] [<0>] lnet_ioctl+0x222/0x250 [lnet] [<0>] notifier_call_chain+0x47/0x70 [<0>] blocking_notifier_call_chain+0x3e/0x60 [<0>] libcfs_ioctl+0xab/0x4a0 [libcfs] [<0>] libcfs_psdev_ioctl+0xba/0xd0 [libcfs] [<0>] do_vfs_ioctl+0xa0/0x680 [<0>] ksys_ioctl+0x70/0x80 [<0>] __x64_sys_ioctl+0x16/0x20 [<0>] do_syscall_64+0x5b/0x1e0 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 <root@ncn-w001

            "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46727
            Subject: LU-15616 lnet: ln_api_mutex deadlocks
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 60722c29610275f533b4ca8e28fa9852c570a52d

            gerrit Gerrit Updater added a comment - "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46727 Subject: LU-15616 lnet: ln_api_mutex deadlocks Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 60722c29610275f533b4ca8e28fa9852c570a52d
            hornc Chris Horn added a comment -

            LU-12233 was another issue with deadlock on shutdown.

            I've seen other variations of this. For example:
            LNet shutdown initiated by lustre_rmmod:

            crash_x86_64> bt
            PID: 524217  TASK: ffff8fc35f65af80  CPU: 2   COMMAND: "rmmod"
             #0 [ffffa02f069abc68] __schedule at ffffffffae54e1d4
             #1 [ffffa02f069abd00] schedule at ffffffffae54e648
             #2 [ffffa02f069abd10] schedule_timeout at ffffffffae551cd3
             #3 [ffffa02f069abda8] kiblnd_shutdown at ffffffffc0fa67e4 [ko2iblnd]
             #4 [ffffa02f069abe28] lnet_shutdown_lndni at ffffffffc0ef0e43 [lnet]
             #5 [ffffa02f069abe60] lnet_shutdown_lndnet at ffffffffc0ef10ac [lnet]
             #6 [ffffa02f069abe88] lnet_shutdown_lndnets at ffffffffc0ef2029 [lnet]
             #7 [ffffa02f069abed0] LNetNIFini at ffffffffc0ef3ccb [lnet]
             #8 [ffffa02f069abed8] ptlrpc_exit at ffffffffc14e772e [ptlrpc]
             #9 [ffffa02f069abee0] __x64_sys_delete_module at ffffffffadd8a0f9
            #10 [ffffa02f069abf38] do_syscall_64 at ffffffffadc0420b
            #11 [ffffa02f069abf50] entry_SYSCALL_64_after_hwframe at ffffffffae6000ad
                RIP: 00007f91fc94e72b  RSP: 00007ffcdb74c588  RFLAGS: 00000206
                RAX: ffffffffffffffda  RBX: 0000557518d633d0  RCX: 00007f91fc94e72b
                RDX: 000000000000000a  RSI: 0000000000000800  RDI: 0000557518d63438
                RBP: 0000000000000000   R8: 00007ffcdb74b501   R9: 0000000000000000
                R10: 00007f91fc9c2200  R11: 0000000000000206  R12: 00007ffcdb74c7b0
                R13: 00007ffcdb74e5f9  R14: 0000557518d612a0  R15: 0000557518d633d0
                ORIG_RAX: 00000000000000b0  CS: 0033  SS: 002b
            crash_x86_64>
            

            LNetNIFini() takes ln_api_mutex:

            LNetNIFini(void)
            {
                    mutex_lock(&the_lnet.ln_api_mutex);
            

            The shutdown seems to be stuck in kiblnd_shutdown():

            *hornc@cflosbld09 LUS-10766 $ grep kiblnd_shutdown dk.log.fmt
            00000800:00000200:1.0:1645008262.725728:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
            00000800:00000200:1.0:1645008265.797688:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
            00000800:00000200:1.0:1645008272.965750:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
            00000800:00000200:1.0:1645008288.325727:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
            00000800:00000200:1.0:1645008320.069708:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
            00000800:00000200:1.0:1645008384.581724:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
            00000800:00000200:2.0:1645008514.629693:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
            00000800:00000200:2.0:1645008775.749732:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
            00000800:00000200:8.0:1645009299.013721:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
            00000800:00000200:8.0:1645010346.565723:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
            00000800:00000200:18.0:1645012442.693692:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
            00000800:00000200:4.0:1645016635.973730:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
            00000800:00000200:5.0:1645025023.557725:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
            *hornc@cflosbld09 LUS-10766 $
            

            The hung tasks in kern log come from someone running lctl commands:

            crash_x86_64> ps | grep lctl
              620605  620604   6  ffff8fc2bf4c5f00  UN   0.0   32392   2656  lctl
              621757  621756   5  ffff8fad312797c0  UN   0.0   32392   2828  lctl
            crash_x86_64> bt 620605
            PID: 620605  TASK: ffff8fc2bf4c5f00  CPU: 6   COMMAND: "lctl"
             #0 [ffffa02f082c7c38] __schedule at ffffffffae54e1d4
             #1 [ffffa02f082c7cd0] schedule at ffffffffae54e648
             #2 [ffffa02f082c7ce0] schedule_preempt_disabled at ffffffffae54e98a
             #3 [ffffa02f082c7ce8] __mutex_lock at ffffffffae550690
             #4 [ffffa02f082c7d78] LNetNIInit at ffffffffc0ef5634 [lnet]
             #5 [ffffa02f082c7dd0] lnet_ioctl at ffffffffc0f16fdf [lnet]
             #6 [ffffa02f082c7de8] notifier_call_chain at ffffffffadd06197
             #7 [ffffa02f082c7e18] blocking_notifier_call_chain at ffffffffadd068be
             #8 [ffffa02f082c7e40] libcfs_psdev_ioctl at ffffffffc0c20cb6 [libcfs]
             #9 [ffffa02f082c7e80] do_vfs_ioctl at ffffffffadf2e084
            #10 [ffffa02f082c7ef8] ksys_ioctl at ffffffffadf2e6c0
            #11 [ffffa02f082c7f30] __x64_sys_ioctl at ffffffffadf2e706
            #12 [ffffa02f082c7f38] do_syscall_64 at ffffffffadc0420b
            #13 [ffffa02f082c7f50] entry_SYSCALL_64_after_hwframe at ffffffffae6000ad
                RIP: 00007fa4dc43563b  RSP: 00007ffee2603018  RFLAGS: 00000206
                RAX: ffffffffffffffda  RBX: 00007fa4ddbb8b20  RCX: 00007fa4dc43563b
                RDX: 00007ffee2603060  RSI: 00000000c0086532  RDI: 0000000000000003
                RBP: 00000000c0086532   R8: 00007fa4ddde5960   R9: 0000000000000003
                R10: 000000000000000f  R11: 0000000000000206  R12: 000055fd4563253b
                R13: 00007ffee2603060  R14: 0000000000000000  R15: 0000000000000000
                ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
            crash_x86_64> bt 621757
            PID: 621757  TASK: ffff8fad312797c0  CPU: 5   COMMAND: "lctl"
             #0 [ffffa02f06727c38] __schedule at ffffffffae54e1d4
             #1 [ffffa02f06727cd0] schedule at ffffffffae54e648
             #2 [ffffa02f06727ce0] schedule_preempt_disabled at ffffffffae54e98a
             #3 [ffffa02f06727ce8] __mutex_lock at ffffffffae550690
             #4 [ffffa02f06727d78] LNetNIInit at ffffffffc0ef5634 [lnet]
             #5 [ffffa02f06727dd0] lnet_ioctl at ffffffffc0f16fdf [lnet]
             #6 [ffffa02f06727de8] notifier_call_chain at ffffffffadd06197
             #7 [ffffa02f06727e18] blocking_notifier_call_chain at ffffffffadd068be
             #8 [ffffa02f06727e40] libcfs_psdev_ioctl at ffffffffc0c20cb6 [libcfs]
             #9 [ffffa02f06727e80] do_vfs_ioctl at ffffffffadf2e084
            #10 [ffffa02f06727ef8] ksys_ioctl at ffffffffadf2e6c0
            #11 [ffffa02f06727f30] __x64_sys_ioctl at ffffffffadf2e706
            #12 [ffffa02f06727f38] do_syscall_64 at ffffffffadc0420b
            #13 [ffffa02f06727f50] entry_SYSCALL_64_after_hwframe at ffffffffae6000ad
                RIP: 00007f881e90c63b  RSP: 00007ffca55ea0f8  RFLAGS: 00000202
                RAX: ffffffffffffffda  RBX: 00007f882008fb20  RCX: 00007f881e90c63b
                RDX: 00007ffca55ea140  RSI: 00000000c0086532  RDI: 0000000000000003
                RBP: 00000000c0086532   R8: 00007f88202bc960   R9: 0000000000000003
                R10: 000000000000000f  R11: 0000000000000202  R12: 0000561639d7753b
                R13: 00007ffca55ea140  R14: 0000000000000000  R15: 0000000000000000
                ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
            crash_x86_64>
            

            The lctl commands attempt to acquire the ln_api_mutex. Since it is held by shutdown code, and shutdown is stuck, the lctl commands are hung/stuck too.

            hornc Chris Horn added a comment - LU-12233 was another issue with deadlock on shutdown. I've seen other variations of this. For example: LNet shutdown initiated by lustre_rmmod: crash_x86_64> bt PID: 524217 TASK: ffff8fc35f65af80 CPU: 2 COMMAND: "rmmod" #0 [ffffa02f069abc68] __schedule at ffffffffae54e1d4 #1 [ffffa02f069abd00] schedule at ffffffffae54e648 #2 [ffffa02f069abd10] schedule_timeout at ffffffffae551cd3 #3 [ffffa02f069abda8] kiblnd_shutdown at ffffffffc0fa67e4 [ko2iblnd] #4 [ffffa02f069abe28] lnet_shutdown_lndni at ffffffffc0ef0e43 [lnet] #5 [ffffa02f069abe60] lnet_shutdown_lndnet at ffffffffc0ef10ac [lnet] #6 [ffffa02f069abe88] lnet_shutdown_lndnets at ffffffffc0ef2029 [lnet] #7 [ffffa02f069abed0] LNetNIFini at ffffffffc0ef3ccb [lnet] #8 [ffffa02f069abed8] ptlrpc_exit at ffffffffc14e772e [ptlrpc] #9 [ffffa02f069abee0] __x64_sys_delete_module at ffffffffadd8a0f9 #10 [ffffa02f069abf38] do_syscall_64 at ffffffffadc0420b #11 [ffffa02f069abf50] entry_SYSCALL_64_after_hwframe at ffffffffae6000ad RIP: 00007f91fc94e72b RSP: 00007ffcdb74c588 RFLAGS: 00000206 RAX: ffffffffffffffda RBX: 0000557518d633d0 RCX: 00007f91fc94e72b RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000557518d63438 RBP: 0000000000000000 R8: 00007ffcdb74b501 R9: 0000000000000000 R10: 00007f91fc9c2200 R11: 0000000000000206 R12: 00007ffcdb74c7b0 R13: 00007ffcdb74e5f9 R14: 0000557518d612a0 R15: 0000557518d633d0 ORIG_RAX: 00000000000000b0 CS: 0033 SS: 002b crash_x86_64> LNetNIFini() takes ln_api_mutex: LNetNIFini(void) { mutex_lock(&the_lnet.ln_api_mutex); The shutdown seems to be stuck in kiblnd_shutdown(): *hornc@cflosbld09 LUS-10766 $ grep kiblnd_shutdown dk.log.fmt 00000800:00000200:1.0:1645008262.725728:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:1.0:1645008265.797688:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:1.0:1645008272.965750:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:1.0:1645008288.325727:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:1.0:1645008320.069708:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:1.0:1645008384.581724:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:2.0:1645008514.629693:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:2.0:1645008775.749732:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:8.0:1645009299.013721:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:8.0:1645010346.565723:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:18.0:1645012442.693692:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:4.0:1645016635.973730:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:5.0:1645025023.557725:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect *hornc@cflosbld09 LUS-10766 $ The hung tasks in kern log come from someone running lctl commands: crash_x86_64> ps | grep lctl 620605 620604 6 ffff8fc2bf4c5f00 UN 0.0 32392 2656 lctl 621757 621756 5 ffff8fad312797c0 UN 0.0 32392 2828 lctl crash_x86_64> bt 620605 PID: 620605 TASK: ffff8fc2bf4c5f00 CPU: 6 COMMAND: "lctl" #0 [ffffa02f082c7c38] __schedule at ffffffffae54e1d4 #1 [ffffa02f082c7cd0] schedule at ffffffffae54e648 #2 [ffffa02f082c7ce0] schedule_preempt_disabled at ffffffffae54e98a #3 [ffffa02f082c7ce8] __mutex_lock at ffffffffae550690 #4 [ffffa02f082c7d78] LNetNIInit at ffffffffc0ef5634 [lnet] #5 [ffffa02f082c7dd0] lnet_ioctl at ffffffffc0f16fdf [lnet] #6 [ffffa02f082c7de8] notifier_call_chain at ffffffffadd06197 #7 [ffffa02f082c7e18] blocking_notifier_call_chain at ffffffffadd068be #8 [ffffa02f082c7e40] libcfs_psdev_ioctl at ffffffffc0c20cb6 [libcfs] #9 [ffffa02f082c7e80] do_vfs_ioctl at ffffffffadf2e084 #10 [ffffa02f082c7ef8] ksys_ioctl at ffffffffadf2e6c0 #11 [ffffa02f082c7f30] __x64_sys_ioctl at ffffffffadf2e706 #12 [ffffa02f082c7f38] do_syscall_64 at ffffffffadc0420b #13 [ffffa02f082c7f50] entry_SYSCALL_64_after_hwframe at ffffffffae6000ad RIP: 00007fa4dc43563b RSP: 00007ffee2603018 RFLAGS: 00000206 RAX: ffffffffffffffda RBX: 00007fa4ddbb8b20 RCX: 00007fa4dc43563b RDX: 00007ffee2603060 RSI: 00000000c0086532 RDI: 0000000000000003 RBP: 00000000c0086532 R8: 00007fa4ddde5960 R9: 0000000000000003 R10: 000000000000000f R11: 0000000000000206 R12: 000055fd4563253b R13: 00007ffee2603060 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b crash_x86_64> bt 621757 PID: 621757 TASK: ffff8fad312797c0 CPU: 5 COMMAND: "lctl" #0 [ffffa02f06727c38] __schedule at ffffffffae54e1d4 #1 [ffffa02f06727cd0] schedule at ffffffffae54e648 #2 [ffffa02f06727ce0] schedule_preempt_disabled at ffffffffae54e98a #3 [ffffa02f06727ce8] __mutex_lock at ffffffffae550690 #4 [ffffa02f06727d78] LNetNIInit at ffffffffc0ef5634 [lnet] #5 [ffffa02f06727dd0] lnet_ioctl at ffffffffc0f16fdf [lnet] #6 [ffffa02f06727de8] notifier_call_chain at ffffffffadd06197 #7 [ffffa02f06727e18] blocking_notifier_call_chain at ffffffffadd068be #8 [ffffa02f06727e40] libcfs_psdev_ioctl at ffffffffc0c20cb6 [libcfs] #9 [ffffa02f06727e80] do_vfs_ioctl at ffffffffadf2e084 #10 [ffffa02f06727ef8] ksys_ioctl at ffffffffadf2e6c0 #11 [ffffa02f06727f30] __x64_sys_ioctl at ffffffffadf2e706 #12 [ffffa02f06727f38] do_syscall_64 at ffffffffadc0420b #13 [ffffa02f06727f50] entry_SYSCALL_64_after_hwframe at ffffffffae6000ad RIP: 00007f881e90c63b RSP: 00007ffca55ea0f8 RFLAGS: 00000202 RAX: ffffffffffffffda RBX: 00007f882008fb20 RCX: 00007f881e90c63b RDX: 00007ffca55ea140 RSI: 00000000c0086532 RDI: 0000000000000003 RBP: 00000000c0086532 R8: 00007f88202bc960 R9: 0000000000000003 R10: 000000000000000f R11: 0000000000000202 R12: 0000561639d7753b R13: 00007ffca55ea140 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b crash_x86_64> The lctl commands attempt to acquire the ln_api_mutex. Since it is held by shutdown code, and shutdown is stuck, the lctl commands are hung/stuck too.

            People

              hornc Chris Horn
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: