Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12354

MDT threads stuck at ldlm_completion_ast

Details

    • Bug
    • Resolution: Incomplete
    • Major
    • None
    • Lustre 2.10.7
    • None
    • 2
    • 9223372036854775807

    Description

      MDT with very high load and lots of threads stuck at ldlm_completion_ast and
      osp_sync_process_queues.

      May be a dup of LU-11359

      [2418705.962173] LNet: Service thread pid 34570 was inactive for 400.29s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [2418706.013418] Pid: 34570, comm: mdt01_236 3.10.0-693.21.1.el7.20180508.x86_64.lustre2106 #1 SMP Wed Jan 30 00:30:34 UTC 2019
      [2418706.013419] Call Trace:
      [2418706.013433]  [<ffffffffa0beba11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      [2418706.034654]  [<ffffffffa0becb53>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
      [2418706.034673]  [<ffffffffa1248d72>] mdt_object_local_lock+0x512/0xb00 [mdt]
      [2418706.034680]  [<ffffffffa12493be>] mdt_object_lock_internal+0x5e/0x300 [mdt]
      [2418706.034687]  [<ffffffffa124a164>] mdt_getattr_name_lock+0x8a4/0x1910 [mdt]
      [2418706.034694]  [<ffffffffa124b480>] mdt_intent_getattr+0x2b0/0x480 [mdt]
      [2418706.034701]  [<ffffffffa124746b>] mdt_intent_opc+0x1eb/0xaf0 [mdt]
      [2418706.034708]  [<ffffffffa124fd08>] mdt_intent_policy+0x138/0x320 [mdt]
      [2418706.034727]  [<ffffffffa0bd12cd>] ldlm_lock_enqueue+0x38d/0x980 [ptlrpc]
      [2418706.034751]  [<ffffffffa0bfac13>] ldlm_handle_enqueue0+0xa83/0x1670 [ptlrpc]
      [2418706.034787]  [<ffffffffa0c80622>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [2418706.034818]  [<ffffffffa0c8428a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      [2418706.034845]  [<ffffffffa0c2c6cb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      [2418706.034870]  [<ffffffffa0c306b2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [2418706.034873]  [<ffffffff810b1131>] kthread+0xd1/0xe0
      [2418706.034875]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
      [2418706.034894]  [<ffffffffffffffff>] 0xffffffffffffffff
      [2418706.034897] LustreError: dumping log to /tmp/lustre-log.1557763688.34570
      
      PID: 32153  TASK: ffff883ed28d5ee0  CPU: 10  COMMAND: "osp-syn-153-0"
       #0 [ffff883ed292f908] __schedule at ffffffff816946af
       #1 [ffff883ed292f990] schedule at ffffffff81694cd9
       #2 [ffff883ed292f9a0] osp_sync_process_queues at ffffffffa13f8197 [osp]
       #3 [ffff883ed292fab0] llog_process_thread at ffffffffa09ad8bb [obdclass]
       #4 [ffff883ed292fb78] llog_process_or_fork at ffffffffa09ae5cc [obdclass]
       #5 [ffff883ed292fbc0] llog_cat_process_cb at ffffffffa09b3dad [obdclass]
       #6 [ffff883ed292fc30] llog_process_thread at ffffffffa09ad8bb [obdclass]
       #7 [ffff883ed292fcf8] llog_process_or_fork at ffffffffa09ae5cc [obdclass]
       #8 [ffff883ed292fd40] llog_cat_process_or_fork at ffffffffa09b2e49 [obdclass]
       #9 [ffff883ed292fdb8] llog_cat_process at ffffffffa09b2f7e [obdclass]
      #10 [ffff883ed292fdd8] osp_sync_thread at ffffffffa13f63af [osp]
      #11 [ffff883ed292fec8] kthread at ffffffff810b1131
      #12 [ffff883ed292ff50] ret_from_fork at ffffffff816a14f7
      

      And

      PID: 33442  TASK: ffff881fcfa38000  CPU: 21  COMMAND: "mdt01_073"
       #0 [ffff883eb683f718] __schedule at ffffffff816946af
       #1 [ffff883eb683f7a8] schedule at ffffffff81694cd9
       #2 [ffff883eb683f7b8] schedule_timeout at ffffffff81692574
       #3 [ffff883eb683f860] ldlm_completion_ast at ffffffffa0beba11 [ptlrpc]
       #4 [ffff883eb683f900] ldlm_cli_enqueue_local at ffffffffa0becb53 [ptlrpc]
       #5 [ffff883eb683f998] mdt_object_local_lock at ffffffffa1248d72 [mdt]
       #6 [ffff883eb683fa48] mdt_object_lock_internal at ffffffffa12493be [mdt]
       #7 [ffff883eb683fa98] mdt_getattr_name_lock at ffffffffa124a164 [mdt]
       #8 [ffff883eb683fb20] mdt_intent_getattr at ffffffffa124b480 [mdt]
       #9 [ffff883eb683fb60] mdt_intent_opc at ffffffffa124746b [mdt]
      #10 [ffff883eb683fbc0] mdt_intent_policy at ffffffffa124fd08 [mdt]
      #11 [ffff883eb683fbf8] ldlm_lock_enqueue at ffffffffa0bd12cd [ptlrpc]
      #12 [ffff883eb683fc50] ldlm_handle_enqueue0 at ffffffffa0bfac13 [ptlrpc]
      #13 [ffff883eb683fce0] tgt_enqueue at ffffffffa0c80622 [ptlrpc]
      #14 [ffff883eb683fd00] tgt_request_handle at ffffffffa0c8428a [ptlrpc]
      #15 [ffff883eb683fd48] ptlrpc_server_handle_request at ffffffffa0c2c6cb [ptlrpc]
      #16 [ffff883eb683fde8] ptlrpc_main at ffffffffa0c306b2 [ptlrpc]
      #17 [ffff883eb683fec8] kthread at ffffffff810b1131
      #18 [ffff883eb683ff50] ret_from_fork at ffffffff816a14f7
      

      Attachments

        Issue Links

          Activity

            [LU-12354] MDT threads stuck at ldlm_completion_ast

            We can close this for now

            mhanafi Mahmoud Hanafi added a comment - We can close this for now
            1. lctl get_param at_max
              at_max=600
            1. # lctl get_param timeout
              timeout=100
            mhanafi Mahmoud Hanafi added a comment - lctl get_param at_max at_max=600 # lctl get_param timeout timeout=100
            hongchao.zhang Hongchao Zhang added a comment - - edited

            As per the logs, there is an IBITS lock on root directory is holding by client at 10.151.10.140, but there is network error
            when sending the Blocking callback to the client

            2418411.342868] LNetError: 21360:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 5 seconds
            [2418411.373829] LNetError: 21360:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 10.151.10.140@o2ib (207): c: 54, oc: 0, rc: 63
            [2418411.411904] Lustre: 33336:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1557763288/real 1557763393]  req@ffff880139dd6600 x1630930117996992/t0(0) o104->nbp2-MDT0000@10.151.10.140@o2ib:15/16 lens 296/224 e 0 to 1 dl 1557763743 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1
            

            After 407 seconds, the lock waiting timer is expired and released, so all other lock enqueue request are blocked
            in ldlm_completion_ast to wait its lock to be granted.

            [2418713.129522] LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 407s: evicting client at 10.151.10.140@o2ib  ns: mdt-nbp2-MDT0000_UUID lock: ffff8810b458a200/0x843e968fdb2cd3cc lrc: 4/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13 rrc: 2189 type: IBT flags: 0x60200400000020 nid: 10.151.10.140@o2ib remote: 0x7e3f2695c7cccb4e expref: 160144 pid: 33420 timeout: 6713595267 lvb_type: 0
            

            What is the output of "lctl get_param at_max", "lctl get_param timeout"?
            This issue can be mitigated by reducing the timeout, but it could cause some operations to be failed if it spent more
            time to be processed, especially under high load situation.

            hongchao.zhang Hongchao Zhang added a comment - - edited As per the logs, there is an IBITS lock on root directory is holding by client at 10.151.10.140, but there is network error when sending the Blocking callback to the client 2418411.342868] LNetError: 21360:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 5 seconds [2418411.373829] LNetError: 21360:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 10.151.10.140@o2ib (207): c: 54, oc: 0, rc: 63 [2418411.411904] Lustre: 33336:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1557763288/real 1557763393] req@ffff880139dd6600 x1630930117996992/t0(0) o104->nbp2-MDT0000@10.151.10.140@o2ib:15/16 lens 296/224 e 0 to 1 dl 1557763743 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1 After 407 seconds, the lock waiting timer is expired and released, so all other lock enqueue request are blocked in ldlm_completion_ast to wait its lock to be granted. [2418713.129522] LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 407s: evicting client at 10.151.10.140@o2ib ns: mdt-nbp2-MDT0000_UUID lock: ffff8810b458a200/0x843e968fdb2cd3cc lrc: 4/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13 rrc: 2189 type: IBT flags: 0x60200400000020 nid: 10.151.10.140@o2ib remote: 0x7e3f2695c7cccb4e expref: 160144 pid: 33420 timeout: 6713595267 lvb_type: 0 What is the output of "lctl get_param at_max", "lctl get_param timeout"? This issue can be mitigated by reducing the timeout, but it could cause some operations to be failed if it spent more time to be processed, especially under high load situation.

            @mhanafi, there is plenty of stack traces in your vmcode logs related to stuck read I/O like:

            [2415376.231996] Pid: 35506, comm: mdt_rdpg00_051 3.10.0-693.21.1.el7.20180508.x86_64.lustre2106 #1 SMP Wed Jan 30 00:30:34 UTC 2019
            [2415376.231998] Call Trace:
            [2415376.232023] [<ffffffff8123499a>] __wait_on_buffer+0x2a/0x30
            [2415376.249546] [<ffffffffa0ee553c>] ldiskfs_bread+0x7c/0xc0 [ldiskfs]
            [2415376.249554] [<ffffffffa0ed3e7a>] __ldiskfs_read_dirblock+0x4a/0x400 [ldiskfs]
            [2415376.249560] [<ffffffffa0ed4270>] htree_dirblock_to_tree+0x40/0x190 [ldiskfs]
            [2415376.249566] [<ffffffffa0ed6451>] ldiskfs_htree_fill_tree+0x201/0x2f0 [ldiskfs]

            Are you sure you do not have any disk issue? This could be a side effect but you should first verify there is not a disk issue that prevent Lustre threads from doing I/O.

            degremoa Aurelien Degremont (Inactive) added a comment - @mhanafi, there is plenty of stack traces in your vmcode logs related to stuck read I/O like: [2415376.231996] Pid: 35506, comm: mdt_rdpg00_051 3.10.0-693.21.1.el7.20180508.x86_64.lustre2106 #1 SMP Wed Jan 30 00:30:34 UTC 2019 [2415376.231998] Call Trace: [2415376.232023] [<ffffffff8123499a>] __wait_on_buffer+0x2a/0x30 [2415376.249546] [<ffffffffa0ee553c>] ldiskfs_bread+0x7c/0xc0 [ldiskfs] [2415376.249554] [<ffffffffa0ed3e7a>] __ldiskfs_read_dirblock+0x4a/0x400 [ldiskfs] [2415376.249560] [<ffffffffa0ed4270>] htree_dirblock_to_tree+0x40/0x190 [ldiskfs] [2415376.249566] [<ffffffffa0ed6451>] ldiskfs_htree_fill_tree+0x201/0x2f0 [ldiskfs] Are you sure you do not have any disk issue? This could be a side effect but you should first verify there is not a disk issue that prevent Lustre threads from doing I/O.

            Attaching dmesg. We don't have /tmp/lustre-log.1557763688.34570 it was deleted. vmcore-dmesg.txt.gz

            mhanafi Mahmoud Hanafi added a comment - Attaching dmesg. We don't have /tmp/lustre-log.1557763688.34570 it was deleted. vmcore-dmesg.txt.gz

            This is not related to LU-11359, as that is a DOM specific issue.

            It would be good to have dmesg from the MDS and also the dumped lustre log(s), like the one listed here:

            [2418706.034897] LustreError: dumping log to /tmp/lustre-log.1557763688.34570
            
            pfarrell Patrick Farrell (Inactive) added a comment - This is not related to LU-11359 , as that is a DOM specific issue. It would be good to have dmesg from the MDS and also the dumped lustre log(s), like the one listed here: [2418706.034897] LustreError: dumping log to /tmp/lustre-log.1557763688.34570
            pjones Peter Jones added a comment -

            Hongchao

            Could you please investigate?

            Thanks

            Peter

            pjones Peter Jones added a comment - Hongchao Could you please investigate? Thanks Peter

            People

              hongchao.zhang Hongchao Zhang
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: