Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6527

Journal commit callback opitmization

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • Major
    • Lustre 2.9.0
    • None
    • 9223372036854775807

    Description

      Faced with following soft lockup's on mds(mds had internal journal with commit interval 5):

      Feb 18 14:03:28 snx11127n003 kernel: BUG: soft lockup - CPU#2 stuck for 67s! [jbd2/md66-8:132029]
      ...
      Feb 18 14:03:28 snx11127n003 kernel: Pid: 132029, comm: jbd2/md66-8 Not tainted 2.6.32-431.17.1.x2.0.47.x86_64 #1 Intel Corporation S2600JF/S2600JF
      Feb 18 14:03:28 snx11127n003 kernel: RIP: 0010:[<ffffffffa08c9e99>]  [<ffffffffa08c9e99>] ptlrpc_commit_replies+0xb9/0x320 [ptlrpc]
      Feb 18 14:03:28 snx11127n003 kernel: RSP: 0018:ffff880791763c20  EFLAGS: 00000206
      Feb 18 14:03:28 snx11127n003 kernel: RAX: ffff880d260c6030 RBX: ffff880791763c80 RCX: 0000000000000000
      Feb 18 14:03:28 snx11127n003 kernel: RDX: ffff880bc7b01030 RSI: ffff880d1974f880 RDI: ffff881014c32928
      Feb 18 14:03:28 snx11127n003 kernel: RBP: ffffffff8100bb8e R08: 0000000000000002 R09: 5a5a5a5a5a5a5a5a
      Feb 18 14:03:28 snx11127n003 kernel: R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: ffff88083fcb02c0
      Feb 18 14:03:28 snx11127n003 kernel: R13: ffff88083febc140 R14: ffff88083febccc0 R15: 000001b100000000
      Feb 18 14:03:28 snx11127n003 kernel: FS:  0000000000000000(0000) GS:ffff880044640000(0000) knlGS:0000000000000000
      Feb 18 14:03:28 snx11127n003 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      Feb 18 14:03:28 snx11127n003 kernel: CR2: 00007fb86046c518 CR3: 0000000001a85000 CR4: 00000000000407e0
      Feb 18 14:03:28 snx11127n003 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      Feb 18 14:03:28 snx11127n003 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Feb 18 14:03:28 snx11127n003 kernel: Process jbd2/md66-8 (pid: 132029, threadinfo ffff880791762000, task ffff8807ef3e4ae0)
      Feb 18 14:03:28 snx11127n003 kernel: Stack:
      Feb 18 14:03:28 snx11127n003 kernel:  0000000000000018 ffff881014c32928 ffff880791763c30 ffff880791763c30
      Feb 18 14:03:28 snx11127n003 kernel: <d> 0000000000000000 0000000000000000 ffff880791763c80 ffff880bcc9fe240
      Feb 18 14:03:28 snx11127n003 kernel: <d> 0000000000000000 ffff8807d7ac0000 ffff880d1974f900 ffff880d1974f900
      Feb 18 14:03:28 snx11127n003 kernel: Call Trace:
      Feb 18 14:03:28 snx11127n003 kernel:  [<ffffffffa0914258>] ? tgt_cb_last_committed+0x298/0x410 [ptlrpc]
      Feb 18 14:03:28 snx11127n003 kernel:  [<ffffffffa0f6dba4>] ? osd_trans_commit_cb+0xb4/0x2b0 [osd_ldiskfs]
      Feb 18 14:03:28 snx11127n003 kernel:  [<ffffffffa0f1c9ba>] ? ldiskfs_journal_commit_callback+0x8a/0xc0 [ldiskfs]
      Feb 18 14:03:28 snx11127n003 kernel:  [<ffffffffa03df8ef>] ? jbd2_journal_commit_transaction+0x116f/0x15a0 [jbd2]
      Feb 18 14:03:28 snx11127n003 kernel:  [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
      Feb 18 14:03:28 snx11127n003 kernel:  [<ffffffff81084a8b>] ? try_to_del_timer_sync+0x7b/0xe0
      Feb 18 14:03:28 snx11127n003 kernel:  [<ffffffffa03e4c08>] ? kjournald2+0xb8/0x220 [jbd2]
      Feb 18 14:03:28 snx11127n003 kernel:  [<ffffffff8109b010>] ? autoremove_wake_function+0x0/0x40
      Feb 18 14:03:28 snx11127n003 kernel:  [<ffffffffa03e4b50>] ? kjournald2+0x0/0x220 [jbd2]
      Feb 18 14:03:28 snx11127n003 kernel:  [<ffffffff8109ac66>] ? kthread+0x96/0xa0
      Feb 18 14:03:28 snx11127n003 kernel:  [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      Feb 18 14:03:28 snx11127n003 kernel:  [<ffffffff8109abd0>] ? kthread+0x0/0xa0
      Feb 18 14:03:28 snx11127n003 kernel:  [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      Analysis shows that there where about 28 millions jce:

      crash> transaction_t 0xffff880c0331ccc0 | grep handle_c
        t_handle_count = 28881407,

      When kjournald2 is busy, it can't mark running transaction as T_LOCKED, so ldiskfs writers may open transaction handles , add blocks to the transaction and add transaction commit hooks, making the commit / checkpont of the transaction even more complex and time consuming.

      Attachments

        Activity

          People

            ys Yang Sheng
            scherementsev Sergey Cheremencev
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: