Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19236

Journaling deadlock pattern on MDS for Lustre 2.15.5, 2.15.6 and 2.15.7 with RHEL 8.10 (ldiskfs setup)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.15.5, Lustre 2.15.6, Lustre 2.15.7
    • None
    • RHEL 8.10 seen on different hardware versions and controllers (Dell R640 and SuperMicro AS -2115HS-TNR)

    Description

      We would like to inform about (what we believe is) a journaling deadlock
      pattern in our MDS setup that is crippling our production. We have seen
      this on different hardware (Dell and Supermicro servers) with different
      storage controllers that have been thoroughly checked hardware/firmware
      wise. We have reproduced it consistently with the following software stack
      combinations:

      Kernels:
      4.18.0-553.5.1.el8_lustre.x86_64, 4.18.0-553.27.1.el8_lustre.x86_64
      and 4.18.0-553.53.1.el8_lustre.x86_64
      Lustre versions: 2.15.5, 2.15.6 and 2.15.7 (the bug exists in all versions)
      DRBD kmod: 9.1.23

      What we are hoping to achieve with this message is:
      -To see and inform if someone has seen this bug with similar software
      stacks.
      -To examine if our plan to upgrade the MDS server to RHEL 9 and 2.16.1
      (5.14.0-427.31.1_lustre.el9.x86_64 with the latest kmod-drbd: 9.2.14) is
      going to fix the issue

      We have also reported the issue to RedHat (case number 04215952) and
      they/we believe that

      A bit more about our setup and the problem with some vmcore analysis that
      we did with the RedHat storage and kernel engineers:

      Our system is ldiskfs based and we are using DRBD to replicate the metadata
      targets. We have done that successfully for a number of years. Here is how
      our DRBD setup looks like:

      [root at lustrea-mds-nx10077836-an ~]# lsblk
      NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
      sda 8:0 0 446.6G 0 disk
      ├─sda1 8:1 0 600M 0 part /boot/efi
      ├─sda2 8:2 0 4G 0 part /boot
      └─sda3 8:3 0 249G 0 part
      ├─rhel-root 253:0 0 150G 0 lvm /
      ├─rhel-swap 253:1 0 20G 0 lvm [SWAP]
      ├─rhel-backup 253:2 0 47G 0 lvm /backup
      └─rhel-var 253:3 0 32G 0 lvm /var
      sdb 8:16 0 17.5T 0 disk
      ├─sdb1 8:17 0 1.8T 0 part
      │ └─drbd0 147:0 0 1.8T 0 disk /lustre/mgs
      ├─sdb2 8:18 0 3.7T 0 part
      │ └─drbd1 147:1 0 3.7T 0 disk /lustre/mdt0
      └─sdb3 8:19 0 3.7T 0 part
      └─drbd2 147:2 0 3.7T 0 disk /lustre/mdt1

      In terms of the problem, what happens is that either I/O is blocked on
      drbd1 (mdt0) or drbd2 (mdt1) with messages like this:
      Jul 17 07:24:19 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task
      ldlm_cn19_000:17761 blocked for more than 120 seconds.
      Jul 17 07:22:16 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task
      mdt21_011:19591 blocked for more than 120 seconds.
      Jul 17 07:22:16 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task
      mdt21_010:19584 blocked for more than 120 seconds.
      Jul 17 07:22:16 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task
      mdt21_009:19574 blocked for more than 120 seconds.
      Jul 17 07:22:16 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task
      mdt21_008:19564 blocked for more than 120 seconds.
      Jul 17 07:22:16 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task
      jbd2/drbd1-8:17843 blocked for more than 120 seconds.
      Jul 17 07:22:16 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task
      ldlm_cn19_002:17763 blocked for more than 120 seconds.
      Jul 17 07:22:16 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task
      ldlm_cn19_000:17761 blocked for more than 120 seconds.

      At that point, the DRBD processes remain blocked, until we powercycle the
      server. It is not possible to cleanly unmount the metadata target of the
      blocked drbd device. We can umount the mgs (drbd0) and the other drbd
      device OK (only one blocks at the time). We need to powercycle the server
      causing production outages, fsck ops that create serious problems to our
      operational environment. The frequency of the problem occurrence varies.
      It could be anything from 12 hours to several days for some filesystems and
      over a month for other filesystems (with the same software stack).

      We (together with RedHat kernel engineers) do not believe DRBD is to blame
      here. In fact, we have gone as far as to run the active MDS in standalone
      mode (no DRBD replication) and we still have triggered the deadlock. We
      suspect something happens in:
      qsd_reserve_or_free_quota
      ↳ osd_reserve_or_free_quota
      ↳ ldiskfs_sync_fs
      ↳ jbd2_transaction_committed

      Thanks for any hints you can provide!

      A little bit of kdump/vmcore analysis reveals the deadlock between jbd2 and
      the way lustre is using it. What we see in the ldiskfs layer is a class of
      threads (MDT and LDLM?) that are stuck. Every one of them lands up blocking
      here when ldiskfs_sync_fs is called. See below:


      --crash> ps -m | grep UN
      [0 00:19:19.605] [UN] PID: 464849 TASK: ff4103d58655c000 CPU: 38
      COMMAND: "ldlm_cn19_006"
      [0 00:19:56.741] [UN] PID: 55781 TASK: ff4103b341b68000 CPU: 10
      COMMAND: "mdt05_001"
      [0 00:21:32.753] [UN] PID: 125286 TASK: ff4103c86207c000 CPU: 74
      COMMAND: "mdt13_009"
      [0 00:22:56.640] [UN] PID: 55720 TASK: ff4103b2fd31c000 CPU: 38
      COMMAND: "ldlm_cn19_000"
      [0 00:26:43.200] [UN] PID: 57585 TASK: ff4103b373fcc000 CPU: 40
      COMMAND: "ldlm_cn20_004"
      [0 00:27:22.124] [UN] PID: 56529 TASK: ff4103b427844000 CPU: 4
      COMMAND: "jbd2/drbd2-8"
      [0 00:27:23.085] [UN] PID: 57687 TASK: ff4103b29eae0000 CPU: 37
      COMMAND: "mdt18_012"
      [0 00:27:25.832] [UN] PID: 57639 TASK: ff4103b3d2318000 CPU: 77
      COMMAND: "mdt14_013"
      [0 00:27:26.014] [UN] PID: 57703 TASK: ff4103b2ddb58000 CPU: 84
      COMMAND: "mdt18_013"

      crash> set ff4103b2ddb58000
      PID: 57703
      COMMAND: "mdt18_013"
      TASK: ff4103b2ddb58000 [THREAD_INFO: ff4103b2ddb58000]
      CPU: 84
      STATE: TASK_UNINTERRUPTIBLE
      crash> bt
      PID: 57703 TASK: ff4103b2ddb58000 CPU: 84 COMMAND: "mdt18_013"
      #0 [ff75fb8e88fe7bc0] __schedule at ffffffff8881dd01
      #1 [ff75fb8e88fe7c18] schedule at ffffffff8881e2f5
      #2 [ff75fb8e88fe7c30] __pfx_jbd2_transaction_committed at ffffffffc02f3c3c
      [jbd2]
      #3 [ff75fb8e88fe7c98] ldiskfs_sync_file at ffffffffc16174f5 [ldiskfs]
      #4 [ff75fb8e88fe7cd8] osd_object_sync at ffffffffc1698378 [osd_ldiskfs]
      #5 [ff75fb8e88fe7d00] mdt_object_sync at ffffffffc18537aa [mdt]
      #6 [ff75fb8e88fe7d20] mdt_sync at ffffffffc1863c63 [mdt]
      #7 [ff75fb8e88fe7d50] tgt_request_handle at ffffffffc11f82dd [ptlrpc]
      #8 [ff75fb8e88fe7dd0] ptlrpc_server_handle_request at ffffffffc11a3723
      [ptlrpc]
      #9 [ff75fb8e88fe7e38] ptlrpc_main at ffffffffc11a521c [ptlrpc]
      #10 [ff75fb8e88fe7f10] kthread at ffffffff87f20e24
      #11 [ff75fb8e88fe7f50] ret_from_fork at ffffffff88a0028f
      crash> whatis __pfx_jbd2_transaction_committed
      <text variable, no debug info> __pfx_jbd2_transaction_committed;
      crash> bt
      PID: 57703 TASK: ff4103b2ddb58000 CPU: 84 COMMAND: "mdt18_013"
      #0 [ff75fb8e88fe7bc0] __schedule at ffffffff8881dd01
      #1 [ff75fb8e88fe7c18] schedule at ffffffff8881e2f5
      #2 [ff75fb8e88fe7c30] jbd2_log_wait_commit at ffffffffc02f3c3c [jbd2]
      #3 [ff75fb8e88fe7c98] ldiskfs_sync_file at ffffffffc16174f5 [ldiskfs]
      #4 [ff75fb8e88fe7cd8] osd_object_sync at ffffffffc1698378 [osd_ldiskfs]
      #5 [ff75fb8e88fe7d00] mdt_object_sync at ffffffffc18537aa [mdt]
      #6 [ff75fb8e88fe7d20] mdt_sync at ffffffffc1863c63 [mdt]
      #7 [ff75fb8e88fe7d50] tgt_request_handle at ffffffffc11f82dd [ptlrpc]
      #8 [ff75fb8e88fe7dd0] ptlrpc_server_handle_request at ffffffffc11a3723
      [ptlrpc]
      #9 [ff75fb8e88fe7e38] ptlrpc_main at ffffffffc11a521c [ptlrpc]
      #10 [ff75fb8e88fe7f10] kthread at ffffffff87f20e24
      #11 [ff75fb8e88fe7f50] ret_from_fork at ffffffff88a0028f
      crash> sym jbd2_log_wait_commit
      ffffffffc02f3b90 (T) jbd2_log_wait_commit [jbd2]
      /usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/fs/jbd2/journal.c:
      681

      Now the backtrace looks much better:

      crash> dis -r ffffffffc02f3c3c | tail -10
      0xffffffffc02f3c1e <jbd2_log_wait_commit+0x8e>: mov %r12d,%eax
      0xffffffffc02f3c21 <jbd2_log_wait_commit+0x91>: sub 0x3c0(%rbx),%eax
      0xffffffffc02f3c27 <jbd2_log_wait_commit+0x97>: test %eax,%eax
      0xffffffffc02f3c29 <jbd2_log_wait_commit+0x99>: jle 0xffffffffc02f3be8
      <jbd2_log_wait_commit+0x58>
      0xffffffffc02f3c2b <jbd2_log_wait_commit+0x9b>: xor %esi,%esi
      0xffffffffc02f3c2d <jbd2_log_wait_commit+0x9d>: mov %rsp,%rdi
      0xffffffffc02f3c30 <jbd2_log_wait_commit+0xa0>: call 0xffffffff87f4e510
      <init_wait_entry>
      0xffffffffc02f3c35 <jbd2_log_wait_commit+0xa5>: jmp 0xffffffffc02f3c3c
      <jbd2_log_wait_commit+0xac>
      0xffffffffc02f3c37 <jbd2_log_wait_commit+0xa7>: call 0xffffffff8881e2a0
      <schedule>
      0xffffffffc02f3c3c <jbd2_log_wait_commit+0xac>: mov $0x2,%edx
      crash>
      crash> dis -r ffffffffc16174f5 | tail
      0xffffffffc16174d6 <ldiskfs_sync_file+0x346>: jmp 0xffffffffc16173f3
      <ldiskfs_sync_file+0x263>
      0xffffffffc16174db <ldiskfs_sync_file+0x34b>: mov %r13d,%esi
      0xffffffffc16174de <ldiskfs_sync_file+0x34e>: mov %r15,%rdi
      0xffffffffc16174e1 <ldiskfs_sync_file+0x351>: call 0xffffffffc02f41a0
      <jbd2_trans_will_send_data_barrier>
      0xffffffffc16174e6 <ldiskfs_sync_file+0x356>: mov %r13d,%esi
      0xffffffffc16174e9 <ldiskfs_sync_file+0x359>: mov %r15,%rdi
      0xffffffffc16174ec <ldiskfs_sync_file+0x35c>: test %eax,%eax
      0xffffffffc16174ee <ldiskfs_sync_file+0x35e>: je 0xffffffffc16174fd
      <ldiskfs_sync_file+0x36d>
      0xffffffffc16174f0 <ldiskfs_sync_file+0x360>: call 0xffffffffc02f5a40
      <jbd2_complete_transaction>
      0xffffffffc16174f5 <ldiskfs_sync_file+0x365>: mov %eax,%r12d
      crash> dis -r ffffffffc02f3c3c | head
      0xffffffffc02f3b90 <jbd2_log_wait_commit>: nopl 0x0(%rax,%rax,1) [FTRACE
      NOP]
      0xffffffffc02f3b95 <jbd2_log_wait_commit+0x5>: push %r15
      0xffffffffc02f3b97 <jbd2_log_wait_commit+0x7>: push %r14
      0xffffffffc02f3b99 <jbd2_log_wait_commit+0x9>: mov $0xfffffe00,%r14d
      0xffffffffc02f3b9f <jbd2_log_wait_commit+0xf>: push %r13
      0xffffffffc02f3ba1 <jbd2_log_wait_commit+0x11>: push %r12
      0xffffffffc02f3ba3 <jbd2_log_wait_commit+0x13>: mov %esi,%r12d
      0xffffffffc02f3ba6 <jbd2_log_wait_commit+0x16>: push %rbp
      0xffffffffc02f3ba7 <jbd2_log_wait_commit+0x17>: lea 0x44(%rdi),%rbp
      0xffffffffc02f3bab <jbd2_log_wait_commit+0x1b>: push %rbx

      crash> journal_t.j_wait_commit ff4103b2f19c7800
      j_wait_commit = {
      lock = {
      {
      rlock = {
      raw_lock = {
      {
      val =

      { counter = 0x0 }

      ,

      { locked = 0x0, pending = 0x0 }

      ,

      { locked_pending = 0x0, tail = 0x0 }

      }
      }
      }
      }
      },
      head =

      { next = 0xff4103b2f19c78c0, prev = 0xff4103b2f19c78c0 }

      },

      So the ldiskfs is using the jbd2 for journaling.

      TID: 000000001c0d1446
      crash> journal_t.j_commit_sequence ff4103b2f19c7800
      j_commit_sequence = 0x1c0d1445,

      crash> waitq journal_t.j_wait_done_commit ff4103b2f19c7800
      PID: 464849 TASK: ff4103d58655c000 CPU: 38 COMMAND: "ldlm_cn19_006"
      PID: 55781 TASK: ff4103b341b68000 CPU: 10 COMMAND: "mdt05_001"
      PID: 125286 TASK: ff4103c86207c000 CPU: 74 COMMAND: "mdt13_009"
      PID: 55720 TASK: ff4103b2fd31c000 CPU: 38 COMMAND: "ldlm_cn19_000"
      PID: 57585 TASK: ff4103b373fcc000 CPU: 40 COMMAND: "ldlm_cn20_004"
      PID: 57687 TASK: ff4103b29eae0000 CPU: 37 COMMAND: "mdt18_012"
      PID: 57639 TASK: ff4103b3d2318000 CPU: 77 COMMAND: "mdt14_013"
      PID: 57703 TASK: ff4103b2ddb58000 CPU: 84 COMMAND: "mdt18_013"
      PID: 56529 TASK: ff4103b427844000 CPU: 4 COMMAND: "jbd2/drbd2-8"


      crash> bt ff4103b427844000
      PID: 56529 TASK: ff4103b427844000 CPU: 4 COMMAND: "jbd2/drbd2-8"
      #0 [ff75fb8e7c333968] __schedule at ffffffff8881dd01
      #1 [ff75fb8e7c3339c0] schedule at ffffffff8881e2f5
      #2 [ff75fb8e7c3339d8] jbd2_log_wait_commit at ffffffffc02f3c3c [jbd2]
      #3 [ff75fb8e7c333a40] ldiskfs_sync_fs at ffffffffc1606840 [ldiskfs]
      #4 [ff75fb8e7c333a78] osd_sync at ffffffffc16972df [osd_ldiskfs]
      #5 [ff75fb8e7c333a98] qsd_acquire at ffffffffc156d56c [lquota]
      #6 [ff75fb8e7c333b18] qsd_op_begin0 at ffffffffc156e045 [lquota]
      #7 [ff75fb8e7c333b98] qsd_reserve_or_free_quota at ffffffffc156ffd5
      [lquota]
      #8 [ff75fb8e7c333bc8] osd_reserve_or_free_quota at ffffffffc1696c86
      [osd_ldiskfs]
      #9 [ff75fb8e7c333c10] tgt_cb_last_committed at ffffffffc11e3512 [ptlrpc]
      #10 [ff75fb8e7c333c78] osd_trans_commit_cb at ffffffffc1699376 [osd_ldiskfs]
      #11 [ff75fb8e7c333cc0] ldiskfs_journal_commit_callback at ffffffffc1607b85
      [ldiskfs]
      #12 [ff75fb8e7c333cf8] jbd2_journal_commit_transaction at ffffffffc02ef960
      [jbd2]
      #13 [ff75fb8e7c333ea0] kjournald2 at ffffffffc02f3fed [jbd2]
      #14 [ff75fb8e7c333f10] kthread at ffffffff87f20e24
      #15 [ff75fb8e7c333f50] ret_from_fork at ffffffff88a0028f

      [293564.548588] Lustre: Skipped 1 previous similar message
      [293564.548588] Pid: 57703, comm: mdt18_013
      4.18.0-553.5.1.el8_lustre.x86_64 #1 SMP Fri Jun 28 18:44:24 UTC 2024
      [293564.548591] Call Trace TBD:
      [293564.548607] [<0>] jbd2_log_wait_commit+0xac/0x120 [jbd2]
      [293564.548615] [<0>] ldiskfs_sync_file+0x365/0x3b0 [ldiskfs]
      [293564.548634] [<0>] osd_object_sync+0x188/0x1e0 [osd_ldiskfs]
      [293564.548645] [<0>] mdt_object_sync.isra.69+0x7a/0x3b0 [mdt]
      [293564.548664] [<0>] mdt_sync+0xf3/0x320 [mdt]
      [293564.548681] [<0>] tgt_request_handle+0xccd/0x1a20 [ptlrpc]
      [293564.548733] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
      [293564.548780] [<0>] ptlrpc_main+0xbec/0x1530 [ptlrpc]
      [293564.548823] [<0>] kthread+0x134/0x150
      [293564.548826] [<0>] ret_from_fork+0x1f/0x40

      [293564.548830] Pid: 57639, comm: mdt14_013
      4.18.0-553.5.1.el8_lustre.x86_64 #1 SMP Fri Jun 28 18:44:24 UTC 2024
      [293564.548835] Call Trace TBD:
      [293564.548856] [<0>] jbd2_log_wait_commit+0xac/0x120 [jbd2]
      [293564.548864] [<0>] ldiskfs_sync_fs+0x170/0x1d0 [ldiskfs]
      [293564.548881] [<0>] osd_sync+0xdf/0x180 [osd_ldiskfs]
      [293564.548888] INFO: task jbd2/drbd2-8:56529 blocked for more than 120
      seconds.
      [293564.548894] Tainted: G OE -------- - -
      4.18.0-553.5.1.el8_lustre.x86_64 #1
      [293564.549271] [<0>] qsd_acquire+0x46c/0xde0 [lquota]
      [293564.550413] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
      this message.
      [293564.550412] [<0>] qsd_op_begin0+0x165/0x8a0 [lquota]

      [293564.550642] task:jbd2/drbd2-8 state stack:0 pid:56529 ppid:2
      flags:0x80004080
      [293564.550641] [<0>] qsd_reserve_or_free_quota+0x2e5/0x7d0 [lquota]
      [293564.550648] Call Trace:
      [293564.550652] __schedule+0x2d1/0x870
      [293564.550652] [<0>] osd_reserve_or_free_quota+0xe6/0x1c0 [osd_ldiskfs]
      [293564.550659] ? __wake_up_common_lock+0x89/0xc0
      [293564.550663] schedule+0x55/0xf0
      [293564.550663] [<0>] dt_reserve_or_free_quota.constprop.45+0x2c/0xd0 [mdd]
      [293564.550667] jbd2_log_wait_commit+0xac/0x120 [jbd2]
      [293564.550675] ? finish_wait+0x80/0x80
      [293564.550675] [<0>] mdd_attr_set+0xa70/0x1020 [mdd]
      [293564.550679] ldiskfs_sync_fs+0x170/0x1d0 [ldiskfs]
      [293564.550685] [<0>] mdt_reint_setattr+0xea8/0x1720 [mdt]
      [293564.550695] osd_sync+0xdf/0x180 [osd_ldiskfs]
      [293564.550708] qsd_acquire+0x46c/0xde0 [lquota]
      [293564.550707] [<0>] mdt_reint_rec+0x11f/0x270 [mdt]
      [293564.550723] qsd_op_begin0+0x165/0x8a0 [lquota]
      [293564.550725] [<0>] mdt_reint_internal+0x4d0/0x7e0 [mdt]
      [293564.550733] ? lfsck_key_init+0x1c/0x130 [lfsck]
      [293564.550747] ? srso_alias_return_thunk+0x5/0xfcdfd
      [293564.550743] [<0>] mdt_reint+0x5d/0x110 [mdt]
      [293564.550749] ? kmem_cache_alloc_trace+0x142/0x280
      [293564.550754] qsd_reserve_or_free_quota+0x2e5/0x7d0 [lquota]
      [293564.550764] osd_reserve_or_free_quota+0xe6/0x1c0 [osd_ldiskfs]
      [293564.550775] ? lu_context_init+0xa5/0x1b0 [obdclass]
      [293564.550760] [<0>] tgt_request_handle+0xccd/0x1a20 [ptlrpc]
      [293564.550807] tgt_cb_last_committed+0xc2/0x5c0 [ptlrpc]
      [293564.550817] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
      [293564.550863] osd_trans_commit_cb+0xe6/0x300 [osd_ldiskfs]
      [293564.550875] ldiskfs_journal_commit_callback+0xa5/0xe0 [ldiskfs]
      [293564.550865] [<0>] ptlrpc_main+0xbec/0x1530 [ptlrpc]
      [293564.550888] jbd2_journal_commit_transaction+0x1690/0x19f0 [jbd2]
      [293564.550897] ? __try_to_del_timer_sync+0x5a/0xa0
      [293564.550900] kjournald2+0xbd/0x270 [jbd2]
      [293564.550904] ? finish_wait+0x80/0x80
      [293564.550907] ? commit_timeout+0x10/0x10 [jbd2]
      [293564.550908] [<0>] kthread+0x134/0x150
      [293564.550911] kthread+0x134/0x150
      [293564.550912] [<0>] ret_from_fork+0x1f/0x40
      [293564.550915] ? set_kthread_struct+0x50/0x50
      [293564.550918] ret_from_fork+0x1f/0x40

      __pfx_jbd2_transaction_committed is bd2_transaction_committed

      __pfx_jbd2_transaction_committed should map to jbd2_transaction_committed

      PID: 56529 TASK: ff4103b427844000 CPU: 4 COMMAND: "jbd2/drbd2-8"
      #0 [ff75fb8e7c333968] __schedule at ffffffff8881dd01
      #1 [ff75fb8e7c3339c0] schedule at ffffffff8881e2f5
      #2 [ff75fb8e7c3339d8] __pfx_jbd2_transaction_committed at ffffffffc02f3c3c
      [jbd2]
      #3 [ff75fb8e7c333a40] ldiskfs_sync_fs at ffffffffc1606840 [ldiskfs]
      #4 [ff75fb8e7c333a78] osd_sync at ffffffffc16972df [osd_ldiskfs]
      #5 [ff75fb8e7c333a98] qsd_acquire at ffffffffc156d56c [lquota]
      #6 [ff75fb8e7c333b18] qsd_op_begin0 at ffffffffc156e045 [lquota]
      #7 [ff75fb8e7c333b98] qsd_reserve_or_free_quota at ffffffffc156ffd5
      [lquota]
      #8 [ff75fb8e7c333bc8] osd_reserve_or_free_quota at ffffffffc1696c86
      [osd_ldiskfs]
      #9 [ff75fb8e7c333c10] tgt_cb_last_committed at ffffffffc11e3512 [ptlrpc]
      #10 [ff75fb8e7c333c78] osd_trans_commit_cb at ffffffffc1699376 [osd_ldiskfs]
      #11 [ff75fb8e7c333cc0] ldiskfs_journal_commit_callback at ffffffffc1607b85
      [ldiskfs]
      #12 [ff75fb8e7c333cf8] __pfx_jbd2_journal_commit_transaction at
      ffffffffc02ef960 [jbd2]
      #13 [ff75fb8e7c333ea0] kjournald2 at ffffffffc02f3fed [jbd2]
      #14 [ff75fb8e7c333f10] kthread at ffffffff87f20e24
      #15 [ff75fb8e7c333f50] ret_from_fork at ffffffff88a0028f

      Focusing on osd_sync -> ldiskfs_sync_fs -> __pfx_jbd2_transaction_committed
      to see why we stalled

      /* Return 1 when transaction with given tid has already committed. */
      int jbd2_transaction_committed(journal_t *journal, tid_t tid)

      { int ret = 1; read_lock(&journal->j_state_lock); ****************** Seems to be a rwlock_t j_state_lock; if (journal->j_running_transaction && journal->j_running_transaction->t_tid == tid) ret = 0; if (journal->j_committing_transaction && journal->j_committing_transaction->t_tid == tid) ret = 0; read_unlock(&journal->j_state_lock); return ret; }

      EXPORT_SYMBOL(jbd2_transaction_committed);

      What does not make sense is how we get to schedule() in each case, if its
      jbd2_transaction_committed
      then maybe here

      0xffffffffc16174a5 <ldiskfs_sync_file+789>: mov %r12,%rdi
      0xffffffffc16174a8 <ldiskfs_sync_file+792>: mov %r14,%r15
      0xffffffffc16174ab <ldiskfs_sync_file+795>: call 0xffffffff881a4170
      <sync_inode_metadata>
      0xffffffffc16174b0 <ldiskfs_sync_file+800>: mov %eax,%r13d
      0xffffffffc16174b3 <ldiskfs_sync_file+803>: test %eax,%eax
      0xffffffffc16174b5 <ldiskfs_sync_file+805>: jne 0xffffffffc161752e
      <ldiskfs_sync_file+926>
      0xffffffffc16174b7 <ldiskfs_sync_file+807>: mov -0x98(%r12),%rax
      0xffffffffc16174bf <ldiskfs_sync_file+815>: movabs $0x10000000000,%rsi
      0xffffffffc16174c9 <ldiskfs_sync_file+825>: test %rsi,%rax
      0xffffffffc16174cc <ldiskfs_sync_file+828>: jne 0xffffffffc161746a
      <ldiskfs_sync_file+730>
      0xffffffffc16174ce <ldiskfs_sync_file+830>: mov %r15,%rdi
      0xffffffffc16174d1 <ldiskfs_sync_file+833>: call 0xffffffff8818a140 <dput>
      0xffffffffc16174d6 <ldiskfs_sync_file+838>: jmp 0xffffffffc16173f3
      <ldiskfs_sync_file+611>
      0xffffffffc16174db <ldiskfs_sync_file+843>: mov %r13d,%esi
      0xffffffffc16174de <ldiskfs_sync_file+846>: mov %r15,%rdi
      0xffffffffc16174e1 <ldiskfs_sync_file+849>: call 0xffffffffc02f41a0
      <jbd2_trans_will_send_data_barrier>
      0xffffffffc16174e6 <ldiskfs_sync_file+854>: mov %r13d,%esi
      0xffffffffc16174e9 <ldiskfs_sync_file+857>: mov %r15,%rdi
      0xffffffffc16174ec <ldiskfs_sync_file+860>: test %eax,%eax
      0xffffffffc16174ee <ldiskfs_sync_file+862>: je 0xffffffffc16174fd
      <ldiskfs_sync_file+877>
      0xffffffffc16174f0 <ldiskfs_sync_file+864>: call 0xffffffffc02f5a40
      <__pfx_jbd2_journal_start_commit>
      0xffffffffc16174f5 <ldiskfs_sync_file+869>: mov %eax,%r12d

      called here
      0xffffffffc16174f0 <ldiskfs_sync_file+864>: call 0xffffffffc02f5a40
      <__pfx_jbd2_journal_start_commit>
      0xffffffffc16174f5 <ldiskfs_sync_file+869>: mov %eax,%r12d
      0xffffffffc16174f8 <ldiskfs_sync_file+872>: jmp 0xffffffffc16172e7
      <ldiskfs_sync_file+343>
      0xffffffffc16174fd <ldiskfs_sync_file+877>: call 0xffffffffc02f5a40
      <__pfx_jbd2_journal_start_commit>
      0xffffffffc1617502 <ldiskfs_sync_file+882>: mov %eax,%r13d

      So stuck in this loop

      xffffffffc02f3c20 <__pfx_jbd2_transaction_committed>: loopne
      0xffffffffc02f3c4d <__pfx_jbd2_transaction_committed+45>
      0xffffffffc02f3c22 <__pfx_jbd2_transaction_committed+2>: add $0x3,%eax
      0xffffffffc02f3c25 <__pfx_jbd2_transaction_committed+5>: add %al,(%rax)
      0xffffffffc02f3c27 <__pfx_jbd2_transaction_committed+7>: test %eax,%eax

      rest of the data
      ----------------
      crash> bt -l
      PID: 56529 TASK: ff4103b427844000 CPU: 4 COMMAND: "jbd2/drbd2-8"
      #0 [ff75fb8e7c333968] __schedule at ffffffff8881dd01

      /usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/kernel/sched/core.c:
      3804
      #1 [ff75fb8e7c3339c0] schedule at ffffffff8881e2f5

      /usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/./arch/x86/include/asm/current.h:
      15
      #2 [ff75fb8e7c3339d8] __pfx_jbd2_transaction_committed at ffffffffc02f3c3c
      [jbd2]

      /usr/src/debug/kernel-5.14.0-427.31.1.el9_4/linux-5.14.0-427.31.1_lustre.el9.x86_64/./include/trace/events/jbd2.h:
      462
      #3 [ff75fb8e7c333a40] ldiskfs_sync_fs at ffffffffc1606840 [ldiskfs]
      /usr/src/debug/lustre-2.15.5-1.el8.x86_64/ldiskfs/super.c: 5310
      #4 [ff75fb8e7c333a78] osd_sync at ffffffffc16972df [osd_ldiskfs]

      /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/osd-ldiskfs/osd_handler.c:
      2497
      #5 [ff75fb8e7c333a98] qsd_acquire at ffffffffc156d56c [lquota]
      /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/quota/qsd_internal.h:
      217
      #6 [ff75fb8e7c333b18] qsd_op_begin0 at ffffffffc156e045 [lquota]
      /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/quota/qsd_handler.c:
      751
      #7 [ff75fb8e7c333b98] qsd_reserve_or_free_quota at ffffffffc156ffd5
      [lquota]
      /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/quota/qsd_handler.c:
      1358
      #8 [ff75fb8e7c333bc8] osd_reserve_or_free_quota at ffffffffc1696c86
      [osd_ldiskfs]

      /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/include/lustre_quota.h: 276
      #9 [ff75fb8e7c333c10] tgt_cb_last_committed at ffffffffc11e3512 [ptlrpc]
      #10 [ff75fb8e7c333c78] osd_trans_commit_cb at ffffffffc1699376 [osd_ldiskfs]

      /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/osd-ldiskfs/osd_handler.c:
      1795
      #11 [ff75fb8e7c333cc0] ldiskfs_journal_commit_callback at ffffffffc1607b85
      [ldiskfs]

      /var/lib/jenkins/workspace/lustre-b2_15/arch/x86_64/build_type/server/distro/el8.10/ib_stack/inkernel/BUILD/reused/usr/src/kernels/4.18.0-553.5.1.el8_lustre.x86_64/include/linux/spinlock.h:
      350
      #12 [ff75fb8e7c333cf8] __pfx_jbd2_journal_commit_transaction at
      ffffffffc02ef960 [jbd2]

      /usr/src/debug/kernel-5.14.0-427.31.1.el9_4/linux-5.14.0-427.31.1_lustre.el9.x86_64/fs/jbd2/commit.c:
      926
      #13 [ff75fb8e7c333ea0] kjournald2 at ffffffffc02f3fed [jbd2]

      /usr/src/debug/kernel-5.14.0-427.31.1.el9_4/linux-5.14.0-427.31.1_lustre.el9.x86_64/./include/trace/events/jbd2.h:
      14
      #14 [ff75fb8e7c333f10] kthread at ffffffff87f20e24

      /usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/kernel/kthread.c:
      364
      #15 [ff75fb8e7c333f50] ret_from_fork at ffffffff88a0028f

      /usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/arch/x86/entry/entry_64.S:
      322

      Trying to sync to drbd2

      crash> super_block ff4103b252bc8000
      struct super_block {
      s_list =

      { next = 0xff4103b218029800, prev = 0xff4103b252bcc000 }

      ,
      s_dev = 154140674,
      s_blocksize_bits = 12 '\f',
      s_blocksize = 4096,
      s_maxbytes = 17592186040320,
      s_type = 0xffffffffc16615e0 <ldiskfs_fs_type>,
      s_op = 0xffffffffc1632d60 <ldiskfs_sops>,
      dq_op = 0xffffffffc1632ee0 <ldiskfs_quota_operations>,
      s_qcop = 0xffffffff88e47160 <dquot_quotactl_sysfile_ops>,
      s_export_op = 0xffffffffc1632d00 <ldiskfs_export_ops>,
      s_flags = 1879113728,
      s_iflags = 9,
      s_magic = 61267,
      s_root = 0xff4103b48587d380,
      s_umount = {
      count =

      { counter = 1536 }

      ,
      ..
      ..
      s_id = "drbd2\0"
      No pending requests so not waiting in the block layer

      147 ff4103b215d2e000 drbd2 ff4103b27ddddf58 0 0 0

      First PID to hang wa sthis one

      PID: 57703 TASK: ff4103b2ddb58000 CPU: 84 COMMAND: "mdt18_013"
      #0 [ff75fb8e88fe7bc0] __schedule at ffffffff8881dd01

      /usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/kernel/sched/core.c:
      3804
      #1 [ff75fb8e88fe7c18] schedule at ffffffff8881e2f5

      /usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/./arch/x86/include/asm/current.h:
      15
      #2 [ff75fb8e88fe7c30] __pfx_jbd2_transaction_committed at ffffffffc02f3c3c
      [jbd2]

      /usr/src/debug/kernel-5.14.0-427.31.1.el9_4/linux-5.14.0-427.31.1_lustre.el9.x86_64/./include/trace/events/jbd2.h:
      462
      #3 [ff75fb8e88fe7c98] ldiskfs_sync_file at ffffffffc16174f5 [ldiskfs]
      /usr/src/debug/lustre-2.15.5-1.el8.x86_64/ldiskfs/fsync.c: 152
      #4 [ff75fb8e88fe7cd8] osd_object_sync at ffffffffc1698378 [osd_ldiskfs]

      /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/osd-ldiskfs/osd_handler.c:
      5108
      #5 [ff75fb8e88fe7d00] mdt_object_sync at ffffffffc18537aa [mdt]
      /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/include/md_object.h:
      564
      #6 [ff75fb8e88fe7d20] mdt_sync at ffffffffc1863c63 [mdt]
      /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/mdt/mdt_handler.c: 3121
      #7 [ff75fb8e88fe7d50] tgt_request_handle at ffffffffc11f82dd [ptlrpc]
      #8 [ff75fb8e88fe7dd0] ptlrpc_server_handle_request at ffffffffc11a3723
      [ptlrpc]
      #9 [ff75fb8e88fe7e38] ptlrpc_main at ffffffffc11a521c [ptlrpc]
      #10 [ff75fb8e88fe7f10] kthread at ffffffff87f20e24

      /usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/kernel/kthread.c:
      364
      #11 [ff75fb8e88fe7f50] ret_from_fork at ffffffff88a0028f

      /usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/arch/x86/entry/entry_64.S:
      322

      crash> file ff4103cd3f0e9900
      struct file {
      f_u = {
      fu_llist =

      { next = 0x0 }

      ,
      fu_rcuhead =

      { next = 0x0, func = 0x0 }

      },
      f_path =

      { mnt = 0xff4103b427901ca0, dentry = 0xff4103d887835140 }

      ,
      f_inode = 0xff4103cddbaf3e58,
      f_op = 0xffffffffc1631540 <ldiskfs_file_operations>,
      f_lock = {

      crash> super_block 0xff4103b252bc8000 | grep s_id
      s_id =
      "drbd2\000s\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",

      Best regards,
      GM

      Georgios Magklaras PhD
      Chief Engineer
      IT Infrastructure/HPC
      The Norwegian Meteorological Institute

      https://www.met.no/
      https://www.steelcyber.com/georgioshome/

      Attachments

        1. dmesg-T-aug18.txt
          1.09 MB
        2. kexec-dmesg.log
          144 kB
        3. quota.log
          267.56 MB
        4. vmcore-dmesg.txt
          257 kB

        Issue Links

          Activity

            People

              hongchao.zhang Hongchao Zhang
              georgiosm Georgios Magklaras
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: