[LU-14729] read-only remount due to transaction error in __ldiskfs_handle_dirty_metadata Created: 02/Jun/21  Updated: 05/Apr/22  Resolved: 17/Sep/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Critical
Reporter: Wang Shilong (Inactive) Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14134 reduce credits for new writing potent... Resolved
is related to LU-15088 obdfilter-survey test_1a osd_trans_ex... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: ------------[ cut here ]------------

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: WARNING: CPU: 15 PID: 6012 at /tmp/rpmbuild-lustre-gs-jenkins-qQMpKt7I/BUILD/lustre-2.12.6_ddn36/ldiskfs/ext4_jbd2.c:266 __ldiskfs_handle_dirty_metadata+0x1c2/0x220 [ldiskfs]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: Modules linked in: ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) ksocklnd(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) xt_nat veth vxlan ip6_udp_tunnel udp_tunnel iptable_mangle xt_mark xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c br_netfilter bridge stp llc overlay(T) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx4_en(OE) mlx4_ib(OE) ib_core(OE) mlx4_core(OE) sunrpc iTCO_wdt ppdev iTCO_vendor_support nfit libnvdimm iosf_mbi crc32_pclmul

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd i2c_i801 parport_pc joydev pcspkr lpc_ich sg parport i6300esb ip_tables ext4 mbcache jbd2 sr_mod sd_mod cdrom crc_t10dif crct10dif_generic bochs_drm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_net net_failover ahci virtio_blk failover virtio_scsi(OE) drm libahci mlx5_core(OE) libata crct10dif_pclmul crct10dif_common mlxfw(OE) crc32c_intel ptp pps_core iavf devlink serio_raw virtio_pci virtio_ring mlx_compat(OE) virtio drm_panel_orientation_quirks dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: CPU: 15 PID: 6012 Comm: ll_ost_io03_055 Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1160.21.1.el7_lustre.ddn15.x86_64 #1

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: Hardware name: DDN SFA400NVXE, BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: Call Trace:

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff85b8305a>] dump_stack+0x19/0x1b

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff8549b1b8>] __warn+0xd8/0x100

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff8549b2fd>] warn_slowpath_null+0x1d/0x20

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0f9f6e2>] __ldiskfs_handle_dirty_metadata+0x1c2/0x220 [ldiskfs]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0407fc2>] ? jbd2_journal_get_write_access+0x32/0x40 [jbd2]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0fa35f5>] ldiskfs_mb_mark_diskspace_used+0x2e5/0x510 [ldiskfs]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0fa79a0>] ldiskfs_mb_new_blocks+0x350/0xb20 [ldiskfs]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff854d3281>] ? __wake_up_common_lock+0x91/0xc0

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff854d2d20>] ? task_rq_unlock+0x20/0x20

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0fe23ce>] ldiskfs_new_meta_blocks+0xfe/0x110 [ldiskfs]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0ff228d>] ldiskfs_ext_split+0x10d/0x9a0 [ldiskfs]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0408321>] ? jbd2_journal_dirty_metadata+0xd1/0x260 [jbd2]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0ff413d>] ldiskfs_ext_insert_extent+0x7bd/0x8d0 [ldiskfs]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0ff441d>] ldiskfs_split_extent_at+0x1cd/0x410 [ldiskfs]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0ff478a>] ldiskfs_split_extent.isra.40+0x12a/0x190 [ldiskfs]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0ff63c2>] ldiskfs_ext_handle_unwritten_extents+0x682/0xc20 [ldiskfs]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0ff33a4>] ? ldiskfs_ext_find_extent+0x134/0x340 [ldiskfs]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0ff7287>] ldiskfs_ext_map_blocks+0x8f7/0xfb0 [ldiskfs]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff8578d922>] ? radix_tree_lookup_slot+0x22/0x50

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0fbd47c>] ldiskfs_map_blocks+0x12c/0x6a0 [ldiskfs]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0f9f5ab>] ? __ldiskfs_handle_dirty_metadata+0x8b/0x220 [ldiskfs]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc10c4b80>] osd_ldiskfs_map_inode_pages+0x300/0x820 [osd_ldiskfs]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc10c543a>] osd_write_commit+0x39a/0x860 [osd_ldiskfs]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc1796086>] ofd_commitrw_write+0xfa6/0x1da0 [ofd]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc179ab29>] ofd_commitrw+0x479/0xe20 [ofd]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc1237cf6>] tgt_brw_write+0x1026/0x1d60 [ptlrpc]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff85625f8d>] ? __slab_free+0x9d/0x290

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0a50bce>] ? lnet_ptl_attach_md+0x37e/0x530 [lnet]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc1184a90>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc123380e>] tgt_request_handle+0xaee/0x15f0 [ptlrpc]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc120ef61>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0969bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc11da7cb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc11d7445>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc11de134>] ptlrpc_main+0xb34/0x1470 [ptlrpc]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff854d4c1e>] ? finish_task_switch+0x4e/0x1c0

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc11dd600>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff854c5da1>] kthread+0xd1/0xe0

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff854c5cd0>] ? insert_kthread_work+0x40/0x40

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff85b95ddd>] ret_from_fork_nospec_begin+0x7/0x21

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff854c5cd0>] ? insert_kthread_work+0x40/0x40

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: ---[ end trace 1bb5251d0179a71d ]---

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: LDISKFS-fs: ldiskfs_mb_mark_diskspace_used:3509: aborting transaction: error 28 in __ldiskfs_handle_dirty_metadata

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: LDISKFS: jbd2_journal_dirty_metadata failed: handle type 0 started at line 1976, credits 28/0, errcode -28

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: Quota error (device sdf): qtree_write_dquot: dquota write failed

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: LDISKFS-fs error (device sdf) in ldiskfs_write_dquot:5609: error 28

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: Aborting journal on device sdf-8.

Jun  1 12:17:54 ai400x-0ab2-vm00 kernel: LDISKFS-fs (sdf): Remounting filesystem read-only



 Comments   
Comment by Gerrit Updater [ 02/Jun/21 ]

Wang Shilong (wshilong@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43890
Subject: LU-14729 osd-ldiskfs: declare dirty block groups correctly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fa09b77c841344742d268d7eba26bb8f7f7cdff1

Comment by Gerrit Updater [ 14/Jun/21 ]

Wang Shilong (wshilong@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43994
Subject: LU-14729 osd-ldiskfs: fix to declare write commmits
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: aeada0a76bdb4c807430ab4aa97faa3299af6046

Comment by Gerrit Updater [ 14/Jun/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43890/
Subject: LU-14729 osd-ldiskfs: declare dirty block groups correctly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 42cda8781f94ad1138afac2d23180ea48f3c3450

Comment by Gerrit Updater [ 21/Jun/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43994/
Subject: LU-14729 osd-ldiskfs: fix to declare write commits
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9810341a839c27b7a53cdc047e0395f8f906c4bf

Comment by Peter Jones [ 21/Jun/21 ]

Landed for 2.15

Comment by Andreas Dilger [ 13/Jul/21 ]

Shilong,
is there still another patch beyond this one to fix the osd-ldiskfs transaction credits? I've found that obdfilter-survey is crashing regularly since 2021-06-21, 41 of 209 runs in "full" test sessions starting on the day the "osd-ldiskfs: fix to declare write commmits" patch landed. This was being incorrectly attributed to LU-12830, but that issue is specific to ZFS.

Comment by Andreas Dilger [ 13/Jul/21 ]

One example is:
https://testing.whamcloud.com/test_sets/67e19f95-7af9-4fd1-809f-99d8f6959fd1

Lustre: DEBUG MARKER: == obdfilter-survey test 1a: Object Storage Targets survey === 02:25:32 
Lustre: 617162:0:(osd_internal.h:1385:osd_trans_exec_check()) lustre-OST0004: opcode 7: used 8, used now 8, reserved 4
Lustre: 617162:0:(osd_handler.c:1867:osd_trans_dump_creds())   create: 0/0/0, destroy: 0/0/0
Lustre: 617162:0:(osd_handler.c:1874:osd_trans_dump_creds())   attr_set: 0/0/0, xattr_set: 0/0/0
Lustre: 617162:0:(osd_handler.c:1884:osd_trans_dump_creds())   write: 1/4/8, punch: 0/0/0, quota 3/3/0
Lustre: 617162:0:(osd_handler.c:1891:osd_trans_dump_creds())   insert: 0/0/0, delete: 0/0/0
Lustre: 617162:0:(osd_handler.c:1898:osd_trans_dump_creds())   ref_add: 0/0/0, ref_del: 0/0/0
LustreError: 617162:0:(osd_internal.h:1388:osd_trans_exec_check()) LBUG
Pid: 617162, comm: lctl 4.18.0-240.22.1.el8_lustre.x86_64 #1 SMP Mon Jun 14 23:42:59 UTC 2021
Call Trace TBD:
 libcfs_call_trace+0x6f/0x90 [libcfs]
 lbug_with_loc+0x43/0x80 [libcfs]
 osd_write_commit+0x6a2/0x980 [osd_ldiskfs]
 ofd_commitrw_write+0x698/0x1990 [ofd]
 ofd_commitrw+0x376/0x9d0 [ofd]
 echo_client_prep_commit.isra.50+0x86f/0xf00 [obdecho]
 echo_client_iocontrol+0xe83/0x1540 [obdecho]
 class_handle_ioctl+0x16b0/0x1c90 [obdclass]
 obd_class_ioctl+0x118/0x160 [obdclass]
 do_vfs_ioctl+0xa4/0x640
 ksys_ioctl+0x60/0x90
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x5b/0x1a0

It might be that this issue is limited to obdfilter-survey, but it would be good to fix this regardless, since this is sometimes used in the field to debug storage performance problems.

Comment by Wang Shilong (Inactive) [ 14/Jul/21 ]

I am checking, this is a bit different than original problem, BUG triggered on Lustre side rather than in ldiskfs, from Ldiskfs point, it looks credits is enough, otherwise we will trigger original warning.

But problem is extra Lustre osd check failed, it found used credits larger than reserved which was unexpected..

Comment by Gerrit Updater [ 15/Jul/21 ]

Wang Shilong (wshilong@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44316
Subject: LU-14729 osd-ldiskfs: declare should consider concurrency
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e040832855eaf4630d80c42df8e5c64bba7ffc23

Comment by Gerrit Updater [ 22/Jul/21 ]

Wang Shilong (wshilong@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44380
Subject: LU-14729 test: debug hang
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 069c8f35d67be3e6c96d006ce4ae7a46ab8b5efd

Comment by James Nunez (Inactive) [ 30/Aug/21 ]

Bobijam

Would you please look into this issue?

New crashes are at:
https://testing.whamcloud.com/test_sets/39c1ac09-2d8b-464c-8ebe-51b85fc7103d
https://testing.whamcloud.com/test_sets/55abefe3-684d-4f52-9f8d-ceb84c5b698f

James

Comment by Andreas Dilger [ 31/Aug/21 ]

There is still patch https://review.whamcloud.com/44316 "LU-14729 osd-ldiskfs: declare should consider concurrency" from Shilong that has not been landed yet.

Comment by Gerrit Updater [ 17/Sep/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44316/
Subject: LU-14729 osd-ldiskfs: declare should consider concurrency
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 934758366dbde211743d53b8c048c57819603cb3

Comment by Peter Jones [ 17/Sep/21 ]

Landed for 2.15

Comment by Gerrit Updater [ 09/Nov/21 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45506
Subject: LU-14729 osd-ldiskfs: declare dirty block groups correctly
Project: fs/lustre-release
Branch: b2_14
Current Patch Set: 1
Commit: b27bf53e9f14b9e5bdba8c9337a1d227cccd863a

Comment by Gerrit Updater [ 09/Nov/21 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45507
Subject: LU-14729 osd-ldiskfs: fix to declare write commits
Project: fs/lustre-release
Branch: b2_14
Current Patch Set: 1
Commit: 892864518364a1d3480e9b9de20f1abe2cdd8df8

Comment by Gerrit Updater [ 09/Nov/21 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45508
Subject: LU-14729 osd-ldiskfs: declare should consider concurrency
Project: fs/lustre-release
Branch: b2_14
Current Patch Set: 1
Commit: d15735e860b22a771680479e6c1279508e6d7235

Comment by Gerrit Updater [ 14/Nov/21 ]

"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/45506/
Subject: LU-14729 osd-ldiskfs: declare dirty block groups correctly
Project: fs/lustre-release
Branch: b2_14
Current Patch Set:
Commit: 8e1e4194258ba83ecba2e4ed5c1d19a79faf28c4

Comment by Gerrit Updater [ 14/Nov/21 ]

"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/45507/
Subject: LU-14729 osd-ldiskfs: fix to declare write commits
Project: fs/lustre-release
Branch: b2_14
Current Patch Set:
Commit: f9c06dc07cc6c2125a014f472bba8f6022b660ac

Generated at Sat Feb 10 03:12:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.