[LU-14729] read-only remount due to transaction error in __ldiskfs_handle_dirty_metadata Created: 02/Jun/21 Updated: 05/Apr/22 Resolved: 17/Sep/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Wang Shilong (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: ------------[ cut here ]------------ Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: WARNING: CPU: 15 PID: 6012 at /tmp/rpmbuild-lustre-gs-jenkins-qQMpKt7I/BUILD/lustre-2.12.6_ddn36/ldiskfs/ext4_jbd2.c:266 __ldiskfs_handle_dirty_metadata+0x1c2/0x220 [ldiskfs] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: Modules linked in: ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) ksocklnd(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) xt_nat veth vxlan ip6_udp_tunnel udp_tunnel iptable_mangle xt_mark xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c br_netfilter bridge stp llc overlay(T) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx4_en(OE) mlx4_ib(OE) ib_core(OE) mlx4_core(OE) sunrpc iTCO_wdt ppdev iTCO_vendor_support nfit libnvdimm iosf_mbi crc32_pclmul Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd i2c_i801 parport_pc joydev pcspkr lpc_ich sg parport i6300esb ip_tables ext4 mbcache jbd2 sr_mod sd_mod cdrom crc_t10dif crct10dif_generic bochs_drm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_net net_failover ahci virtio_blk failover virtio_scsi(OE) drm libahci mlx5_core(OE) libata crct10dif_pclmul crct10dif_common mlxfw(OE) crc32c_intel ptp pps_core iavf devlink serio_raw virtio_pci virtio_ring mlx_compat(OE) virtio drm_panel_orientation_quirks dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: CPU: 15 PID: 6012 Comm: ll_ost_io03_055 Kdump: loaded Tainted: G OE ------------ T 3.10.0-1160.21.1.el7_lustre.ddn15.x86_64 #1 Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: Hardware name: DDN SFA400NVXE, BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: Call Trace: Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff85b8305a>] dump_stack+0x19/0x1b Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff8549b1b8>] __warn+0xd8/0x100 Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff8549b2fd>] warn_slowpath_null+0x1d/0x20 Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0f9f6e2>] __ldiskfs_handle_dirty_metadata+0x1c2/0x220 [ldiskfs] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0407fc2>] ? jbd2_journal_get_write_access+0x32/0x40 [jbd2] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0fa35f5>] ldiskfs_mb_mark_diskspace_used+0x2e5/0x510 [ldiskfs] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0fa79a0>] ldiskfs_mb_new_blocks+0x350/0xb20 [ldiskfs] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff854d3281>] ? __wake_up_common_lock+0x91/0xc0 Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff854d2d20>] ? task_rq_unlock+0x20/0x20 Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0fe23ce>] ldiskfs_new_meta_blocks+0xfe/0x110 [ldiskfs] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0ff228d>] ldiskfs_ext_split+0x10d/0x9a0 [ldiskfs] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0408321>] ? jbd2_journal_dirty_metadata+0xd1/0x260 [jbd2] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0ff413d>] ldiskfs_ext_insert_extent+0x7bd/0x8d0 [ldiskfs] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0ff441d>] ldiskfs_split_extent_at+0x1cd/0x410 [ldiskfs] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0ff478a>] ldiskfs_split_extent.isra.40+0x12a/0x190 [ldiskfs] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0ff63c2>] ldiskfs_ext_handle_unwritten_extents+0x682/0xc20 [ldiskfs] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0ff33a4>] ? ldiskfs_ext_find_extent+0x134/0x340 [ldiskfs] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0ff7287>] ldiskfs_ext_map_blocks+0x8f7/0xfb0 [ldiskfs] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff8578d922>] ? radix_tree_lookup_slot+0x22/0x50 Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0fbd47c>] ldiskfs_map_blocks+0x12c/0x6a0 [ldiskfs] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0f9f5ab>] ? __ldiskfs_handle_dirty_metadata+0x8b/0x220 [ldiskfs] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc10c4b80>] osd_ldiskfs_map_inode_pages+0x300/0x820 [osd_ldiskfs] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc10c543a>] osd_write_commit+0x39a/0x860 [osd_ldiskfs] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc1796086>] ofd_commitrw_write+0xfa6/0x1da0 [ofd] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc179ab29>] ofd_commitrw+0x479/0xe20 [ofd] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc1237cf6>] tgt_brw_write+0x1026/0x1d60 [ptlrpc] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff85625f8d>] ? __slab_free+0x9d/0x290 Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0a50bce>] ? lnet_ptl_attach_md+0x37e/0x530 [lnet] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc1184a90>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc123380e>] tgt_request_handle+0xaee/0x15f0 [ptlrpc] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc120ef61>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc0969bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc11da7cb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc11d7445>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc11de134>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff854d4c1e>] ? finish_task_switch+0x4e/0x1c0 Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffffc11dd600>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff854c5da1>] kthread+0xd1/0xe0 Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff854c5cd0>] ? insert_kthread_work+0x40/0x40 Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff85b95ddd>] ret_from_fork_nospec_begin+0x7/0x21 Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: [<ffffffff854c5cd0>] ? insert_kthread_work+0x40/0x40 Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: ---[ end trace 1bb5251d0179a71d ]--- Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: LDISKFS-fs: ldiskfs_mb_mark_diskspace_used:3509: aborting transaction: error 28 in __ldiskfs_handle_dirty_metadata Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: LDISKFS: jbd2_journal_dirty_metadata failed: handle type 0 started at line 1976, credits 28/0, errcode -28 Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: Quota error (device sdf): qtree_write_dquot: dquota write failed Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: LDISKFS-fs error (device sdf) in ldiskfs_write_dquot:5609: error 28 Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: Aborting journal on device sdf-8. Jun 1 12:17:54 ai400x-0ab2-vm00 kernel: LDISKFS-fs (sdf): Remounting filesystem read-only |
| Comments |
| Comment by Gerrit Updater [ 02/Jun/21 ] |
|
Wang Shilong (wshilong@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43890 |
| Comment by Gerrit Updater [ 14/Jun/21 ] |
|
Wang Shilong (wshilong@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43994 |
| Comment by Gerrit Updater [ 14/Jun/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43890/ |
| Comment by Gerrit Updater [ 21/Jun/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43994/ |
| Comment by Peter Jones [ 21/Jun/21 ] |
|
Landed for 2.15 |
| Comment by Andreas Dilger [ 13/Jul/21 ] |
|
Shilong, |
| Comment by Andreas Dilger [ 13/Jul/21 ] |
|
One example is: Lustre: DEBUG MARKER: == obdfilter-survey test 1a: Object Storage Targets survey === 02:25:32 Lustre: 617162:0:(osd_internal.h:1385:osd_trans_exec_check()) lustre-OST0004: opcode 7: used 8, used now 8, reserved 4 Lustre: 617162:0:(osd_handler.c:1867:osd_trans_dump_creds()) create: 0/0/0, destroy: 0/0/0 Lustre: 617162:0:(osd_handler.c:1874:osd_trans_dump_creds()) attr_set: 0/0/0, xattr_set: 0/0/0 Lustre: 617162:0:(osd_handler.c:1884:osd_trans_dump_creds()) write: 1/4/8, punch: 0/0/0, quota 3/3/0 Lustre: 617162:0:(osd_handler.c:1891:osd_trans_dump_creds()) insert: 0/0/0, delete: 0/0/0 Lustre: 617162:0:(osd_handler.c:1898:osd_trans_dump_creds()) ref_add: 0/0/0, ref_del: 0/0/0 LustreError: 617162:0:(osd_internal.h:1388:osd_trans_exec_check()) LBUG Pid: 617162, comm: lctl 4.18.0-240.22.1.el8_lustre.x86_64 #1 SMP Mon Jun 14 23:42:59 UTC 2021 Call Trace TBD: libcfs_call_trace+0x6f/0x90 [libcfs] lbug_with_loc+0x43/0x80 [libcfs] osd_write_commit+0x6a2/0x980 [osd_ldiskfs] ofd_commitrw_write+0x698/0x1990 [ofd] ofd_commitrw+0x376/0x9d0 [ofd] echo_client_prep_commit.isra.50+0x86f/0xf00 [obdecho] echo_client_iocontrol+0xe83/0x1540 [obdecho] class_handle_ioctl+0x16b0/0x1c90 [obdclass] obd_class_ioctl+0x118/0x160 [obdclass] do_vfs_ioctl+0xa4/0x640 ksys_ioctl+0x60/0x90 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x5b/0x1a0 It might be that this issue is limited to obdfilter-survey, but it would be good to fix this regardless, since this is sometimes used in the field to debug storage performance problems. |
| Comment by Wang Shilong (Inactive) [ 14/Jul/21 ] |
|
I am checking, this is a bit different than original problem, BUG triggered on Lustre side rather than in ldiskfs, from Ldiskfs point, it looks credits is enough, otherwise we will trigger original warning. But problem is extra Lustre osd check failed, it found used credits larger than reserved which was unexpected.. |
| Comment by Gerrit Updater [ 15/Jul/21 ] |
|
Wang Shilong (wshilong@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44316 |
| Comment by Gerrit Updater [ 22/Jul/21 ] |
|
Wang Shilong (wshilong@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44380 |
| Comment by James Nunez (Inactive) [ 30/Aug/21 ] |
|
Bobijam Would you please look into this issue? New crashes are at: James |
| Comment by Andreas Dilger [ 31/Aug/21 ] |
|
There is still patch https://review.whamcloud.com/44316 " |
| Comment by Gerrit Updater [ 17/Sep/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44316/ |
| Comment by Peter Jones [ 17/Sep/21 ] |
|
Landed for 2.15 |
| Comment by Gerrit Updater [ 09/Nov/21 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45506 |
| Comment by Gerrit Updater [ 09/Nov/21 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45507 |
| Comment by Gerrit Updater [ 09/Nov/21 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45508 |
| Comment by Gerrit Updater [ 14/Nov/21 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/45506/ |
| Comment by Gerrit Updater [ 14/Nov/21 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/45507/ |