[LU-9740] Most of OSTs remounted read-only due to abort transaction in __ldiskfs_handle_dirty_metadata Created: 06/Jul/17  Updated: 06/Sep/18  Resolved: 19/Jul/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.10.1, Lustre 2.11.0

Type: Bug Priority: Major
Reporter: nasf (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Most of OST (36 out of 40) was remounted in read0-only at the same time due to abort transaction in __ldiskfs_handle_dirty_metadata as follows.

Jun 15 18:54:54 oss01-mg kernel: ------------[ cut here ]------------
Jun 15 18:54:54 oss01-mg kernel: WARNING: at /tmp/rpmbuild-lustre-root-FtLAmY5x/BUILD/lustre-2.7.18.4.ddn0.g557254f/ldiskfs/ext4_jbd2.c:266 __ldiskfs_handle_dirty_metadata+0x1c2/0x220 [ldiskfs]()
Jun 15 18:54:54 oss01-mg kernel: Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx4_en(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) mlx4_core(OE) crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul ppdev glue_helper ablk_helper cryptd sg pcspkr i6300esb parport_pc parport i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod crct10dif_generic cdrom ata_generic mlx5_ib(OE) ib_core(OE) ib_addr(OE) ib_netlink(OE) pata_acpi cirrus syscopyarea sysfillrect sysimgblt drm_kms_helper mlx5_core(OE) ttm vxlan ip6_udp_tunnel udp_tunnel
Jun 15 18:54:54 oss01-mg kernel: ata_piix crct10dif_pclmul ptp crct10dif_common e1000 drm libata pps_core serio_raw igbvf crc32c_intel i2c_core mlx_compat(OE) sfablkdriver(OE) floppy dm_mirror dm_region_hash dm_log dm_mod
Jun 15 18:54:54 oss01-mg kernel: CPU: 9 PID: 4980 Comm: ll_ost03_019 Tainted: G           OE  ------------   3.10.0-327.36.1.el7_lustre.2.7.18.4.ddn0.g557254f.x86_64 #1
Jun 15 18:54:54 oss01-mg kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
Jun 15 18:54:54 oss01-mg kernel: 0000000000000000 00000000e3d423b3 ffff8809cea03820 ffffffff816366a1
Jun 15 18:54:54 oss01-mg kernel: ffff8809cea03858 ffffffff8107b260 ffff88083ec95d68 ffff880a0070d450
Jun 15 18:54:54 oss01-mg kernel: ffff8809897a4548 ffffffffa0c66a1c 0000000000000327 ffff8809cea03868
Jun 15 18:54:54 oss01-mg kernel: Call Trace:
Jun 15 18:54:54 oss01-mg kernel: [<ffffffff816366a1>] dump_stack+0x19/0x1b
Jun 15 18:54:54 oss01-mg kernel: [<ffffffff8107b260>] warn_slowpath_common+0x70/0xb0
Jun 15 18:54:54 oss01-mg kernel: [<ffffffff8107b3aa>] warn_slowpath_null+0x1a/0x20
Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0c09622>] __ldiskfs_handle_dirty_metadata+0x1c2/0x220 [ldiskfs]
Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0c2cf21>] ldiskfs_getblk+0x131/0x200 [ldiskfs]
Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0c2d01a>] ldiskfs_bread+0x2a/0x1e0 [ldiskfs]
Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0cf33c9>] osd_ldiskfs_write_record+0x169/0x360 [osd_ldiskfs]
Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0cf36b8>] osd_write+0xf8/0x230 [osd_ldiskfs]
Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0790325>] dt_record_write+0x45/0x130 [obdclass]
Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0a4ceac>] tgt_client_data_write.isra.19+0x12c/0x140 [ptlrpc]
Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0a5112b>] tgt_client_data_update+0x36b/0x510 [ptlrpc]
Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0a51a0b>] tgt_client_new+0x3fb/0x5f0 [ptlrpc]
Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0e29358>] ofd_obd_connect+0x2e8/0x3f0 [ofd]
Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa09b6c6f>] target_handle_connect+0x11ef/0x2bf0 [ptlrpc]
Jun 15 18:54:54 oss01-mg kernel: [<ffffffff810c5618>] ? load_balance+0x218/0x890
Jun 15 18:54:54 oss01-mg kernel: [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
Jun 15 18:54:54 oss01-mg kernel: [<ffffffff810c1a96>] ? dequeue_entity+0x106/0x520
Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0a35400>] ? nrs_request_removed+0x80/0x120 [ptlrpc]
Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0a5c8ba>] tgt_request_handle+0x55a/0x11f0 [ptlrpc]
Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa09ffa0b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0634d08>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa09fcad8>] ? ptlrpc_wait_event+0x98/0x330 [ptlrpc]
Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0a03330>] ptlrpc_main+0xc00/0x1f60 [ptlrpc]
Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0a02730>] ? ptlrpc_register_service+0x1070/0x1070 [ptlrpc]
Jun 15 18:54:54 oss01-mg kernel: [<ffffffff810a5b8f>] kthread+0xcf/0xe0
Jun 15 18:54:54 oss01-mg kernel: [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
Jun 15 18:54:54 oss01-mg kernel: [<ffffffff81646cd8>] ret_from_fork+0x58/0x90
Jun 15 18:54:54 oss01-mg kernel: [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
Jun 15 18:54:54 oss01-mg kernel: ---[ end trace 120678ee9d6e4000 ]---
Jun 15 18:54:54 oss01-mg kernel: LDISKFS-fs: ldiskfs_getblk:807: aborting transaction: error 28 in __ldiskfs_handle_dirty_metadata
Jun 15 18:54:54 oss01-mg kernel: LDISKFS-fs error (device sfa0007): ldiskfs_getblk:807: inode #81: block 805347324: comm ll_ost03_019: journal_dirty_metadata failed: handle type 0 started at line 1156, credits 8/0, errcode -28
Jun 15 18:54:54 oss01-mg kernel: LDISKFS-fs: ldiskfs_getblk:807: aborting transaction: error 28 in __ldiskfs_handle_dirty_metadata
Jun 15 18:54:54 oss01-mg kernel: LDISKFS-fs error (device sfa0000): ldiskfs_getblk:807: inode #81: block 110813181: comm ll_ost03_045: journal_dirty_metadata failed: handle type 0 started at line 1156, credits 8/0, errcode -28
Jun 15 18:54:54 oss01-mg kernel: Aborting journal on device sfa0000-8.
Jun 15 18:54:54 oss01-mg kernel: LDISKFS-fs (sfa0000): Remounting filesystem read-only
Jun 15 18:54:54 oss01-mg kernel: LustreError: 5006:0:(osd_io.c:1694:osd_ldiskfs_write_record()) sfa0000: error reading offset 20480 (block 5): rc = -28
Jun 15 18:54:54 oss01-mg kernel: LDISKFS-fs error (device sfa0000) in osd_trans_stop:1240: error 28

The customer was trying to mount lustre from new clients which are Intel Xeon Phi KNL server. They were installing lustre 2.7.19.8.ddn3. They have 16 machines and mounting lustre worked as expected on 14 clients, but they found a problem on 2 clients. When we checked messages files from the servers, OST had been remounted read-only.

Similar issue is reported in LU-6722 and the fix is included from 2.7.19.12. Can you check if this synopsis is due to the issue in LU-6722?



 Comments   
Comment by Gerrit Updater [ 06/Jul/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/27947
Subject: LU-9740 ldiskfs: more credits for non-append write
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4ae97971f3ed0156ecbe81bf088b5e6a23ce10ec

Comment by Gerrit Updater [ 19/Jul/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27947/
Subject: LU-9740 ldiskfs: more credits for non-append write
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c668a8d405a9d8819bf9b96e0c610ccc5353d77d

Comment by Peter Jones [ 19/Jul/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 26/Jul/17 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/28229
Subject: LU-9740 ldiskfs: more credits for non-append write
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 98f1b59aaf81ac9d57b8091a0517fc89faf5a1d3

Comment by Gerrit Updater [ 07/Aug/17 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28229/
Subject: LU-9740 ldiskfs: more credits for non-append write
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 138c9a3bae52a2d6abeb5af07fc2076bcd9526b1

Generated at Sat Feb 10 02:28:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.