Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
3
-
9223372036854775807
Description
Most of OST (36 out of 40) was remounted in read0-only at the same time due to abort transaction in __ldiskfs_handle_dirty_metadata as follows.
Jun 15 18:54:54 oss01-mg kernel: ------------[ cut here ]------------ Jun 15 18:54:54 oss01-mg kernel: WARNING: at /tmp/rpmbuild-lustre-root-FtLAmY5x/BUILD/lustre-2.7.18.4.ddn0.g557254f/ldiskfs/ext4_jbd2.c:266 __ldiskfs_handle_dirty_metadata+0x1c2/0x220 [ldiskfs]() Jun 15 18:54:54 oss01-mg kernel: Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx4_en(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) mlx4_core(OE) crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul ppdev glue_helper ablk_helper cryptd sg pcspkr i6300esb parport_pc parport i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod crct10dif_generic cdrom ata_generic mlx5_ib(OE) ib_core(OE) ib_addr(OE) ib_netlink(OE) pata_acpi cirrus syscopyarea sysfillrect sysimgblt drm_kms_helper mlx5_core(OE) ttm vxlan ip6_udp_tunnel udp_tunnel Jun 15 18:54:54 oss01-mg kernel: ata_piix crct10dif_pclmul ptp crct10dif_common e1000 drm libata pps_core serio_raw igbvf crc32c_intel i2c_core mlx_compat(OE) sfablkdriver(OE) floppy dm_mirror dm_region_hash dm_log dm_mod Jun 15 18:54:54 oss01-mg kernel: CPU: 9 PID: 4980 Comm: ll_ost03_019 Tainted: G OE ------------ 3.10.0-327.36.1.el7_lustre.2.7.18.4.ddn0.g557254f.x86_64 #1 Jun 15 18:54:54 oss01-mg kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 Jun 15 18:54:54 oss01-mg kernel: 0000000000000000 00000000e3d423b3 ffff8809cea03820 ffffffff816366a1 Jun 15 18:54:54 oss01-mg kernel: ffff8809cea03858 ffffffff8107b260 ffff88083ec95d68 ffff880a0070d450 Jun 15 18:54:54 oss01-mg kernel: ffff8809897a4548 ffffffffa0c66a1c 0000000000000327 ffff8809cea03868 Jun 15 18:54:54 oss01-mg kernel: Call Trace: Jun 15 18:54:54 oss01-mg kernel: [<ffffffff816366a1>] dump_stack+0x19/0x1b Jun 15 18:54:54 oss01-mg kernel: [<ffffffff8107b260>] warn_slowpath_common+0x70/0xb0 Jun 15 18:54:54 oss01-mg kernel: [<ffffffff8107b3aa>] warn_slowpath_null+0x1a/0x20 Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0c09622>] __ldiskfs_handle_dirty_metadata+0x1c2/0x220 [ldiskfs] Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0c2cf21>] ldiskfs_getblk+0x131/0x200 [ldiskfs] Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0c2d01a>] ldiskfs_bread+0x2a/0x1e0 [ldiskfs] Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0cf33c9>] osd_ldiskfs_write_record+0x169/0x360 [osd_ldiskfs] Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0cf36b8>] osd_write+0xf8/0x230 [osd_ldiskfs] Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0790325>] dt_record_write+0x45/0x130 [obdclass] Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0a4ceac>] tgt_client_data_write.isra.19+0x12c/0x140 [ptlrpc] Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0a5112b>] tgt_client_data_update+0x36b/0x510 [ptlrpc] Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0a51a0b>] tgt_client_new+0x3fb/0x5f0 [ptlrpc] Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0e29358>] ofd_obd_connect+0x2e8/0x3f0 [ofd] Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa09b6c6f>] target_handle_connect+0x11ef/0x2bf0 [ptlrpc] Jun 15 18:54:54 oss01-mg kernel: [<ffffffff810c5618>] ? load_balance+0x218/0x890 Jun 15 18:54:54 oss01-mg kernel: [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0 Jun 15 18:54:54 oss01-mg kernel: [<ffffffff810c1a96>] ? dequeue_entity+0x106/0x520 Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0a35400>] ? nrs_request_removed+0x80/0x120 [ptlrpc] Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0a5c8ba>] tgt_request_handle+0x55a/0x11f0 [ptlrpc] Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa09ffa0b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0634d08>] ? lc_watchdog_touch+0x68/0x180 [libcfs] Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa09fcad8>] ? ptlrpc_wait_event+0x98/0x330 [ptlrpc] Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0a03330>] ptlrpc_main+0xc00/0x1f60 [ptlrpc] Jun 15 18:54:54 oss01-mg kernel: [<ffffffffa0a02730>] ? ptlrpc_register_service+0x1070/0x1070 [ptlrpc] Jun 15 18:54:54 oss01-mg kernel: [<ffffffff810a5b8f>] kthread+0xcf/0xe0 Jun 15 18:54:54 oss01-mg kernel: [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 Jun 15 18:54:54 oss01-mg kernel: [<ffffffff81646cd8>] ret_from_fork+0x58/0x90 Jun 15 18:54:54 oss01-mg kernel: [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 Jun 15 18:54:54 oss01-mg kernel: ---[ end trace 120678ee9d6e4000 ]--- Jun 15 18:54:54 oss01-mg kernel: LDISKFS-fs: ldiskfs_getblk:807: aborting transaction: error 28 in __ldiskfs_handle_dirty_metadata Jun 15 18:54:54 oss01-mg kernel: LDISKFS-fs error (device sfa0007): ldiskfs_getblk:807: inode #81: block 805347324: comm ll_ost03_019: journal_dirty_metadata failed: handle type 0 started at line 1156, credits 8/0, errcode -28 Jun 15 18:54:54 oss01-mg kernel: LDISKFS-fs: ldiskfs_getblk:807: aborting transaction: error 28 in __ldiskfs_handle_dirty_metadata Jun 15 18:54:54 oss01-mg kernel: LDISKFS-fs error (device sfa0000): ldiskfs_getblk:807: inode #81: block 110813181: comm ll_ost03_045: journal_dirty_metadata failed: handle type 0 started at line 1156, credits 8/0, errcode -28 Jun 15 18:54:54 oss01-mg kernel: Aborting journal on device sfa0000-8. Jun 15 18:54:54 oss01-mg kernel: LDISKFS-fs (sfa0000): Remounting filesystem read-only Jun 15 18:54:54 oss01-mg kernel: LustreError: 5006:0:(osd_io.c:1694:osd_ldiskfs_write_record()) sfa0000: error reading offset 20480 (block 5): rc = -28 Jun 15 18:54:54 oss01-mg kernel: LDISKFS-fs error (device sfa0000) in osd_trans_stop:1240: error 28
The customer was trying to mount lustre from new clients which are Intel Xeon Phi KNL server. They were installing lustre 2.7.19.8.ddn3. They have 16 machines and mounting lustre worked as expected on 14 clients, but they found a problem on 2 clients. When we checked messages files from the servers, OST had been remounted read-only.
Similar issue is reported in LU-6722 and the fix is included from 2.7.19.12. Can you check if this synopsis is due to the issue in LU-6722?