Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.11.0
-
3
-
9223372036854775807
Description
I added DNE testing into my testbed and test 300k seems to be having problems on ldiskfs.
It starts like this:
[13013.691307] Lustre: DEBUG MARKER: == sanity test 300k: test large striped directory ==================================================== 18:35:09 (1520206509) [13014.055529] Lustre: *** cfs_fail_loc=1703, val=0*** [13014.098222] Lustre: 24226:0:(osd_internal.h:1151:osd_trans_exec_op()) lustre-MDT0000: opcode 3: before 512 < left 1480, rollback = 4 [13014.099829] Lustre: 24226:0:(osd_internal.h:1151:osd_trans_exec_op()) Skipped 23 previous similar messages [13014.101329] Lustre: 24226:0:(osd_handler.c:1741:osd_trans_dump_creds()) create: 185/1480/0, destroy: 0/0/0 [13014.102702] Lustre: 24226:0:(osd_handler.c:1741:osd_trans_dump_creds()) Skipped 23 previous similar messages [13014.103935] Lustre: 24226:0:(osd_handler.c:1748:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 346/33383/0 [13014.105479] Lustre: 24226:0:(osd_handler.c:1748:osd_trans_dump_creds()) Skipped 23 previous similar messages [13014.106887] Lustre: 24226:0:(osd_handler.c:1758:osd_trans_dump_creds()) write: 53/1648/0, punch: 0/0/0, quota 2/2/0 [13014.107836] Lustre: 24226:0:(osd_handler.c:1758:osd_trans_dump_creds()) Skipped 23 previous similar messages [13014.112440] Lustre: 24226:0:(osd_handler.c:1765:osd_trans_dump_creds()) insert: 1055/23885/0, delete: 0/0/0 [13014.114028] Lustre: 24226:0:(osd_handler.c:1765:osd_trans_dump_creds()) Skipped 23 previous similar messages [13014.115799] Lustre: 24226:0:(osd_handler.c:1772:osd_trans_dump_creds()) ref_add: 685/685/0, ref_del: 0/0/0 [13014.132704] Lustre: 24226:0:(osd_handler.c:1772:osd_trans_dump_creds()) Skipped 23 previous similar messages [13017.644534] ------------[ cut here ]------------ [13017.650823] WARNING: at /home/green/git/lustre-release/ldiskfs/ext4_jbd2.c:266 __ldiskfs_handle_dirty_metadata+0x1d2/0x230 [ldiskfs]() [13017.671187] Modules linked in: brd lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) ext4 loop zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) zlib_deflate mbcache jbd2 syscopyarea sysfillrect sysimgblt ata_generic ttm pata_acpi drm_kms_helper drm i2c_piix4 ata_piix virtio_balloon pcspkr serio_raw virtio_blk i2c_core virtio_console libata floppy nfsd ip_tables rpcsec_gss_krb5 [last unloaded: obdecho] [13017.687393] CPU: 6 PID: 24226 Comm: mdt03_004 Tainted: P OE ------------ 3.10.0-debug #2 [13017.688706] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [13017.689495] 0000000000000000 000000009b5bd8bc ffff880213757438 ffffffff816fd3e4 [13017.691291] ffff880213757470 ffffffff81076250 ffff880076c48f88 ffff880237e11fa0 [13017.692912] ffff8800855c39f8 ffffffffa0b48d9c 0000000000000323 ffff880213757480 [13017.694222] Call Trace: [13017.694832] [<ffffffff816fd3e4>] dump_stack+0x19/0x1b [13017.695505] [<ffffffff81076250>] warn_slowpath_common+0x70/0xb0 [13017.696204] [<ffffffff8107639a>] warn_slowpath_null+0x1a/0x20 [13017.697091] [<ffffffffa0b38192>] __ldiskfs_handle_dirty_metadata+0x1d2/0x230 [ldiskfs] [13017.698752] [<ffffffffa0b17602>] ldiskfs_getblk+0x142/0x210 [ldiskfs] [13017.699645] [<ffffffffa0b176f7>] ldiskfs_bread+0x27/0xe0 [ldiskfs] [13017.700518] [<ffffffffa0b091a1>] ldiskfs_append+0x81/0x150 [ldiskfs] [13017.701366] [<ffffffffa0b1022f>] ldiskfs_init_new_dir+0xcf/0x230 [ldiskfs] [13017.702238] [<ffffffffa0b103de>] ldiskfs_add_dot_dotdot+0x4e/0x90 [ldiskfs] [13017.703247] [<ffffffffa0b8559f>] osd_add_dot_dotdot_internal.isra.73+0x5f/0x80 [osd_ldiskfs] [13017.706196] [<ffffffffa0b9a962>] osd_index_ea_insert+0xa22/0x11c0 [osd_ldiskfs] [13017.707436] [<ffffffffa0dbb1b2>] lod_sub_insert+0x1c2/0x340 [lod] [13017.708022] [<ffffffffa0da0387>] lod_xattr_set_lmv.isra.49.constprop.66+0x647/0xe30 [lod] [13017.709667] [<ffffffffa0dae0cf>] lod_dir_striping_create_internal+0x46f/0x950 [lod] [13017.712608] [<ffffffffa0dafb2f>] lod_xattr_set+0x25f/0x860 [lod] [13017.713517] [<ffffffffa0daf8d0>] ? lod_layout_del+0xab0/0xab0 [lod] [13017.714535] [<ffffffffa0c5e6da>] mdo_xattr_set+0x7a/0x180 [mdd] [13017.715478] [<ffffffffa0c76981>] ? mdd_env_info+0x21/0x60 [mdd] [13017.716396] [<ffffffffa0c649f6>] mdd_create_object+0x416/0x980 [mdd] [13017.717305] [<ffffffffa0c6c383>] mdd_create+0xce3/0x13a0 [mdd] [13017.718352] [<ffffffffa0ce7296>] mdt_create+0x846/0xbb0 [mdt] [13017.719404] [<ffffffffa03d1b89>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [13017.720412] [<ffffffffa0406481>] ? lprocfs_job_stats_log+0xd1/0x600 [obdclass] [13017.726742] [<ffffffffa0ce776b>] mdt_reint_create+0x16b/0x350 [mdt] [13017.727541] [<ffffffffa0ce8cc0>] mdt_reint_rec+0x80/0x210 [mdt] [13017.728262] [<ffffffffa0cc87bb>] mdt_reint_internal+0x5fb/0x990 [mdt] [13017.728967] [<ffffffffa0cd3587>] mdt_reint+0x67/0x140 [mdt] [13017.730200] [<ffffffffa063b28b>] tgt_request_handle+0x93b/0x13e0 [ptlrpc] [13017.730981] [<ffffffffa05e1ce1>] ptlrpc_server_handle_request+0x261/0xaf0 [ptlrpc] [13017.732359] [<ffffffffa05e5a98>] ptlrpc_main+0xa58/0x1df0 [ptlrpc] [13017.733076] [<ffffffffa05e5040>] ? ptlrpc_register_service+0xeb0/0xeb0 [ptlrpc] [13017.748161] [<ffffffff810a2eba>] kthread+0xea/0xf0 [13017.750350] [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140 [13017.751067] [<ffffffff8170fb98>] ret_from_fork+0x58/0x90 [13017.751743] [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140 [13017.752450] ---[ end trace 4ba61d7be859e71c ]--- [13017.753099] LDISKFS-fs: ldiskfs_getblk:803: aborting transaction: error 28 in __ldiskfs_handle_dirty_metadata [13017.758373] LDISKFS-fs error (device loop0): ldiskfs_getblk:803: inode #26416: block 17641: comm mdt03_004: journal_dirty_metadata failed: handle type 0 started at line 1837, credits 512/0, errcode -28 [13017.761414] Aborting journal on device loop0-8.
Understandably a lot of errors ensure after that and then we get leaked objects in cache as we try to remove the modules:
[13293.993223] Lustre: 31450:0:(gss_mech_switch.c:80:lgss_mech_unregister()) Unregister krb5 mechanism [13295.637401] LNet: Removed LNI 192.168.123.193@tcp [13296.208435] kmem_cache_destroy ldiskfs_inode_cache: Slab cache still has objects [13296.230451] CPU: 7 PID: 31473 Comm: rmmod Tainted: P W OE ------------ 3.10.0-debug #2 [13296.246837] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [13296.247525] ffff880322083b00 0000000055345a9a ffff880303a6fec0 ffffffff816fd3e4 [13296.248832] ffff880303a6fee0 ffffffff811965cf fffffffffffffff5 ffffffffa0b65b80 [13296.250123] ffff880303a6fef0 ffffffffa0b46cc0 ffff880303a6ff78 ffffffff810f06bb [13296.251433] Call Trace: [13296.252040] [<ffffffff816fd3e4>] dump_stack+0x19/0x1b [13296.252717] [<ffffffff811965cf>] kmem_cache_destroy+0xdf/0xf0 [13296.253481] [<ffffffffa0b46cc0>] ldiskfs_exit_fs+0x3c/0x37c [ldiskfs] [13296.267203] [<ffffffff810f06bb>] SyS_delete_module+0x16b/0x2d0 [13296.267942] [<ffffffff8170fc49>] system_call_fastpath+0x16/0x1b [13296.375897] kmem_cache_destroy ldiskfs_extent_status: Slab cache still has objects [13296.386704] CPU: 7 PID: 31473 Comm: rmmod Tainted: P W OE ------------ 3.10.0-debug #2 [13296.388371] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [13296.389248] ffff88030d64d6c0 0000000055345a9a ffff880303a6feb0 ffffffff816fd3e4 [13296.390923] ffff880303a6fed0 ffffffff811965cf fffffffffffffff5 ffffffffa0b65b80 [13296.392558] ffff880303a6fee0 ffffffffa0b423ca ffff880303a6fef0 ffffffffa0b46cf8 [13296.394264] Call Trace: [13296.397125] [<ffffffff816fd3e4>] dump_stack+0x19/0x1b [13296.398053] [<ffffffff811965cf>] kmem_cache_destroy+0xdf/0xf0 [13296.399147] [<ffffffffa0b423ca>] ldiskfs_exit_es+0x1a/0x20 [ldiskfs] [13296.400204] [<ffffffffa0b46cf8>] ldiskfs_exit_fs+0x74/0x37c [ldiskfs] [13296.401109] [<ffffffff810f06bb>] SyS_delete_module+0x16b/0x2d0 [13296.402219] [<ffffffff8170fc49>] system_call_fastpath+0x16/0x1b
MDS journal size is too low with DNE. Default journal size is 16M and 4m for single transaction is not enough for now, probably we need some calculation against MDSCOUNT.
[root@rhel7-1 tests]# REFORMAT=yes MDSCOUNT=4 MDSSIZE=512000 MDSJOURNALSIZE=128 sh sanity.sh
These options avoid a jbd2 panic on 300k and other tests passed without osd_trans_exec_op error.