[LU-10762] sanity test 300k transaction accounting problem Created: 05/Mar/18 Updated: 30/Jul/18 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Oleg Drokin | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | dne2 | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
I added DNE testing into my testbed and test 300k seems to be having problems on ldiskfs. It starts like this: [13013.691307] Lustre: DEBUG MARKER: == sanity test 300k: test large striped directory ==================================================== 18:35:09 (1520206509) [13014.055529] Lustre: *** cfs_fail_loc=1703, val=0*** [13014.098222] Lustre: 24226:0:(osd_internal.h:1151:osd_trans_exec_op()) lustre-MDT0000: opcode 3: before 512 < left 1480, rollback = 4 [13014.099829] Lustre: 24226:0:(osd_internal.h:1151:osd_trans_exec_op()) Skipped 23 previous similar messages [13014.101329] Lustre: 24226:0:(osd_handler.c:1741:osd_trans_dump_creds()) create: 185/1480/0, destroy: 0/0/0 [13014.102702] Lustre: 24226:0:(osd_handler.c:1741:osd_trans_dump_creds()) Skipped 23 previous similar messages [13014.103935] Lustre: 24226:0:(osd_handler.c:1748:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 346/33383/0 [13014.105479] Lustre: 24226:0:(osd_handler.c:1748:osd_trans_dump_creds()) Skipped 23 previous similar messages [13014.106887] Lustre: 24226:0:(osd_handler.c:1758:osd_trans_dump_creds()) write: 53/1648/0, punch: 0/0/0, quota 2/2/0 [13014.107836] Lustre: 24226:0:(osd_handler.c:1758:osd_trans_dump_creds()) Skipped 23 previous similar messages [13014.112440] Lustre: 24226:0:(osd_handler.c:1765:osd_trans_dump_creds()) insert: 1055/23885/0, delete: 0/0/0 [13014.114028] Lustre: 24226:0:(osd_handler.c:1765:osd_trans_dump_creds()) Skipped 23 previous similar messages [13014.115799] Lustre: 24226:0:(osd_handler.c:1772:osd_trans_dump_creds()) ref_add: 685/685/0, ref_del: 0/0/0 [13014.132704] Lustre: 24226:0:(osd_handler.c:1772:osd_trans_dump_creds()) Skipped 23 previous similar messages [13017.644534] ------------[ cut here ]------------ [13017.650823] WARNING: at /home/green/git/lustre-release/ldiskfs/ext4_jbd2.c:266 __ldiskfs_handle_dirty_metadata+0x1d2/0x230 [ldiskfs]() [13017.671187] Modules linked in: brd lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) ext4 loop zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) zlib_deflate mbcache jbd2 syscopyarea sysfillrect sysimgblt ata_generic ttm pata_acpi drm_kms_helper drm i2c_piix4 ata_piix virtio_balloon pcspkr serio_raw virtio_blk i2c_core virtio_console libata floppy nfsd ip_tables rpcsec_gss_krb5 [last unloaded: obdecho] [13017.687393] CPU: 6 PID: 24226 Comm: mdt03_004 Tainted: P OE ------------ 3.10.0-debug #2 [13017.688706] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [13017.689495] 0000000000000000 000000009b5bd8bc ffff880213757438 ffffffff816fd3e4 [13017.691291] ffff880213757470 ffffffff81076250 ffff880076c48f88 ffff880237e11fa0 [13017.692912] ffff8800855c39f8 ffffffffa0b48d9c 0000000000000323 ffff880213757480 [13017.694222] Call Trace: [13017.694832] [<ffffffff816fd3e4>] dump_stack+0x19/0x1b [13017.695505] [<ffffffff81076250>] warn_slowpath_common+0x70/0xb0 [13017.696204] [<ffffffff8107639a>] warn_slowpath_null+0x1a/0x20 [13017.697091] [<ffffffffa0b38192>] __ldiskfs_handle_dirty_metadata+0x1d2/0x230 [ldiskfs] [13017.698752] [<ffffffffa0b17602>] ldiskfs_getblk+0x142/0x210 [ldiskfs] [13017.699645] [<ffffffffa0b176f7>] ldiskfs_bread+0x27/0xe0 [ldiskfs] [13017.700518] [<ffffffffa0b091a1>] ldiskfs_append+0x81/0x150 [ldiskfs] [13017.701366] [<ffffffffa0b1022f>] ldiskfs_init_new_dir+0xcf/0x230 [ldiskfs] [13017.702238] [<ffffffffa0b103de>] ldiskfs_add_dot_dotdot+0x4e/0x90 [ldiskfs] [13017.703247] [<ffffffffa0b8559f>] osd_add_dot_dotdot_internal.isra.73+0x5f/0x80 [osd_ldiskfs] [13017.706196] [<ffffffffa0b9a962>] osd_index_ea_insert+0xa22/0x11c0 [osd_ldiskfs] [13017.707436] [<ffffffffa0dbb1b2>] lod_sub_insert+0x1c2/0x340 [lod] [13017.708022] [<ffffffffa0da0387>] lod_xattr_set_lmv.isra.49.constprop.66+0x647/0xe30 [lod] [13017.709667] [<ffffffffa0dae0cf>] lod_dir_striping_create_internal+0x46f/0x950 [lod] [13017.712608] [<ffffffffa0dafb2f>] lod_xattr_set+0x25f/0x860 [lod] [13017.713517] [<ffffffffa0daf8d0>] ? lod_layout_del+0xab0/0xab0 [lod] [13017.714535] [<ffffffffa0c5e6da>] mdo_xattr_set+0x7a/0x180 [mdd] [13017.715478] [<ffffffffa0c76981>] ? mdd_env_info+0x21/0x60 [mdd] [13017.716396] [<ffffffffa0c649f6>] mdd_create_object+0x416/0x980 [mdd] [13017.717305] [<ffffffffa0c6c383>] mdd_create+0xce3/0x13a0 [mdd] [13017.718352] [<ffffffffa0ce7296>] mdt_create+0x846/0xbb0 [mdt] [13017.719404] [<ffffffffa03d1b89>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [13017.720412] [<ffffffffa0406481>] ? lprocfs_job_stats_log+0xd1/0x600 [obdclass] [13017.726742] [<ffffffffa0ce776b>] mdt_reint_create+0x16b/0x350 [mdt] [13017.727541] [<ffffffffa0ce8cc0>] mdt_reint_rec+0x80/0x210 [mdt] [13017.728262] [<ffffffffa0cc87bb>] mdt_reint_internal+0x5fb/0x990 [mdt] [13017.728967] [<ffffffffa0cd3587>] mdt_reint+0x67/0x140 [mdt] [13017.730200] [<ffffffffa063b28b>] tgt_request_handle+0x93b/0x13e0 [ptlrpc] [13017.730981] [<ffffffffa05e1ce1>] ptlrpc_server_handle_request+0x261/0xaf0 [ptlrpc] [13017.732359] [<ffffffffa05e5a98>] ptlrpc_main+0xa58/0x1df0 [ptlrpc] [13017.733076] [<ffffffffa05e5040>] ? ptlrpc_register_service+0xeb0/0xeb0 [ptlrpc] [13017.748161] [<ffffffff810a2eba>] kthread+0xea/0xf0 [13017.750350] [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140 [13017.751067] [<ffffffff8170fb98>] ret_from_fork+0x58/0x90 [13017.751743] [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140 [13017.752450] ---[ end trace 4ba61d7be859e71c ]--- [13017.753099] LDISKFS-fs: ldiskfs_getblk:803: aborting transaction: error 28 in __ldiskfs_handle_dirty_metadata [13017.758373] LDISKFS-fs error (device loop0): ldiskfs_getblk:803: inode #26416: block 17641: comm mdt03_004: journal_dirty_metadata failed: handle type 0 started at line 1837, credits 512/0, errcode -28 [13017.761414] Aborting journal on device loop0-8. Understandably a lot of errors ensure after that and then we get leaked objects in cache as we try to remove the modules: [13293.993223] Lustre: 31450:0:(gss_mech_switch.c:80:lgss_mech_unregister()) Unregister krb5 mechanism [13295.637401] LNet: Removed LNI 192.168.123.193@tcp [13296.208435] kmem_cache_destroy ldiskfs_inode_cache: Slab cache still has objects [13296.230451] CPU: 7 PID: 31473 Comm: rmmod Tainted: P W OE ------------ 3.10.0-debug #2 [13296.246837] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [13296.247525] ffff880322083b00 0000000055345a9a ffff880303a6fec0 ffffffff816fd3e4 [13296.248832] ffff880303a6fee0 ffffffff811965cf fffffffffffffff5 ffffffffa0b65b80 [13296.250123] ffff880303a6fef0 ffffffffa0b46cc0 ffff880303a6ff78 ffffffff810f06bb [13296.251433] Call Trace: [13296.252040] [<ffffffff816fd3e4>] dump_stack+0x19/0x1b [13296.252717] [<ffffffff811965cf>] kmem_cache_destroy+0xdf/0xf0 [13296.253481] [<ffffffffa0b46cc0>] ldiskfs_exit_fs+0x3c/0x37c [ldiskfs] [13296.267203] [<ffffffff810f06bb>] SyS_delete_module+0x16b/0x2d0 [13296.267942] [<ffffffff8170fc49>] system_call_fastpath+0x16/0x1b [13296.375897] kmem_cache_destroy ldiskfs_extent_status: Slab cache still has objects [13296.386704] CPU: 7 PID: 31473 Comm: rmmod Tainted: P W OE ------------ 3.10.0-debug #2 [13296.388371] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [13296.389248] ffff88030d64d6c0 0000000055345a9a ffff880303a6feb0 ffffffff816fd3e4 [13296.390923] ffff880303a6fed0 ffffffff811965cf fffffffffffffff5 ffffffffa0b65b80 [13296.392558] ffff880303a6fee0 ffffffffa0b423ca ffff880303a6fef0 ffffffffa0b46cf8 [13296.394264] Call Trace: [13296.397125] [<ffffffff816fd3e4>] dump_stack+0x19/0x1b [13296.398053] [<ffffffff811965cf>] kmem_cache_destroy+0xdf/0xf0 [13296.399147] [<ffffffffa0b423ca>] ldiskfs_exit_es+0x1a/0x20 [ldiskfs] [13296.400204] [<ffffffffa0b46cf8>] ldiskfs_exit_fs+0x74/0x37c [ldiskfs] [13296.401109] [<ffffffff810f06bb>] SyS_delete_module+0x16b/0x2d0 [13296.402219] [<ffffffff8170fc49>] system_call_fastpath+0x16/0x1b |
| Comments |
| Comment by Chris Horn [ 20/Jul/18 ] |
|
I believe I'm seeing this same issue with Lustre 2.11 on a SLES 15 vm: [ 5011.405770] Lustre: DEBUG MARKER: == sanity test 300k: test large striped directory ==================================================== 13:09:56 (1532110196) [ 5011.523914] Lustre: *** cfs_fail_loc=1703, val=0*** [ 5011.529595] Lustre: 1525:0:(osd_internal.h:1168:osd_trans_exec_op()) lustre-MDT0000: opcode 3: before 512 < left 2160, rollback = 4 [ 5011.529604] Lustre: 1525:0:(osd_internal.h:1168:osd_trans_exec_op()) Skipped 71 previous similar messages [ 5011.529613] Lustre: 1525:0:(osd_handler.c:1745:osd_trans_dump_creds()) create: 270/2160/0, destroy: 0/0/0 [ 5011.529617] Lustre: 1525:0:(osd_handler.c:1745:osd_trans_dump_creds()) Skipped 23 previous similar messages [ 5011.529623] Lustre: 1525:0:(osd_handler.c:1752:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 516/49873/0 [ 5011.529627] Lustre: 1525:0:(osd_handler.c:1752:osd_trans_dump_creds()) Skipped 23 previous similar messages [ 5011.529636] Lustre: 1525:0:(osd_handler.c:1762:osd_trans_dump_creds()) write: 53/1648/0, punch: 0/0/0, quota 3/3/0 [ 5011.529640] Lustre: 1525:0:(osd_handler.c:1762:osd_trans_dump_creds()) Skipped 23 previous similar messages [ 5011.529648] Lustre: 1525:0:(osd_handler.c:1769:osd_trans_dump_creds()) insert: 1310/26350/0, delete: 0/0/0 [ 5011.529652] Lustre: 1525:0:(osd_handler.c:1769:osd_trans_dump_creds()) Skipped 23 previous similar messages [ 5011.529659] Lustre: 1525:0:(osd_handler.c:1776:osd_trans_dump_creds()) ref_add: 770/770/0, ref_del: 0/0/0 [ 5011.529664] Lustre: 1525:0:(osd_handler.c:1776:osd_trans_dump_creds()) Skipped 23 previous similar messages [ 5015.565453] Lustre: 1525:0:(osd_internal.h:1168:osd_trans_exec_op()) lustre-MDT0000: opcode 8: before 17 < left 25904, rollback = 9 [ 5015.565459] Lustre: 1525:0:(osd_internal.h:1168:osd_trans_exec_op()) Skipped 2125 previous similar messages [ 5015.995265] ------------[ cut here ]------------ [ 5015.995346] WARNING: CPU: 26 PID: 1525 at /tmp/rpmbuild-lustre-build-dm3Xm3GS/BUILD/lustre-2.11.0_1_cray_131_gba03050/ldiskfs/ext4_jbd2.c:277 __ldiskfs_handle_dirty_metadata+0xa7/0x1a0 [ldiskfs] [ 5015.995348] Modules linked in: obdecho(OEN) brd ofd(OEN) ost(OEN) osp(OEN) mdd(OEN) lod(OEN) mdt(OEN) lfsck(OEN) mgs(OEN) osd_ldiskfs(OEN) ldiskfs(OEN) lquota(OEN) lustre(OEN) mgc(OEN) lov(OEN) mdc(OEN) osc(OEN) lmv(OEN) fid(OEN) fld(OEN) ptlrpc(OEN) obdclass(OEN) ksocklnd(OEN) lnet(OEN) libcfs(OEN) loop quota_v2 quota_tree af_packet iscsi_ibft iscsi_boot_sysfs vmw_vsock_vmci_transport sb_edac coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 vsock vmwgfx ttm drm_kms_helper drm vmw_balloon crypto_simd ppdev drm_panel_orientation_quirks syscopyarea sysfillrect sysimgblt fb_sys_fops glue_helper vmxnet3 cryptd joydev shpchp vmw_vmci i2c_piix4 parport_pc pcspkr parport ac button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sd_mod ata_piix ahci libahci [ 5015.995423] serio_raw mptspi scsi_transport_spi libata mptscsih mptbase floppy sg scsi_mod autofs4 [last unloaded: obdecho] [ 5015.995434] Supported: No, Unsupported modules are loaded [ 5015.995439] CPU: 26 PID: 1525 Comm: mdt03_004 Tainted: G W OE 4.12.14-23-default #1 SLE15 [ 5015.995440] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/30/2013 [ 5015.995442] task: ffff9ed67ebb8900 task.stack: ffffaf9ed8468000 [ 5015.995456] RIP: 0010:__ldiskfs_handle_dirty_metadata+0xa7/0x1a0 [ldiskfs] [ 5015.995458] RSP: 0018:ffffaf9ed846b770 EFLAGS: 00010286 [ 5015.995460] RAX: ffff9ed89de62800 RBX: ffff9ed874087f70 RCX: ffff9ed8a642ca40 [ 5015.995461] RDX: 0000000000000000 RSI: ffff9ed874087f70 RDI: ffff9ed8a64bc210 [ 5015.995462] RBP: ffff9ed8a64bc210 R08: ffff9ed874087f70 R09: 0000000000000000 [ 5015.995463] R10: ffff9ed8ac3a4000 R11: 0000000000000040 R12: ffff9ed8a642ca40 [ 5015.995465] R13: 00000000ffffffe4 R14: ffffffffc10314d0 R15: 00000000000003fb [ 5015.995468] FS: 0000000000000000(0000) GS:ffff9ed8bfc80000(0000) knlGS:0000000000000000 [ 5015.995469] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5015.995470] CR2: 00007f651bf8e860 CR3: 0000000437a7a000 CR4: 00000000000406e0 [ 5015.995558] Call Trace: [ 5015.995584] ldiskfs_getblk+0xeb/0x170 [ldiskfs] [ 5015.995597] ldiskfs_bread+0xe/0x80 [ldiskfs] [ 5015.995610] ldiskfs_append+0x5f/0x100 [ldiskfs] [ 5015.995623] ldiskfs_init_new_dir+0xc1/0x1a0 [ldiskfs] [ 5015.995634] ldiskfs_add_dot_dotdot+0x41/0x70 [ldiskfs] [ 5015.995686] osd_add_dot_dotdot_internal.isra.79+0x53/0x70 [osd_ldiskfs] [ 5015.995698] osd_index_ea_insert+0xe81/0xfb0 [osd_ldiskfs] [ 5015.995737] lod_sub_insert+0x168/0x360 [lod] [ 5015.995750] lod_xattr_set_lmv.isra.49.constprop.65+0x65d/0xda0 [lod] [ 5015.995763] lod_dir_striping_create_internal+0x3a8/0x810 [lod] [ 5015.995774] lod_xattr_set+0x27e/0x740 [lod] [ 5015.995882] ? lu_context_refill+0x15/0x50 [obdclass] [ 5015.995910] mdd_create_object+0x3e9/0x950 [mdd] [ 5015.995920] mdd_create+0x11c0/0x1d60 [mdd] [ 5015.995984] ? mdt_version_save+0x66/0x170 [mdt] [ 5015.995456] RIP: 0010:__ldiskfs_handle_dirty_metadata+0xa7/0x1a0 [ldiskfs] [ 5015.995458] RSP: 0018:ffffaf9ed846b770 EFLAGS: 00010286 [ 5015.995460] RAX: ffff9ed89de62800 RBX: ffff9ed874087f70 RCX: ffff9ed8a642ca40 [ 5015.995461] RDX: 0000000000000000 RSI: ffff9ed874087f70 RDI: ffff9ed8a64bc210 [ 5015.995462] RBP: ffff9ed8a64bc210 R08: ffff9ed874087f70 R09: 0000000000000000 [ 5015.995463] R10: ffff9ed8ac3a4000 R11: 0000000000000040 R12: ffff9ed8a642ca40 [ 5015.995465] R13: 00000000ffffffe4 R14: ffffffffc10314d0 R15: 00000000000003fb [ 5015.995468] FS: 0000000000000000(0000) GS:ffff9ed8bfc80000(0000) knlGS:0000000000000000 [ 5015.995469] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5015.995470] CR2: 00007f651bf8e860 CR3: 0000000437a7a000 CR4: 00000000000406e0 [ 5015.995558] Call Trace: [ 5015.995584] ldiskfs_getblk+0xeb/0x170 [ldiskfs] [ 5015.995597] ldiskfs_bread+0xe/0x80 [ldiskfs] [ 5015.995610] ldiskfs_append+0x5f/0x100 [ldiskfs] [ 5015.995623] ldiskfs_init_new_dir+0xc1/0x1a0 [ldiskfs] [ 5015.995634] ldiskfs_add_dot_dotdot+0x41/0x70 [ldiskfs] [ 5015.995686] osd_add_dot_dotdot_internal.isra.79+0x53/0x70 [osd_ldiskfs] [ 5015.995698] osd_index_ea_insert+0xe81/0xfb0 [osd_ldiskfs] [ 5015.995737] lod_sub_insert+0x168/0x360 [lod] [ 5015.995750] lod_xattr_set_lmv.isra.49.constprop.65+0x65d/0xda0 [lod] [ 5015.995763] lod_dir_striping_create_internal+0x3a8/0x810 [lod] [ 5015.995774] lod_xattr_set+0x27e/0x740 [lod] [ 5015.995882] ? lu_context_refill+0x15/0x50 [obdclass] [ 5015.995910] mdd_create_object+0x3e9/0x950 [mdd] [ 5015.995920] mdd_create+0x11c0/0x1d60 [mdd] [ 5015.995984] ? mdt_version_save+0x66/0x170 [mdt] [ 5015.995999] mdt_create+0x8cd/0xc50 [mdt] [ 5015.996030] ? lprocfs_job_stats_log+0x2b4/0x680 [obdclass] [ 5015.996046] mdt_reint_create+0x23d/0x340 [mdt] [ 5015.996060] mdt_reint_rec+0x7f/0x250 [mdt] [ 5015.996075] mdt_reint_internal+0x5ee/0x680 [mdt] [ 5015.996089] mdt_reint+0x5e/0x110 [mdt] [ 5015.996311] tgt_request_handle+0x814/0x1500 [ptlrpc] [ 5015.996363] ptlrpc_server_handle_request+0x2de/0xb90 [ptlrpc] [ 5015.996411] ptlrpc_main+0xaf7/0x1f10 [ptlrpc] [ 5015.996456] ? ptlrpc_register_service+0xec0/0xec0 [ptlrpc] [ 5015.996464] kthread+0x11a/0x130 [ 5015.996467] ? kthread_create_on_node+0x40/0x40 [ 5015.996475] ? do_syscall_64+0x74/0x140 [ 5015.996481] ? SyS_exit_group+0x10/0x10 [ 5015.996489] ret_from_fork+0x35/0x40 [ 5015.996492] Code: e5 48 89 de 48 89 ef e8 88 eb 34 ff f6 45 1c 08 41 89 c5 75 d4 48 8b 45 00 48 85 c0 74 cb 48 8b 00 f6 00 02 75 c3 45 85 ed 74 bb <0f> 0b 48 c7 c2 60 24 03 c1 45 89 e8 48 89 e9 44 89 fe 4c 89 f7 [ 5015.996521] ---[ end trace cbe0954090541ae3 ]--- [ 5015.996526] LDISKFS-fs: ldiskfs_getblk:1019: aborting transaction: error 28 in __ldiskfs_handle_dirty_metadata [ 5015.998649] LDISKFS-fs error (device sdc): ldiskfs_getblk:1019: inode #26034: block 17004: comm mdt03_004: journal_dirty_metadata failed: handle type 0 started at line 1841, credits 512/0, errcode -28 [ 5016.002471] Aborting journal on device sdc-8. |
| Comment by Chris Horn [ 23/Jul/18 ] |
|
FWIW - I do not see the same behavior when I run sanity on a remote client. There are messages in the server's dmesg from obd_trans_exec_op() and obd_trans_dump_creds() for subtests 300d, 300g, 300h, and 300n, but not for 300k. And I don't see the "aborting transaction" error from ldiskfs_getblk(), nor is the MDT is getting re-mounted read-only. I only see the read-only remount when the client is on the same node as the server. |
| Comment by Alexey Lyashkov [ 30/Jul/18 ] |
|
MDS journal size is too low with DNE. Default journal size is 16M and 4m for single transaction is not enough for now, probably we need some calculation against MDSCOUNT. |