[LU-8119] conf-sanity test_47: timeout Created: 10/May/16 Updated: 19/Apr/17 Resolved: 19/Apr/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for liuying <emoly.liu@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/ff7c62c8-13ad-11e6-9e5d-5254006e85c2. The sub-test test_47 failed with the following error: test failed to respond and timed out Please provide additional information about the failure here. MDS console showed: 06:28:40:Call Trace: 06:28:40: [<ffffffffa0c92a8c>] osd_trans_commit_cb+0x17c/0x3c0 [osd_zfs] 06:28:40: [<ffffffffa021686d>] dmu_tx_do_callbacks+0x3d/0x60 [zfs] 06:28:40: [<ffffffffa0258b58>] txg_do_callbacks+0x18/0x40 [zfs] 06:28:40: [<ffffffffa019d727>] taskq_thread+0x1e7/0x3f0 [spl] 06:28:40: [<ffffffff81067670>] ? default_wake_function+0x0/0x20 06:28:40: [<ffffffffa019d540>] ? taskq_thread+0x0/0x3f0 [spl] 06:28:40: [<ffffffff810a138e>] kthread+0x9e/0xc0 06:28:40: [<ffffffff8100c28a>] child_rip+0xa/0x20 06:28:40: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 06:28:40: [<ffffffff8100c280>] ? child_rip+0x0/0x20 06:28:40:Code: 48 8b 0f 48 8b 41 48 48 8d 51 48 48 39 d0 48 8d 58 d8 74 2e 0f 1f 80 00 00 00 00 48 8b 43 10 48 85 c0 74 0d 48 8b 73 18 4c 89 e7 <ff> d0 49 8b 0c 24 48 8b 43 28 48 8d 51 48 48 39 d0 48 8d 58 d8 06:28:40:RIP [<ffffffffa136d550>] dt_txn_hook_commit+0x40/0x60 [obdclass] 06:28:40: RSP <ffff88005a2f3d80> Info required for matching: conf-sanity 47 |
| Comments |
| Comment by James Nunez (Inactive) [ 13/May/16 ] |
|
There's a similar stack trace for a conf-sanity test 44 timeout at https://testing.hpdd.intel.com/test_sets/ecc6376e-1892-11e6-855a-5254006e85c2. From the MDS console log: 11:49:02:[10785.693635] Lustre: DEBUG MARKER: umount -d -f /mnt/mds1 11:49:02:[10786.476837] general protection fault: 0000 [#1] SMP 11:49:02:[10786.477004] Modules linked in: osd_zfs(OE) lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) osc(OE) mdc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) sha512_generic crypto_null dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ppdev pcspkr virtio_balloon i2c_piix4 parport_pc parport zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) nfsd nfs_acl lockd grace spl(OE) auth_rpcgss zlib_deflate sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus ata_piix syscopyarea 8139too virtio_blk sysfillrect sysimgblt drm_kms_helper ttm drm floppy libata virtio_pci i2c_core serio_raw virtio_ring 8139cp virtio mii [last unloaded: libcfs] 11:49:02:[10786.477004] CPU: 0 PID: 6194 Comm: tx_commit_cb Tainted: P OE ------------ 3.10.0-327.13.1.el7_lustre.x86_64 #1 11:49:02:[10786.477004] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 11:49:02:[10786.477004] task: ffff88007a155080 ti: ffff88004d714000 task.ti: ffff88004d714000 11:49:02:[10786.477004] RIP: 0010:[<ffffffffa0ae0910>] [<ffffffffa0ae0910>] dt_txn_hook_commit+0x30/0x60 [obdclass] 11:49:02:[10786.477004] RSP: 0018:ffff88004d717d88 EFLAGS: 00010207 11:49:02:[10786.477004] RAX: 5a5a5a5a5a5a5a5a RBX: 5a5a5a5a5a5a5a32 RCX: ffff88004e49c000 11:49:02:[10786.477004] RDX: ffff88004e49c048 RSI: ffffffffa13485c0 RDI: ffff880053a15810 11:49:02:[10786.477004] RBP: ffff88004d717d98 R08: 20737365636f7250 R09: 0a64657265746e65 11:49:02:[10786.477004] R10: ffff880054140b1f R11: 0a64657265746e65 R12: ffff880053a15800 11:49:02:[10786.477004] R13: 0000000000000000 R14: ffff880053a15850 R15: ffff88004e49c000 11:49:02:[10786.477004] FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 11:49:02:[10786.477004] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 11:49:02:[10786.477004] CR2: 00007f348ade1000 CR3: 000000007b9e0000 CR4: 00000000000006f0 11:49:02:[10786.477004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 11:49:02:[10786.477004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 11:49:02:[10786.477004] Stack: 11:49:02:[10786.477004] ffff880052c72ec0 ffff880053a15800 ffff88004d717dd0 ffffffffa14bf75d 11:49:02:[10786.477004] ffff880052c72ec0 ffff880052c72e10 0000000000000000 ffff880052c72e00 11:49:02:[10786.477004] ffff88004db7dd00 ffff88004d717e00 ffffffffa03d1884 ffff880052c72e00 11:49:02:[10786.477004] Call Trace: 11:49:02:[10786.477004] [<ffffffffa14bf75d>] osd_trans_commit_cb+0xed/0x440 [osd_zfs] 11:49:02:[10786.477004] [<ffffffffa03d1884>] dmu_tx_do_callbacks+0x44/0x70 [zfs] 11:49:02:[10786.477004] [<ffffffffa04182c4>] txg_do_callbacks+0x14/0x30 [zfs] 11:49:02:[10786.477004] [<ffffffffa02886de>] taskq_thread+0x21e/0x430 [spl] 11:49:02:[10786.477004] [<ffffffff810b8c00>] ? wake_up_state+0x20/0x20 11:49:02:[10786.477004] [<ffffffffa02884c0>] ? taskq_thread_spawn+0x60/0x60 [spl] 11:49:02:[10786.477004] [<ffffffff810a5acf>] kthread+0xcf/0xe0 11:49:02:[10786.477004] [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 11:49:02:[10786.477004] [<ffffffff81646018>] ret_from_fork+0x58/0x90 11:49:02:[10786.477004] [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 11:49:02:[10786.477004] Code: 55 48 89 e5 41 54 49 89 fc 53 f6 47 48 02 75 42 48 8b 0f 48 8b 41 48 48 8d 51 48 48 39 d0 48 8d 58 d8 74 2e 0f 1f 80 00 00 00 00 <48> 8b 43 10 48 85 c0 74 0d 48 8b 73 18 4c 89 e7 ff d0 49 8b 0c 11:49:02:[10786.477004] RIP [<ffffffffa0ae0910>] dt_txn_hook_commit+0x30/0x60 [obdclass] 11:49:02:[10786.477004] RSP <ffff88004d717d88> |
| Comment by Mikhail Pershin [ 07/Dec/16 ] |
|
I saw this issue regularly in my tests and have patch for it. The problem is that code in osd_trans_commit_cb(): /* call per-transaction callbacks if any */
list_for_each_entry_safe(dcb, tmp, &oh->ot_dcb_list, dcb_linkage)
dcb->dcb_func(NULL, th, dcb, error);
it misses list_del_init(); before dcb_func() call. the dcb entry can be destroyed by the dcb_func() and must be removed from the list prior that. Interesting that there is no such problem with ldiskfs OSD, the bug exists only in ZFS OSD code. I will push patch shortly. |
| Comment by Gerrit Updater [ 07/Dec/16 ] |
|
Mike Pershin (mike.pershin@intel.com) uploaded a new patch: https://review.whamcloud.com/24202 |
| Comment by Mikhail Pershin [ 07/Dec/16 ] |
|
I suppose that this bug is quite rare because there is only one commit callback in most cases but from JIRA I see it happens time to time and tickets are just closed when bug is not appearing for some time. Meanwhile it is critical bug anyway. |
| Comment by Mikhail Pershin [ 24/Dec/16 ] |
|
I see several kinds of this issue still when applying high load on OSD and then umount it. Alex reported the same issues are happening in his tests. |
| Comment by Gerrit Updater [ 19/Apr/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24202/ |
| Comment by Peter Jones [ 19/Apr/17 ] |
|
Landed for 2.10 |