[LU-8119] conf-sanity test_47: timeout Created: 10/May/16  Updated: 19/Apr/17  Resolved: 19/Apr/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for liuying <emoly.liu@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/ff7c62c8-13ad-11e6-9e5d-5254006e85c2.

The sub-test test_47 failed with the following error:

test failed to respond and timed out

Please provide additional information about the failure here.

MDS console showed:

06:28:40:Call Trace:
06:28:40: [<ffffffffa0c92a8c>] osd_trans_commit_cb+0x17c/0x3c0 [osd_zfs]
06:28:40: [<ffffffffa021686d>] dmu_tx_do_callbacks+0x3d/0x60 [zfs]
06:28:40: [<ffffffffa0258b58>] txg_do_callbacks+0x18/0x40 [zfs]
06:28:40: [<ffffffffa019d727>] taskq_thread+0x1e7/0x3f0 [spl]
06:28:40: [<ffffffff81067670>] ? default_wake_function+0x0/0x20
06:28:40: [<ffffffffa019d540>] ? taskq_thread+0x0/0x3f0 [spl]
06:28:40: [<ffffffff810a138e>] kthread+0x9e/0xc0
06:28:40: [<ffffffff8100c28a>] child_rip+0xa/0x20
06:28:40: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
06:28:40: [<ffffffff8100c280>] ? child_rip+0x0/0x20
06:28:40:Code: 48 8b 0f 48 8b 41 48 48 8d 51 48 48 39 d0 48 8d 58 d8 74 2e 0f 1f 80 00 00 00 00 48 8b 43 10 48 85 c0 74 0d 48 8b 73 18 4c 89 e7 <ff> d0 49 8b 0c 24 48 8b 43 28 48 8d 51 48 48 39 d0 48 8d 58 d8 
06:28:40:RIP  [<ffffffffa136d550>] dt_txn_hook_commit+0x40/0x60 [obdclass]
06:28:40: RSP <ffff88005a2f3d80>

Info required for matching: conf-sanity 47



 Comments   
Comment by James Nunez (Inactive) [ 13/May/16 ]

There's a similar stack trace for a conf-sanity test 44 timeout at https://testing.hpdd.intel.com/test_sets/ecc6376e-1892-11e6-855a-5254006e85c2. From the MDS console log:

11:49:02:[10785.693635] Lustre: DEBUG MARKER: umount -d -f /mnt/mds1
11:49:02:[10786.476837] general protection fault: 0000 [#1] SMP 
11:49:02:[10786.477004] Modules linked in: osd_zfs(OE) lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) osc(OE) mdc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) sha512_generic crypto_null dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ppdev pcspkr virtio_balloon i2c_piix4 parport_pc parport zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) nfsd nfs_acl lockd grace spl(OE) auth_rpcgss zlib_deflate sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus ata_piix syscopyarea 8139too virtio_blk sysfillrect sysimgblt drm_kms_helper ttm drm floppy libata virtio_pci i2c_core serio_raw virtio_ring 8139cp virtio mii [last unloaded: libcfs]
11:49:02:[10786.477004] CPU: 0 PID: 6194 Comm: tx_commit_cb Tainted: P           OE  ------------   3.10.0-327.13.1.el7_lustre.x86_64 #1
11:49:02:[10786.477004] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
11:49:02:[10786.477004] task: ffff88007a155080 ti: ffff88004d714000 task.ti: ffff88004d714000
11:49:02:[10786.477004] RIP: 0010:[<ffffffffa0ae0910>]  [<ffffffffa0ae0910>] dt_txn_hook_commit+0x30/0x60 [obdclass]
11:49:02:[10786.477004] RSP: 0018:ffff88004d717d88  EFLAGS: 00010207
11:49:02:[10786.477004] RAX: 5a5a5a5a5a5a5a5a RBX: 5a5a5a5a5a5a5a32 RCX: ffff88004e49c000
11:49:02:[10786.477004] RDX: ffff88004e49c048 RSI: ffffffffa13485c0 RDI: ffff880053a15810
11:49:02:[10786.477004] RBP: ffff88004d717d98 R08: 20737365636f7250 R09: 0a64657265746e65
11:49:02:[10786.477004] R10: ffff880054140b1f R11: 0a64657265746e65 R12: ffff880053a15800
11:49:02:[10786.477004] R13: 0000000000000000 R14: ffff880053a15850 R15: ffff88004e49c000
11:49:02:[10786.477004] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
11:49:02:[10786.477004] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
11:49:02:[10786.477004] CR2: 00007f348ade1000 CR3: 000000007b9e0000 CR4: 00000000000006f0
11:49:02:[10786.477004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
11:49:02:[10786.477004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
11:49:02:[10786.477004] Stack:
11:49:02:[10786.477004]  ffff880052c72ec0 ffff880053a15800 ffff88004d717dd0 ffffffffa14bf75d
11:49:02:[10786.477004]  ffff880052c72ec0 ffff880052c72e10 0000000000000000 ffff880052c72e00
11:49:02:[10786.477004]  ffff88004db7dd00 ffff88004d717e00 ffffffffa03d1884 ffff880052c72e00
11:49:02:[10786.477004] Call Trace:
11:49:02:[10786.477004]  [<ffffffffa14bf75d>] osd_trans_commit_cb+0xed/0x440 [osd_zfs]
11:49:02:[10786.477004]  [<ffffffffa03d1884>] dmu_tx_do_callbacks+0x44/0x70 [zfs]
11:49:02:[10786.477004]  [<ffffffffa04182c4>] txg_do_callbacks+0x14/0x30 [zfs]
11:49:02:[10786.477004]  [<ffffffffa02886de>] taskq_thread+0x21e/0x430 [spl]
11:49:02:[10786.477004]  [<ffffffff810b8c00>] ? wake_up_state+0x20/0x20
11:49:02:[10786.477004]  [<ffffffffa02884c0>] ? taskq_thread_spawn+0x60/0x60 [spl]
11:49:02:[10786.477004]  [<ffffffff810a5acf>] kthread+0xcf/0xe0
11:49:02:[10786.477004]  [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
11:49:02:[10786.477004]  [<ffffffff81646018>] ret_from_fork+0x58/0x90
11:49:02:[10786.477004]  [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
11:49:02:[10786.477004] Code: 55 48 89 e5 41 54 49 89 fc 53 f6 47 48 02 75 42 48 8b 0f 48 8b 41 48 48 8d 51 48 48 39 d0 48 8d 58 d8 74 2e 0f 1f 80 00 00 00 00 <48> 8b 43 10 48 85 c0 74 0d 48 8b 73 18 4c 89 e7 ff d0 49 8b 0c 
11:49:02:[10786.477004] RIP  [<ffffffffa0ae0910>] dt_txn_hook_commit+0x30/0x60 [obdclass]
11:49:02:[10786.477004]  RSP <ffff88004d717d88>
Comment by Mikhail Pershin [ 07/Dec/16 ]

I saw this issue regularly in my tests and have patch for it. The problem is that code in osd_trans_commit_cb():

 /* call per-transaction callbacks if any */
 list_for_each_entry_safe(dcb, tmp, &oh->ot_dcb_list, dcb_linkage)
         dcb->dcb_func(NULL, th, dcb, error);

it misses list_del_init(); before dcb_func() call. the dcb entry can be destroyed by the dcb_func() and must be removed from the list prior that.

Interesting that there is no such problem with ldiskfs OSD, the bug exists only in ZFS OSD code.

I will push patch shortly.

Comment by Gerrit Updater [ 07/Dec/16 ]

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: https://review.whamcloud.com/24202
Subject: LU-8119 osd: remove dcb entry from the list prior processing.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1b51c67568583be3d70506723c4ac8fa97176dcd

Comment by Mikhail Pershin [ 07/Dec/16 ]

I suppose that this bug is quite rare because there is only one commit callback in most cases but from JIRA I see it happens time to time and tickets are just closed when bug is not appearing for some time. Meanwhile it is critical bug anyway.

Comment by Mikhail Pershin [ 24/Dec/16 ]

I see several kinds of this issue still when applying high load on OSD and then umount it. Alex reported the same issues are happening in his tests.

Comment by Gerrit Updater [ 19/Apr/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24202/
Subject: LU-8119 osd: remove dcb entry from the list prior processing.
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: aa64476cef7d5f0050de20839925868013caa0b6

Comment by Peter Jones [ 19/Apr/17 ]

Landed for 2.10

Generated at Sat Feb 10 02:14:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.