[LU-8392] sanity test_27z: soft lockup - CPU#0 stuck for 22s! [ptlrpcd_rcv:6145] Created: 13/Jul/16 Updated: 12/Aug/16 Resolved: 12/Aug/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Sebastien Buisson (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for John Hammond <john.hammond@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/2fcd5716-48f6-11e6-bf87-5254006e85c2. The sub-test test_27z failed with the following error: test failed to respond and timed out 11:07:16:[ 1632.062003] BUG: soft lockup - CPU#0 stuck for 22s! [ptlrpcd_rcv:6145] 11:07:16:[ 1632.062003] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) osc(OE) mdc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ppdev pcspkr virtio_balloon parport_pc parport i2c_piix4 nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk cirrus syscopyarea sysfillrect sysimgblt drm_kms_helper 8139too ttm ata_piix serio_raw virtio_pci virtio_ring virtio libata 8139cp mii drm i2c_core floppy 11:07:16:[ 1632.062003] CPU: 0 PID: 6145 Comm: ptlrpcd_rcv Tainted: G OE ------------ 3.10.0-327.22.2.el7.x86_64 #1 11:07:16:[ 1632.062003] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 11:07:16:[ 1632.062003] task: ffff880078184500 ti: ffff880078e58000 task.ti: ffff880078e58000 11:07:16:[ 1632.062003] RIP: 0010:[<ffffffff8163dab2>] [<ffffffff8163dab2>] _raw_spin_lock+0x32/0x50 11:07:16:[ 1632.062003] RSP: 0018:ffff880078e5b9f0 EFLAGS: 00000202 11:07:16:[ 1632.062003] RAX: 0000000000002fb8 RBX: 0000000000000000 RCX: 000000000000701c 11:07:16:[ 1632.062003] RDX: 000000000000701e RSI: 000000000000701e RDI: ffff880079cb4d00 11:07:16:[ 1632.062003] RBP: ffff880078e5b9f0 R08: 0000000000000000 R09: 0000000000000208 11:07:16:[ 1632.062003] R10: 0000000000000009 R11: ffff880078e5b85e R12: ffff880078e5bfd8 11:07:16:[ 1632.062003] R13: ffffffff812fd8e3 R14: ffff880078e5b9f0 R15: ffffffffa05cf498 11:07:16:[ 1632.062003] FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 11:07:16:[ 1632.062003] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 11:07:16:[ 1632.062003] CR2: 00007f5378b2fed0 CR3: 000000007856d000 CR4: 00000000000006f0 11:07:16:[ 1632.062003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 11:07:16:[ 1632.062003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 11:07:16:[ 1632.062003] Stack: 11:07:16:[ 1632.062003] ffff880078e5ba18 ffffffffa05db498 0000020000000100 0000000000000000 11:07:16:[ 1632.062003] ffffffffffffffff ffff880078e5ba90 ffffffffa065a296 000200000a090430 11:07:16:[ 1632.062003] 000200000a09042e ffff880044ae7c00 ffffffffa06916e0 0000000000020000 11:07:16:[ 1632.062003] Call Trace: 11:07:16:[ 1632.062003] [<ffffffffa05db498>] cfs_percpt_lock+0x58/0x110 [libcfs] 11:07:16:[ 1632.062003] [<ffffffffa065a296>] lnet_send+0xb6/0xc90 [lnet] 11:07:16:[ 1632.062003] [<ffffffff811c178e>] ? kmem_cache_alloc_trace+0x1ce/0x1f0 11:07:16:[ 1632.062003] [<ffffffffa065b0b5>] LNetPut+0x245/0x7a0 [lnet] 11:07:16:[ 1632.062003] [<ffffffffa0919aa3>] ptl_send_buf+0x183/0x500 [ptlrpc] 11:07:16:[ 1632.062003] [<ffffffffa091b5b1>] ptl_send_rpc+0x611/0xda0 [ptlrpc] 11:07:16:[ 1632.062003] [<ffffffffa0910ff0>] ptlrpc_send_new_req+0x460/0xa60 [ptlrpc] 11:07:16:[ 1632.062003] [<ffffffffa0914358>] ptlrpc_check_set.part.23+0x9a8/0x1dd0 [ptlrpc] 11:07:16:[ 1632.062003] [<ffffffffa09157db>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] 11:07:16:[ 1632.062003] [<ffffffffa09403bb>] ptlrpcd_check+0x4eb/0x5e0 [ptlrpc] 11:07:16:[ 1632.062003] [<ffffffffa094076b>] ptlrpcd+0x2bb/0x560 [ptlrpc] 11:07:16:[ 1632.062003] [<ffffffff810b88d0>] ? wake_up_state+0x20/0x20 11:07:16:[ 1632.062003] [<ffffffffa09404b0>] ? ptlrpcd_check+0x5e0/0x5e0 [ptlrpc] 11:07:16:[ 1632.062003] [<ffffffff810a5aef>] kthread+0xcf/0xe0 11:07:16:[ 1632.062003] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 11:07:16:[ 1632.062003] [<ffffffff816467d8>] ret_from_fork+0x58/0x90 11:07:16:[ 1632.062003] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 Info required for matching: sanity 27z |
| Comments |
| Comment by John Hammond [ 13/Jul/16 ] |
|
Another failure: https://testing.hpdd.intel.com/test_sets/0cc7cb54-4872-11e6-8968-5254006e85c2 |
| Comment by John Hammond [ 13/Jul/16 ] |
|
Seems to be introduced by http://review.whamcloud.com/#/c/18782/ which calls lnet_ipif_query() while holding the lnet_net_lock: PID: 9297 TASK: ffff88007a2f3980 CPU: 0 COMMAND: "ll_cfg_requeue" #0 [ffff8800796c73c8] __schedule at ffffffff8163b16d #1 [ffff8800796c7430] __cond_resched at ffffffff810b5ed6 #2 [ffff8800796c7448] _cond_resched at ffffffff8163baaa #3 [ffff8800796c7458] kmem_cache_alloc at ffffffff811c1425 #4 [ffff8800796c7498] sock_alloc_inode at ffffffff8151025d #5 [ffff8800796c74b8] alloc_inode at ffffffff811f986d #6 [ffff8800796c74d8] new_inode_pseudo at ffffffff811fb891 #7 [ffff8800796c74f8] sock_alloc at ffffffff8150fdba #8 [ffff8800796c7510] __sock_create at ffffffff81510b05 #9 [ffff8800796c7560] sock_create at ffffffff81510d10 #10 [ffff8800796c7570] lnet_sock_ioctl at ffffffffa0654070 [lnet] #11 [ffff8800796c75b8] lnet_ipif_query at ffffffffa0654ca8 [lnet] #12 [ffff8800796c7620] LNetDist at ffffffffa06568b4 [lnet] #13 [ffff8800796c7688] ptlrpc_uuid_to_peer at ffffffffa09262d4 [ptlrpc] #14 [ffff8800796c76f0] ptlrpc_uuid_to_connection at ffffffffa090d26f [ptlrpc] #15 [ffff8800796c7730] import_set_conn at ffffffffa08f1600 [ptlrpc] #16 [ffff8800796c7788] import_set_conn_priority at ffffffffa08f3555 [ptlrpc] #17 [ffff8800796c7798] ptlrpc_recover_import at ffffffffa091816b [ptlrpc] #18 [ffff8800796c7848] lprocfs_import_seq_write at ffffffffa0944600 [ptlrpc] #19 [ffff8800796c78a8] osc_import_seq_write at ffffffffa0adb539 [osc] #20 [ffff8800796c78b8] class_process_proc_param at ffffffffa0731ea4 [obdclass] #21 [ffff8800796c7ac8] osc_process_config_base at ffffffffa0adab82 [osc] #22 [ffff8800796c7ad8] osc_cl_process_config at ffffffffa0adcebc [osc] #23 [ffff8800796c7af8] class_process_config at ffffffffa0737d9e [obdclass] #24 [ffff8800796c7bb0] mgc_apply_recover_logs at ffffffffa0b91bbc [mgc] #25 [ffff8800796c7cd8] mgc_process_recover_nodemap_log at ffffffffa0b93d48 [mgc] #26 [ffff8800796c7d68] mgc_process_log at ffffffffa0b96614 [mgc] #27 [ffff8800796c7e28] mgc_requeue_thread at ffffffffa0b98538 [mgc] #28 [ffff8800796c7ec8] kthread at ffffffff810a5aef #29 [ffff8800796c7f50] ret_from_fork at ffffffff816467d8 |
| Comment by nasf (Inactive) [ 15/Jul/16 ] |
|
I hit similar trouble in replay-single test_102c on master: |
| Comment by Jian Yu [ 19/Jul/16 ] |
|
The same failure occurred on replay-single test 38 on master branch: |
| Comment by Peter Jones [ 19/Jul/16 ] |
|
Sebastien Could you please look into this one? Thanks Peter |
| Comment by Gerrit Updater [ 20/Jul/16 ] |
|
Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: http://review.whamcloud.com/21437 |
| Comment by Jian Yu [ 22/Jul/16 ] |
|
More failure instance: https://testing.hpdd.intel.com/test_sets/32f28fce-4f90-11e6-9f8e-5254006e85c2 |
| Comment by Sebastien Buisson (Inactive) [ 22/Jul/16 ] |
|
Hum, this soft lockup bug was hit during testing of patch http://review.whamcloud.com/21437: So it seems that calling lnet_ipif_query() while holding the lnet_net_lock (patch landed at http://review.whamcloud.com/18782), whereas probably not recommended, is not the primary reason for this soft lockup. |
| Comment by nasf (Inactive) [ 26/Jul/16 ] |
|
Another failure instance on master: |
| Comment by Jian Yu [ 28/Jul/16 ] |
|
sanity-quota test 18 also hit the same issue: |
| Comment by nasf (Inactive) [ 29/Jul/16 ] |
|
Another failure on master: |
| Comment by nasf (Inactive) [ 31/Jul/16 ] |
|
More failures on master: |
| Comment by James Nunez (Inactive) [ 01/Aug/16 ] |
|
Looks like we are started hitting this soft lockup in recovery-small test_61 in the past two weeks: |
| Comment by Sebastien Buisson (Inactive) [ 02/Aug/16 ] |
|
Oleg, I guess you reverted patch http://review.whamcloud.com/18782 to solve the issue described here. But as the issue was still hit with patch http://review.whamcloud.com/21437 (patch that avoids calling lnet_ipif_query() while holding the lnet_net_lock), there is no clue that this patch is responsible for this soft lockup. If you anyway think that patch http://review.whamcloud.com/21437 is necessary, let's work to have it merged. Sebastien. |
| Comment by Sebastien Buisson (Inactive) [ 03/Aug/16 ] |
|
I pushed a new version of the patch at http://review.whamcloud.com/21437, that avoids calling LIBCFS_ALLOC() while holding the lock. |
| Comment by Sebastien Buisson (Inactive) [ 04/Aug/16 ] |
|
FYI, patch http://review.whamcloud.com/21437 successfully passed Maloo tests. |
| Comment by Sebastien Buisson (Inactive) [ 05/Aug/16 ] |
|
Hi, Confirmed that http://review.whamcloud.com/21437 passes sanity tests consistently. |
| Comment by Peter Jones [ 11/Aug/16 ] |
|
If I understand correctly, the revert of the |
| Comment by Sebastien Buisson (Inactive) [ 12/Aug/16 ] |
|
Peter, It appears this soft lockup was due to http://review.whamcloud.com/18782. But now that this patch is reverted, the bug should not appear any more. Maybe we can wait a bit before closing this bug, just to make sure it does not occur again. Sebastien. |
| Comment by Peter Jones [ 12/Aug/16 ] |
|
In that case let's close this as a duplicate of |