[LU-8392] sanity test_27z: soft lockup - CPU#0 stuck for 22s! [ptlrpcd_rcv:6145] Created: 13/Jul/16  Updated: 12/Aug/16  Resolved: 12/Aug/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Sebastien Buisson (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Related
is related to LU-7845 Support namespace in credentials retr... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for John Hammond <john.hammond@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/2fcd5716-48f6-11e6-bf87-5254006e85c2.

The sub-test test_27z failed with the following error:

test failed to respond and timed out
11:07:16:[ 1632.062003] BUG: soft lockup - CPU#0 stuck for 22s! [ptlrpcd_rcv:6145]
11:07:16:[ 1632.062003] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) osc(OE) mdc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ppdev pcspkr virtio_balloon parport_pc parport i2c_piix4 nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk cirrus syscopyarea sysfillrect sysimgblt drm_kms_helper 8139too ttm ata_piix serio_raw virtio_pci virtio_ring virtio libata 8139cp mii drm i2c_core floppy
11:07:16:[ 1632.062003] CPU: 0 PID: 6145 Comm: ptlrpcd_rcv Tainted: G           OE  ------------   3.10.0-327.22.2.el7.x86_64 #1
11:07:16:[ 1632.062003] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
11:07:16:[ 1632.062003] task: ffff880078184500 ti: ffff880078e58000 task.ti: ffff880078e58000
11:07:16:[ 1632.062003] RIP: 0010:[<ffffffff8163dab2>]  [<ffffffff8163dab2>] _raw_spin_lock+0x32/0x50
11:07:16:[ 1632.062003] RSP: 0018:ffff880078e5b9f0  EFLAGS: 00000202
11:07:16:[ 1632.062003] RAX: 0000000000002fb8 RBX: 0000000000000000 RCX: 000000000000701c
11:07:16:[ 1632.062003] RDX: 000000000000701e RSI: 000000000000701e RDI: ffff880079cb4d00
11:07:16:[ 1632.062003] RBP: ffff880078e5b9f0 R08: 0000000000000000 R09: 0000000000000208
11:07:16:[ 1632.062003] R10: 0000000000000009 R11: ffff880078e5b85e R12: ffff880078e5bfd8
11:07:16:[ 1632.062003] R13: ffffffff812fd8e3 R14: ffff880078e5b9f0 R15: ffffffffa05cf498
11:07:16:[ 1632.062003] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
11:07:16:[ 1632.062003] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
11:07:16:[ 1632.062003] CR2: 00007f5378b2fed0 CR3: 000000007856d000 CR4: 00000000000006f0
11:07:16:[ 1632.062003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
11:07:16:[ 1632.062003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
11:07:16:[ 1632.062003] Stack:
11:07:16:[ 1632.062003]  ffff880078e5ba18 ffffffffa05db498 0000020000000100 0000000000000000
11:07:16:[ 1632.062003]  ffffffffffffffff ffff880078e5ba90 ffffffffa065a296 000200000a090430
11:07:16:[ 1632.062003]  000200000a09042e ffff880044ae7c00 ffffffffa06916e0 0000000000020000
11:07:16:[ 1632.062003] Call Trace:
11:07:16:[ 1632.062003]  [<ffffffffa05db498>] cfs_percpt_lock+0x58/0x110 [libcfs]
11:07:16:[ 1632.062003]  [<ffffffffa065a296>] lnet_send+0xb6/0xc90 [lnet]
11:07:16:[ 1632.062003]  [<ffffffff811c178e>] ? kmem_cache_alloc_trace+0x1ce/0x1f0
11:07:16:[ 1632.062003]  [<ffffffffa065b0b5>] LNetPut+0x245/0x7a0 [lnet]
11:07:16:[ 1632.062003]  [<ffffffffa0919aa3>] ptl_send_buf+0x183/0x500 [ptlrpc]
11:07:16:[ 1632.062003]  [<ffffffffa091b5b1>] ptl_send_rpc+0x611/0xda0 [ptlrpc]
11:07:16:[ 1632.062003]  [<ffffffffa0910ff0>] ptlrpc_send_new_req+0x460/0xa60 [ptlrpc]
11:07:16:[ 1632.062003]  [<ffffffffa0914358>] ptlrpc_check_set.part.23+0x9a8/0x1dd0 [ptlrpc]
11:07:16:[ 1632.062003]  [<ffffffffa09157db>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
11:07:16:[ 1632.062003]  [<ffffffffa09403bb>] ptlrpcd_check+0x4eb/0x5e0 [ptlrpc]
11:07:16:[ 1632.062003]  [<ffffffffa094076b>] ptlrpcd+0x2bb/0x560 [ptlrpc]
11:07:16:[ 1632.062003]  [<ffffffff810b88d0>] ? wake_up_state+0x20/0x20
11:07:16:[ 1632.062003]  [<ffffffffa09404b0>] ? ptlrpcd_check+0x5e0/0x5e0 [ptlrpc]
11:07:16:[ 1632.062003]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
11:07:16:[ 1632.062003]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
11:07:16:[ 1632.062003]  [<ffffffff816467d8>] ret_from_fork+0x58/0x90
11:07:16:[ 1632.062003]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140

Info required for matching: sanity 27z



 Comments   
Comment by John Hammond [ 13/Jul/16 ]

Another failure: https://testing.hpdd.intel.com/test_sets/0cc7cb54-4872-11e6-8968-5254006e85c2

Comment by John Hammond [ 13/Jul/16 ]

Seems to be introduced by http://review.whamcloud.com/#/c/18782/ which calls lnet_ipif_query() while holding the lnet_net_lock:

PID: 9297   TASK: ffff88007a2f3980  CPU: 0   COMMAND: "ll_cfg_requeue"
 #0 [ffff8800796c73c8] __schedule at ffffffff8163b16d
 #1 [ffff8800796c7430] __cond_resched at ffffffff810b5ed6
 #2 [ffff8800796c7448] _cond_resched at ffffffff8163baaa
 #3 [ffff8800796c7458] kmem_cache_alloc at ffffffff811c1425
 #4 [ffff8800796c7498] sock_alloc_inode at ffffffff8151025d
 #5 [ffff8800796c74b8] alloc_inode at ffffffff811f986d
 #6 [ffff8800796c74d8] new_inode_pseudo at ffffffff811fb891
 #7 [ffff8800796c74f8] sock_alloc at ffffffff8150fdba
 #8 [ffff8800796c7510] __sock_create at ffffffff81510b05
 #9 [ffff8800796c7560] sock_create at ffffffff81510d10
#10 [ffff8800796c7570] lnet_sock_ioctl at ffffffffa0654070 [lnet]
#11 [ffff8800796c75b8] lnet_ipif_query at ffffffffa0654ca8 [lnet]
#12 [ffff8800796c7620] LNetDist at ffffffffa06568b4 [lnet]
#13 [ffff8800796c7688] ptlrpc_uuid_to_peer at ffffffffa09262d4 [ptlrpc]
#14 [ffff8800796c76f0] ptlrpc_uuid_to_connection at ffffffffa090d26f [ptlrpc]
#15 [ffff8800796c7730] import_set_conn at ffffffffa08f1600 [ptlrpc]
#16 [ffff8800796c7788] import_set_conn_priority at ffffffffa08f3555 [ptlrpc]
#17 [ffff8800796c7798] ptlrpc_recover_import at ffffffffa091816b [ptlrpc]
#18 [ffff8800796c7848] lprocfs_import_seq_write at ffffffffa0944600 [ptlrpc]
#19 [ffff8800796c78a8] osc_import_seq_write at ffffffffa0adb539 [osc]
#20 [ffff8800796c78b8] class_process_proc_param at ffffffffa0731ea4 [obdclass]
#21 [ffff8800796c7ac8] osc_process_config_base at ffffffffa0adab82 [osc]
#22 [ffff8800796c7ad8] osc_cl_process_config at ffffffffa0adcebc [osc]
#23 [ffff8800796c7af8] class_process_config at ffffffffa0737d9e [obdclass]
#24 [ffff8800796c7bb0] mgc_apply_recover_logs at ffffffffa0b91bbc [mgc]
#25 [ffff8800796c7cd8] mgc_process_recover_nodemap_log at ffffffffa0b93d48 [mgc]
#26 [ffff8800796c7d68] mgc_process_log at ffffffffa0b96614 [mgc]
#27 [ffff8800796c7e28] mgc_requeue_thread at ffffffffa0b98538 [mgc]
#28 [ffff8800796c7ec8] kthread at ffffffff810a5aef
#29 [ffff8800796c7f50] ret_from_fork at ffffffff816467d8
Comment by nasf (Inactive) [ 15/Jul/16 ]

I hit similar trouble in replay-single test_102c on master:
https://testing.hpdd.intel.com/test_sets/335fffb2-4a47-11e6-a80f-5254006e85c2

Comment by Jian Yu [ 19/Jul/16 ]

The same failure occurred on replay-single test 38 on master branch:
https://testing.hpdd.intel.com/test_sets/64b16e48-49c5-11e6-bf87-5254006e85c2

Comment by Peter Jones [ 19/Jul/16 ]

Sebastien

Could you please look into this one?

Thanks

Peter

Comment by Gerrit Updater [ 20/Jul/16 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: http://review.whamcloud.com/21437
Subject: LU-8392 lnet: no lnet_net_lock for address visibility check
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 29538f69e00d8677789e2c18387e4a8e150ca241

Comment by Jian Yu [ 22/Jul/16 ]

More failure instance: https://testing.hpdd.intel.com/test_sets/32f28fce-4f90-11e6-9f8e-5254006e85c2

Comment by Sebastien Buisson (Inactive) [ 22/Jul/16 ]

Hum, this soft lockup bug was hit during testing of patch http://review.whamcloud.com/21437:
https://testing.hpdd.intel.com/test_sets/3a5366ea-4fd0-11e6-bf87-5254006e85c2

So it seems that calling lnet_ipif_query() while holding the lnet_net_lock (patch landed at http://review.whamcloud.com/18782), whereas probably not recommended, is not the primary reason for this soft lockup.

Comment by nasf (Inactive) [ 26/Jul/16 ]

Another failure instance on master:
https://testing.hpdd.intel.com/test_sets/d79fa8e8-532f-11e6-bf87-5254006e85c2

Comment by Jian Yu [ 28/Jul/16 ]

sanity-quota test 18 also hit the same issue:
https://testing.hpdd.intel.com/test_sets/f2f11e80-54a4-11e6-905c-5254006e85c2

Comment by nasf (Inactive) [ 29/Jul/16 ]

Another failure on master:
https://testing.hpdd.intel.com/test_sets/b99f71fc-5573-11e6-b5b1-5254006e85c2

Comment by nasf (Inactive) [ 31/Jul/16 ]

More failures on master:
https://testing.hpdd.intel.com/test_sets/7ce11cfc-56a4-11e6-906c-5254006e85c2

Comment by James Nunez (Inactive) [ 01/Aug/16 ]

Looks like we are started hitting this soft lockup in recovery-small test_61 in the past two weeks:
https://testing.hpdd.intel.com/test_sets/af0b71a4-4e8f-11e6-9f8e-5254006e85c2
https://testing.hpdd.intel.com/test_sets/6e43ceea-528c-11e6-bf87-5254006e85c2
https://testing.hpdd.intel.com/test_sets/ec71f75a-5382-11e6-87c4-5254006e85c2
https://testing.hpdd.intel.com/test_sets/2e6697d8-55e5-11e6-aa74-5254006e85c2

Comment by Sebastien Buisson (Inactive) [ 02/Aug/16 ]

Oleg,

I guess you reverted patch http://review.whamcloud.com/18782 to solve the issue described here. But as the issue was still hit with patch http://review.whamcloud.com/21437 (patch that avoids calling lnet_ipif_query() while holding the lnet_net_lock), there is no clue that this patch is responsible for this soft lockup.

If you anyway think that patch http://review.whamcloud.com/21437 is necessary, let's work to have it merged.

Sebastien.

Comment by Sebastien Buisson (Inactive) [ 03/Aug/16 ]

I pushed a new version of the patch at http://review.whamcloud.com/21437, that avoids calling LIBCFS_ALLOC() while holding the lock.

Comment by Sebastien Buisson (Inactive) [ 04/Aug/16 ]

FYI, patch http://review.whamcloud.com/21437 successfully passed Maloo tests.
I have just retriggered them with:
Test-Parameters: fortestonly testlist=sanity,sanity,sanity,sanity
to see if it passes sanity consistently.

Comment by Sebastien Buisson (Inactive) [ 05/Aug/16 ]

Hi,

Confirmed that http://review.whamcloud.com/21437 passes sanity tests consistently.

Comment by Peter Jones [ 11/Aug/16 ]

If I understand correctly, the revert of the LU-7845 patch from master has meant that this is no logner occurring regularly and so is not a blocker to 2.9. AS Sebastien reports it can still happen occasionally I have moved it to 2.10 with a lower prioirity.

Comment by Sebastien Buisson (Inactive) [ 12/Aug/16 ]

Peter,

It appears this soft lockup was due to http://review.whamcloud.com/18782. But now that this patch is reverted, the bug should not appear any more. Maybe we can wait a bit before closing this bug, just to make sure it does not occur again.

Sebastien.

Comment by Peter Jones [ 12/Aug/16 ]

In that case let's close this as a duplicate of LU-7845 and open a new ticket if there is still a residual issue

Generated at Sat Feb 10 02:17:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.