[LU-4410] sanityn test 40a: BUG: soft lockup - CPU#0 stuck for 67s! [ptlrpcd_0:2892] Created: 23/Dec/13 Updated: 10/Oct/21 Resolved: 10/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.2, Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jian Yu | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | 22pl | ||
| Environment: |
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/70/ (2.4.2 RC2) |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 12104 | ||||||||||||
| Description |
|
sanityn test 40a hung and hit the following failure on one client: 21:36:52:Lustre: DEBUG MARKER: == sanityn test 40a: pdirops: create vs others ================ 21:34:49 (1387604089) 21:36:53:BUG: soft lockup - CPU#0 stuck for 67s! [ptlrpcd_0:2892] 21:36:53:Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode 8139too 8139cp mii virtio_balloon i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] 21:36:53:CPU 0 21:36:53:Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) 21:36:53:BUG: soft lockup - CPU#1 stuck for 67s! [ll_sa_4070:4079] 21:36:53:Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode 8139too 8139cp mii virtio_balloon i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] 21:36:53:CPU 1 21:36:53:Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode 8139too 8139cp mii virtio_balloon i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] 21:36:53: 21:36:53:Pid: 4079, comm: ll_sa_4070 Not tainted 2.6.32-358.23.2.el6.x86_64 #1 Red Hat KVM 21:36:53:RIP: 0010:[<ffffffff81510aae>] [<ffffffff81510aae>] _spin_lock+0x1e/0x30 21:36:53:RSP: 0018:ffff88006c26bda0 EFLAGS: 00000206 21:36:53:RAX: 0000000000000002 RBX: ffff88006c26bda0 RCX: ffff88007cfd8800 21:36:54:RDX: 0000000000000000 RSI: ffff88006c25fec0 RDI: ffff88007a737ec0 21:36:54:RBP: ffffffff8100bb8e R08: ffff88007d860e68 R09: 00000000fffffffe 21:36:54:R10: 0000000000000000 R11: 0000000000000001 R12: ffff88006c26bd80 21:36:54:R13: ffff88006d6c9000 R14: 0000000000001000 R15: 0000000000000000 21:36:54:FS: 00007fb227702700(0000) GS:ffff880002300000(0000) knlGS:0000000000000000 21:36:54:CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 21:36:54:CR2: 00007f7bbff64000 CR3: 000000006c183000 CR4: 00000000000006e0 21:36:54:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 21:36:54:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 21:36:54:Process ll_sa_4070 (pid: 4079, threadinfo ffff88006c26a000, task ffff88006bd25500) 21:36:54:Stack: 21:36:54: ffff88006c26be10 ffffffffa0abb680 ffff88007a737bf8 ffff88006e9501c8 21:36:54:<d> 0000000000000000 ffff88007a737b00 ffff88007caa01c0 ffff88006bf57200 21:36:54:<d> ffff88006c26bdf0 ffff88007a7ba800 ffff88007a7ba970 ffff88007a737e80 21:36:54:Call Trace: 21:36:54: [<ffffffffa0abb680>] ? ll_post_statahead+0x50/0xa80 [lustre] 21:36:55: [<ffffffffa0abf8c8>] ? ll_statahead_thread+0x268/0xfa0 [lustre] 21:36:55: [<ffffffff81063990>] ? default_wake_function+0x0/0x20 21:36:55: [<ffffffffa0abf660>] ? ll_statahead_thread+0x0/0xfa0 [lustre] 21:36:55: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 21:36:55: [<ffffffffa0abf660>] ? ll_statahead_thread+0x0/0xfa0 [lustre] 21:36:55: [<ffffffffa0abf660>] ? ll_statahead_thread+0x0/0xfa0 [lustre] 21:36:55: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Maloo report: https://maloo.whamcloud.com/test_sets/7cca784a-6b4b-11e3-99ba-52540035b04c |
| Comments |
| Comment by Jian Yu [ 23/Dec/13 ] |
|
Here is the query result on Maloo for sanityn test 40a timeout failure on Lustre b2_4 branch: The failure did not occur on previous Lustre b2_4 builds. |
| Comment by Jian Yu [ 23/Dec/13 ] |
|
Just ran the same test with FSTYPE=zfs on Lustre 2.4.2 RC2 for 10 times. All runs passed. So, it seems like this is an occasional failure. |
| Comment by James Nunez (Inactive) [ 29/May/14 ] |
|
I hit this problem while testing a b2_5 patch and lustre-rsync_test test_6 timedout. Logs are at https://maloo.whamcloud.com/test_sets/0712c6e6-e762-11e3-b2f3-52540035b04c Note: This was run on ldiskfs, not ZFS |
| Comment by Jian Yu [ 05/Jun/14 ] |
|
More instance on Lustre b2_5 branch while running lustre-rsync-test test 6 with FSTYPE=ldiskfs: |
| Comment by Jian Yu [ 09/Jun/14 ] |
|
Hi Nasf, It looks like the failure is related to statahead. It originally occurred on Lustre b2_4 branch with ZFS on sanityn test 40a, and now frequently occurs on Lustre b2_5 branch with ldiskfs on lustre-rsync-test test 6. Could you please help take a look if these two test failures have the same root cause? Thanks. |
| Comment by nasf (Inactive) [ 09/Jun/14 ] |
|
It seems that some thread was blocked with the ll_inode_info::lli_sa_lock held. There is some known bug for it. Here is the patch: http://review.whamcloud.com/#/c/9665/ Would you please to try the patch? Thanks! |
| Comment by Jian Yu [ 10/Jun/14 ] |
Sure, I'll do this. Thank you! |
| Comment by Jian Yu [ 11/Jun/14 ] |
|
Here is the patch back-ported to Lustre b2_5 branch: http://review.whamcloud.com/10674 |
| Comment by Di Wang [ 13/Jun/14 ] |
|
Hmm, I saw similar problem when I run my patch http://review.whamcloud.com/#/c/10622/ on master. https://maloo.whamcloud.com/test_sets/196f5da8-f2d5-11e3-b88b-52540035b04c Is this also needed on master? |
| Comment by Jian Yu [ 17/Jun/14 ] |
The patch was reverted from Lustre b2_5 branch because we need wait until master version is fully ready. |
| Comment by Jian Yu [ 18/Jun/14 ] |
|
Another sanityn test 40a failure instance on Lustre b2_5 branch: |
| Comment by Nathaniel Clark [ 16/Jul/14 ] |
|
Another lustre-rsync-test test_6 on master branch review-dne-part-1: |
| Comment by Li Wei (Inactive) [ 06/Aug/14 ] |
|
lustre-rsync-test 6, master, zfs, single MDT: https://testing.hpdd.intel.com/test_sets/67240e86-1cf6-11e4-9a83-5254006e85c2 |
| Comment by nasf (Inactive) [ 07/Aug/14 ] |
|
Another failure instance: https://testing.hpdd.intel.com/test_sessions/6d60501e-1dbb-11e4-8fe8-5254006e85c2 |
| Comment by Jian Yu [ 05/Sep/14 ] |
|
While verifying patch http://review.whamcloud.com/11615 with FSTYPE=zfs on Lustre b2_5 branch, lustre-rsync-test hit the same failure: |
| Comment by Jian Yu [ 24/Feb/15 ] |
|
Here is the back-ported patch for Lustre b2_5 branch: http://review.whamcloud.com/13846 |