[LU-10539] Hang in sanity test 27y after test 27wa failure Created: 20/Jan/18  Updated: 11/May/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Recently I added hang detection in my test scripts (forcing a crashdump) after a string of unnoticed hangs.

These started perhaps within a last month.

this is typical dmesg excerpt I see:

[416495.506504] Lustre: DEBUG MARKER: == sanity test 27w: check /home/green/git/lustre-release/lustre/utils/lfs setstripe -S and getstrip -d options ====================================================================================================== 11:43:55 (1516466635)
[416497.255866] Lustre: DEBUG MARKER: == sanity test 27wa: check /home/green/git/lustre-release/lustre/utils/lfs setstripe -c -i options === 11:43:57 (1516466637)
[416497.534629] Lustre: DEBUG MARKER: sanity test_27wa: @@@@@@ FAIL: stripe offset 1 != 0
[416502.783473] Lustre: DEBUG MARKER: == sanity test 27x: create files while OST0 is degraded ============================================== 11:44:03 (1516466643)
[416514.642209] Lustre: DEBUG MARKER: == sanity test 27y: create files while OST0 is degraded and the rest inactive ======================== 11:44:15 (1516466655)
[416515.530002] Lustre: setting import lustre-OST0001_UUID INACTIVE by administrator request
[416567.482888] Lustre: lustre-OST0001: haven't heard from client lustre-MDT0000-mdtlov_UUID (at 0@lo) in 55 seconds. I think it's dead, and I am evicting it. exp ffff880050d64800, cur 1516466708 expire 1516466678 last 1516466653
[422479.178646] SysRq : Trigger a crash

So this "trigger a crash" is my hang-detecting script action.

I also have a somewhat similar crashes in sanityn on final cleanup.

[ 5243.482652] Lustre: DEBUG MARKER: == sanityn test 101c: Discard DoM data on close-unlink =============================================== 05:54:08 (1516445648)
[ 5246.056667] Lustre: DEBUG MARKER: cleanup: ======================================================
[ 5246.824947] Lustre: DEBUG MARKER: == sanityn test complete, duration 3366 sec ========================================================== 05:54:11 (1516445651)
[ 5406.643264] Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request
[ 5406.646379] LustreError: 10927:0:(ldlm_resource.c:1093:ldlm_resource_complain()) lustre-MDT0000-mdc-ffff88029712e800: namespace resource [0x200000401:0x61:0x0].0x0 (ffff8802a3125e80) refcount nonzero (1) after lock cleanup; forcing cleanup.
[ 5406.649224] LustreError: 10927:0:(ldlm_resource.c:1093:ldlm_resource_complain()) Skipped 1 previous similar message
[ 5406.650574] LustreError: 10927:0:(ldlm_resource.c:1669:ldlm_resource_dump()) --- Resource: [0x200000401:0x61:0x0].0x0 (ffff8802a3125e80) refcount = 2
[ 5406.653403] LustreError: 10927:0:(ldlm_resource.c:1669:ldlm_resource_dump()) --- Resource: [0x200000401:0x61:0x0].0x0 (ffff8802a3125e80) refcount = 2
[ 5458.369294] Lustre: lustre-OST0001: haven't heard from client 973784fe-361f-e798-f5c2-e30e4191bdb1 (at 0@lo) in 53 seconds. I think it's dead, and I am evicting it. exp ffff8802ab25f800, cur 1516445863 expire 1516445833 last 1516445810
[ 5458.941486] Lustre: lustre-OST0000: haven't heard from client 973784fe-361f-e798-f5c2-e30e4191bdb1 (at 0@lo) in 53 seconds. I think it's dead, and I am evicting it. exp ffff8803204f5800, cur 1516445863 expire 1516445833 last 1516445810
[ 8436.036588] SysRq : Trigger a crash


 Comments   
Comment by Bob Glossman (Inactive) [ 27/Jan/18 ]

more on master:
https://testing.hpdd.intel.com/test_sets/601a1c42-0385-11e8-a6ad-52540065bddc
https://testing.hpdd.intel.com/test_sets/637b7002-03c1-11e8-a6ad-52540065bddc
https://testing.hpdd.intel.com/test_sets/e5467cea-03f2-11e8-bd00-52540065bddc
https://testing.hpdd.intel.com/test_sets/ceba37a2-0454-11e8-a7cd-52540065bddc

likely a master only problem.
hit repeatedly testing on el6.9 for LU-10564.
not seen in similar tests on b2_10.
not seen in earlier tests on el6.9 for LU-10456, only done a few weeks ago.

Comment by Bob Glossman (Inactive) [ 31/Jan/18 ]

another on master:
https://testing.hpdd.intel.com/test_sets/17b5efb2-06c8-11e8-a10a-52540065bddc

Comment by Jian Yu [ 06/Feb/18 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/f6e6ce10-095a-11e8-a6ad-52540065bddc

Comment by Bob Glossman (Inactive) [ 06/Feb/18 ]

another on master:
https://testing.hpdd.intel.com/test_sets/cbcbd91a-0b70-11e8-bd00-52540065bddc

I'm starting to think this is a 100% fail on el6.9
Haven't seen a single instance that didn't fail.

Comment by Bob Glossman (Inactive) [ 13/Feb/18 ]

another on master:
https://testing.hpdd.intel.com/test_sets/06087af8-10de-11e8-a10a-52540065bddc

Comment by Bob Glossman (Inactive) [ 21/Feb/18 ]

more on master:
https://testing.hpdd.intel.com/test_sets/4a93087c-16a7-11e8-bd00-52540065bddc
https://testing.hpdd.intel.com/test_sets/061926e6-20d3-11e8-b046-52540065bddc

https://testing.hpdd.intel.com/test_sets/64b8214c-263d-11e8-9e0e-52540065bddc

Comment by Bob Glossman (Inactive) [ 15/Mar/18 ]

more on master:
https://testing.hpdd.intel.com/test_sets/3b6caff0-2878-11e8-b74b-52540065bddc
https://testing.hpdd.intel.com/test_sets/718ab598-293e-11e8-b3c6-52540065bddc

this seems to be a 100% fail on el6.9, but only on master.
Since el6 is officially unsupported on master and this looks very likely to be a test-only issue and not a lustre bug it may not make sense to pursue this issue.

Comment by Saurabh Tandan (Inactive) [ 11/May/18 ]

Similar thing seen for interop between 2.10.3_132 <-> EE3
RHEL7.4 2.10 Server/
RHEL7.3 EE3 Client

https://testing.hpdd.intel.com/test_sets/68e72d66-50fa-11e8-abc3-52540065bddc

Generated at Sat Feb 10 02:35:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.