[LU-10539] Hang in sanity test 27y after test 27wa failure Created: 20/Jan/18 Updated: 11/May/18 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Oleg Drokin | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Recently I added hang detection in my test scripts (forcing a crashdump) after a string of unnoticed hangs. These started perhaps within a last month. this is typical dmesg excerpt I see: [416495.506504] Lustre: DEBUG MARKER: == sanity test 27w: check /home/green/git/lustre-release/lustre/utils/lfs setstripe -S and getstrip -d options ====================================================================================================== 11:43:55 (1516466635) [416497.255866] Lustre: DEBUG MARKER: == sanity test 27wa: check /home/green/git/lustre-release/lustre/utils/lfs setstripe -c -i options === 11:43:57 (1516466637) [416497.534629] Lustre: DEBUG MARKER: sanity test_27wa: @@@@@@ FAIL: stripe offset 1 != 0 [416502.783473] Lustre: DEBUG MARKER: == sanity test 27x: create files while OST0 is degraded ============================================== 11:44:03 (1516466643) [416514.642209] Lustre: DEBUG MARKER: == sanity test 27y: create files while OST0 is degraded and the rest inactive ======================== 11:44:15 (1516466655) [416515.530002] Lustre: setting import lustre-OST0001_UUID INACTIVE by administrator request [416567.482888] Lustre: lustre-OST0001: haven't heard from client lustre-MDT0000-mdtlov_UUID (at 0@lo) in 55 seconds. I think it's dead, and I am evicting it. exp ffff880050d64800, cur 1516466708 expire 1516466678 last 1516466653 [422479.178646] SysRq : Trigger a crash So this "trigger a crash" is my hang-detecting script action. I also have a somewhat similar crashes in sanityn on final cleanup. [ 5243.482652] Lustre: DEBUG MARKER: == sanityn test 101c: Discard DoM data on close-unlink =============================================== 05:54:08 (1516445648) [ 5246.056667] Lustre: DEBUG MARKER: cleanup: ====================================================== [ 5246.824947] Lustre: DEBUG MARKER: == sanityn test complete, duration 3366 sec ========================================================== 05:54:11 (1516445651) [ 5406.643264] Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request [ 5406.646379] LustreError: 10927:0:(ldlm_resource.c:1093:ldlm_resource_complain()) lustre-MDT0000-mdc-ffff88029712e800: namespace resource [0x200000401:0x61:0x0].0x0 (ffff8802a3125e80) refcount nonzero (1) after lock cleanup; forcing cleanup. [ 5406.649224] LustreError: 10927:0:(ldlm_resource.c:1093:ldlm_resource_complain()) Skipped 1 previous similar message [ 5406.650574] LustreError: 10927:0:(ldlm_resource.c:1669:ldlm_resource_dump()) --- Resource: [0x200000401:0x61:0x0].0x0 (ffff8802a3125e80) refcount = 2 [ 5406.653403] LustreError: 10927:0:(ldlm_resource.c:1669:ldlm_resource_dump()) --- Resource: [0x200000401:0x61:0x0].0x0 (ffff8802a3125e80) refcount = 2 [ 5458.369294] Lustre: lustre-OST0001: haven't heard from client 973784fe-361f-e798-f5c2-e30e4191bdb1 (at 0@lo) in 53 seconds. I think it's dead, and I am evicting it. exp ffff8802ab25f800, cur 1516445863 expire 1516445833 last 1516445810 [ 5458.941486] Lustre: lustre-OST0000: haven't heard from client 973784fe-361f-e798-f5c2-e30e4191bdb1 (at 0@lo) in 53 seconds. I think it's dead, and I am evicting it. exp ffff8803204f5800, cur 1516445863 expire 1516445833 last 1516445810 [ 8436.036588] SysRq : Trigger a crash |
| Comments |