Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.11.0
-
None
-
3
-
9223372036854775807
Description
Recently I added hang detection in my test scripts (forcing a crashdump) after a string of unnoticed hangs.
These started perhaps within a last month.
this is typical dmesg excerpt I see:
[416495.506504] Lustre: DEBUG MARKER: == sanity test 27w: check /home/green/git/lustre-release/lustre/utils/lfs setstripe -S and getstrip -d options ====================================================================================================== 11:43:55 (1516466635) [416497.255866] Lustre: DEBUG MARKER: == sanity test 27wa: check /home/green/git/lustre-release/lustre/utils/lfs setstripe -c -i options === 11:43:57 (1516466637) [416497.534629] Lustre: DEBUG MARKER: sanity test_27wa: @@@@@@ FAIL: stripe offset 1 != 0 [416502.783473] Lustre: DEBUG MARKER: == sanity test 27x: create files while OST0 is degraded ============================================== 11:44:03 (1516466643) [416514.642209] Lustre: DEBUG MARKER: == sanity test 27y: create files while OST0 is degraded and the rest inactive ======================== 11:44:15 (1516466655) [416515.530002] Lustre: setting import lustre-OST0001_UUID INACTIVE by administrator request [416567.482888] Lustre: lustre-OST0001: haven't heard from client lustre-MDT0000-mdtlov_UUID (at 0@lo) in 55 seconds. I think it's dead, and I am evicting it. exp ffff880050d64800, cur 1516466708 expire 1516466678 last 1516466653 [422479.178646] SysRq : Trigger a crash
So this "trigger a crash" is my hang-detecting script action.
I also have a somewhat similar crashes in sanityn on final cleanup.
[ 5243.482652] Lustre: DEBUG MARKER: == sanityn test 101c: Discard DoM data on close-unlink =============================================== 05:54:08 (1516445648) [ 5246.056667] Lustre: DEBUG MARKER: cleanup: ====================================================== [ 5246.824947] Lustre: DEBUG MARKER: == sanityn test complete, duration 3366 sec ========================================================== 05:54:11 (1516445651) [ 5406.643264] Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request [ 5406.646379] LustreError: 10927:0:(ldlm_resource.c:1093:ldlm_resource_complain()) lustre-MDT0000-mdc-ffff88029712e800: namespace resource [0x200000401:0x61:0x0].0x0 (ffff8802a3125e80) refcount nonzero (1) after lock cleanup; forcing cleanup. [ 5406.649224] LustreError: 10927:0:(ldlm_resource.c:1093:ldlm_resource_complain()) Skipped 1 previous similar message [ 5406.650574] LustreError: 10927:0:(ldlm_resource.c:1669:ldlm_resource_dump()) --- Resource: [0x200000401:0x61:0x0].0x0 (ffff8802a3125e80) refcount = 2 [ 5406.653403] LustreError: 10927:0:(ldlm_resource.c:1669:ldlm_resource_dump()) --- Resource: [0x200000401:0x61:0x0].0x0 (ffff8802a3125e80) refcount = 2 [ 5458.369294] Lustre: lustre-OST0001: haven't heard from client 973784fe-361f-e798-f5c2-e30e4191bdb1 (at 0@lo) in 53 seconds. I think it's dead, and I am evicting it. exp ffff8802ab25f800, cur 1516445863 expire 1516445833 last 1516445810 [ 5458.941486] Lustre: lustre-OST0000: haven't heard from client 973784fe-361f-e798-f5c2-e30e4191bdb1 (at 0@lo) in 53 seconds. I think it's dead, and I am evicting it. exp ffff8803204f5800, cur 1516445863 expire 1516445833 last 1516445810 [ 8436.036588] SysRq : Trigger a crash
Attachments
Issue Links
- mentioned in
-
Page Loading...