Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10539

Hang in sanity test 27y after test 27wa failure

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.11.0
    • None
    • 3
    • 9223372036854775807

    Description

      Recently I added hang detection in my test scripts (forcing a crashdump) after a string of unnoticed hangs.

      These started perhaps within a last month.

      this is typical dmesg excerpt I see:

      [416495.506504] Lustre: DEBUG MARKER: == sanity test 27w: check /home/green/git/lustre-release/lustre/utils/lfs setstripe -S and getstrip -d options ====================================================================================================== 11:43:55 (1516466635)
      [416497.255866] Lustre: DEBUG MARKER: == sanity test 27wa: check /home/green/git/lustre-release/lustre/utils/lfs setstripe -c -i options === 11:43:57 (1516466637)
      [416497.534629] Lustre: DEBUG MARKER: sanity test_27wa: @@@@@@ FAIL: stripe offset 1 != 0
      [416502.783473] Lustre: DEBUG MARKER: == sanity test 27x: create files while OST0 is degraded ============================================== 11:44:03 (1516466643)
      [416514.642209] Lustre: DEBUG MARKER: == sanity test 27y: create files while OST0 is degraded and the rest inactive ======================== 11:44:15 (1516466655)
      [416515.530002] Lustre: setting import lustre-OST0001_UUID INACTIVE by administrator request
      [416567.482888] Lustre: lustre-OST0001: haven't heard from client lustre-MDT0000-mdtlov_UUID (at 0@lo) in 55 seconds. I think it's dead, and I am evicting it. exp ffff880050d64800, cur 1516466708 expire 1516466678 last 1516466653
      [422479.178646] SysRq : Trigger a crash
      

      So this "trigger a crash" is my hang-detecting script action.

      I also have a somewhat similar crashes in sanityn on final cleanup.

      [ 5243.482652] Lustre: DEBUG MARKER: == sanityn test 101c: Discard DoM data on close-unlink =============================================== 05:54:08 (1516445648)
      [ 5246.056667] Lustre: DEBUG MARKER: cleanup: ======================================================
      [ 5246.824947] Lustre: DEBUG MARKER: == sanityn test complete, duration 3366 sec ========================================================== 05:54:11 (1516445651)
      [ 5406.643264] Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request
      [ 5406.646379] LustreError: 10927:0:(ldlm_resource.c:1093:ldlm_resource_complain()) lustre-MDT0000-mdc-ffff88029712e800: namespace resource [0x200000401:0x61:0x0].0x0 (ffff8802a3125e80) refcount nonzero (1) after lock cleanup; forcing cleanup.
      [ 5406.649224] LustreError: 10927:0:(ldlm_resource.c:1093:ldlm_resource_complain()) Skipped 1 previous similar message
      [ 5406.650574] LustreError: 10927:0:(ldlm_resource.c:1669:ldlm_resource_dump()) --- Resource: [0x200000401:0x61:0x0].0x0 (ffff8802a3125e80) refcount = 2
      [ 5406.653403] LustreError: 10927:0:(ldlm_resource.c:1669:ldlm_resource_dump()) --- Resource: [0x200000401:0x61:0x0].0x0 (ffff8802a3125e80) refcount = 2
      [ 5458.369294] Lustre: lustre-OST0001: haven't heard from client 973784fe-361f-e798-f5c2-e30e4191bdb1 (at 0@lo) in 53 seconds. I think it's dead, and I am evicting it. exp ffff8802ab25f800, cur 1516445863 expire 1516445833 last 1516445810
      [ 5458.941486] Lustre: lustre-OST0000: haven't heard from client 973784fe-361f-e798-f5c2-e30e4191bdb1 (at 0@lo) in 53 seconds. I think it's dead, and I am evicting it. exp ffff8803204f5800, cur 1516445863 expire 1516445833 last 1516445810
      [ 8436.036588] SysRq : Trigger a crash
      

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: