Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-20003

recovery-small.sh test_131 FAIL: dd succeeded

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Medium
    • Lustre 2.18.0
    • Lustre 2.17.0
    • 3
    • 9223372036854775807

    Description

      It appears recovery-small.sh test_131 is failing regularly in Janitor testing:
      https://testing.whamcloud.com/gerrit-janitor/61262/results.html reports 50 failures in the past month, for example:
      https://testing.whamcloud.com/gerrit-janitor/61262/testresults/recovery-small-ldiskfs-DNE-rocky8.10_x86_64-rocky8.10_x86_64-retry1/

      It appears that what is causing the problem is that the "dd sleep 4" fail_loc=0x80000414 on the client is waking up before the server fail_loc=0x8000031e is set, so the client completes the "dd" successfully before it is evicted.

      It is a bit hard to tell, because the timestamps in the client and server dmesg logs are relative to boot and not absolute times shown by dmesg -T.

      The client logs show the dd thread waking up at 6113.67, and using the DEBUG MARKER line for reference this works out to @1773674221.67 +/- 1s:

      [ 6108.477238] Lustre: DEBUG MARKER: == recovery-small test 131: IO vs evict results to IO under staled lock === 11:16:56 (1773674216)
      [ 6109.579798] LustreError: 94095:0:(osc_request.c:2983:osc_build_rpc()) cfs_fail_timeout id 414 sleeping for 4000ms
      [ 6113.672214] LustreError: 94095:0:(osc_request.c:2983:osc_build_rpc()) cfs_fail_timeout id 414 awake
      [ 6114.881162] LustreError: lustre-OST0000-osc-ffff98dc07f29800: operation ost_statfs to node 192.168.201.103@tcp failed: rc = -107
      
      $ echo $((1773674216 - 6108 + 6113))
      1773674221
      

      Checking the server logs shows the eviction is triggered at 6116.64, and using the DEBUG MARKER timestamp this works out to @1773674222.64 +/-1s.

      [ 6110.836160] Lustre: DEBUG MARKER: == recovery-small test 131: IO vs evict results to IO under staled lock === 11:16:56 (1773674216)
      [ 6116.644026] Lustre: 85956:0:(genops.c:1793:obd_export_evict_by_uuid()) lustre-OST0000: evicting 50b6f219-0167-4a18-8ed7-c655d95343ea at adminstrative request
      [ 6116.677331] LustreError: 13152:0:(ldlm_lockd.c:2933:ldlm_bl_thread_exports()) cfs_fail_timeout id 31e sleeping for 4000ms
      [ 6120.377949] Lustre: DEBUG MARKER: recovery-small test_131: @@@@@@ FAIL: dd succeeded
      [ 6120.746865] LustreError: 13152:0:(ldlm_lockd.c:2933:ldlm_bl_thread_exports()) cfs_fail_timeout id 31e awake
      
      $ echo $((1773674216 - 6110 + 6116))
      1773674222
      

      Attachments

        Activity

          People

            adilger Andreas Dilger
            adilger Andreas Dilger
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: