Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-463

orphan recovery happens too late, causing writes to fail with ENOENT after recovery

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.1.0, (10)
      Lustre 2.2.0, Lustre 2.1.1, Lustre 2.1.2, Lustre 2.1.3, Lustre 2.1.4, Lustre 2.1.5, Lustre 1.8.8, Lustre 1.8.6, Lustre 1.8.9, Lustre 2.1.6
    • None
    • 3
    • 22,777
    • 5680

    Description

      While running recovery-mds-scale with FLAVOR=OSS, it failed as follows after running 3 hours:

      ==== Checking the clients loads AFTER  failover -- failure NOT OK
      ost5 has failed over 5 times, and counting...
      sleeping 246 seconds ... 
      tar: etc/rc.d/rc6.d/K88rsyslog: Cannot stat: No such file or directory
      tar: Exiting with failure status due to previous errors
      Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
      client-21-ib
      Client load failed on node client-21-ib
      
      client client-21-ib load stdout and debug files :
                    /tmp/recovery-mds-scale.log_run_tar.sh-client-21-ib
                    /tmp/recovery-mds-scale.log_run_tar.sh-client-21-ib.debug
      2011-06-26 08:08:03 Terminating clients loads ...
      Duration:                86400
      Server failover period: 600 seconds
      Exited after:           13565 seconds
      Number of failovers before exit:
      mds: 0 times
      ost1: 2 times
      ost2: 6 times
      ost3: 3 times
      ost4: 4 times
      ost5: 5 times
      ost6: 3 times
      Status: FAIL: rc=1
      

      Syslog on client node client-21-ib showed that:

      Jun 26 08:03:55 client-21 kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
      Jun 26 08:04:20 client-21 kernel: LustreError: 18613:0:(client.c:2347:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff88031daf6c00 x1372677268199869/t98784270264 o2->lustre-OST0005_UUID@192.168.4.132@o2ib:28/4 lens 400/592 e 0 to 1 dl 1309100718 ref 2 fl Interpret:R/4/0 rc -2/-2
      

      Syslog on the MDS node client-10-ib showed that:

      Jun 26 08:03:57 client-10-ib kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
      Jun 26 08:04:22 client-10-ib kernel: LustreError: 17651:0:(client.c:2347:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff810320674400 x1372677249608261/t98784270265 o2->lustre-OST0005_UUID@192.168.4.132@o2ib:28/4 lens 400/592 e 0 to 1 dl 1309100720 ref 2 fl Interpret:R/4/0 rc -2/-2
      

      Syslog on the OSS node fat-amd-1-ib showed that:

      Jun 26 08:03:57 fat-amd-1-ib kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
      Jun 26 08:04:21 fat-amd-1-ib kernel: Lustre: 6278:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) lustre-OST0005: 5 recoverable clients remain
      Jun 26 08:04:21 fat-amd-1-ib kernel: Lustre: 6278:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) Skipped 2 previous similar messagesJun 26 08:04:21 fat-amd-1-ib kernel: LustreError: 6336:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-lustre-OST0005_UUID: lvbo_init failed for resource 161916: rc -2
      Jun 26 08:04:21 fat-amd-1-ib kernel: LustreError: 6336:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 18 previous similar messagesJun 26 08:04:25 fat-amd-1-ib kernel: LustreError: 7708:0:(filter_log.c:135:filter_cancel_cookies_cb()) error cancelling log cookies: rc = -19
      Jun 26 08:04:25 fat-amd-1-ib kernel: LustreError: 7708:0:(filter_log.c:135:filter_cancel_cookies_cb()) Skipped 8 previous similar messagesJun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: Recovery period over after 0:05, of 6 clients 6 recovered and 0 were evicted.
      Jun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: sending delayed replies to recovered clientsJun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: received MDS connection from 192.168.4.10@o2ib
      

      Maloo report: https://maloo.whamcloud.com/test_sets/f1c2fd72-a067-11e0-aee5-52540025f9af

      Please find the debug logs in the attachment.

      This is a known issue: bug 22777

      Attachments

        Issue Links

          Activity

            [LU-463] orphan recovery happens too late, causing writes to fail with ENOENT after recovery
            adilger Andreas Dilger made changes -
            Resolution New: Duplicate [ 3 ]
            Status Original: Reopened [ 4 ] New: Resolved [ 5 ]
            adilger Andreas Dilger made changes -
            Link New: This issue duplicates LU-6200 [ LU-6200 ]
            hongchao.zhang Hongchao Zhang added a comment - - edited

            Yes, this should a duplicate of LU-6200, which contains a patch to fix the problem in a similar way.

            hongchao.zhang Hongchao Zhang added a comment - - edited Yes, this should a duplicate of LU-6200 , which contains a patch to fix the problem in a similar way.
            yujian Jian Yu added a comment -

            Hi Andreas,

            The failure is still blocking recovery-mds-scale failover_ost testing on all of the Lustre branches. LU-4621 and LU-6200 were created later to track the failures.

            yujian Jian Yu added a comment - Hi Andreas, The failure is still blocking recovery-mds-scale failover_ost testing on all of the Lustre branches. LU-4621 and LU-6200 were created later to track the failures.
            adilger Andreas Dilger made changes -
            Resolution Original: Incomplete [ 4 ]
            Status Original: Resolved [ 5 ] New: Reopened [ 4 ]

            I noticed that the patch http://review.whamcloud.com/4868 (against b2_1) was never landed to any branch. Was this bug fixed in some other way (in which case it can be closed again) or are we just ignoring these test failures now?

            adilger Andreas Dilger added a comment - I noticed that the patch http://review.whamcloud.com/4868 (against b2_1) was never landed to any branch. Was this bug fixed in some other way (in which case it can be closed again) or are we just ignoring these test failures now?
            jfc John Fuchs-Chesney (Inactive) made changes -
            Resolution New: Incomplete [ 4 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]

            Incomplete and out of date.
            ~ jfc.

            jfc John Fuchs-Chesney (Inactive) added a comment - Incomplete and out of date. ~ jfc.
            hongchao.zhang Hongchao Zhang added a comment - the patch is updated http://review.hpdd.intel.com/#/c/4868/
            yujian Jian Yu made changes -
            Affects Version/s New: Lustre 2.1.6 [ 10292 ]

            People

              hongchao.zhang Hongchao Zhang
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: