Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-463

orphan recovery happens too late, causing writes to fail with ENOENT after recovery

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.1.0, (10)
      Lustre 2.2.0, Lustre 2.1.1, Lustre 2.1.2, Lustre 2.1.3, Lustre 2.1.4, Lustre 2.1.5, Lustre 1.8.8, Lustre 1.8.6, Lustre 1.8.9, Lustre 2.1.6
    • None
    • 3
    • 22,777
    • 5680

    Description

      While running recovery-mds-scale with FLAVOR=OSS, it failed as follows after running 3 hours:

      ==== Checking the clients loads AFTER  failover -- failure NOT OK
      ost5 has failed over 5 times, and counting...
      sleeping 246 seconds ... 
      tar: etc/rc.d/rc6.d/K88rsyslog: Cannot stat: No such file or directory
      tar: Exiting with failure status due to previous errors
      Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
      client-21-ib
      Client load failed on node client-21-ib
      
      client client-21-ib load stdout and debug files :
                    /tmp/recovery-mds-scale.log_run_tar.sh-client-21-ib
                    /tmp/recovery-mds-scale.log_run_tar.sh-client-21-ib.debug
      2011-06-26 08:08:03 Terminating clients loads ...
      Duration:                86400
      Server failover period: 600 seconds
      Exited after:           13565 seconds
      Number of failovers before exit:
      mds: 0 times
      ost1: 2 times
      ost2: 6 times
      ost3: 3 times
      ost4: 4 times
      ost5: 5 times
      ost6: 3 times
      Status: FAIL: rc=1
      

      Syslog on client node client-21-ib showed that:

      Jun 26 08:03:55 client-21 kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
      Jun 26 08:04:20 client-21 kernel: LustreError: 18613:0:(client.c:2347:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff88031daf6c00 x1372677268199869/t98784270264 o2->lustre-OST0005_UUID@192.168.4.132@o2ib:28/4 lens 400/592 e 0 to 1 dl 1309100718 ref 2 fl Interpret:R/4/0 rc -2/-2
      

      Syslog on the MDS node client-10-ib showed that:

      Jun 26 08:03:57 client-10-ib kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
      Jun 26 08:04:22 client-10-ib kernel: LustreError: 17651:0:(client.c:2347:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff810320674400 x1372677249608261/t98784270265 o2->lustre-OST0005_UUID@192.168.4.132@o2ib:28/4 lens 400/592 e 0 to 1 dl 1309100720 ref 2 fl Interpret:R/4/0 rc -2/-2
      

      Syslog on the OSS node fat-amd-1-ib showed that:

      Jun 26 08:03:57 fat-amd-1-ib kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
      Jun 26 08:04:21 fat-amd-1-ib kernel: Lustre: 6278:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) lustre-OST0005: 5 recoverable clients remain
      Jun 26 08:04:21 fat-amd-1-ib kernel: Lustre: 6278:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) Skipped 2 previous similar messagesJun 26 08:04:21 fat-amd-1-ib kernel: LustreError: 6336:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-lustre-OST0005_UUID: lvbo_init failed for resource 161916: rc -2
      Jun 26 08:04:21 fat-amd-1-ib kernel: LustreError: 6336:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 18 previous similar messagesJun 26 08:04:25 fat-amd-1-ib kernel: LustreError: 7708:0:(filter_log.c:135:filter_cancel_cookies_cb()) error cancelling log cookies: rc = -19
      Jun 26 08:04:25 fat-amd-1-ib kernel: LustreError: 7708:0:(filter_log.c:135:filter_cancel_cookies_cb()) Skipped 8 previous similar messagesJun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: Recovery period over after 0:05, of 6 clients 6 recovered and 0 were evicted.
      Jun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: sending delayed replies to recovered clientsJun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: received MDS connection from 192.168.4.10@o2ib
      

      Maloo report: https://maloo.whamcloud.com/test_sets/f1c2fd72-a067-11e0-aee5-52540025f9af

      Please find the debug logs in the attachment.

      This is a known issue: bug 22777

      Attachments

        Issue Links

          Activity

            [LU-463] orphan recovery happens too late, causing writes to fail with ENOENT after recovery
            hongchao.zhang Hongchao Zhang added a comment - - edited

            Yes, this should a duplicate of LU-6200, which contains a patch to fix the problem in a similar way.

            hongchao.zhang Hongchao Zhang added a comment - - edited Yes, this should a duplicate of LU-6200 , which contains a patch to fix the problem in a similar way.
            yujian Jian Yu added a comment -

            Hi Andreas,

            The failure is still blocking recovery-mds-scale failover_ost testing on all of the Lustre branches. LU-4621 and LU-6200 were created later to track the failures.

            yujian Jian Yu added a comment - Hi Andreas, The failure is still blocking recovery-mds-scale failover_ost testing on all of the Lustre branches. LU-4621 and LU-6200 were created later to track the failures.

            I noticed that the patch http://review.whamcloud.com/4868 (against b2_1) was never landed to any branch. Was this bug fixed in some other way (in which case it can be closed again) or are we just ignoring these test failures now?

            adilger Andreas Dilger added a comment - I noticed that the patch http://review.whamcloud.com/4868 (against b2_1) was never landed to any branch. Was this bug fixed in some other way (in which case it can be closed again) or are we just ignoring these test failures now?

            Incomplete and out of date.
            ~ jfc.

            jfc John Fuchs-Chesney (Inactive) added a comment - Incomplete and out of date. ~ jfc.
            hongchao.zhang Hongchao Zhang added a comment - the patch is updated http://review.hpdd.intel.com/#/c/4868/
            yujian Jian Yu added a comment -

            Lustre Tag: v2_1_6_RC1
            Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/208
            Distro/Arch: RHEL6.4/x86_64
            ENABLE_QUOTA=yes
            FAILURE_MODE=HARD

            The issue occurred again while running recovery-mds-scale failover_ost test:
            https://maloo.whamcloud.com/test_sets/bca04cd4-cdf2-11e2-ba28-52540035b04c

            yujian Jian Yu added a comment - Lustre Tag: v2_1_6_RC1 Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/208 Distro/Arch: RHEL6.4/x86_64 ENABLE_QUOTA=yes FAILURE_MODE=HARD The issue occurred again while running recovery-mds-scale failover_ost test: https://maloo.whamcloud.com/test_sets/bca04cd4-cdf2-11e2-ba28-52540035b04c
            yujian Jian Yu added a comment - - edited

            Lustre Branch: b2_1
            Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/189/

            The same issue occurred: https://maloo.whamcloud.com/test_sets/278684ba-902b-11e2-9b28-52540035b04c

            Dmesg on OSS node showed that:

            Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
            cannot allocate a tage (271)
            cannot allocate a tage (271)
            Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost6 has failed over 5 times, and counting...
            Lustre: DEBUG MARKER: ost6 has failed over 5 times, and counting...
            Lustre: lustre-OST0006: sending delayed replies to recovered clients
            LustreError: 3691:0:(ldlm_resource.c:1090:ldlm_resource_get()) lvbo_init failed for resource 14662: rc -2
            Lustre: lustre-OST0006: received MDS connection from 10.10.4.190@tcp
            __ratelimit: 12 callbacks suppressed
            cannot allocate a tage (402)
            cannot allocate a tage (402)
            cannot allocate a tage (402)
            cannot allocate a tage (402)
            cannot allocate a tage (402)
            cannot allocate a tage (402)
            cannot allocate a tage (402)
            cannot allocate a tage (402)
            cannot allocate a tage (402)
            cannot allocate a tage (402)
            __ratelimit: 8 callbacks suppressed
            
            yujian Jian Yu added a comment - - edited Lustre Branch: b2_1 Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/189/ The same issue occurred: https://maloo.whamcloud.com/test_sets/278684ba-902b-11e2-9b28-52540035b04c Dmesg on OSS node showed that: Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK cannot allocate a tage (271) cannot allocate a tage (271) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost6 has failed over 5 times, and counting... Lustre: DEBUG MARKER: ost6 has failed over 5 times, and counting... Lustre: lustre-OST0006: sending delayed replies to recovered clients LustreError: 3691:0:(ldlm_resource.c:1090:ldlm_resource_get()) lvbo_init failed for resource 14662: rc -2 Lustre: lustre-OST0006: received MDS connection from 10.10.4.190@tcp __ratelimit: 12 callbacks suppressed cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) __ratelimit: 8 callbacks suppressed
            yujian Jian Yu added a comment - - edited

            Lustre Tag: v1_8_9_WC1_RC2
            Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/258
            Distro/Arch: RHEL5.9/x86_64(server), RHEL6.3/x86_64(client)
            Network: TCP (1GigE)
            ENABLE_QUOTA=yes
            FAILURE_MODE=HARD

            The same issue occurred: https://maloo.whamcloud.com/test_sets/79cd620e-7af3-11e2-b916-52540035b04c

            yujian Jian Yu added a comment - - edited Lustre Tag: v1_8_9_WC1_RC2 Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/258 Distro/Arch: RHEL5.9/x86_64(server), RHEL6.3/x86_64(client) Network: TCP (1GigE) ENABLE_QUOTA=yes FAILURE_MODE=HARD The same issue occurred: https://maloo.whamcloud.com/test_sets/79cd620e-7af3-11e2-b916-52540035b04c
            yujian Jian Yu added a comment -

            Lustre Tag: v1_8_9_WC1_RC1
            Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/256
            Distro/Arch: RHEL5.9/x86_64
            Network: IB (in-kernel OFED)
            ENABLE_QUOTA=yes
            FAILURE_MODE=HARD

            The same issue occurred: https://maloo.whamcloud.com/test_sets/d7e82752-79db-11e2-8fd2-52540035b04c

            yujian Jian Yu added a comment - Lustre Tag: v1_8_9_WC1_RC1 Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/256 Distro/Arch: RHEL5.9/x86_64 Network: IB (in-kernel OFED) ENABLE_QUOTA=yes FAILURE_MODE=HARD The same issue occurred: https://maloo.whamcloud.com/test_sets/d7e82752-79db-11e2-8fd2-52540035b04c
            yujian Jian Yu added a comment - Lustre Branch: b1_8 Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/251/ The same issue occurred: https://maloo.whamcloud.com/test_sets/9734baba-661f-11e2-a42b-52540035b04c

            People

              hongchao.zhang Hongchao Zhang
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: