Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-463

orphan recovery happens too late, causing writes to fail with ENOENT after recovery

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.1.0, (10)
      Lustre 2.2.0, Lustre 2.1.1, Lustre 2.1.2, Lustre 2.1.3, Lustre 2.1.4, Lustre 2.1.5, Lustre 1.8.8, Lustre 1.8.6, Lustre 1.8.9, Lustre 2.1.6
    • None
    • 3
    • 22,777
    • 5680

    Description

      While running recovery-mds-scale with FLAVOR=OSS, it failed as follows after running 3 hours:

      ==== Checking the clients loads AFTER  failover -- failure NOT OK
      ost5 has failed over 5 times, and counting...
      sleeping 246 seconds ... 
      tar: etc/rc.d/rc6.d/K88rsyslog: Cannot stat: No such file or directory
      tar: Exiting with failure status due to previous errors
      Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
      client-21-ib
      Client load failed on node client-21-ib
      
      client client-21-ib load stdout and debug files :
                    /tmp/recovery-mds-scale.log_run_tar.sh-client-21-ib
                    /tmp/recovery-mds-scale.log_run_tar.sh-client-21-ib.debug
      2011-06-26 08:08:03 Terminating clients loads ...
      Duration:                86400
      Server failover period: 600 seconds
      Exited after:           13565 seconds
      Number of failovers before exit:
      mds: 0 times
      ost1: 2 times
      ost2: 6 times
      ost3: 3 times
      ost4: 4 times
      ost5: 5 times
      ost6: 3 times
      Status: FAIL: rc=1
      

      Syslog on client node client-21-ib showed that:

      Jun 26 08:03:55 client-21 kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
      Jun 26 08:04:20 client-21 kernel: LustreError: 18613:0:(client.c:2347:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff88031daf6c00 x1372677268199869/t98784270264 o2->lustre-OST0005_UUID@192.168.4.132@o2ib:28/4 lens 400/592 e 0 to 1 dl 1309100718 ref 2 fl Interpret:R/4/0 rc -2/-2
      

      Syslog on the MDS node client-10-ib showed that:

      Jun 26 08:03:57 client-10-ib kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
      Jun 26 08:04:22 client-10-ib kernel: LustreError: 17651:0:(client.c:2347:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff810320674400 x1372677249608261/t98784270265 o2->lustre-OST0005_UUID@192.168.4.132@o2ib:28/4 lens 400/592 e 0 to 1 dl 1309100720 ref 2 fl Interpret:R/4/0 rc -2/-2
      

      Syslog on the OSS node fat-amd-1-ib showed that:

      Jun 26 08:03:57 fat-amd-1-ib kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
      Jun 26 08:04:21 fat-amd-1-ib kernel: Lustre: 6278:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) lustre-OST0005: 5 recoverable clients remain
      Jun 26 08:04:21 fat-amd-1-ib kernel: Lustre: 6278:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) Skipped 2 previous similar messagesJun 26 08:04:21 fat-amd-1-ib kernel: LustreError: 6336:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-lustre-OST0005_UUID: lvbo_init failed for resource 161916: rc -2
      Jun 26 08:04:21 fat-amd-1-ib kernel: LustreError: 6336:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 18 previous similar messagesJun 26 08:04:25 fat-amd-1-ib kernel: LustreError: 7708:0:(filter_log.c:135:filter_cancel_cookies_cb()) error cancelling log cookies: rc = -19
      Jun 26 08:04:25 fat-amd-1-ib kernel: LustreError: 7708:0:(filter_log.c:135:filter_cancel_cookies_cb()) Skipped 8 previous similar messagesJun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: Recovery period over after 0:05, of 6 clients 6 recovered and 0 were evicted.
      Jun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: sending delayed replies to recovered clientsJun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: received MDS connection from 192.168.4.10@o2ib
      

      Maloo report: https://maloo.whamcloud.com/test_sets/f1c2fd72-a067-11e0-aee5-52540025f9af

      Please find the debug logs in the attachment.

      This is a known issue: bug 22777

      Attachments

        Issue Links

          Activity

            [LU-463] orphan recovery happens too late, causing writes to fail with ENOENT after recovery
            yujian Jian Yu added a comment - - edited

            Lustre Tag: v1_8_9_WC1_RC2
            Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/258
            Distro/Arch: RHEL5.9/x86_64(server), RHEL6.3/x86_64(client)
            Network: TCP (1GigE)
            ENABLE_QUOTA=yes
            FAILURE_MODE=HARD

            The same issue occurred: https://maloo.whamcloud.com/test_sets/79cd620e-7af3-11e2-b916-52540035b04c

            yujian Jian Yu added a comment - - edited Lustre Tag: v1_8_9_WC1_RC2 Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/258 Distro/Arch: RHEL5.9/x86_64(server), RHEL6.3/x86_64(client) Network: TCP (1GigE) ENABLE_QUOTA=yes FAILURE_MODE=HARD The same issue occurred: https://maloo.whamcloud.com/test_sets/79cd620e-7af3-11e2-b916-52540035b04c
            yujian Jian Yu added a comment -

            Lustre Tag: v1_8_9_WC1_RC1
            Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/256
            Distro/Arch: RHEL5.9/x86_64
            Network: IB (in-kernel OFED)
            ENABLE_QUOTA=yes
            FAILURE_MODE=HARD

            The same issue occurred: https://maloo.whamcloud.com/test_sets/d7e82752-79db-11e2-8fd2-52540035b04c

            yujian Jian Yu added a comment - Lustre Tag: v1_8_9_WC1_RC1 Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/256 Distro/Arch: RHEL5.9/x86_64 Network: IB (in-kernel OFED) ENABLE_QUOTA=yes FAILURE_MODE=HARD The same issue occurred: https://maloo.whamcloud.com/test_sets/d7e82752-79db-11e2-8fd2-52540035b04c
            yujian Jian Yu added a comment - Lustre Branch: b1_8 Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/251/ The same issue occurred: https://maloo.whamcloud.com/test_sets/9734baba-661f-11e2-a42b-52540035b04c

            the patch is updated, it will try to create the missing object if it was not created during recovery for precreation RPCs are not replayable,
            it will try to wait the creation complete or return EINPROGRESS if OBD_CONNECT_EINPROGRESS is set in the obd_export>exp_connect_flags.

            hongchao.zhang Hongchao Zhang added a comment - the patch is updated, it will try to create the missing object if it was not created during recovery for precreation RPCs are not replayable, it will try to wait the creation complete or return EINPROGRESS if OBD_CONNECT_EINPROGRESS is set in the obd_export >exp_connect_flags.
            hongchao.zhang Hongchao Zhang added a comment - the patch is tracked at http://review.whamcloud.com/#change,4868

            the patch against b2_1 is under creation&test.

            hongchao.zhang Hongchao Zhang added a comment - the patch against b2_1 is under creation&test.
            yujian Jian Yu added a comment -

            This has been blocking the recovery-mds-scale failover_ost test.

            yujian Jian Yu added a comment - This has been blocking the recovery-mds-scale failover_ost test.

            how about fixing the bug by waiting some time if the -2(ENOENT) is encountered on OST, which is in recovery mode atm.
            will produce a patch by this way.

            hongchao.zhang Hongchao Zhang added a comment - how about fixing the bug by waiting some time if the -2(ENOENT) is encountered on OST, which is in recovery mode atm. will produce a patch by this way.
            pjones Peter Jones added a comment -

            Hongchao

            Could you please look into this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Hongchao Could you please look into this one? Thanks Peter
            yujian Jian Yu added a comment - Another instance: https://maloo.whamcloud.com/test_sets/f99459d2-eb26-11e1-b137-52540035b04c
            yujian Jian Yu added a comment -

            Lustre Tag: v2_1_3_RC1
            Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/113/
            Distro/Arch: RHEL6.3/x86_64 (kernel version: 2.6.32-279.2.1.el6)
            Network: IB (in-kernel OFED)
            ENABLE_QUOTA=yes
            FAILURE_MODE=HARD

            The issue occurred again while running recovery-mds-scale failover_ost test:
            https://maloo.whamcloud.com/test_sets/b18a1330-e5ad-11e1-ae4e-52540035b04c

            yujian Jian Yu added a comment - Lustre Tag: v2_1_3_RC1 Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/113/ Distro/Arch: RHEL6.3/x86_64 (kernel version: 2.6.32-279.2.1.el6) Network: IB (in-kernel OFED) ENABLE_QUOTA=yes FAILURE_MODE=HARD The issue occurred again while running recovery-mds-scale failover_ost test: https://maloo.whamcloud.com/test_sets/b18a1330-e5ad-11e1-ae4e-52540035b04c

            People

              hongchao.zhang Hongchao Zhang
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: