Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-463

orphan recovery happens too late, causing writes to fail with ENOENT after recovery

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.1.0, (10)
      Lustre 2.2.0, Lustre 2.1.1, Lustre 2.1.2, Lustre 2.1.3, Lustre 2.1.4, Lustre 2.1.5, Lustre 1.8.8, Lustre 1.8.6, Lustre 1.8.9, Lustre 2.1.6
    • None
    • 3
    • 22,777
    • 5680

    Description

      While running recovery-mds-scale with FLAVOR=OSS, it failed as follows after running 3 hours:

      ==== Checking the clients loads AFTER  failover -- failure NOT OK
      ost5 has failed over 5 times, and counting...
      sleeping 246 seconds ... 
      tar: etc/rc.d/rc6.d/K88rsyslog: Cannot stat: No such file or directory
      tar: Exiting with failure status due to previous errors
      Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
      client-21-ib
      Client load failed on node client-21-ib
      
      client client-21-ib load stdout and debug files :
                    /tmp/recovery-mds-scale.log_run_tar.sh-client-21-ib
                    /tmp/recovery-mds-scale.log_run_tar.sh-client-21-ib.debug
      2011-06-26 08:08:03 Terminating clients loads ...
      Duration:                86400
      Server failover period: 600 seconds
      Exited after:           13565 seconds
      Number of failovers before exit:
      mds: 0 times
      ost1: 2 times
      ost2: 6 times
      ost3: 3 times
      ost4: 4 times
      ost5: 5 times
      ost6: 3 times
      Status: FAIL: rc=1
      

      Syslog on client node client-21-ib showed that:

      Jun 26 08:03:55 client-21 kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
      Jun 26 08:04:20 client-21 kernel: LustreError: 18613:0:(client.c:2347:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff88031daf6c00 x1372677268199869/t98784270264 o2->lustre-OST0005_UUID@192.168.4.132@o2ib:28/4 lens 400/592 e 0 to 1 dl 1309100718 ref 2 fl Interpret:R/4/0 rc -2/-2
      

      Syslog on the MDS node client-10-ib showed that:

      Jun 26 08:03:57 client-10-ib kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
      Jun 26 08:04:22 client-10-ib kernel: LustreError: 17651:0:(client.c:2347:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff810320674400 x1372677249608261/t98784270265 o2->lustre-OST0005_UUID@192.168.4.132@o2ib:28/4 lens 400/592 e 0 to 1 dl 1309100720 ref 2 fl Interpret:R/4/0 rc -2/-2
      

      Syslog on the OSS node fat-amd-1-ib showed that:

      Jun 26 08:03:57 fat-amd-1-ib kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
      Jun 26 08:04:21 fat-amd-1-ib kernel: Lustre: 6278:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) lustre-OST0005: 5 recoverable clients remain
      Jun 26 08:04:21 fat-amd-1-ib kernel: Lustre: 6278:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) Skipped 2 previous similar messagesJun 26 08:04:21 fat-amd-1-ib kernel: LustreError: 6336:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-lustre-OST0005_UUID: lvbo_init failed for resource 161916: rc -2
      Jun 26 08:04:21 fat-amd-1-ib kernel: LustreError: 6336:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 18 previous similar messagesJun 26 08:04:25 fat-amd-1-ib kernel: LustreError: 7708:0:(filter_log.c:135:filter_cancel_cookies_cb()) error cancelling log cookies: rc = -19
      Jun 26 08:04:25 fat-amd-1-ib kernel: LustreError: 7708:0:(filter_log.c:135:filter_cancel_cookies_cb()) Skipped 8 previous similar messagesJun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: Recovery period over after 0:05, of 6 clients 6 recovered and 0 were evicted.
      Jun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: sending delayed replies to recovered clientsJun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: received MDS connection from 192.168.4.10@o2ib
      

      Maloo report: https://maloo.whamcloud.com/test_sets/f1c2fd72-a067-11e0-aee5-52540025f9af

      Please find the debug logs in the attachment.

      This is a known issue: bug 22777

      Attachments

        Issue Links

          Activity

            [LU-463] orphan recovery happens too late, causing writes to fail with ENOENT after recovery
            pjones Peter Jones added a comment -

            Hongchao

            Could you please look into this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Hongchao Could you please look into this one? Thanks Peter
            yujian Jian Yu added a comment - Another instance: https://maloo.whamcloud.com/test_sets/f99459d2-eb26-11e1-b137-52540035b04c
            yujian Jian Yu added a comment -

            Lustre Tag: v2_1_3_RC1
            Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/113/
            Distro/Arch: RHEL6.3/x86_64 (kernel version: 2.6.32-279.2.1.el6)
            Network: IB (in-kernel OFED)
            ENABLE_QUOTA=yes
            FAILURE_MODE=HARD

            The issue occurred again while running recovery-mds-scale failover_ost test:
            https://maloo.whamcloud.com/test_sets/b18a1330-e5ad-11e1-ae4e-52540035b04c

            yujian Jian Yu added a comment - Lustre Tag: v2_1_3_RC1 Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/113/ Distro/Arch: RHEL6.3/x86_64 (kernel version: 2.6.32-279.2.1.el6) Network: IB (in-kernel OFED) ENABLE_QUOTA=yes FAILURE_MODE=HARD The issue occurred again while running recovery-mds-scale failover_ost test: https://maloo.whamcloud.com/test_sets/b18a1330-e5ad-11e1-ae4e-52540035b04c
            yujian Jian Yu added a comment -

            Lustre Tag: v2_1_2_RC2
            Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/86/
            Distro/Arch: RHEL6.2/x86_64
            Network: TCP (1GigE)
            ENABLE_QUOTA=yes
            FAILURE_MODE=HARD

            The same issue occurred while failing over OST: https://maloo.whamcloud.com/test_sets/c9193e08-abca-11e1-9b8f-52540035b04c

            yujian Jian Yu added a comment - Lustre Tag: v2_1_2_RC2 Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/86/ Distro/Arch: RHEL6.2/x86_64 Network: TCP (1GigE) ENABLE_QUOTA=yes FAILURE_MODE=HARD The same issue occurred while failing over OST: https://maloo.whamcloud.com/test_sets/c9193e08-abca-11e1-9b8f-52540035b04c
            yujian Jian Yu added a comment -

            Lustre Tag: v1_8_8_WC1_RC1
            Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/195/
            Distro/Arch: RHEL5.8/x86_64(server), RHEL6.2/x86_64(client)
            Network: TCP (1GigE)
            ENABLE_QUOTA=yes
            FAILURE_MODE=HARD

            The same issue occurred while failing over OST: https://maloo.whamcloud.com/test_sets/be9c60e0-9e82-11e1-9080-52540035b04c

            yujian Jian Yu added a comment - Lustre Tag: v1_8_8_WC1_RC1 Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/195/ Distro/Arch: RHEL5.8/x86_64(server), RHEL6.2/x86_64(client) Network: TCP (1GigE) ENABLE_QUOTA=yes FAILURE_MODE=HARD The same issue occurred while failing over OST: https://maloo.whamcloud.com/test_sets/be9c60e0-9e82-11e1-9080-52540035b04c
            yujian Jian Yu added a comment -

            Lustre Tag: v2_2_0_0_RC2
            Lustre Build: http://build.whamcloud.com/job/lustre-b2_2/17/
            Distro/Arch: SLES11SP1/x86_64(client), RHEL6.2/x86_64(server)
            Network: TCP (1GigE)
            ENABLE_QUOTA=yes
            FAILURE_MODE=HARD

            The same issue occurred while failing over OST: https://maloo.whamcloud.com/test_sets/b6eb20c8-799f-11e1-9d2a-5254004bbbd3

            yujian Jian Yu added a comment - Lustre Tag: v2_2_0_0_RC2 Lustre Build: http://build.whamcloud.com/job/lustre-b2_2/17/ Distro/Arch: SLES11SP1/x86_64(client), RHEL6.2/x86_64(server) Network: TCP (1GigE) ENABLE_QUOTA=yes FAILURE_MODE=HARD The same issue occurred while failing over OST: https://maloo.whamcloud.com/test_sets/b6eb20c8-799f-11e1-9d2a-5254004bbbd3
            yujian Jian Yu added a comment -

            Lustre Tag: v2_1_1_0_RC4
            Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/44/
            e2fsprogs Build: http://build.whamcloud.com/job/e2fsprogs-master/217/
            Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-220.el6)
            Network: IB (in-kernel OFED)
            ENABLE_QUOTA=yes
            FAILURE_MODE=HARD
            FLAVOR=OSS

            Configuration:

            MGS/MDS Nodes: client-8-ib
            
            OSS Nodes: client-18-ib(active), client-19-ib(active)
                                          \ /
                                          OST1 (active in client-18-ib)
                                          OST2 (active in client-19-ib)
                                          OST3 (active in client-18-ib)
                                          OST4 (active in client-19-ib)
                                          OST5 (active in client-18-ib)
                                          OST6 (active in client-19-ib)
                       client-9-ib(OST7)
            
            Client Nodes: client-[1,4,17],fat-amd-2,fat-intel-2
            
            Network Addresses:
            client-1-ib: 192.168.4.1
            client-4-ib: 192.168.4.4
            client-8-ib: 192.168.4.8
            client-9-ib: 192.168.4.9
            client-17-ib: 192.168.4.17
            client-18-ib: 192.168.4.18
            client-19-ib: 192.168.4.19
            fat-amd-2-ib: 192.168.4.133
            fat-intel-2-ib: 192.168.4.129
            

            While running recovery-mds-scale with FLAVOR=OSS, it failed as follows:

            ==== Checking the clients loads AFTER  failover -- failure NOT OK
            ost1 has failed over 1 times, and counting...
            sleeping 717 seconds ...
            tar: etc/selinux/targeted/contexts/users/root: Cannot write: No such file or directory
            tar: Exiting with failure status due to previous errors
            Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
            client-1-ib
            Client load failed on node client-1-ib
            
            client client-1-ib load stdout and debug files :
                          /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
                          /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug
            

            /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib:

            tar: etc/selinux/targeted/contexts/users/root: Cannot write: No such file or directory
            tar: Exiting with failure status due to previous errors
            

            /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug

            <~snip~>
            2012-02-22 03:56:04: tar run starting
            + mkdir -p /mnt/lustre/d0.tar-client-1-ib
            + cd /mnt/lustre/d0.tar-client-1-ib
            + wait 11196
            + do_tar
            + tar cf - /etc
            + tar xf -
            + tee /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
            tar: Removing leading `/' from member names
            + return 2
            + RC=2
            ++ grep 'exit delayed from previous errors' /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
            + PREV_ERRORS=
            + true
            + '[' 2 -ne 0 -a '' -a '' ']'
            + '[' 2 -eq 0 ']'
            ++ date '+%F %H:%M:%S'
            + echoerr '2012-02-22 03:59:25: tar failed'
            + echo '2012-02-22 03:59:25: tar failed'
            2012-02-22 03:59:25: tar failed
            <~snip~>
            

            Syslog on client node client-1-ib showed that:

            Feb 22 03:59:12 client-1 kernel: Lustre: DEBUG MARKER: ost1 has failed over 1 times, and counting...
            Feb 22 03:59:19 client-1 kernel: LustreError: 10064:0:(client.c:2590:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff88031d605c00 x1394513519058221/t379(379) o-1->lustre-OST0004_UUID@192.168.4.19@o2ib:28/4 lens 408/400 e 0 to 0 dl 1329912005 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1
            Feb 22 03:59:19 client-1 kernel: LustreError: 10064:0:(client.c:2590:ptlrpc_replay_interpret()) Skipped 4 previous similar messages
            Feb 22 03:59:19 client-1 kernel: Lustre: lustre-OST0004-osc-ffff88032c89a400: Connection restored to service lustre-OST0004 using nid 192.168.4.19@o2ib.
            

            Syslog on MDS node client-8-ib showed that:

            Feb 22 03:59:12 client-8-ib kernel: Lustre: DEBUG MARKER: ost1 has failed over 1 times, and counting...
            Feb 22 03:59:19 client-8-ib kernel: LustreError: 5628:0:(client.c:2590:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff88030708c400 x1394513506470444/t380(380) o-1->lustre-OST0004_UUID@192.168.4.19@o2ib:28/4 lens 408/400 e 0 to 0 dl 1329912005 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1
            Feb 22 03:59:19 client-8-ib kernel: LustreError: 5628:0:(client.c:2590:ptlrpc_replay_interpret()) Skipped 4 previous similar messages
            Feb 22 03:59:19 client-8-ib kernel: Lustre: lustre-OST0004-osc-MDT0000: Connection restored to service lustre-OST0004 using nid 192.168.4.19@o2ib.
            Feb 22 03:59:19 client-8-ib kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0004_UUID now active, resetting orphans
            Feb 22 03:59:19 client-8-ib kernel: Lustre: 7395:0:(quota_master.c:1760:mds_quota_recovery()) Only 3/7 OSTs are active, abort quota recovery
            

            Syslog on OSS node client-19-ib showed that:

            Feb 22 03:59:12 client-19-ib kernel: Lustre: DEBUG MARKER: ost1 has failed over 1 times, and counting...
            Feb 22 03:59:18 client-19-ib kernel: Lustre: 7501:0:(filter.c:2697:filter_connect_internal()) lustre-OST0004: Received MDS connection for group 0
            Feb 22 03:59:18 client-19-ib kernel: LustreError: 9874:0:(filter.c:4141:filter_destroy())  lustre-OST0004: can not find olg of group 0
            Feb 22 03:59:18 client-19-ib kernel: LustreError: 9874:0:(filter.c:4141:filter_destroy()) Skipped 22 previous similar messages
            Feb 22 03:59:19 client-19-ib kernel: Lustre: lustre-OST0004: sending delayed replies to recovered clients
            Feb 22 03:59:19 client-19-ib kernel: Lustre: lustre-OST0004: received MDS connection from 192.168.4.8@o2ib
            Feb 22 03:59:19 client-19-ib kernel: Lustre: 7530:0:(filter.c:2553:filter_llog_connect()) lustre-OST0004: Recovery from log 0xff506/0x0:8f36a744
            

            Please refer to /scratch/logs/2.1.1/recovery-oss-scale.1329912676.log.tar.bz2 on brent node for debug and other logs.

            yujian Jian Yu added a comment - Lustre Tag: v2_1_1_0_RC4 Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/44/ e2fsprogs Build: http://build.whamcloud.com/job/e2fsprogs-master/217/ Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-220.el6) Network: IB (in-kernel OFED) ENABLE_QUOTA=yes FAILURE_MODE=HARD FLAVOR=OSS Configuration: MGS/MDS Nodes: client-8-ib OSS Nodes: client-18-ib(active), client-19-ib(active) \ / OST1 (active in client-18-ib) OST2 (active in client-19-ib) OST3 (active in client-18-ib) OST4 (active in client-19-ib) OST5 (active in client-18-ib) OST6 (active in client-19-ib) client-9-ib(OST7) Client Nodes: client-[1,4,17],fat-amd-2,fat-intel-2 Network Addresses: client-1-ib: 192.168.4.1 client-4-ib: 192.168.4.4 client-8-ib: 192.168.4.8 client-9-ib: 192.168.4.9 client-17-ib: 192.168.4.17 client-18-ib: 192.168.4.18 client-19-ib: 192.168.4.19 fat-amd-2-ib: 192.168.4.133 fat-intel-2-ib: 192.168.4.129 While running recovery-mds-scale with FLAVOR=OSS, it failed as follows: ==== Checking the clients loads AFTER failover -- failure NOT OK ost1 has failed over 1 times, and counting... sleeping 717 seconds ... tar: etc/selinux/targeted/contexts/users/root: Cannot write: No such file or directory tar: Exiting with failure status due to previous errors Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file client-1-ib Client load failed on node client-1-ib client client-1-ib load stdout and debug files : /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib: tar: etc/selinux/targeted/contexts/users/root: Cannot write: No such file or directory tar: Exiting with failure status due to previous errors /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug <~snip~> 2012-02-22 03:56:04: tar run starting + mkdir -p /mnt/lustre/d0.tar-client-1-ib + cd /mnt/lustre/d0.tar-client-1-ib + wait 11196 + do_tar + tar cf - /etc + tar xf - + tee /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib tar: Removing leading `/' from member names + return 2 + RC=2 ++ grep 'exit delayed from previous errors' /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib + PREV_ERRORS= + true + '[' 2 -ne 0 -a '' -a '' ']' + '[' 2 -eq 0 ']' ++ date '+%F %H:%M:%S' + echoerr '2012-02-22 03:59:25: tar failed' + echo '2012-02-22 03:59:25: tar failed' 2012-02-22 03:59:25: tar failed <~snip~> Syslog on client node client-1-ib showed that: Feb 22 03:59:12 client-1 kernel: Lustre: DEBUG MARKER: ost1 has failed over 1 times, and counting... Feb 22 03:59:19 client-1 kernel: LustreError: 10064:0:(client.c:2590:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff88031d605c00 x1394513519058221/t379(379) o-1->lustre-OST0004_UUID@192.168.4.19@o2ib:28/4 lens 408/400 e 0 to 0 dl 1329912005 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1 Feb 22 03:59:19 client-1 kernel: LustreError: 10064:0:(client.c:2590:ptlrpc_replay_interpret()) Skipped 4 previous similar messages Feb 22 03:59:19 client-1 kernel: Lustre: lustre-OST0004-osc-ffff88032c89a400: Connection restored to service lustre-OST0004 using nid 192.168.4.19@o2ib. Syslog on MDS node client-8-ib showed that: Feb 22 03:59:12 client-8-ib kernel: Lustre: DEBUG MARKER: ost1 has failed over 1 times, and counting... Feb 22 03:59:19 client-8-ib kernel: LustreError: 5628:0:(client.c:2590:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff88030708c400 x1394513506470444/t380(380) o-1->lustre-OST0004_UUID@192.168.4.19@o2ib:28/4 lens 408/400 e 0 to 0 dl 1329912005 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1 Feb 22 03:59:19 client-8-ib kernel: LustreError: 5628:0:(client.c:2590:ptlrpc_replay_interpret()) Skipped 4 previous similar messages Feb 22 03:59:19 client-8-ib kernel: Lustre: lustre-OST0004-osc-MDT0000: Connection restored to service lustre-OST0004 using nid 192.168.4.19@o2ib. Feb 22 03:59:19 client-8-ib kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0004_UUID now active, resetting orphans Feb 22 03:59:19 client-8-ib kernel: Lustre: 7395:0:(quota_master.c:1760:mds_quota_recovery()) Only 3/7 OSTs are active, abort quota recovery Syslog on OSS node client-19-ib showed that: Feb 22 03:59:12 client-19-ib kernel: Lustre: DEBUG MARKER: ost1 has failed over 1 times, and counting... Feb 22 03:59:18 client-19-ib kernel: Lustre: 7501:0:(filter.c:2697:filter_connect_internal()) lustre-OST0004: Received MDS connection for group 0 Feb 22 03:59:18 client-19-ib kernel: LustreError: 9874:0:(filter.c:4141:filter_destroy()) lustre-OST0004: can not find olg of group 0 Feb 22 03:59:18 client-19-ib kernel: LustreError: 9874:0:(filter.c:4141:filter_destroy()) Skipped 22 previous similar messages Feb 22 03:59:19 client-19-ib kernel: Lustre: lustre-OST0004: sending delayed replies to recovered clients Feb 22 03:59:19 client-19-ib kernel: Lustre: lustre-OST0004: received MDS connection from 192.168.4.8@o2ib Feb 22 03:59:19 client-19-ib kernel: Lustre: 7530:0:(filter.c:2553:filter_llog_connect()) lustre-OST0004: Recovery from log 0xff506/0x0:8f36a744 Please refer to /scratch/logs/2.1.1/recovery-oss-scale.1329912676.log.tar.bz2 on brent node for debug and other logs.
            yujian Jian Yu added a comment -

            Lustre Tag: v1_8_7_WC1_RC1
            Lustre Build: http://newbuild.whamcloud.com/job/lustre-b1_8/142/
            e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/65/
            Distro/Arch: RHEL5/x86_64(server, OFED 1.5.3.2, ext4-based ldiskfs), RHEL6/x86_64(client, in-kernel OFED)
            ENABLE_QUOTA=yes
            FAILURE_MODE=HARD
            FLAVOR=OSS

            recovery-mds-scale (FLAVOR=OSS) test failed with the same issue: https://maloo.whamcloud.com/test_sets/004f464c-f550-11e0-908b-52540025f9af

            Please refer to the attached recovery-oss-scale.1318474116.log.tar.bz2 for more logs.

            yujian Jian Yu added a comment - Lustre Tag: v1_8_7_WC1_RC1 Lustre Build: http://newbuild.whamcloud.com/job/lustre-b1_8/142/ e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/65/ Distro/Arch: RHEL5/x86_64(server, OFED 1.5.3.2, ext4-based ldiskfs), RHEL6/x86_64(client, in-kernel OFED) ENABLE_QUOTA=yes FAILURE_MODE=HARD FLAVOR=OSS recovery-mds-scale (FLAVOR=OSS) test failed with the same issue: https://maloo.whamcloud.com/test_sets/004f464c-f550-11e0-908b-52540025f9af Please refer to the attached recovery-oss-scale.1318474116.log.tar.bz2 for more logs.
            yujian Jian Yu added a comment -

            Lustre Tag: v2_1_0_0_RC2
            Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/283/
            e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/54/
            Distro/Arch: RHEL6/x86_64(in-kernel OFED, kernel version: 2.6.32-131.6.1.el6.x86_64)
            ENABLE_QUOTA=yes
            FAILURE_MODE=HARD
            FLAVOR=OSS

            After running about 2 hours (OSS failed over 7 times), recovery-mds-scale (FLAVOR=OSS) test failed with the same issue:
            https://maloo.whamcloud.com/test_sets/50e83640-e362-11e0-9909-52540025f9af

            Please refer to the attached recovery-oss-scale.1316503823.log.tar.bz2 for more logs.

            yujian Jian Yu added a comment - Lustre Tag: v2_1_0_0_RC2 Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/283/ e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/54/ Distro/Arch: RHEL6/x86_64(in-kernel OFED, kernel version: 2.6.32-131.6.1.el6.x86_64) ENABLE_QUOTA=yes FAILURE_MODE=HARD FLAVOR=OSS After running about 2 hours (OSS failed over 7 times), recovery-mds-scale (FLAVOR=OSS) test failed with the same issue: https://maloo.whamcloud.com/test_sets/50e83640-e362-11e0-9909-52540025f9af Please refer to the attached recovery-oss-scale.1316503823.log.tar.bz2 for more logs.
            yujian Jian Yu added a comment -

            Lustre Branch: master
            Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/276/
            e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/54/
            Distro/Arch: RHEL5/x86_64(in-kernel OFED, kernel version: 2.6.18-238.19.1.el5)
            ENABLE_QUOTA=yes
            FAILURE_MODE=HARD
            FLAVOR=OSS

            recovery-mds-scale(FLAVOR=OSS) failed with the same issue: https://maloo.whamcloud.com/test_sets/d0a83c78-da97-11e0-8d02-52540025f9af

            Please refer to the attached recovery-oss-scale.1315539020.log.tar.bz2 for more logs.

            yujian Jian Yu added a comment - Lustre Branch: master Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/276/ e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/54/ Distro/Arch: RHEL5/x86_64(in-kernel OFED, kernel version: 2.6.18-238.19.1.el5) ENABLE_QUOTA=yes FAILURE_MODE=HARD FLAVOR=OSS recovery-mds-scale(FLAVOR=OSS) failed with the same issue: https://maloo.whamcloud.com/test_sets/d0a83c78-da97-11e0-8d02-52540025f9af Please refer to the attached recovery-oss-scale.1315539020.log.tar.bz2 for more logs.
            yujian Jian Yu added a comment -

            Lustre Tag: v2_0_65_0
            Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/204/
            e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/
            Distro/Arch: RHEL6/x86_64(in-kernel OFED, kernel version: 2.6.32-131.2.1.el6)
            ENABLE_QUOTA=yes
            FAILURE_MODE=HARD
            FLAVOR=OSS

            Lustre cluster configuration:

            MGS/MDS Node: client-7-ib
            OSS Nodes:    fat-amd-1-ib(active), fat-amd-2-ib(active)
                                             \  /
                                             OST1 (active in fat-amd-1-ib)
                                             OST2 (active in fat-amd-2-ib)
                                             OST3 (active in fat-amd-1-ib)
                                             OST4 (active in fat-amd-2-ib)
                                             OST5 (active in fat-amd-1-ib)
                                             OST6 (active in fat-amd-2-ib)
                          client-8-ib (OST7)
            Client Nodes: fat-amd-3-ib, client-[9,11,12,13]-ib
            

            While running recovery-mds-scale with FLAVOR=OSS, it failed as follows:

            ==== Checking the clients loads AFTER  failover -- failure NOT OK
            ost5 has failed over 1 times, and counting...
            sleeping 417 seconds ... 
            tar: etc/selinux/targeted/modules/active/modules/postgrey.pp: Cannot write: No such file or directory
            tar: Exiting with failure status due to previous errors
            Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
            client-13-ib
            client-12-ib
            Client load failed on node client-13-ib
            
            client client-13-ib load stdout and debug files :
                          /tmp/recovery-mds-scale.log_run_dbench.sh-client-13-ib
                          /tmp/recovery-mds-scale.log_run_dbench.sh-client-13-ib.debug
            

            /tmp/recovery-mds-scale.log_run_dbench.sh-client-13-ib:

            copying /usr/share/dbench/client.txt to /mnt/lustre/d0.dbench-client-13-ib/client.txt
            running 'dbench 2' on /mnt/lustre/d0.dbench-client-13-ib at Sun Jul 24 21:00:21 PDT 2011
            dbench PID=29959
            dbench version 4.00 - Copyright Andrew Tridgell 1999-2004
            
            Running for 600 seconds with load 'client.txt' and minimum warmup 120 secs
            0 of 2 processes prepared for launch   0 sec
            2 of 2 processes prepared for launch   0 sec
            releasing clients
               2       666    42.32 MB/sec  warmup   1 sec  latency 354.182 ms
               2      1436    26.02 MB/sec  warmup   2 sec  latency 386.166 ms
            <~snip~>
               2     18452     0.00 MB/sec  execute  67 sec  latency 171356.589 ms
               2     18452     0.00 MB/sec  execute  68 sec  latency 172356.684 ms
            [18468] write failed on handle 13839 (Cannot send after transport endpoint shutdown)
            Child failed with status 1
            

            /tmp/recovery-mds-scale.log_run_dbench.sh-client-13-ib.debug:

            2011-07-24 21:00:21: dbench run starting
            + mkdir -p /mnt/lustre/d0.dbench-client-13-ib
            + load_pid=29927
            + wait 29927
            + rundbench -D /mnt/lustre/d0.dbench-client-13-ib 2
            touch: missing file operand
            Try `touch --help' for more information.
            + '[' 1 -eq 0 ']'
            ++ date '+%F %H:%M:%S'
            + echoerr '2011-07-24 21:03:30: dbench failed'
            + echo '2011-07-24 21:03:30: dbench failed'
            2011-07-24 21:03:30: dbench failed
            

            /tmp/recovery-mds-scale.log_run_tar.sh-client-12-ib:

            tar: etc/selinux/targeted/modules/active/modules/postgrey.pp: Cannot write: No such file or directory
            tar: Exiting with failure status due to previous errors
            

            /tmp/recovery-mds-scale.log_run_tar.sh-client-12-ib.debug:

            2011-07-24 21:00:20: tar run starting
            + mkdir -p /mnt/lustre/d0.tar-client-12-ib
            + cd /mnt/lustre/d0.tar-client-12-ib
            + wait 29934
            + do_tar
            + tar cf - /etc
            + tar xf -
            + tee /tmp/recovery-mds-scale.log_run_tar.sh-client-12-ib
            tar: Removing leading `/' from member names
            + return 2
            + RC=2
            ++ grep 'exit delayed from previous errors' /tmp/recovery-mds-scale.log_run_tar.sh-client-12-ib
            + PREV_ERRORS=
            + true
            + '[' 2 -ne 0 -a '' -a '' ']'
            + '[' 2 -eq 0 ']'
            ++ date '+%F %H:%M:%S'
            + echoerr '2011-07-24 21:03:32: tar failed'
            + echo '2011-07-24 21:03:32: tar failed'
            2011-07-24 21:03:32: tar failed
            

            Syslog on client node client-13-ib showed that:

            Jul 24 21:03:27 client-13 kernel: Lustre: DEBUG MARKER: ost5 has failed over 1 times, and counting...
            Jul 24 21:03:29 client-13 kernel: Lustre: 29024:0:(client.c:2527:ptlrpc_replay_interpret()) @@@ Version mismatch during replay
            Jul 24 21:03:29 client-13 kernel:  req@ffff8802fcbb5000 x1375277041099764/t470(470) o-1->lustre-OST0004_UUID@192.168.4.133@o2ib:6/4 lens 512/400 e 1 to 0 dl 1311566653 ref 2 fl Interpret:R/ffffffff/ffffffff rc -75/-1
            Jul 24 21:03:30 client-13 kernel: Lustre: 29024:0:(import.c:1190:completed_replay_interpret()) lustre-OST0002-osc-ffff88030ec62400: version recovery fails, reconnecting
            Jul 24 21:03:30 client-13 kernel: LustreError: 167-0: This client was evicted by lustre-OST0002; in progress operations using this service will fail.
            Jul 24 21:03:30 client-13 kernel: LustreError: 29023:0:(client.c:1057:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff8802fcbdf400 x1375277041107454/t0(0) o-1->lustre-OST0002_UUID@192.168.4.133@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
            Jul 24 21:03:30 client-13 kernel: LustreError: 29021:0:(client.c:1057:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff8803071b3800 x1375277041107455/t0(0) o-1->lustre-OST0002_UUID@192.168.4.133@o2ib:6/4 lens 456/416 e 0 to 0 dl 0 ref 2 fl Rpc:/ffffffff/ffffffff rc 0/-1
            Jul 24 21:03:30 client-13 kernel: Lustre: lustre-OST0002-osc-ffff88030ec62400: Connection restored to service lustre-OST0002 using nid 192.168.4.133@o2ib.
            

            Syslog on client node client-12-ib showed that:

            Jul 24 21:03:27 client-12 kernel: Lustre: DEBUG MARKER: ost5 has failed over 1 times, and counting...
            Jul 24 21:03:29 client-12 kernel: LustreError: 29051:0:(client.c:2570:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff880302aa3800 x1375277041064805/t507(507) o-1->lustre-OST0004_UUID@192.168.4.133@o2ib:28/4 lens 408/400 e 0 to 0 dl 1311566655 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1
            Jul 24 21:03:29 client-12 kernel: LustreError: 29051:0:(client.c:2570:ptlrpc_replay_interpret()) Skipped 10 previous similar messages
            Jul 24 21:03:29 client-12 kernel: Lustre: lustre-OST0004-osc-ffff88030b964000: Connection restored to service lustre-OST0004 using nid 192.168.4.133@o2ib.
            

            Syslog on MDS node client-7-ib showed that:

            Jul 24 21:03:27 client-7 kernel: Lustre: DEBUG MARKER: ost5 has failed over 1 times, and counting...
            Jul 24 21:03:29 client-7 kernel: LustreError: 29520:0:(client.c:2570:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff8802a519e000 x1375277024283756/t508(508) o-1->lustre-OST0004_UUID@192.168.4.133@o2ib:28/4 lens 408/400 e 0 to 0 dl 1311566655 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1
            Jul 24 21:03:29 client-7 kernel: LustreError: 29520:0:(client.c:2570:ptlrpc_replay_interpret()) Skipped 10 previous similar messages
            Jul 24 21:03:29 client-7 kernel: Lustre: lustre-OST0004-osc-MDT0000: Connection restored to service lustre-OST0004 using nid 192.168.4.133@o2ib.
            Jul 24 21:03:29 client-7 kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0004_UUID now active, resetting orphans
            Jul 24 21:03:29 client-7 kernel: Lustre: 31049:0:(quota_master.c:1760:mds_quota_recovery()) Only 6/7 OSTs are active, abort quota recovery
            

            Syslog on OSS node fat-amd-2-ib showed that:

            Jul 24 21:03:27 fat-amd-2 kernel: Lustre: DEBUG MARKER: ost5 has failed over 1 times, and counting...
            Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(filter.c:4111:filter_destroy())  lustre-OST0004: can not find olg of group 0
            Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(filter.c:4111:filter_destroy()) Skipped 9 previous similar messages
            Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(genops.c:1267:class_disconnect_stale_exports()) lustre-OST0004: disconnect stale client de6f48d2-f9b1-c66d-ae70-0bfaf5c8e6b5@192.168.4.13@o2ib
            Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(filter.c:2927:filter_grant_sanity_check()) filter_disconnect: tot_granted 5345280 != fo_tot_granted 11374592
            Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(ldlm_resource.c:1084:ldlm_resource_get()) lvbo_init failed for resource 326: rc -2
            Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(ldlm_resource.c:1084:ldlm_resource_get()) Skipped 10 previous similar messages
            Jul 24 21:03:29 fat-amd-2 kernel: Lustre: lustre-OST0004: sending delayed replies to recovered clients
            Jul 24 21:03:29 fat-amd-2 kernel: Lustre: lustre-OST0004: received MDS connection from 192.168.4.7@o2ib
            Jul 24 21:03:29 fat-amd-2 kernel: Lustre: 30378:0:(filter.c:2550:filter_llog_connect()) lustre-OST0004: Recovery from log 0xff506/0x0:5a402c04
            Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 30513:0:(filter_io.c:723:filter_preprw_write()) lustre-OST0004: BRW to missing obj 342/0:rc -2
            Jul 24 21:03:30 fat-amd-2 kernel: Lustre: 30368:0:(filter.c:2846:filter_connect()) lustre-OST0002: Received MDS connection (0x2d0794f47e7586a3); group 0
            Jul 24 21:03:30 fat-amd-2 kernel: Lustre: 30368:0:(filter.c:2846:filter_connect()) Skipped 9 previous similar messages
            Jul 24 21:04:15 fat-amd-2 kernel: Lustre: 30360:0:(ldlm_lib.c:871:target_handle_connect()) lustre-OST0004: connection from de6f48d2-f9b1-c66d-ae70-0bfaf5c8e6b5@192.168.4.13@o2ib t470 exp (null) cur 1311566655 last 0
            Jul 24 21:04:15 fat-amd-2 kernel: Lustre: 30360:0:(ldlm_lib.c:871:target_handle_connect()) Skipped 5 previous similar messages
            Jul 24 21:04:15 fat-amd-2 kernel: Lustre: 30360:0:(filter.c:2846:filter_connect()) lustre-OST0004: Received MDS connection (0x2d0794f47e759c6e); group 0
            Jul 24 21:04:15 fat-amd-2 kernel: Lustre: 30360:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import lustre-OST0004->NET_0x50000c0a8040d_UUID netid 50000: select flavor null
            Jul 24 21:04:15 fat-amd-2 kernel: Lustre: 30360:0:(sec.c:1474:sptlrpc_import_sec_adapt()) Skipped 5 previous similar messages
            

            Maloo report: https://maloo.whamcloud.com/test_sets/eff97732-b67e-11e0-8bdf-52540025f9af

            Please refer to the attached recovery-oss-scale-1311567030.tar.bz2 for more syslogs and debug logs.

            yujian Jian Yu added a comment - Lustre Tag: v2_0_65_0 Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/204/ e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/ Distro/Arch: RHEL6/x86_64(in-kernel OFED, kernel version: 2.6.32-131.2.1.el6) ENABLE_QUOTA=yes FAILURE_MODE=HARD FLAVOR=OSS Lustre cluster configuration: MGS/MDS Node: client-7-ib OSS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(active) \ / OST1 (active in fat-amd-1-ib) OST2 (active in fat-amd-2-ib) OST3 (active in fat-amd-1-ib) OST4 (active in fat-amd-2-ib) OST5 (active in fat-amd-1-ib) OST6 (active in fat-amd-2-ib) client-8-ib (OST7) Client Nodes: fat-amd-3-ib, client-[9,11,12,13]-ib While running recovery-mds-scale with FLAVOR=OSS, it failed as follows: ==== Checking the clients loads AFTER failover -- failure NOT OK ost5 has failed over 1 times, and counting... sleeping 417 seconds ... tar: etc/selinux/targeted/modules/active/modules/postgrey.pp: Cannot write: No such file or directory tar: Exiting with failure status due to previous errors Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file client-13-ib client-12-ib Client load failed on node client-13-ib client client-13-ib load stdout and debug files : /tmp/recovery-mds-scale.log_run_dbench.sh-client-13-ib /tmp/recovery-mds-scale.log_run_dbench.sh-client-13-ib.debug /tmp/recovery-mds-scale.log_run_dbench.sh-client-13-ib: copying /usr/share/dbench/client.txt to /mnt/lustre/d0.dbench-client-13-ib/client.txt running 'dbench 2' on /mnt/lustre/d0.dbench-client-13-ib at Sun Jul 24 21:00:21 PDT 2011 dbench PID=29959 dbench version 4.00 - Copyright Andrew Tridgell 1999-2004 Running for 600 seconds with load 'client.txt' and minimum warmup 120 secs 0 of 2 processes prepared for launch 0 sec 2 of 2 processes prepared for launch 0 sec releasing clients 2 666 42.32 MB/sec warmup 1 sec latency 354.182 ms 2 1436 26.02 MB/sec warmup 2 sec latency 386.166 ms <~snip~> 2 18452 0.00 MB/sec execute 67 sec latency 171356.589 ms 2 18452 0.00 MB/sec execute 68 sec latency 172356.684 ms [18468] write failed on handle 13839 (Cannot send after transport endpoint shutdown) Child failed with status 1 /tmp/recovery-mds-scale.log_run_dbench.sh-client-13-ib.debug: 2011-07-24 21:00:21: dbench run starting + mkdir -p /mnt/lustre/d0.dbench-client-13-ib + load_pid=29927 + wait 29927 + rundbench -D /mnt/lustre/d0.dbench-client-13-ib 2 touch: missing file operand Try `touch --help' for more information. + '[' 1 -eq 0 ']' ++ date '+%F %H:%M:%S' + echoerr '2011-07-24 21:03:30: dbench failed' + echo '2011-07-24 21:03:30: dbench failed' 2011-07-24 21:03:30: dbench failed /tmp/recovery-mds-scale.log_run_tar.sh-client-12-ib: tar: etc/selinux/targeted/modules/active/modules/postgrey.pp: Cannot write: No such file or directory tar: Exiting with failure status due to previous errors /tmp/recovery-mds-scale.log_run_tar.sh-client-12-ib.debug: 2011-07-24 21:00:20: tar run starting + mkdir -p /mnt/lustre/d0.tar-client-12-ib + cd /mnt/lustre/d0.tar-client-12-ib + wait 29934 + do_tar + tar cf - /etc + tar xf - + tee /tmp/recovery-mds-scale.log_run_tar.sh-client-12-ib tar: Removing leading `/' from member names + return 2 + RC=2 ++ grep 'exit delayed from previous errors' /tmp/recovery-mds-scale.log_run_tar.sh-client-12-ib + PREV_ERRORS= + true + '[' 2 -ne 0 -a '' -a '' ']' + '[' 2 -eq 0 ']' ++ date '+%F %H:%M:%S' + echoerr '2011-07-24 21:03:32: tar failed' + echo '2011-07-24 21:03:32: tar failed' 2011-07-24 21:03:32: tar failed Syslog on client node client-13-ib showed that: Jul 24 21:03:27 client-13 kernel: Lustre: DEBUG MARKER: ost5 has failed over 1 times, and counting... Jul 24 21:03:29 client-13 kernel: Lustre: 29024:0:(client.c:2527:ptlrpc_replay_interpret()) @@@ Version mismatch during replay Jul 24 21:03:29 client-13 kernel: req@ffff8802fcbb5000 x1375277041099764/t470(470) o-1->lustre-OST0004_UUID@192.168.4.133@o2ib:6/4 lens 512/400 e 1 to 0 dl 1311566653 ref 2 fl Interpret:R/ffffffff/ffffffff rc -75/-1 Jul 24 21:03:30 client-13 kernel: Lustre: 29024:0:(import.c:1190:completed_replay_interpret()) lustre-OST0002-osc-ffff88030ec62400: version recovery fails, reconnecting Jul 24 21:03:30 client-13 kernel: LustreError: 167-0: This client was evicted by lustre-OST0002; in progress operations using this service will fail. Jul 24 21:03:30 client-13 kernel: LustreError: 29023:0:(client.c:1057:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8802fcbdf400 x1375277041107454/t0(0) o-1->lustre-OST0002_UUID@192.168.4.133@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1 Jul 24 21:03:30 client-13 kernel: LustreError: 29021:0:(client.c:1057:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8803071b3800 x1375277041107455/t0(0) o-1->lustre-OST0002_UUID@192.168.4.133@o2ib:6/4 lens 456/416 e 0 to 0 dl 0 ref 2 fl Rpc:/ffffffff/ffffffff rc 0/-1 Jul 24 21:03:30 client-13 kernel: Lustre: lustre-OST0002-osc-ffff88030ec62400: Connection restored to service lustre-OST0002 using nid 192.168.4.133@o2ib. Syslog on client node client-12-ib showed that: Jul 24 21:03:27 client-12 kernel: Lustre: DEBUG MARKER: ost5 has failed over 1 times, and counting... Jul 24 21:03:29 client-12 kernel: LustreError: 29051:0:(client.c:2570:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff880302aa3800 x1375277041064805/t507(507) o-1->lustre-OST0004_UUID@192.168.4.133@o2ib:28/4 lens 408/400 e 0 to 0 dl 1311566655 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1 Jul 24 21:03:29 client-12 kernel: LustreError: 29051:0:(client.c:2570:ptlrpc_replay_interpret()) Skipped 10 previous similar messages Jul 24 21:03:29 client-12 kernel: Lustre: lustre-OST0004-osc-ffff88030b964000: Connection restored to service lustre-OST0004 using nid 192.168.4.133@o2ib. Syslog on MDS node client-7-ib showed that: Jul 24 21:03:27 client-7 kernel: Lustre: DEBUG MARKER: ost5 has failed over 1 times, and counting... Jul 24 21:03:29 client-7 kernel: LustreError: 29520:0:(client.c:2570:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff8802a519e000 x1375277024283756/t508(508) o-1->lustre-OST0004_UUID@192.168.4.133@o2ib:28/4 lens 408/400 e 0 to 0 dl 1311566655 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1 Jul 24 21:03:29 client-7 kernel: LustreError: 29520:0:(client.c:2570:ptlrpc_replay_interpret()) Skipped 10 previous similar messages Jul 24 21:03:29 client-7 kernel: Lustre: lustre-OST0004-osc-MDT0000: Connection restored to service lustre-OST0004 using nid 192.168.4.133@o2ib. Jul 24 21:03:29 client-7 kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0004_UUID now active, resetting orphans Jul 24 21:03:29 client-7 kernel: Lustre: 31049:0:(quota_master.c:1760:mds_quota_recovery()) Only 6/7 OSTs are active, abort quota recovery Syslog on OSS node fat-amd-2-ib showed that: Jul 24 21:03:27 fat-amd-2 kernel: Lustre: DEBUG MARKER: ost5 has failed over 1 times, and counting... Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(filter.c:4111:filter_destroy()) lustre-OST0004: can not find olg of group 0 Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(filter.c:4111:filter_destroy()) Skipped 9 previous similar messages Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(genops.c:1267:class_disconnect_stale_exports()) lustre-OST0004: disconnect stale client de6f48d2-f9b1-c66d-ae70-0bfaf5c8e6b5@192.168.4.13@o2ib Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(filter.c:2927:filter_grant_sanity_check()) filter_disconnect: tot_granted 5345280 != fo_tot_granted 11374592 Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(ldlm_resource.c:1084:ldlm_resource_get()) lvbo_init failed for resource 326: rc -2 Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(ldlm_resource.c:1084:ldlm_resource_get()) Skipped 10 previous similar messages Jul 24 21:03:29 fat-amd-2 kernel: Lustre: lustre-OST0004: sending delayed replies to recovered clients Jul 24 21:03:29 fat-amd-2 kernel: Lustre: lustre-OST0004: received MDS connection from 192.168.4.7@o2ib Jul 24 21:03:29 fat-amd-2 kernel: Lustre: 30378:0:(filter.c:2550:filter_llog_connect()) lustre-OST0004: Recovery from log 0xff506/0x0:5a402c04 Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 30513:0:(filter_io.c:723:filter_preprw_write()) lustre-OST0004: BRW to missing obj 342/0:rc -2 Jul 24 21:03:30 fat-amd-2 kernel: Lustre: 30368:0:(filter.c:2846:filter_connect()) lustre-OST0002: Received MDS connection (0x2d0794f47e7586a3); group 0 Jul 24 21:03:30 fat-amd-2 kernel: Lustre: 30368:0:(filter.c:2846:filter_connect()) Skipped 9 previous similar messages Jul 24 21:04:15 fat-amd-2 kernel: Lustre: 30360:0:(ldlm_lib.c:871:target_handle_connect()) lustre-OST0004: connection from de6f48d2-f9b1-c66d-ae70-0bfaf5c8e6b5@192.168.4.13@o2ib t470 exp (null) cur 1311566655 last 0 Jul 24 21:04:15 fat-amd-2 kernel: Lustre: 30360:0:(ldlm_lib.c:871:target_handle_connect()) Skipped 5 previous similar messages Jul 24 21:04:15 fat-amd-2 kernel: Lustre: 30360:0:(filter.c:2846:filter_connect()) lustre-OST0004: Received MDS connection (0x2d0794f47e759c6e); group 0 Jul 24 21:04:15 fat-amd-2 kernel: Lustre: 30360:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import lustre-OST0004->NET_0x50000c0a8040d_UUID netid 50000: select flavor null Jul 24 21:04:15 fat-amd-2 kernel: Lustre: 30360:0:(sec.c:1474:sptlrpc_import_sec_adapt()) Skipped 5 previous similar messages Maloo report: https://maloo.whamcloud.com/test_sets/eff97732-b67e-11e0-8bdf-52540025f9af Please refer to the attached recovery-oss-scale-1311567030.tar.bz2 for more syslogs and debug logs.

            People

              hongchao.zhang Hongchao Zhang
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: