[LU-463] orphan recovery happens too late, causing writes to fail with ENOENT after recovery - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.1.0, (10)
Lustre 2.2.0, Lustre 2.1.1, Lustre 2.1.2, Lustre 2.1.3, Lustre 2.1.4, Lustre 2.1.5, Lustre 1.8.8, Lustre 1.8.6, Lustre 1.8.9, Lustre 2.1.6
Labels:
None
Environment:

Hide

Lustre Branch: v1_8_6_RC3
Lustre Build: http://newbuild.whamcloud.com/job/lustre-b1_8/90/
e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/
Distro/Arch: RHEL6/x86_64(patchless client, in-kernel OFED, kernel version: 2.6.32-131.2.1.el6)
                    RHEL5/x86_64(server, OFED 1.5.3.1, kernel version: 2.6.18-238.12.1.el5_lustre)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

MGS/MDS Nodes: client-10-ib(active), client-12-ib(passive)
                                                           \ /
                                            1 combined MGS/MDT

OSS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(active)
                                                   \ /
                                               OST1 (active in fat-amd-1-ib)
                                               OST2 (active in fat-amd-2-ib)
                                               OST3 (active in fat-amd-1-ib)
                                               OST4 (active in fat-amd-2-ib)
                                               OST5 (active in fat-amd-1-ib)
                                               OST6 (active in fat-amd-2-ib)

Client Nodes: fat-amd-3-ib, client-[6,7,16,21,24]-ib

Show
Lustre Branch: v1_8_6_RC3 Lustre Build: http://newbuild.whamcloud.com/job/lustre-b1_8/90/ e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/ Distro/Arch: RHEL6/x86_64(patchless client, in-kernel OFED, kernel version: 2.6.32-131.2.1.el6)                     RHEL5/x86_64(server, OFED 1.5.3.1, kernel version: 2.6.18-238.12.1.el5_lustre) ENABLE_QUOTA=yes FAILURE_MODE=HARD MGS/MDS Nodes: client-10-ib(active), client-12-ib(passive)                                                            \ /                                             1 combined MGS/MDT OSS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(active)                                                    \ /                                                OST1 (active in fat-amd-1-ib)                                                OST2 (active in fat-amd-2-ib)                                                OST3 (active in fat-amd-1-ib)                                                OST4 (active in fat-amd-2-ib)                                                OST5 (active in fat-amd-1-ib)                                                OST6 (active in fat-amd-2-ib) Client Nodes: fat-amd-3-ib, client-[6,7,16,21,24]-ib

Severity:
3
Bugzilla ID:
22,777
Rank (Obsolete):
5680

Description

While running recovery-mds-scale with FLAVOR=OSS, it failed as follows after running 3 hours:

==== Checking the clients loads AFTER  failover -- failure NOT OK
ost5 has failed over 5 times, and counting...
sleeping 246 seconds ... 
tar: etc/rc.d/rc6.d/K88rsyslog: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors
Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
client-21-ib
Client load failed on node client-21-ib

client client-21-ib load stdout and debug files :
              /tmp/recovery-mds-scale.log_run_tar.sh-client-21-ib
              /tmp/recovery-mds-scale.log_run_tar.sh-client-21-ib.debug
2011-06-26 08:08:03 Terminating clients loads ...
Duration:                86400
Server failover period: 600 seconds
Exited after:           13565 seconds
Number of failovers before exit:
mds: 0 times
ost1: 2 times
ost2: 6 times
ost3: 3 times
ost4: 4 times
ost5: 5 times
ost6: 3 times
Status: FAIL: rc=1

Syslog on client node client-21-ib showed that:

Jun 26 08:03:55 client-21 kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
Jun 26 08:04:20 client-21 kernel: LustreError: 18613:0:(client.c:2347:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff88031daf6c00 x1372677268199869/t98784270264 o2->lustre-OST0005_UUID@192.168.4.132@o2ib:28/4 lens 400/592 e 0 to 1 dl 1309100718 ref 2 fl Interpret:R/4/0 rc -2/-2

Syslog on the MDS node client-10-ib showed that:

Jun 26 08:03:57 client-10-ib kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
Jun 26 08:04:22 client-10-ib kernel: LustreError: 17651:0:(client.c:2347:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff810320674400 x1372677249608261/t98784270265 o2->lustre-OST0005_UUID@192.168.4.132@o2ib:28/4 lens 400/592 e 0 to 1 dl 1309100720 ref 2 fl Interpret:R/4/0 rc -2/-2

Syslog on the OSS node fat-amd-1-ib showed that:

Jun 26 08:03:57 fat-amd-1-ib kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
Jun 26 08:04:21 fat-amd-1-ib kernel: Lustre: 6278:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) lustre-OST0005: 5 recoverable clients remain
Jun 26 08:04:21 fat-amd-1-ib kernel: Lustre: 6278:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) Skipped 2 previous similar messagesJun 26 08:04:21 fat-amd-1-ib kernel: LustreError: 6336:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-lustre-OST0005_UUID: lvbo_init failed for resource 161916: rc -2
Jun 26 08:04:21 fat-amd-1-ib kernel: LustreError: 6336:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 18 previous similar messagesJun 26 08:04:25 fat-amd-1-ib kernel: LustreError: 7708:0:(filter_log.c:135:filter_cancel_cookies_cb()) error cancelling log cookies: rc = -19
Jun 26 08:04:25 fat-amd-1-ib kernel: LustreError: 7708:0:(filter_log.c:135:filter_cancel_cookies_cb()) Skipped 8 previous similar messagesJun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: Recovery period over after 0:05, of 6 clients 6 recovered and 0 were evicted.
Jun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: sending delayed replies to recovered clientsJun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: received MDS connection from 192.168.4.10@o2ib

Maloo report: https://maloo.whamcloud.com/test_sets/f1c2fd72-a067-11e0-aee5-52540025f9af

Please find the debug logs in the attachment.

This is a known issue: bug 22777

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

recovery-mds-scale-1309100884.tar.bz2
2.17 MB
26/Jun/11 11:46 PM
recovery-oss-scale.1315539020.log.tar.bz2
3.58 MB
09/Sep/11 12:14 AM
recovery-oss-scale.1318474116.log.tar.bz2
1.57 MB
13/Oct/11 1:13 AM
recovery-oss-scale-1311567030.tar.bz2
1.04 MB
25/Jul/11 3:18 AM

Issue Links

duplicates

LU-6200 Failover recovery-mds-scale test_failover_ost: test_failover_ost returned 1

Resolved

Trackbacks

Lustre 1.8.6-wc1 release testing tracker Lustre 1.8.6wc1 RC1 Tag: v186RC1 Created Date: 20110610 RC1 was DOA due to a build failure related to tag name LU408

Lustre 1.8.7-wc1 release testing tracker Lustre 1.8.7wc1 RC1 Tag: v187WC1RC1 Build:

Lustre 1.8.8-wc1 release testing tracker Lustre 1.8.8wc1 RC1 Tag: v188WC1RC1 Build:

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Lustre 2.1.0 release testing tracker Lustre 2.1.0 RC0 Tag: v2100RC0 Build:

Lustre 2.1.1 release testing tracker Lustre 2.1.1 RC4 Tag: v2110RC4 Build:

Lustre 2.1.2 release testing tracker Lustre 2.1.2 RC2 Tag: v212RC2 Build:

Lustre 2.1.3 release testing tracker Lustre 2.1.3 RC1 Tag: v213RC1 Build:

(4 Trackbacks)

Activity

[LU-463] orphan recovery happens too late, causing writes to fail with ENOENT after recovery

Hongchao Zhang added a comment - 18/Dec/12 5:04 AM

the patch against b2_1 is under creation&test.

Hongchao Zhang added a comment - 18/Dec/12 5:04 AM the patch against b2_1 is under creation&test.

Jian Yu added a comment - 12/Dec/12 6:24 AM

This has been blocking the recovery-mds-scale failover_ost test.

Jian Yu added a comment - 12/Dec/12 6:24 AM This has been blocking the recovery-mds-scale failover_ost test.

Hongchao Zhang added a comment - 06/Dec/12 6:37 AM

how about fixing the bug by waiting some time if the -2(ENOENT) is encountered on OST, which is in recovery mode atm.
will produce a patch by this way.

Hongchao Zhang added a comment - 06/Dec/12 6:37 AM how about fixing the bug by waiting some time if the -2(ENOENT) is encountered on OST, which is in recovery mode atm. will produce a patch by this way.

Peter Jones added a comment - 05/Dec/12 1:22 PM

Hongchao

Could you please look into this one?

Thanks

Peter

Peter Jones added a comment - 05/Dec/12 1:22 PM Hongchao Could you please look into this one? Thanks Peter

Jian Yu added a comment - 21/Aug/12 7:34 AM

Another instance:
https://maloo.whamcloud.com/test_sets/f99459d2-eb26-11e1-b137-52540035b04c

Jian Yu added a comment - 21/Aug/12 7:34 AM Another instance: https://maloo.whamcloud.com/test_sets/f99459d2-eb26-11e1-b137-52540035b04c

Jian Yu added a comment - 14/Aug/12 7:19 AM

Lustre Tag: v2_1_3_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/113/
Distro/Arch: RHEL6.3/x86_64 (kernel version: 2.6.32-279.2.1.el6)
Network: IB (in-kernel OFED)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

The issue occurred again while running recovery-mds-scale failover_ost test:
https://maloo.whamcloud.com/test_sets/b18a1330-e5ad-11e1-ae4e-52540035b04c

Jian Yu added a comment - 14/Aug/12 7:19 AM Lustre Tag: v2_1_3_RC1 Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/113/ Distro/Arch: RHEL6.3/x86_64 (kernel version: 2.6.32-279.2.1.el6) Network: IB (in-kernel OFED) ENABLE_QUOTA=yes FAILURE_MODE=HARD The issue occurred again while running recovery-mds-scale failover_ost test: https://maloo.whamcloud.com/test_sets/b18a1330-e5ad-11e1-ae4e-52540035b04c

Jian Yu added a comment - 01/Jun/12 5:43 AM

Lustre Tag: v2_1_2_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/86/
Distro/Arch: RHEL6.2/x86_64
Network: TCP (1GigE)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

The same issue occurred while failing over OST: https://maloo.whamcloud.com/test_sets/c9193e08-abca-11e1-9b8f-52540035b04c

Jian Yu added a comment - 01/Jun/12 5:43 AM Lustre Tag: v2_1_2_RC2 Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/86/ Distro/Arch: RHEL6.2/x86_64 Network: TCP (1GigE) ENABLE_QUOTA=yes FAILURE_MODE=HARD The same issue occurred while failing over OST: https://maloo.whamcloud.com/test_sets/c9193e08-abca-11e1-9b8f-52540035b04c

Jian Yu added a comment - 15/May/12 8:05 AM

Lustre Tag: v1_8_8_WC1_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/195/
Distro/Arch: RHEL5.8/x86_64(server), RHEL6.2/x86_64(client)
Network: TCP (1GigE)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

The same issue occurred while failing over OST: https://maloo.whamcloud.com/test_sets/be9c60e0-9e82-11e1-9080-52540035b04c

Jian Yu added a comment - 15/May/12 8:05 AM Lustre Tag: v1_8_8_WC1_RC1 Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/195/ Distro/Arch: RHEL5.8/x86_64(server), RHEL6.2/x86_64(client) Network: TCP (1GigE) ENABLE_QUOTA=yes FAILURE_MODE=HARD The same issue occurred while failing over OST: https://maloo.whamcloud.com/test_sets/be9c60e0-9e82-11e1-9080-52540035b04c

Jian Yu added a comment - 29/Mar/12 9:17 AM

Lustre Tag: v2_2_0_0_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_2/17/
Distro/Arch: SLES11SP1/x86_64(client), RHEL6.2/x86_64(server)
Network: TCP (1GigE)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

The same issue occurred while failing over OST: https://maloo.whamcloud.com/test_sets/b6eb20c8-799f-11e1-9d2a-5254004bbbd3

Jian Yu added a comment - 29/Mar/12 9:17 AM Lustre Tag: v2_2_0_0_RC2 Lustre Build: http://build.whamcloud.com/job/lustre-b2_2/17/ Distro/Arch: SLES11SP1/x86_64(client), RHEL6.2/x86_64(server) Network: TCP (1GigE) ENABLE_QUOTA=yes FAILURE_MODE=HARD The same issue occurred while failing over OST: https://maloo.whamcloud.com/test_sets/b6eb20c8-799f-11e1-9d2a-5254004bbbd3

Jian Yu added a comment - 22/Feb/12 7:49 AM

Lustre Tag: v2_1_1_0_RC4
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/44/
e2fsprogs Build: http://build.whamcloud.com/job/e2fsprogs-master/217/
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-220.el6)
Network: IB (in-kernel OFED)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD
FLAVOR=OSS

Configuration:

MGS/MDS Nodes: client-8-ib

OSS Nodes: client-18-ib(active), client-19-ib(active)
                              \ /
                              OST1 (active in client-18-ib)
                              OST2 (active in client-19-ib)
                              OST3 (active in client-18-ib)
                              OST4 (active in client-19-ib)
                              OST5 (active in client-18-ib)
                              OST6 (active in client-19-ib)
           client-9-ib(OST7)

Client Nodes: client-[1,4,17],fat-amd-2,fat-intel-2

Network Addresses:
client-1-ib: 192.168.4.1
client-4-ib: 192.168.4.4
client-8-ib: 192.168.4.8
client-9-ib: 192.168.4.9
client-17-ib: 192.168.4.17
client-18-ib: 192.168.4.18
client-19-ib: 192.168.4.19
fat-amd-2-ib: 192.168.4.133
fat-intel-2-ib: 192.168.4.129

While running recovery-mds-scale with FLAVOR=OSS, it failed as follows:

==== Checking the clients loads AFTER  failover -- failure NOT OK
ost1 has failed over 1 times, and counting...
sleeping 717 seconds ...
tar: etc/selinux/targeted/contexts/users/root: Cannot write: No such file or directory
tar: Exiting with failure status due to previous errors
Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
client-1-ib
Client load failed on node client-1-ib

client client-1-ib load stdout and debug files :
              /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
              /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug

/tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib:

tar: etc/selinux/targeted/contexts/users/root: Cannot write: No such file or directory
tar: Exiting with failure status due to previous errors

/tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug

<~snip~>
2012-02-22 03:56:04: tar run starting
+ mkdir -p /mnt/lustre/d0.tar-client-1-ib
+ cd /mnt/lustre/d0.tar-client-1-ib
+ wait 11196
+ do_tar
+ tar cf - /etc
+ tar xf -
+ tee /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
tar: Removing leading `/' from member names
+ return 2
+ RC=2
++ grep 'exit delayed from previous errors' /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
+ PREV_ERRORS=
+ true
+ '[' 2 -ne 0 -a '' -a '' ']'
+ '[' 2 -eq 0 ']'
++ date '+%F %H:%M:%S'
+ echoerr '2012-02-22 03:59:25: tar failed'
+ echo '2012-02-22 03:59:25: tar failed'
2012-02-22 03:59:25: tar failed
<~snip~>

Syslog on client node client-1-ib showed that:

Feb 22 03:59:12 client-1 kernel: Lustre: DEBUG MARKER: ost1 has failed over 1 times, and counting...
Feb 22 03:59:19 client-1 kernel: LustreError: 10064:0:(client.c:2590:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff88031d605c00 x1394513519058221/t379(379) o-1->lustre-OST0004_UUID@192.168.4.19@o2ib:28/4 lens 408/400 e 0 to 0 dl 1329912005 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1
Feb 22 03:59:19 client-1 kernel: LustreError: 10064:0:(client.c:2590:ptlrpc_replay_interpret()) Skipped 4 previous similar messages
Feb 22 03:59:19 client-1 kernel: Lustre: lustre-OST0004-osc-ffff88032c89a400: Connection restored to service lustre-OST0004 using nid 192.168.4.19@o2ib.

Syslog on MDS node client-8-ib showed that:

Feb 22 03:59:12 client-8-ib kernel: Lustre: DEBUG MARKER: ost1 has failed over 1 times, and counting...
Feb 22 03:59:19 client-8-ib kernel: LustreError: 5628:0:(client.c:2590:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff88030708c400 x1394513506470444/t380(380) o-1->lustre-OST0004_UUID@192.168.4.19@o2ib:28/4 lens 408/400 e 0 to 0 dl 1329912005 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1
Feb 22 03:59:19 client-8-ib kernel: LustreError: 5628:0:(client.c:2590:ptlrpc_replay_interpret()) Skipped 4 previous similar messages
Feb 22 03:59:19 client-8-ib kernel: Lustre: lustre-OST0004-osc-MDT0000: Connection restored to service lustre-OST0004 using nid 192.168.4.19@o2ib.
Feb 22 03:59:19 client-8-ib kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0004_UUID now active, resetting orphans
Feb 22 03:59:19 client-8-ib kernel: Lustre: 7395:0:(quota_master.c:1760:mds_quota_recovery()) Only 3/7 OSTs are active, abort quota recovery

Syslog on OSS node client-19-ib showed that:

Feb 22 03:59:12 client-19-ib kernel: Lustre: DEBUG MARKER: ost1 has failed over 1 times, and counting...
Feb 22 03:59:18 client-19-ib kernel: Lustre: 7501:0:(filter.c:2697:filter_connect_internal()) lustre-OST0004: Received MDS connection for group 0
Feb 22 03:59:18 client-19-ib kernel: LustreError: 9874:0:(filter.c:4141:filter_destroy())  lustre-OST0004: can not find olg of group 0
Feb 22 03:59:18 client-19-ib kernel: LustreError: 9874:0:(filter.c:4141:filter_destroy()) Skipped 22 previous similar messages
Feb 22 03:59:19 client-19-ib kernel: Lustre: lustre-OST0004: sending delayed replies to recovered clients
Feb 22 03:59:19 client-19-ib kernel: Lustre: lustre-OST0004: received MDS connection from 192.168.4.8@o2ib
Feb 22 03:59:19 client-19-ib kernel: Lustre: 7530:0:(filter.c:2553:filter_llog_connect()) lustre-OST0004: Recovery from log 0xff506/0x0:8f36a744

Please refer to /scratch/logs/2.1.1/recovery-oss-scale.1329912676.log.tar.bz2 on brent node for debug and other logs.

Jian Yu added a comment - 22/Feb/12 7:49 AM Lustre Tag: v2_1_1_0_RC4 Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/44/ e2fsprogs Build: http://build.whamcloud.com/job/e2fsprogs-master/217/ Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-220.el6) Network: IB (in-kernel OFED) ENABLE_QUOTA=yes FAILURE_MODE=HARD FLAVOR=OSS Configuration: MGS/MDS Nodes: client-8-ib OSS Nodes: client-18-ib(active), client-19-ib(active) \ / OST1 (active in client-18-ib) OST2 (active in client-19-ib) OST3 (active in client-18-ib) OST4 (active in client-19-ib) OST5 (active in client-18-ib) OST6 (active in client-19-ib) client-9-ib(OST7) Client Nodes: client-[1,4,17],fat-amd-2,fat-intel-2 Network Addresses: client-1-ib: 192.168.4.1 client-4-ib: 192.168.4.4 client-8-ib: 192.168.4.8 client-9-ib: 192.168.4.9 client-17-ib: 192.168.4.17 client-18-ib: 192.168.4.18 client-19-ib: 192.168.4.19 fat-amd-2-ib: 192.168.4.133 fat-intel-2-ib: 192.168.4.129 While running recovery-mds-scale with FLAVOR=OSS, it failed as follows: ==== Checking the clients loads AFTER failover -- failure NOT OK ost1 has failed over 1 times, and counting... sleeping 717 seconds ... tar: etc/selinux/targeted/contexts/users/root: Cannot write: No such file or directory tar: Exiting with failure status due to previous errors Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file client-1-ib Client load failed on node client-1-ib client client-1-ib load stdout and debug files : /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib: tar: etc/selinux/targeted/contexts/users/root: Cannot write: No such file or directory tar: Exiting with failure status due to previous errors /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug <~snip~> 2012-02-22 03:56:04: tar run starting + mkdir -p /mnt/lustre/d0.tar-client-1-ib + cd /mnt/lustre/d0.tar-client-1-ib + wait 11196 + do_tar + tar cf - /etc + tar xf - + tee /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib tar: Removing leading `/' from member names + return 2 + RC=2 ++ grep 'exit delayed from previous errors' /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib + PREV_ERRORS= + true + '[' 2 -ne 0 -a '' -a '' ']' + '[' 2 -eq 0 ']' ++ date '+%F %H:%M:%S' + echoerr '2012-02-22 03:59:25: tar failed' + echo '2012-02-22 03:59:25: tar failed' 2012-02-22 03:59:25: tar failed <~snip~> Syslog on client node client-1-ib showed that: Feb 22 03:59:12 client-1 kernel: Lustre: DEBUG MARKER: ost1 has failed over 1 times, and counting... Feb 22 03:59:19 client-1 kernel: LustreError: 10064:0:(client.c:2590:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff88031d605c00 x1394513519058221/t379(379) o-1->lustre-OST0004_UUID@192.168.4.19@o2ib:28/4 lens 408/400 e 0 to 0 dl 1329912005 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1 Feb 22 03:59:19 client-1 kernel: LustreError: 10064:0:(client.c:2590:ptlrpc_replay_interpret()) Skipped 4 previous similar messages Feb 22 03:59:19 client-1 kernel: Lustre: lustre-OST0004-osc-ffff88032c89a400: Connection restored to service lustre-OST0004 using nid 192.168.4.19@o2ib. Syslog on MDS node client-8-ib showed that: Feb 22 03:59:12 client-8-ib kernel: Lustre: DEBUG MARKER: ost1 has failed over 1 times, and counting... Feb 22 03:59:19 client-8-ib kernel: LustreError: 5628:0:(client.c:2590:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff88030708c400 x1394513506470444/t380(380) o-1->lustre-OST0004_UUID@192.168.4.19@o2ib:28/4 lens 408/400 e 0 to 0 dl 1329912005 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1 Feb 22 03:59:19 client-8-ib kernel: LustreError: 5628:0:(client.c:2590:ptlrpc_replay_interpret()) Skipped 4 previous similar messages Feb 22 03:59:19 client-8-ib kernel: Lustre: lustre-OST0004-osc-MDT0000: Connection restored to service lustre-OST0004 using nid 192.168.4.19@o2ib. Feb 22 03:59:19 client-8-ib kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0004_UUID now active, resetting orphans Feb 22 03:59:19 client-8-ib kernel: Lustre: 7395:0:(quota_master.c:1760:mds_quota_recovery()) Only 3/7 OSTs are active, abort quota recovery Syslog on OSS node client-19-ib showed that: Feb 22 03:59:12 client-19-ib kernel: Lustre: DEBUG MARKER: ost1 has failed over 1 times, and counting... Feb 22 03:59:18 client-19-ib kernel: Lustre: 7501:0:(filter.c:2697:filter_connect_internal()) lustre-OST0004: Received MDS connection for group 0 Feb 22 03:59:18 client-19-ib kernel: LustreError: 9874:0:(filter.c:4141:filter_destroy()) lustre-OST0004: can not find olg of group 0 Feb 22 03:59:18 client-19-ib kernel: LustreError: 9874:0:(filter.c:4141:filter_destroy()) Skipped 22 previous similar messages Feb 22 03:59:19 client-19-ib kernel: Lustre: lustre-OST0004: sending delayed replies to recovered clients Feb 22 03:59:19 client-19-ib kernel: Lustre: lustre-OST0004: received MDS connection from 192.168.4.8@o2ib Feb 22 03:59:19 client-19-ib kernel: Lustre: 7530:0:(filter.c:2553:filter_llog_connect()) lustre-OST0004: Recovery from log 0xff506/0x0:8f36a744 Please refer to /scratch/logs/2.1.1/recovery-oss-scale.1329912676.log.tar.bz2 on brent node for debug and other logs.

Jian Yu added a comment - 13/Oct/11 1:07 AM

Lustre Tag: v1_8_7_WC1_RC1
Lustre Build: http://newbuild.whamcloud.com/job/lustre-b1_8/142/
e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/65/
Distro/Arch: RHEL5/x86_64(server, OFED 1.5.3.2, ext4-based ldiskfs), RHEL6/x86_64(client, in-kernel OFED)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD
FLAVOR=OSS

recovery-mds-scale (FLAVOR=OSS) test failed with the same issue: https://maloo.whamcloud.com/test_sets/004f464c-f550-11e0-908b-52540025f9af

Please refer to the attached recovery-oss-scale.1318474116.log.tar.bz2 for more logs.

Jian Yu added a comment - 13/Oct/11 1:07 AM Lustre Tag: v1_8_7_WC1_RC1 Lustre Build: http://newbuild.whamcloud.com/job/lustre-b1_8/142/ e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/65/ Distro/Arch: RHEL5/x86_64(server, OFED 1.5.3.2, ext4-based ldiskfs), RHEL6/x86_64(client, in-kernel OFED) ENABLE_QUOTA=yes FAILURE_MODE=HARD FLAVOR=OSS recovery-mds-scale (FLAVOR=OSS) test failed with the same issue: https://maloo.whamcloud.com/test_sets/004f464c-f550-11e0-908b-52540025f9af Please refer to the attached recovery-oss-scale.1318474116.log.tar.bz2 for more logs.

People

Assignee:: Hongchao Zhang

Reporter:: Jian Yu

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 26/Jun/11 11:46 PM

Updated:: 05/Apr/16 10:57 PM

Resolved:: 05/Apr/16 10:57 PM