[LU-463] orphan recovery happens too late, causing writes to fail with ENOENT after recovery Created: 26/Jun/11  Updated: 05/Apr/16  Resolved: 05/Apr/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0, Lustre 2.2.0, Lustre 2.1.1, Lustre 2.1.2, Lustre 2.1.3, Lustre 2.1.4, Lustre 2.1.5, Lustre 1.8.8, Lustre 1.8.6, Lustre 1.8.9, Lustre 2.1.6
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: Hongchao Zhang
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Lustre Branch: v1_8_6_RC3
Lustre Build: http://newbuild.whamcloud.com/job/lustre-b1_8/90/
e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/
Distro/Arch: RHEL6/x86_64(patchless client, in-kernel OFED, kernel version: 2.6.32-131.2.1.el6)
RHEL5/x86_64(server, OFED 1.5.3.1, kernel version: 2.6.18-238.12.1.el5_lustre)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

MGS/MDS Nodes: client-10-ib(active), client-12-ib(passive)
\ /
1 combined MGS/MDT

OSS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(active)
\ /
OST1 (active in fat-amd-1-ib)
OST2 (active in fat-amd-2-ib)
OST3 (active in fat-amd-1-ib)
OST4 (active in fat-amd-2-ib)
OST5 (active in fat-amd-1-ib)
OST6 (active in fat-amd-2-ib)

Client Nodes: fat-amd-3-ib, client-[6,7,16,21,24]-ib


Attachments: File recovery-mds-scale-1309100884.tar.bz2     File recovery-oss-scale-1311567030.tar.bz2     File recovery-oss-scale.1315539020.log.tar.bz2     File recovery-oss-scale.1318474116.log.tar.bz2    
Issue Links:
Duplicate
duplicates LU-6200 Failover recovery-mds-scale test_fail... Resolved
Severity: 3
Bugzilla ID: 22,777
Rank (Obsolete): 5680

 Description   

While running recovery-mds-scale with FLAVOR=OSS, it failed as follows after running 3 hours:

==== Checking the clients loads AFTER  failover -- failure NOT OK
ost5 has failed over 5 times, and counting...
sleeping 246 seconds ... 
tar: etc/rc.d/rc6.d/K88rsyslog: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors
Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
client-21-ib
Client load failed on node client-21-ib

client client-21-ib load stdout and debug files :
              /tmp/recovery-mds-scale.log_run_tar.sh-client-21-ib
              /tmp/recovery-mds-scale.log_run_tar.sh-client-21-ib.debug
2011-06-26 08:08:03 Terminating clients loads ...
Duration:                86400
Server failover period: 600 seconds
Exited after:           13565 seconds
Number of failovers before exit:
mds: 0 times
ost1: 2 times
ost2: 6 times
ost3: 3 times
ost4: 4 times
ost5: 5 times
ost6: 3 times
Status: FAIL: rc=1

Syslog on client node client-21-ib showed that:

Jun 26 08:03:55 client-21 kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
Jun 26 08:04:20 client-21 kernel: LustreError: 18613:0:(client.c:2347:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff88031daf6c00 x1372677268199869/t98784270264 o2->lustre-OST0005_UUID@192.168.4.132@o2ib:28/4 lens 400/592 e 0 to 1 dl 1309100718 ref 2 fl Interpret:R/4/0 rc -2/-2

Syslog on the MDS node client-10-ib showed that:

Jun 26 08:03:57 client-10-ib kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
Jun 26 08:04:22 client-10-ib kernel: LustreError: 17651:0:(client.c:2347:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff810320674400 x1372677249608261/t98784270265 o2->lustre-OST0005_UUID@192.168.4.132@o2ib:28/4 lens 400/592 e 0 to 1 dl 1309100720 ref 2 fl Interpret:R/4/0 rc -2/-2

Syslog on the OSS node fat-amd-1-ib showed that:

Jun 26 08:03:57 fat-amd-1-ib kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
Jun 26 08:04:21 fat-amd-1-ib kernel: Lustre: 6278:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) lustre-OST0005: 5 recoverable clients remain
Jun 26 08:04:21 fat-amd-1-ib kernel: Lustre: 6278:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) Skipped 2 previous similar messagesJun 26 08:04:21 fat-amd-1-ib kernel: LustreError: 6336:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-lustre-OST0005_UUID: lvbo_init failed for resource 161916: rc -2
Jun 26 08:04:21 fat-amd-1-ib kernel: LustreError: 6336:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 18 previous similar messagesJun 26 08:04:25 fat-amd-1-ib kernel: LustreError: 7708:0:(filter_log.c:135:filter_cancel_cookies_cb()) error cancelling log cookies: rc = -19
Jun 26 08:04:25 fat-amd-1-ib kernel: LustreError: 7708:0:(filter_log.c:135:filter_cancel_cookies_cb()) Skipped 8 previous similar messagesJun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: Recovery period over after 0:05, of 6 clients 6 recovered and 0 were evicted.
Jun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: sending delayed replies to recovered clientsJun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: received MDS connection from 192.168.4.10@o2ib

Maloo report: https://maloo.whamcloud.com/test_sets/f1c2fd72-a067-11e0-aee5-52540025f9af

Please find the debug logs in the attachment.

This is a known issue: bug 22777



 Comments   
Comment by Peter Jones [ 27/Jun/11 ]

johann

Do you agree that this is a known issue? If so, does this mean that this not a blocker to 1.8.6-wc?

Peter

Comment by Johann Lombardi (Inactive) [ 27/Jun/11 ]

Yes, this is a known issue that we hit sometimes with those tests.

Comment by Jian Yu [ 25/Jul/11 ]

Lustre Tag: v2_0_65_0
Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/204/
e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/
Distro/Arch: RHEL6/x86_64(in-kernel OFED, kernel version: 2.6.32-131.2.1.el6)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD
FLAVOR=OSS

Lustre cluster configuration:

MGS/MDS Node: client-7-ib
OSS Nodes:    fat-amd-1-ib(active), fat-amd-2-ib(active)
                                 \  /
                                 OST1 (active in fat-amd-1-ib)
                                 OST2 (active in fat-amd-2-ib)
                                 OST3 (active in fat-amd-1-ib)
                                 OST4 (active in fat-amd-2-ib)
                                 OST5 (active in fat-amd-1-ib)
                                 OST6 (active in fat-amd-2-ib)
              client-8-ib (OST7)
Client Nodes: fat-amd-3-ib, client-[9,11,12,13]-ib

While running recovery-mds-scale with FLAVOR=OSS, it failed as follows:

==== Checking the clients loads AFTER  failover -- failure NOT OK
ost5 has failed over 1 times, and counting...
sleeping 417 seconds ... 
tar: etc/selinux/targeted/modules/active/modules/postgrey.pp: Cannot write: No such file or directory
tar: Exiting with failure status due to previous errors
Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
client-13-ib
client-12-ib
Client load failed on node client-13-ib

client client-13-ib load stdout and debug files :
              /tmp/recovery-mds-scale.log_run_dbench.sh-client-13-ib
              /tmp/recovery-mds-scale.log_run_dbench.sh-client-13-ib.debug

/tmp/recovery-mds-scale.log_run_dbench.sh-client-13-ib:

copying /usr/share/dbench/client.txt to /mnt/lustre/d0.dbench-client-13-ib/client.txt
running 'dbench 2' on /mnt/lustre/d0.dbench-client-13-ib at Sun Jul 24 21:00:21 PDT 2011
dbench PID=29959
dbench version 4.00 - Copyright Andrew Tridgell 1999-2004

Running for 600 seconds with load 'client.txt' and minimum warmup 120 secs
0 of 2 processes prepared for launch   0 sec
2 of 2 processes prepared for launch   0 sec
releasing clients
   2       666    42.32 MB/sec  warmup   1 sec  latency 354.182 ms
   2      1436    26.02 MB/sec  warmup   2 sec  latency 386.166 ms
<~snip~>
   2     18452     0.00 MB/sec  execute  67 sec  latency 171356.589 ms
   2     18452     0.00 MB/sec  execute  68 sec  latency 172356.684 ms
[18468] write failed on handle 13839 (Cannot send after transport endpoint shutdown)
Child failed with status 1

/tmp/recovery-mds-scale.log_run_dbench.sh-client-13-ib.debug:

2011-07-24 21:00:21: dbench run starting
+ mkdir -p /mnt/lustre/d0.dbench-client-13-ib
+ load_pid=29927
+ wait 29927
+ rundbench -D /mnt/lustre/d0.dbench-client-13-ib 2
touch: missing file operand
Try `touch --help' for more information.
+ '[' 1 -eq 0 ']'
++ date '+%F %H:%M:%S'
+ echoerr '2011-07-24 21:03:30: dbench failed'
+ echo '2011-07-24 21:03:30: dbench failed'
2011-07-24 21:03:30: dbench failed

/tmp/recovery-mds-scale.log_run_tar.sh-client-12-ib:

tar: etc/selinux/targeted/modules/active/modules/postgrey.pp: Cannot write: No such file or directory
tar: Exiting with failure status due to previous errors

/tmp/recovery-mds-scale.log_run_tar.sh-client-12-ib.debug:

2011-07-24 21:00:20: tar run starting
+ mkdir -p /mnt/lustre/d0.tar-client-12-ib
+ cd /mnt/lustre/d0.tar-client-12-ib
+ wait 29934
+ do_tar
+ tar cf - /etc
+ tar xf -
+ tee /tmp/recovery-mds-scale.log_run_tar.sh-client-12-ib
tar: Removing leading `/' from member names
+ return 2
+ RC=2
++ grep 'exit delayed from previous errors' /tmp/recovery-mds-scale.log_run_tar.sh-client-12-ib
+ PREV_ERRORS=
+ true
+ '[' 2 -ne 0 -a '' -a '' ']'
+ '[' 2 -eq 0 ']'
++ date '+%F %H:%M:%S'
+ echoerr '2011-07-24 21:03:32: tar failed'
+ echo '2011-07-24 21:03:32: tar failed'
2011-07-24 21:03:32: tar failed

Syslog on client node client-13-ib showed that:

Jul 24 21:03:27 client-13 kernel: Lustre: DEBUG MARKER: ost5 has failed over 1 times, and counting...
Jul 24 21:03:29 client-13 kernel: Lustre: 29024:0:(client.c:2527:ptlrpc_replay_interpret()) @@@ Version mismatch during replay
Jul 24 21:03:29 client-13 kernel:  req@ffff8802fcbb5000 x1375277041099764/t470(470) o-1->lustre-OST0004_UUID@192.168.4.133@o2ib:6/4 lens 512/400 e 1 to 0 dl 1311566653 ref 2 fl Interpret:R/ffffffff/ffffffff rc -75/-1
Jul 24 21:03:30 client-13 kernel: Lustre: 29024:0:(import.c:1190:completed_replay_interpret()) lustre-OST0002-osc-ffff88030ec62400: version recovery fails, reconnecting
Jul 24 21:03:30 client-13 kernel: LustreError: 167-0: This client was evicted by lustre-OST0002; in progress operations using this service will fail.
Jul 24 21:03:30 client-13 kernel: LustreError: 29023:0:(client.c:1057:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff8802fcbdf400 x1375277041107454/t0(0) o-1->lustre-OST0002_UUID@192.168.4.133@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
Jul 24 21:03:30 client-13 kernel: LustreError: 29021:0:(client.c:1057:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff8803071b3800 x1375277041107455/t0(0) o-1->lustre-OST0002_UUID@192.168.4.133@o2ib:6/4 lens 456/416 e 0 to 0 dl 0 ref 2 fl Rpc:/ffffffff/ffffffff rc 0/-1
Jul 24 21:03:30 client-13 kernel: Lustre: lustre-OST0002-osc-ffff88030ec62400: Connection restored to service lustre-OST0002 using nid 192.168.4.133@o2ib.

Syslog on client node client-12-ib showed that:

Jul 24 21:03:27 client-12 kernel: Lustre: DEBUG MARKER: ost5 has failed over 1 times, and counting...
Jul 24 21:03:29 client-12 kernel: LustreError: 29051:0:(client.c:2570:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff880302aa3800 x1375277041064805/t507(507) o-1->lustre-OST0004_UUID@192.168.4.133@o2ib:28/4 lens 408/400 e 0 to 0 dl 1311566655 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1
Jul 24 21:03:29 client-12 kernel: LustreError: 29051:0:(client.c:2570:ptlrpc_replay_interpret()) Skipped 10 previous similar messages
Jul 24 21:03:29 client-12 kernel: Lustre: lustre-OST0004-osc-ffff88030b964000: Connection restored to service lustre-OST0004 using nid 192.168.4.133@o2ib.

Syslog on MDS node client-7-ib showed that:

Jul 24 21:03:27 client-7 kernel: Lustre: DEBUG MARKER: ost5 has failed over 1 times, and counting...
Jul 24 21:03:29 client-7 kernel: LustreError: 29520:0:(client.c:2570:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff8802a519e000 x1375277024283756/t508(508) o-1->lustre-OST0004_UUID@192.168.4.133@o2ib:28/4 lens 408/400 e 0 to 0 dl 1311566655 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1
Jul 24 21:03:29 client-7 kernel: LustreError: 29520:0:(client.c:2570:ptlrpc_replay_interpret()) Skipped 10 previous similar messages
Jul 24 21:03:29 client-7 kernel: Lustre: lustre-OST0004-osc-MDT0000: Connection restored to service lustre-OST0004 using nid 192.168.4.133@o2ib.
Jul 24 21:03:29 client-7 kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0004_UUID now active, resetting orphans
Jul 24 21:03:29 client-7 kernel: Lustre: 31049:0:(quota_master.c:1760:mds_quota_recovery()) Only 6/7 OSTs are active, abort quota recovery

Syslog on OSS node fat-amd-2-ib showed that:

Jul 24 21:03:27 fat-amd-2 kernel: Lustre: DEBUG MARKER: ost5 has failed over 1 times, and counting...
Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(filter.c:4111:filter_destroy())  lustre-OST0004: can not find olg of group 0
Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(filter.c:4111:filter_destroy()) Skipped 9 previous similar messages
Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(genops.c:1267:class_disconnect_stale_exports()) lustre-OST0004: disconnect stale client de6f48d2-f9b1-c66d-ae70-0bfaf5c8e6b5@192.168.4.13@o2ib
Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(filter.c:2927:filter_grant_sanity_check()) filter_disconnect: tot_granted 5345280 != fo_tot_granted 11374592
Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(ldlm_resource.c:1084:ldlm_resource_get()) lvbo_init failed for resource 326: rc -2
Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(ldlm_resource.c:1084:ldlm_resource_get()) Skipped 10 previous similar messages
Jul 24 21:03:29 fat-amd-2 kernel: Lustre: lustre-OST0004: sending delayed replies to recovered clients
Jul 24 21:03:29 fat-amd-2 kernel: Lustre: lustre-OST0004: received MDS connection from 192.168.4.7@o2ib
Jul 24 21:03:29 fat-amd-2 kernel: Lustre: 30378:0:(filter.c:2550:filter_llog_connect()) lustre-OST0004: Recovery from log 0xff506/0x0:5a402c04
Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 30513:0:(filter_io.c:723:filter_preprw_write()) lustre-OST0004: BRW to missing obj 342/0:rc -2
Jul 24 21:03:30 fat-amd-2 kernel: Lustre: 30368:0:(filter.c:2846:filter_connect()) lustre-OST0002: Received MDS connection (0x2d0794f47e7586a3); group 0
Jul 24 21:03:30 fat-amd-2 kernel: Lustre: 30368:0:(filter.c:2846:filter_connect()) Skipped 9 previous similar messages
Jul 24 21:04:15 fat-amd-2 kernel: Lustre: 30360:0:(ldlm_lib.c:871:target_handle_connect()) lustre-OST0004: connection from de6f48d2-f9b1-c66d-ae70-0bfaf5c8e6b5@192.168.4.13@o2ib t470 exp (null) cur 1311566655 last 0
Jul 24 21:04:15 fat-amd-2 kernel: Lustre: 30360:0:(ldlm_lib.c:871:target_handle_connect()) Skipped 5 previous similar messages
Jul 24 21:04:15 fat-amd-2 kernel: Lustre: 30360:0:(filter.c:2846:filter_connect()) lustre-OST0004: Received MDS connection (0x2d0794f47e759c6e); group 0
Jul 24 21:04:15 fat-amd-2 kernel: Lustre: 30360:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import lustre-OST0004->NET_0x50000c0a8040d_UUID netid 50000: select flavor null
Jul 24 21:04:15 fat-amd-2 kernel: Lustre: 30360:0:(sec.c:1474:sptlrpc_import_sec_adapt()) Skipped 5 previous similar messages

Maloo report: https://maloo.whamcloud.com/test_sets/eff97732-b67e-11e0-8bdf-52540025f9af

Please refer to the attached recovery-oss-scale-1311567030.tar.bz2 for more syslogs and debug logs.

Comment by Jian Yu [ 09/Sep/11 ]

Lustre Branch: master
Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/276/
e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/54/
Distro/Arch: RHEL5/x86_64(in-kernel OFED, kernel version: 2.6.18-238.19.1.el5)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD
FLAVOR=OSS

recovery-mds-scale(FLAVOR=OSS) failed with the same issue: https://maloo.whamcloud.com/test_sets/d0a83c78-da97-11e0-8d02-52540025f9af

Please refer to the attached recovery-oss-scale.1315539020.log.tar.bz2 for more logs.

Comment by Jian Yu [ 20/Sep/11 ]

Lustre Tag: v2_1_0_0_RC2
Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/283/
e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/54/
Distro/Arch: RHEL6/x86_64(in-kernel OFED, kernel version: 2.6.32-131.6.1.el6.x86_64)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD
FLAVOR=OSS

After running about 2 hours (OSS failed over 7 times), recovery-mds-scale (FLAVOR=OSS) test failed with the same issue:
https://maloo.whamcloud.com/test_sets/50e83640-e362-11e0-9909-52540025f9af

Please refer to the attached recovery-oss-scale.1316503823.log.tar.bz2 for more logs.

Comment by Jian Yu [ 13/Oct/11 ]

Lustre Tag: v1_8_7_WC1_RC1
Lustre Build: http://newbuild.whamcloud.com/job/lustre-b1_8/142/
e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/65/
Distro/Arch: RHEL5/x86_64(server, OFED 1.5.3.2, ext4-based ldiskfs), RHEL6/x86_64(client, in-kernel OFED)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD
FLAVOR=OSS

recovery-mds-scale (FLAVOR=OSS) test failed with the same issue: https://maloo.whamcloud.com/test_sets/004f464c-f550-11e0-908b-52540025f9af

Please refer to the attached recovery-oss-scale.1318474116.log.tar.bz2 for more logs.

Comment by Jian Yu [ 22/Feb/12 ]

Lustre Tag: v2_1_1_0_RC4
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/44/
e2fsprogs Build: http://build.whamcloud.com/job/e2fsprogs-master/217/
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-220.el6)
Network: IB (in-kernel OFED)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD
FLAVOR=OSS

Configuration:

MGS/MDS Nodes: client-8-ib

OSS Nodes: client-18-ib(active), client-19-ib(active)
                              \ /
                              OST1 (active in client-18-ib)
                              OST2 (active in client-19-ib)
                              OST3 (active in client-18-ib)
                              OST4 (active in client-19-ib)
                              OST5 (active in client-18-ib)
                              OST6 (active in client-19-ib)
           client-9-ib(OST7)

Client Nodes: client-[1,4,17],fat-amd-2,fat-intel-2

Network Addresses:
client-1-ib: 192.168.4.1
client-4-ib: 192.168.4.4
client-8-ib: 192.168.4.8
client-9-ib: 192.168.4.9
client-17-ib: 192.168.4.17
client-18-ib: 192.168.4.18
client-19-ib: 192.168.4.19
fat-amd-2-ib: 192.168.4.133
fat-intel-2-ib: 192.168.4.129

While running recovery-mds-scale with FLAVOR=OSS, it failed as follows:

==== Checking the clients loads AFTER  failover -- failure NOT OK
ost1 has failed over 1 times, and counting...
sleeping 717 seconds ...
tar: etc/selinux/targeted/contexts/users/root: Cannot write: No such file or directory
tar: Exiting with failure status due to previous errors
Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
client-1-ib
Client load failed on node client-1-ib

client client-1-ib load stdout and debug files :
              /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
              /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug

/tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib:

tar: etc/selinux/targeted/contexts/users/root: Cannot write: No such file or directory
tar: Exiting with failure status due to previous errors

/tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug

<~snip~>
2012-02-22 03:56:04: tar run starting
+ mkdir -p /mnt/lustre/d0.tar-client-1-ib
+ cd /mnt/lustre/d0.tar-client-1-ib
+ wait 11196
+ do_tar
+ tar cf - /etc
+ tar xf -
+ tee /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
tar: Removing leading `/' from member names
+ return 2
+ RC=2
++ grep 'exit delayed from previous errors' /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
+ PREV_ERRORS=
+ true
+ '[' 2 -ne 0 -a '' -a '' ']'
+ '[' 2 -eq 0 ']'
++ date '+%F %H:%M:%S'
+ echoerr '2012-02-22 03:59:25: tar failed'
+ echo '2012-02-22 03:59:25: tar failed'
2012-02-22 03:59:25: tar failed
<~snip~>

Syslog on client node client-1-ib showed that:

Feb 22 03:59:12 client-1 kernel: Lustre: DEBUG MARKER: ost1 has failed over 1 times, and counting...
Feb 22 03:59:19 client-1 kernel: LustreError: 10064:0:(client.c:2590:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff88031d605c00 x1394513519058221/t379(379) o-1->lustre-OST0004_UUID@192.168.4.19@o2ib:28/4 lens 408/400 e 0 to 0 dl 1329912005 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1
Feb 22 03:59:19 client-1 kernel: LustreError: 10064:0:(client.c:2590:ptlrpc_replay_interpret()) Skipped 4 previous similar messages
Feb 22 03:59:19 client-1 kernel: Lustre: lustre-OST0004-osc-ffff88032c89a400: Connection restored to service lustre-OST0004 using nid 192.168.4.19@o2ib.

Syslog on MDS node client-8-ib showed that:

Feb 22 03:59:12 client-8-ib kernel: Lustre: DEBUG MARKER: ost1 has failed over 1 times, and counting...
Feb 22 03:59:19 client-8-ib kernel: LustreError: 5628:0:(client.c:2590:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff88030708c400 x1394513506470444/t380(380) o-1->lustre-OST0004_UUID@192.168.4.19@o2ib:28/4 lens 408/400 e 0 to 0 dl 1329912005 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1
Feb 22 03:59:19 client-8-ib kernel: LustreError: 5628:0:(client.c:2590:ptlrpc_replay_interpret()) Skipped 4 previous similar messages
Feb 22 03:59:19 client-8-ib kernel: Lustre: lustre-OST0004-osc-MDT0000: Connection restored to service lustre-OST0004 using nid 192.168.4.19@o2ib.
Feb 22 03:59:19 client-8-ib kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0004_UUID now active, resetting orphans
Feb 22 03:59:19 client-8-ib kernel: Lustre: 7395:0:(quota_master.c:1760:mds_quota_recovery()) Only 3/7 OSTs are active, abort quota recovery

Syslog on OSS node client-19-ib showed that:

Feb 22 03:59:12 client-19-ib kernel: Lustre: DEBUG MARKER: ost1 has failed over 1 times, and counting...
Feb 22 03:59:18 client-19-ib kernel: Lustre: 7501:0:(filter.c:2697:filter_connect_internal()) lustre-OST0004: Received MDS connection for group 0
Feb 22 03:59:18 client-19-ib kernel: LustreError: 9874:0:(filter.c:4141:filter_destroy())  lustre-OST0004: can not find olg of group 0
Feb 22 03:59:18 client-19-ib kernel: LustreError: 9874:0:(filter.c:4141:filter_destroy()) Skipped 22 previous similar messages
Feb 22 03:59:19 client-19-ib kernel: Lustre: lustre-OST0004: sending delayed replies to recovered clients
Feb 22 03:59:19 client-19-ib kernel: Lustre: lustre-OST0004: received MDS connection from 192.168.4.8@o2ib
Feb 22 03:59:19 client-19-ib kernel: Lustre: 7530:0:(filter.c:2553:filter_llog_connect()) lustre-OST0004: Recovery from log 0xff506/0x0:8f36a744

Please refer to /scratch/logs/2.1.1/recovery-oss-scale.1329912676.log.tar.bz2 on brent node for debug and other logs.

Comment by Jian Yu [ 29/Mar/12 ]

Lustre Tag: v2_2_0_0_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_2/17/
Distro/Arch: SLES11SP1/x86_64(client), RHEL6.2/x86_64(server)
Network: TCP (1GigE)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

The same issue occurred while failing over OST: https://maloo.whamcloud.com/test_sets/b6eb20c8-799f-11e1-9d2a-5254004bbbd3

Comment by Jian Yu [ 15/May/12 ]

Lustre Tag: v1_8_8_WC1_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/195/
Distro/Arch: RHEL5.8/x86_64(server), RHEL6.2/x86_64(client)
Network: TCP (1GigE)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

The same issue occurred while failing over OST: https://maloo.whamcloud.com/test_sets/be9c60e0-9e82-11e1-9080-52540035b04c

Comment by Jian Yu [ 01/Jun/12 ]

Lustre Tag: v2_1_2_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/86/
Distro/Arch: RHEL6.2/x86_64
Network: TCP (1GigE)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

The same issue occurred while failing over OST: https://maloo.whamcloud.com/test_sets/c9193e08-abca-11e1-9b8f-52540035b04c

Comment by Jian Yu [ 14/Aug/12 ]

Lustre Tag: v2_1_3_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/113/
Distro/Arch: RHEL6.3/x86_64 (kernel version: 2.6.32-279.2.1.el6)
Network: IB (in-kernel OFED)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

The issue occurred again while running recovery-mds-scale failover_ost test:
https://maloo.whamcloud.com/test_sets/b18a1330-e5ad-11e1-ae4e-52540035b04c

Comment by Jian Yu [ 21/Aug/12 ]

Another instance:
https://maloo.whamcloud.com/test_sets/f99459d2-eb26-11e1-b137-52540035b04c

Comment by Peter Jones [ 05/Dec/12 ]

Hongchao

Could you please look into this one?

Thanks

Peter

Comment by Hongchao Zhang [ 06/Dec/12 ]

how about fixing the bug by waiting some time if the -2(ENOENT) is encountered on OST, which is in recovery mode atm.
will produce a patch by this way.

Comment by Jian Yu [ 12/Dec/12 ]

This has been blocking the recovery-mds-scale failover_ost test.

Comment by Hongchao Zhang [ 18/Dec/12 ]

the patch against b2_1 is under creation&test.

Comment by Hongchao Zhang [ 19/Dec/12 ]

the patch is tracked at http://review.whamcloud.com/#change,4868

Comment by Hongchao Zhang [ 18/Jan/13 ]

the patch is updated, it will try to create the missing object if it was not created during recovery for precreation RPCs are not replayable,
it will try to wait the creation complete or return EINPROGRESS if OBD_CONNECT_EINPROGRESS is set in the obd_export>exp_connect_flags.

Comment by Jian Yu [ 24/Jan/13 ]

Lustre Branch: b1_8
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/251/

The same issue occurred: https://maloo.whamcloud.com/test_sets/9734baba-661f-11e2-a42b-52540035b04c

Comment by Jian Yu [ 18/Feb/13 ]

Lustre Tag: v1_8_9_WC1_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/256
Distro/Arch: RHEL5.9/x86_64
Network: IB (in-kernel OFED)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

The same issue occurred: https://maloo.whamcloud.com/test_sets/d7e82752-79db-11e2-8fd2-52540035b04c

Comment by Jian Yu [ 20/Feb/13 ]

Lustre Tag: v1_8_9_WC1_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/258
Distro/Arch: RHEL5.9/x86_64(server), RHEL6.3/x86_64(client)
Network: TCP (1GigE)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

The same issue occurred: https://maloo.whamcloud.com/test_sets/79cd620e-7af3-11e2-b916-52540035b04c

Comment by Jian Yu [ 19/Mar/13 ]

Lustre Branch: b2_1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/189/

The same issue occurred: https://maloo.whamcloud.com/test_sets/278684ba-902b-11e2-9b28-52540035b04c

Dmesg on OSS node showed that:

Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
cannot allocate a tage (271)
cannot allocate a tage (271)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost6 has failed over 5 times, and counting...
Lustre: DEBUG MARKER: ost6 has failed over 5 times, and counting...
Lustre: lustre-OST0006: sending delayed replies to recovered clients
LustreError: 3691:0:(ldlm_resource.c:1090:ldlm_resource_get()) lvbo_init failed for resource 14662: rc -2
Lustre: lustre-OST0006: received MDS connection from 10.10.4.190@tcp
__ratelimit: 12 callbacks suppressed
cannot allocate a tage (402)
cannot allocate a tage (402)
cannot allocate a tage (402)
cannot allocate a tage (402)
cannot allocate a tage (402)
cannot allocate a tage (402)
cannot allocate a tage (402)
cannot allocate a tage (402)
cannot allocate a tage (402)
cannot allocate a tage (402)
__ratelimit: 8 callbacks suppressed
Comment by Jian Yu [ 06/Jun/13 ]

Lustre Tag: v2_1_6_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/208
Distro/Arch: RHEL6.4/x86_64
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

The issue occurred again while running recovery-mds-scale failover_ost test:
https://maloo.whamcloud.com/test_sets/bca04cd4-cdf2-11e2-ba28-52540035b04c

Comment by Hongchao Zhang [ 18/Jun/13 ]

the patch is updated http://review.hpdd.intel.com/#/c/4868/

Comment by John Fuchs-Chesney (Inactive) [ 07/Jan/16 ]

Incomplete and out of date.
~ jfc.

Comment by Andreas Dilger [ 03/Apr/16 ]

I noticed that the patch http://review.whamcloud.com/4868 (against b2_1) was never landed to any branch. Was this bug fixed in some other way (in which case it can be closed again) or are we just ignoring these test failures now?

Comment by Jian Yu [ 05/Apr/16 ]

Hi Andreas,

The failure is still blocking recovery-mds-scale failover_ost testing on all of the Lustre branches. LU-4621 and LU-6200 were created later to track the failures.

Comment by Hongchao Zhang [ 05/Apr/16 ]

Yes, this should a duplicate of LU-6200, which contains a patch to fix the problem in a similar way.

Generated at Sat Feb 10 01:07:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.