[LU-463] orphan recovery happens too late, causing writes to fail with ENOENT after recovery Created: 26/Jun/11 Updated: 05/Apr/16 Resolved: 05/Apr/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 2.2.0, Lustre 2.1.1, Lustre 2.1.2, Lustre 2.1.3, Lustre 2.1.4, Lustre 2.1.5, Lustre 1.8.8, Lustre 1.8.6, Lustre 1.8.9, Lustre 2.1.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jian Yu | Assignee: | Hongchao Zhang |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Branch: v1_8_6_RC3 MGS/MDS Nodes: client-10-ib(active), client-12-ib(passive) OSS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(active) Client Nodes: fat-amd-3-ib, client-[6,7,16,21,24]-ib |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Bugzilla ID: | 22,777 | ||||||||
| Rank (Obsolete): | 5680 | ||||||||
| Description |
|
While running recovery-mds-scale with FLAVOR=OSS, it failed as follows after running 3 hours: ==== Checking the clients loads AFTER failover -- failure NOT OK
ost5 has failed over 5 times, and counting...
sleeping 246 seconds ...
tar: etc/rc.d/rc6.d/K88rsyslog: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors
Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
client-21-ib
Client load failed on node client-21-ib
client client-21-ib load stdout and debug files :
/tmp/recovery-mds-scale.log_run_tar.sh-client-21-ib
/tmp/recovery-mds-scale.log_run_tar.sh-client-21-ib.debug
2011-06-26 08:08:03 Terminating clients loads ...
Duration: 86400
Server failover period: 600 seconds
Exited after: 13565 seconds
Number of failovers before exit:
mds: 0 times
ost1: 2 times
ost2: 6 times
ost3: 3 times
ost4: 4 times
ost5: 5 times
ost6: 3 times
Status: FAIL: rc=1
Syslog on client node client-21-ib showed that: Jun 26 08:03:55 client-21 kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting... Jun 26 08:04:20 client-21 kernel: LustreError: 18613:0:(client.c:2347:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff88031daf6c00 x1372677268199869/t98784270264 o2->lustre-OST0005_UUID@192.168.4.132@o2ib:28/4 lens 400/592 e 0 to 1 dl 1309100718 ref 2 fl Interpret:R/4/0 rc -2/-2 Syslog on the MDS node client-10-ib showed that: Jun 26 08:03:57 client-10-ib kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting... Jun 26 08:04:22 client-10-ib kernel: LustreError: 17651:0:(client.c:2347:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff810320674400 x1372677249608261/t98784270265 o2->lustre-OST0005_UUID@192.168.4.132@o2ib:28/4 lens 400/592 e 0 to 1 dl 1309100720 ref 2 fl Interpret:R/4/0 rc -2/-2 Syslog on the OSS node fat-amd-1-ib showed that: Jun 26 08:03:57 fat-amd-1-ib kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting... Jun 26 08:04:21 fat-amd-1-ib kernel: Lustre: 6278:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) lustre-OST0005: 5 recoverable clients remain Jun 26 08:04:21 fat-amd-1-ib kernel: Lustre: 6278:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) Skipped 2 previous similar messagesJun 26 08:04:21 fat-amd-1-ib kernel: LustreError: 6336:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-lustre-OST0005_UUID: lvbo_init failed for resource 161916: rc -2 Jun 26 08:04:21 fat-amd-1-ib kernel: LustreError: 6336:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 18 previous similar messagesJun 26 08:04:25 fat-amd-1-ib kernel: LustreError: 7708:0:(filter_log.c:135:filter_cancel_cookies_cb()) error cancelling log cookies: rc = -19 Jun 26 08:04:25 fat-amd-1-ib kernel: LustreError: 7708:0:(filter_log.c:135:filter_cancel_cookies_cb()) Skipped 8 previous similar messagesJun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: Recovery period over after 0:05, of 6 clients 6 recovered and 0 were evicted. Jun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: sending delayed replies to recovered clientsJun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: received MDS connection from 192.168.4.10@o2ib Maloo report: https://maloo.whamcloud.com/test_sets/f1c2fd72-a067-11e0-aee5-52540025f9af Please find the debug logs in the attachment. This is a known issue: bug 22777 |
| Comments |
| Comment by Peter Jones [ 27/Jun/11 ] |
|
johann Do you agree that this is a known issue? If so, does this mean that this not a blocker to 1.8.6-wc? Peter |
| Comment by Johann Lombardi (Inactive) [ 27/Jun/11 ] |
|
Yes, this is a known issue that we hit sometimes with those tests. |
| Comment by Jian Yu [ 25/Jul/11 ] |
|
Lustre Tag: v2_0_65_0 Lustre cluster configuration: MGS/MDS Node: client-7-ib
OSS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(active)
\ /
OST1 (active in fat-amd-1-ib)
OST2 (active in fat-amd-2-ib)
OST3 (active in fat-amd-1-ib)
OST4 (active in fat-amd-2-ib)
OST5 (active in fat-amd-1-ib)
OST6 (active in fat-amd-2-ib)
client-8-ib (OST7)
Client Nodes: fat-amd-3-ib, client-[9,11,12,13]-ib
While running recovery-mds-scale with FLAVOR=OSS, it failed as follows: ==== Checking the clients loads AFTER failover -- failure NOT OK
ost5 has failed over 1 times, and counting...
sleeping 417 seconds ...
tar: etc/selinux/targeted/modules/active/modules/postgrey.pp: Cannot write: No such file or directory
tar: Exiting with failure status due to previous errors
Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
client-13-ib
client-12-ib
Client load failed on node client-13-ib
client client-13-ib load stdout and debug files :
/tmp/recovery-mds-scale.log_run_dbench.sh-client-13-ib
/tmp/recovery-mds-scale.log_run_dbench.sh-client-13-ib.debug
/tmp/recovery-mds-scale.log_run_dbench.sh-client-13-ib: copying /usr/share/dbench/client.txt to /mnt/lustre/d0.dbench-client-13-ib/client.txt running 'dbench 2' on /mnt/lustre/d0.dbench-client-13-ib at Sun Jul 24 21:00:21 PDT 2011 dbench PID=29959 dbench version 4.00 - Copyright Andrew Tridgell 1999-2004 Running for 600 seconds with load 'client.txt' and minimum warmup 120 secs 0 of 2 processes prepared for launch 0 sec 2 of 2 processes prepared for launch 0 sec releasing clients 2 666 42.32 MB/sec warmup 1 sec latency 354.182 ms 2 1436 26.02 MB/sec warmup 2 sec latency 386.166 ms <~snip~> 2 18452 0.00 MB/sec execute 67 sec latency 171356.589 ms 2 18452 0.00 MB/sec execute 68 sec latency 172356.684 ms [18468] write failed on handle 13839 (Cannot send after transport endpoint shutdown) Child failed with status 1 /tmp/recovery-mds-scale.log_run_dbench.sh-client-13-ib.debug: 2011-07-24 21:00:21: dbench run starting + mkdir -p /mnt/lustre/d0.dbench-client-13-ib + load_pid=29927 + wait 29927 + rundbench -D /mnt/lustre/d0.dbench-client-13-ib 2 touch: missing file operand Try `touch --help' for more information. + '[' 1 -eq 0 ']' ++ date '+%F %H:%M:%S' + echoerr '2011-07-24 21:03:30: dbench failed' + echo '2011-07-24 21:03:30: dbench failed' 2011-07-24 21:03:30: dbench failed /tmp/recovery-mds-scale.log_run_tar.sh-client-12-ib: tar: etc/selinux/targeted/modules/active/modules/postgrey.pp: Cannot write: No such file or directory tar: Exiting with failure status due to previous errors /tmp/recovery-mds-scale.log_run_tar.sh-client-12-ib.debug: 2011-07-24 21:00:20: tar run starting + mkdir -p /mnt/lustre/d0.tar-client-12-ib + cd /mnt/lustre/d0.tar-client-12-ib + wait 29934 + do_tar + tar cf - /etc + tar xf - + tee /tmp/recovery-mds-scale.log_run_tar.sh-client-12-ib tar: Removing leading `/' from member names + return 2 + RC=2 ++ grep 'exit delayed from previous errors' /tmp/recovery-mds-scale.log_run_tar.sh-client-12-ib + PREV_ERRORS= + true + '[' 2 -ne 0 -a '' -a '' ']' + '[' 2 -eq 0 ']' ++ date '+%F %H:%M:%S' + echoerr '2011-07-24 21:03:32: tar failed' + echo '2011-07-24 21:03:32: tar failed' 2011-07-24 21:03:32: tar failed Syslog on client node client-13-ib showed that: Jul 24 21:03:27 client-13 kernel: Lustre: DEBUG MARKER: ost5 has failed over 1 times, and counting... Jul 24 21:03:29 client-13 kernel: Lustre: 29024:0:(client.c:2527:ptlrpc_replay_interpret()) @@@ Version mismatch during replay Jul 24 21:03:29 client-13 kernel: req@ffff8802fcbb5000 x1375277041099764/t470(470) o-1->lustre-OST0004_UUID@192.168.4.133@o2ib:6/4 lens 512/400 e 1 to 0 dl 1311566653 ref 2 fl Interpret:R/ffffffff/ffffffff rc -75/-1 Jul 24 21:03:30 client-13 kernel: Lustre: 29024:0:(import.c:1190:completed_replay_interpret()) lustre-OST0002-osc-ffff88030ec62400: version recovery fails, reconnecting Jul 24 21:03:30 client-13 kernel: LustreError: 167-0: This client was evicted by lustre-OST0002; in progress operations using this service will fail. Jul 24 21:03:30 client-13 kernel: LustreError: 29023:0:(client.c:1057:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8802fcbdf400 x1375277041107454/t0(0) o-1->lustre-OST0002_UUID@192.168.4.133@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1 Jul 24 21:03:30 client-13 kernel: LustreError: 29021:0:(client.c:1057:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8803071b3800 x1375277041107455/t0(0) o-1->lustre-OST0002_UUID@192.168.4.133@o2ib:6/4 lens 456/416 e 0 to 0 dl 0 ref 2 fl Rpc:/ffffffff/ffffffff rc 0/-1 Jul 24 21:03:30 client-13 kernel: Lustre: lustre-OST0002-osc-ffff88030ec62400: Connection restored to service lustre-OST0002 using nid 192.168.4.133@o2ib. Syslog on client node client-12-ib showed that: Jul 24 21:03:27 client-12 kernel: Lustre: DEBUG MARKER: ost5 has failed over 1 times, and counting... Jul 24 21:03:29 client-12 kernel: LustreError: 29051:0:(client.c:2570:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff880302aa3800 x1375277041064805/t507(507) o-1->lustre-OST0004_UUID@192.168.4.133@o2ib:28/4 lens 408/400 e 0 to 0 dl 1311566655 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1 Jul 24 21:03:29 client-12 kernel: LustreError: 29051:0:(client.c:2570:ptlrpc_replay_interpret()) Skipped 10 previous similar messages Jul 24 21:03:29 client-12 kernel: Lustre: lustre-OST0004-osc-ffff88030b964000: Connection restored to service lustre-OST0004 using nid 192.168.4.133@o2ib. Syslog on MDS node client-7-ib showed that: Jul 24 21:03:27 client-7 kernel: Lustre: DEBUG MARKER: ost5 has failed over 1 times, and counting... Jul 24 21:03:29 client-7 kernel: LustreError: 29520:0:(client.c:2570:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff8802a519e000 x1375277024283756/t508(508) o-1->lustre-OST0004_UUID@192.168.4.133@o2ib:28/4 lens 408/400 e 0 to 0 dl 1311566655 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1 Jul 24 21:03:29 client-7 kernel: LustreError: 29520:0:(client.c:2570:ptlrpc_replay_interpret()) Skipped 10 previous similar messages Jul 24 21:03:29 client-7 kernel: Lustre: lustre-OST0004-osc-MDT0000: Connection restored to service lustre-OST0004 using nid 192.168.4.133@o2ib. Jul 24 21:03:29 client-7 kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0004_UUID now active, resetting orphans Jul 24 21:03:29 client-7 kernel: Lustre: 31049:0:(quota_master.c:1760:mds_quota_recovery()) Only 6/7 OSTs are active, abort quota recovery Syslog on OSS node fat-amd-2-ib showed that: Jul 24 21:03:27 fat-amd-2 kernel: Lustre: DEBUG MARKER: ost5 has failed over 1 times, and counting... Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(filter.c:4111:filter_destroy()) lustre-OST0004: can not find olg of group 0 Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(filter.c:4111:filter_destroy()) Skipped 9 previous similar messages Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(genops.c:1267:class_disconnect_stale_exports()) lustre-OST0004: disconnect stale client de6f48d2-f9b1-c66d-ae70-0bfaf5c8e6b5@192.168.4.13@o2ib Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(filter.c:2927:filter_grant_sanity_check()) filter_disconnect: tot_granted 5345280 != fo_tot_granted 11374592 Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(ldlm_resource.c:1084:ldlm_resource_get()) lvbo_init failed for resource 326: rc -2 Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 32133:0:(ldlm_resource.c:1084:ldlm_resource_get()) Skipped 10 previous similar messages Jul 24 21:03:29 fat-amd-2 kernel: Lustre: lustre-OST0004: sending delayed replies to recovered clients Jul 24 21:03:29 fat-amd-2 kernel: Lustre: lustre-OST0004: received MDS connection from 192.168.4.7@o2ib Jul 24 21:03:29 fat-amd-2 kernel: Lustre: 30378:0:(filter.c:2550:filter_llog_connect()) lustre-OST0004: Recovery from log 0xff506/0x0:5a402c04 Jul 24 21:03:29 fat-amd-2 kernel: LustreError: 30513:0:(filter_io.c:723:filter_preprw_write()) lustre-OST0004: BRW to missing obj 342/0:rc -2 Jul 24 21:03:30 fat-amd-2 kernel: Lustre: 30368:0:(filter.c:2846:filter_connect()) lustre-OST0002: Received MDS connection (0x2d0794f47e7586a3); group 0 Jul 24 21:03:30 fat-amd-2 kernel: Lustre: 30368:0:(filter.c:2846:filter_connect()) Skipped 9 previous similar messages Jul 24 21:04:15 fat-amd-2 kernel: Lustre: 30360:0:(ldlm_lib.c:871:target_handle_connect()) lustre-OST0004: connection from de6f48d2-f9b1-c66d-ae70-0bfaf5c8e6b5@192.168.4.13@o2ib t470 exp (null) cur 1311566655 last 0 Jul 24 21:04:15 fat-amd-2 kernel: Lustre: 30360:0:(ldlm_lib.c:871:target_handle_connect()) Skipped 5 previous similar messages Jul 24 21:04:15 fat-amd-2 kernel: Lustre: 30360:0:(filter.c:2846:filter_connect()) lustre-OST0004: Received MDS connection (0x2d0794f47e759c6e); group 0 Jul 24 21:04:15 fat-amd-2 kernel: Lustre: 30360:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import lustre-OST0004->NET_0x50000c0a8040d_UUID netid 50000: select flavor null Jul 24 21:04:15 fat-amd-2 kernel: Lustre: 30360:0:(sec.c:1474:sptlrpc_import_sec_adapt()) Skipped 5 previous similar messages Maloo report: https://maloo.whamcloud.com/test_sets/eff97732-b67e-11e0-8bdf-52540025f9af Please refer to the attached recovery-oss-scale-1311567030.tar.bz2 for more syslogs and debug logs. |
| Comment by Jian Yu [ 09/Sep/11 ] |
|
Lustre Branch: master recovery-mds-scale(FLAVOR=OSS) failed with the same issue: https://maloo.whamcloud.com/test_sets/d0a83c78-da97-11e0-8d02-52540025f9af Please refer to the attached recovery-oss-scale.1315539020.log.tar.bz2 for more logs. |
| Comment by Jian Yu [ 20/Sep/11 ] |
|
Lustre Tag: v2_1_0_0_RC2 After running about 2 hours (OSS failed over 7 times), recovery-mds-scale (FLAVOR=OSS) test failed with the same issue: Please refer to the attached recovery-oss-scale.1316503823.log.tar.bz2 for more logs. |
| Comment by Jian Yu [ 13/Oct/11 ] |
|
Lustre Tag: v1_8_7_WC1_RC1 recovery-mds-scale (FLAVOR=OSS) test failed with the same issue: https://maloo.whamcloud.com/test_sets/004f464c-f550-11e0-908b-52540025f9af Please refer to the attached recovery-oss-scale.1318474116.log.tar.bz2 for more logs. |
| Comment by Jian Yu [ 22/Feb/12 ] |
|
Lustre Tag: v2_1_1_0_RC4 Configuration: MGS/MDS Nodes: client-8-ib
OSS Nodes: client-18-ib(active), client-19-ib(active)
\ /
OST1 (active in client-18-ib)
OST2 (active in client-19-ib)
OST3 (active in client-18-ib)
OST4 (active in client-19-ib)
OST5 (active in client-18-ib)
OST6 (active in client-19-ib)
client-9-ib(OST7)
Client Nodes: client-[1,4,17],fat-amd-2,fat-intel-2
Network Addresses:
client-1-ib: 192.168.4.1
client-4-ib: 192.168.4.4
client-8-ib: 192.168.4.8
client-9-ib: 192.168.4.9
client-17-ib: 192.168.4.17
client-18-ib: 192.168.4.18
client-19-ib: 192.168.4.19
fat-amd-2-ib: 192.168.4.133
fat-intel-2-ib: 192.168.4.129
While running recovery-mds-scale with FLAVOR=OSS, it failed as follows: ==== Checking the clients loads AFTER failover -- failure NOT OK
ost1 has failed over 1 times, and counting...
sleeping 717 seconds ...
tar: etc/selinux/targeted/contexts/users/root: Cannot write: No such file or directory
tar: Exiting with failure status due to previous errors
Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
client-1-ib
Client load failed on node client-1-ib
client client-1-ib load stdout and debug files :
/tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
/tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug
/tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib: tar: etc/selinux/targeted/contexts/users/root: Cannot write: No such file or directory tar: Exiting with failure status due to previous errors /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug <~snip~> 2012-02-22 03:56:04: tar run starting + mkdir -p /mnt/lustre/d0.tar-client-1-ib + cd /mnt/lustre/d0.tar-client-1-ib + wait 11196 + do_tar + tar cf - /etc + tar xf - + tee /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib tar: Removing leading `/' from member names + return 2 + RC=2 ++ grep 'exit delayed from previous errors' /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib + PREV_ERRORS= + true + '[' 2 -ne 0 -a '' -a '' ']' + '[' 2 -eq 0 ']' ++ date '+%F %H:%M:%S' + echoerr '2012-02-22 03:59:25: tar failed' + echo '2012-02-22 03:59:25: tar failed' 2012-02-22 03:59:25: tar failed <~snip~> Syslog on client node client-1-ib showed that: Feb 22 03:59:12 client-1 kernel: Lustre: DEBUG MARKER: ost1 has failed over 1 times, and counting... Feb 22 03:59:19 client-1 kernel: LustreError: 10064:0:(client.c:2590:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff88031d605c00 x1394513519058221/t379(379) o-1->lustre-OST0004_UUID@192.168.4.19@o2ib:28/4 lens 408/400 e 0 to 0 dl 1329912005 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1 Feb 22 03:59:19 client-1 kernel: LustreError: 10064:0:(client.c:2590:ptlrpc_replay_interpret()) Skipped 4 previous similar messages Feb 22 03:59:19 client-1 kernel: Lustre: lustre-OST0004-osc-ffff88032c89a400: Connection restored to service lustre-OST0004 using nid 192.168.4.19@o2ib. Syslog on MDS node client-8-ib showed that: Feb 22 03:59:12 client-8-ib kernel: Lustre: DEBUG MARKER: ost1 has failed over 1 times, and counting... Feb 22 03:59:19 client-8-ib kernel: LustreError: 5628:0:(client.c:2590:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff88030708c400 x1394513506470444/t380(380) o-1->lustre-OST0004_UUID@192.168.4.19@o2ib:28/4 lens 408/400 e 0 to 0 dl 1329912005 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1 Feb 22 03:59:19 client-8-ib kernel: LustreError: 5628:0:(client.c:2590:ptlrpc_replay_interpret()) Skipped 4 previous similar messages Feb 22 03:59:19 client-8-ib kernel: Lustre: lustre-OST0004-osc-MDT0000: Connection restored to service lustre-OST0004 using nid 192.168.4.19@o2ib. Feb 22 03:59:19 client-8-ib kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0004_UUID now active, resetting orphans Feb 22 03:59:19 client-8-ib kernel: Lustre: 7395:0:(quota_master.c:1760:mds_quota_recovery()) Only 3/7 OSTs are active, abort quota recovery Syslog on OSS node client-19-ib showed that: Feb 22 03:59:12 client-19-ib kernel: Lustre: DEBUG MARKER: ost1 has failed over 1 times, and counting... Feb 22 03:59:18 client-19-ib kernel: Lustre: 7501:0:(filter.c:2697:filter_connect_internal()) lustre-OST0004: Received MDS connection for group 0 Feb 22 03:59:18 client-19-ib kernel: LustreError: 9874:0:(filter.c:4141:filter_destroy()) lustre-OST0004: can not find olg of group 0 Feb 22 03:59:18 client-19-ib kernel: LustreError: 9874:0:(filter.c:4141:filter_destroy()) Skipped 22 previous similar messages Feb 22 03:59:19 client-19-ib kernel: Lustre: lustre-OST0004: sending delayed replies to recovered clients Feb 22 03:59:19 client-19-ib kernel: Lustre: lustre-OST0004: received MDS connection from 192.168.4.8@o2ib Feb 22 03:59:19 client-19-ib kernel: Lustre: 7530:0:(filter.c:2553:filter_llog_connect()) lustre-OST0004: Recovery from log 0xff506/0x0:8f36a744 Please refer to /scratch/logs/2.1.1/recovery-oss-scale.1329912676.log.tar.bz2 on brent node for debug and other logs. |
| Comment by Jian Yu [ 29/Mar/12 ] |
|
Lustre Tag: v2_2_0_0_RC2 The same issue occurred while failing over OST: https://maloo.whamcloud.com/test_sets/b6eb20c8-799f-11e1-9d2a-5254004bbbd3 |
| Comment by Jian Yu [ 15/May/12 ] |
|
Lustre Tag: v1_8_8_WC1_RC1 The same issue occurred while failing over OST: https://maloo.whamcloud.com/test_sets/be9c60e0-9e82-11e1-9080-52540035b04c |
| Comment by Jian Yu [ 01/Jun/12 ] |
|
Lustre Tag: v2_1_2_RC2 The same issue occurred while failing over OST: https://maloo.whamcloud.com/test_sets/c9193e08-abca-11e1-9b8f-52540035b04c |
| Comment by Jian Yu [ 14/Aug/12 ] |
|
Lustre Tag: v2_1_3_RC1 The issue occurred again while running recovery-mds-scale failover_ost test: |
| Comment by Jian Yu [ 21/Aug/12 ] |
|
Another instance: |
| Comment by Peter Jones [ 05/Dec/12 ] |
|
Hongchao Could you please look into this one? Thanks Peter |
| Comment by Hongchao Zhang [ 06/Dec/12 ] |
|
how about fixing the bug by waiting some time if the -2(ENOENT) is encountered on OST, which is in recovery mode atm. |
| Comment by Jian Yu [ 12/Dec/12 ] |
|
This has been blocking the recovery-mds-scale failover_ost test. |
| Comment by Hongchao Zhang [ 18/Dec/12 ] |
|
the patch against b2_1 is under creation&test. |
| Comment by Hongchao Zhang [ 19/Dec/12 ] |
|
the patch is tracked at http://review.whamcloud.com/#change,4868 |
| Comment by Hongchao Zhang [ 18/Jan/13 ] |
|
the patch is updated, it will try to create the missing object if it was not created during recovery for precreation RPCs are not replayable, |
| Comment by Jian Yu [ 24/Jan/13 ] |
|
Lustre Branch: b1_8 The same issue occurred: https://maloo.whamcloud.com/test_sets/9734baba-661f-11e2-a42b-52540035b04c |
| Comment by Jian Yu [ 18/Feb/13 ] |
|
Lustre Tag: v1_8_9_WC1_RC1 The same issue occurred: https://maloo.whamcloud.com/test_sets/d7e82752-79db-11e2-8fd2-52540035b04c |
| Comment by Jian Yu [ 20/Feb/13 ] |
|
Lustre Tag: v1_8_9_WC1_RC2 The same issue occurred: https://maloo.whamcloud.com/test_sets/79cd620e-7af3-11e2-b916-52540035b04c |
| Comment by Jian Yu [ 19/Mar/13 ] |
|
Lustre Branch: b2_1 The same issue occurred: https://maloo.whamcloud.com/test_sets/278684ba-902b-11e2-9b28-52540035b04c Dmesg on OSS node showed that: Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK cannot allocate a tage (271) cannot allocate a tage (271) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost6 has failed over 5 times, and counting... Lustre: DEBUG MARKER: ost6 has failed over 5 times, and counting... Lustre: lustre-OST0006: sending delayed replies to recovered clients LustreError: 3691:0:(ldlm_resource.c:1090:ldlm_resource_get()) lvbo_init failed for resource 14662: rc -2 Lustre: lustre-OST0006: received MDS connection from 10.10.4.190@tcp __ratelimit: 12 callbacks suppressed cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) __ratelimit: 8 callbacks suppressed |
| Comment by Jian Yu [ 06/Jun/13 ] |
|
Lustre Tag: v2_1_6_RC1 The issue occurred again while running recovery-mds-scale failover_ost test: |
| Comment by Hongchao Zhang [ 18/Jun/13 ] |
|
the patch is updated http://review.hpdd.intel.com/#/c/4868/ |
| Comment by John Fuchs-Chesney (Inactive) [ 07/Jan/16 ] |
|
Incomplete and out of date. |
| Comment by Andreas Dilger [ 03/Apr/16 ] |
|
I noticed that the patch http://review.whamcloud.com/4868 (against b2_1) was never landed to any branch. Was this bug fixed in some other way (in which case it can be closed again) or are we just ignoring these test failures now? |
| Comment by Jian Yu [ 05/Apr/16 ] |
|
Hi Andreas, The failure is still blocking recovery-mds-scale failover_ost testing on all of the Lustre branches. |
| Comment by Hongchao Zhang [ 05/Apr/16 ] |
|
Yes, this should a duplicate of |