Details
-
Bug
-
Resolution: Duplicate
-
Blocker
-
None
-
Lustre 2.1.0, Lustre 2.2.0, Lustre 2.1.1, Lustre 2.1.2, Lustre 2.1.3, Lustre 2.1.4, Lustre 2.1.5, Lustre 1.8.8, Lustre 1.8.6, Lustre 1.8.9, Lustre 2.1.6
-
None
-
Lustre Branch: v1_8_6_RC3
Lustre Build: http://newbuild.whamcloud.com/job/lustre-b1_8/90/
e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/
Distro/Arch: RHEL6/x86_64(patchless client, in-kernel OFED, kernel version: 2.6.32-131.2.1.el6)
RHEL5/x86_64(server, OFED 1.5.3.1, kernel version: 2.6.18-238.12.1.el5_lustre)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD
MGS/MDS Nodes: client-10-ib(active), client-12-ib(passive)
\ /
1 combined MGS/MDT
OSS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(active)
\ /
OST1 (active in fat-amd-1-ib)
OST2 (active in fat-amd-2-ib)
OST3 (active in fat-amd-1-ib)
OST4 (active in fat-amd-2-ib)
OST5 (active in fat-amd-1-ib)
OST6 (active in fat-amd-2-ib)
Client Nodes: fat-amd-3-ib, client-[6,7,16,21,24]-ib
Lustre Branch: v1_8_6_RC3 Lustre Build: http://newbuild.whamcloud.com/job/lustre-b1_8/90/ e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/ Distro/Arch: RHEL6/x86_64(patchless client, in-kernel OFED, kernel version: 2.6.32-131.2.1.el6) RHEL5/x86_64(server, OFED 1.5.3.1, kernel version: 2.6.18-238.12.1.el5_lustre) ENABLE_QUOTA=yes FAILURE_MODE=HARD MGS/MDS Nodes: client-10-ib(active), client-12-ib(passive) \ / 1 combined MGS/MDT OSS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(active) \ / OST1 (active in fat-amd-1-ib) OST2 (active in fat-amd-2-ib) OST3 (active in fat-amd-1-ib) OST4 (active in fat-amd-2-ib) OST5 (active in fat-amd-1-ib) OST6 (active in fat-amd-2-ib) Client Nodes: fat-amd-3-ib, client-[6,7,16,21,24]-ib
-
3
-
22,777
-
5680
Description
While running recovery-mds-scale with FLAVOR=OSS, it failed as follows after running 3 hours:
==== Checking the clients loads AFTER failover -- failure NOT OK ost5 has failed over 5 times, and counting... sleeping 246 seconds ... tar: etc/rc.d/rc6.d/K88rsyslog: Cannot stat: No such file or directory tar: Exiting with failure status due to previous errors Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file client-21-ib Client load failed on node client-21-ib client client-21-ib load stdout and debug files : /tmp/recovery-mds-scale.log_run_tar.sh-client-21-ib /tmp/recovery-mds-scale.log_run_tar.sh-client-21-ib.debug 2011-06-26 08:08:03 Terminating clients loads ... Duration: 86400 Server failover period: 600 seconds Exited after: 13565 seconds Number of failovers before exit: mds: 0 times ost1: 2 times ost2: 6 times ost3: 3 times ost4: 4 times ost5: 5 times ost6: 3 times Status: FAIL: rc=1
Syslog on client node client-21-ib showed that:
Jun 26 08:03:55 client-21 kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting... Jun 26 08:04:20 client-21 kernel: LustreError: 18613:0:(client.c:2347:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff88031daf6c00 x1372677268199869/t98784270264 o2->lustre-OST0005_UUID@192.168.4.132@o2ib:28/4 lens 400/592 e 0 to 1 dl 1309100718 ref 2 fl Interpret:R/4/0 rc -2/-2
Syslog on the MDS node client-10-ib showed that:
Jun 26 08:03:57 client-10-ib kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting... Jun 26 08:04:22 client-10-ib kernel: LustreError: 17651:0:(client.c:2347:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff810320674400 x1372677249608261/t98784270265 o2->lustre-OST0005_UUID@192.168.4.132@o2ib:28/4 lens 400/592 e 0 to 1 dl 1309100720 ref 2 fl Interpret:R/4/0 rc -2/-2
Syslog on the OSS node fat-amd-1-ib showed that:
Jun 26 08:03:57 fat-amd-1-ib kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting... Jun 26 08:04:21 fat-amd-1-ib kernel: Lustre: 6278:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) lustre-OST0005: 5 recoverable clients remain Jun 26 08:04:21 fat-amd-1-ib kernel: Lustre: 6278:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) Skipped 2 previous similar messagesJun 26 08:04:21 fat-amd-1-ib kernel: LustreError: 6336:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-lustre-OST0005_UUID: lvbo_init failed for resource 161916: rc -2 Jun 26 08:04:21 fat-amd-1-ib kernel: LustreError: 6336:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 18 previous similar messagesJun 26 08:04:25 fat-amd-1-ib kernel: LustreError: 7708:0:(filter_log.c:135:filter_cancel_cookies_cb()) error cancelling log cookies: rc = -19 Jun 26 08:04:25 fat-amd-1-ib kernel: LustreError: 7708:0:(filter_log.c:135:filter_cancel_cookies_cb()) Skipped 8 previous similar messagesJun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: Recovery period over after 0:05, of 6 clients 6 recovered and 0 were evicted. Jun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: sending delayed replies to recovered clientsJun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: received MDS connection from 192.168.4.10@o2ib
Maloo report: https://maloo.whamcloud.com/test_sets/f1c2fd72-a067-11e0-aee5-52540025f9af
Please find the debug logs in the attachment.
This is a known issue: bug 22777
Attachments
Issue Links
- duplicates
-
LU-6200 Failover recovery-mds-scale test_failover_ost: test_failover_ost returned 1
- Resolved
- Trackbacks
-
Lustre 1.8.6-wc1 release testing tracker Lustre 1.8.6wc1 RC1 Tag: v186RC1 Created Date: 20110610 RC1 was DOA due to a build failure related to tag name LU408
-
Lustre 1.8.7-wc1 release testing tracker Lustre 1.8.7wc1 RC1 Tag: v187WC1RC1 Build:
-
Lustre 1.8.8-wc1 release testing tracker Lustre 1.8.8wc1 RC1 Tag: v188WC1RC1 Build:
-
Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA
-
Lustre 2.1.0 release testing tracker Lustre 2.1.0 RC0 Tag: v2100RC0 Build:
-
Lustre 2.1.1 release testing tracker Lustre 2.1.1 RC4 Tag: v2110RC4 Build:
-
Lustre 2.1.2 release testing tracker Lustre 2.1.2 RC2 Tag: v212RC2 Build:
-
Lustre 2.1.3 release testing tracker Lustre 2.1.3 RC1 Tag: v213RC1 Build: