Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Duplicate
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.1.0, Lustre 2.2.0, Lustre 2.1.1, Lustre 2.1.2, Lustre 2.1.3, Lustre 2.1.4, Lustre 2.1.5, Lustre 1.8.8, Lustre 1.8.6, Lustre 1.8.9, Lustre 2.1.6
Labels:
None
Environment:

Hide

Lustre Branch: v1_8_6_RC3
Lustre Build: http://newbuild.whamcloud.com/job/lustre-b1_8/90/
e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/
Distro/Arch: RHEL6/x86_64(patchless client, in-kernel OFED, kernel version: 2.6.32-131.2.1.el6)
                    RHEL5/x86_64(server, OFED 1.5.3.1, kernel version: 2.6.18-238.12.1.el5_lustre)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

MGS/MDS Nodes: client-10-ib(active), client-12-ib(passive)
                                                           \ /
                                            1 combined MGS/MDT

OSS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(active)
                                                   \ /
                                               OST1 (active in fat-amd-1-ib)
                                               OST2 (active in fat-amd-2-ib)
                                               OST3 (active in fat-amd-1-ib)
                                               OST4 (active in fat-amd-2-ib)
                                               OST5 (active in fat-amd-1-ib)
                                               OST6 (active in fat-amd-2-ib)

Client Nodes: fat-amd-3-ib, client-[6,7,16,21,24]-ib

Show
Lustre Branch: v1_8_6_RC3 Lustre Build: http://newbuild.whamcloud.com/job/lustre-b1_8/90/ e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/ Distro/Arch: RHEL6/x86_64(patchless client, in-kernel OFED, kernel version: 2.6.32-131.2.1.el6)                     RHEL5/x86_64(server, OFED 1.5.3.1, kernel version: 2.6.18-238.12.1.el5_lustre) ENABLE_QUOTA=yes FAILURE_MODE=HARD MGS/MDS Nodes: client-10-ib(active), client-12-ib(passive)                                                            \ /                                             1 combined MGS/MDT OSS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(active)                                                    \ /                                                OST1 (active in fat-amd-1-ib)                                                OST2 (active in fat-amd-2-ib)                                                OST3 (active in fat-amd-1-ib)                                                OST4 (active in fat-amd-2-ib)                                                OST5 (active in fat-amd-1-ib)                                                OST6 (active in fat-amd-2-ib) Client Nodes: fat-amd-3-ib, client-[6,7,16,21,24]-ib

Severity:
3
Bugzilla ID:
22,777
Rank (Obsolete):
5680

Description

While running recovery-mds-scale with FLAVOR=OSS, it failed as follows after running 3 hours:

==== Checking the clients loads AFTER  failover -- failure NOT OK
ost5 has failed over 5 times, and counting...
sleeping 246 seconds ... 
tar: etc/rc.d/rc6.d/K88rsyslog: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors
Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
client-21-ib
Client load failed on node client-21-ib

client client-21-ib load stdout and debug files :
              /tmp/recovery-mds-scale.log_run_tar.sh-client-21-ib
              /tmp/recovery-mds-scale.log_run_tar.sh-client-21-ib.debug
2011-06-26 08:08:03 Terminating clients loads ...
Duration:                86400
Server failover period: 600 seconds
Exited after:           13565 seconds
Number of failovers before exit:
mds: 0 times
ost1: 2 times
ost2: 6 times
ost3: 3 times
ost4: 4 times
ost5: 5 times
ost6: 3 times
Status: FAIL: rc=1

Syslog on client node client-21-ib showed that:

Jun 26 08:03:55 client-21 kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
Jun 26 08:04:20 client-21 kernel: LustreError: 18613:0:(client.c:2347:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff88031daf6c00 x1372677268199869/t98784270264 o2->lustre-OST0005_UUID@192.168.4.132@o2ib:28/4 lens 400/592 e 0 to 1 dl 1309100718 ref 2 fl Interpret:R/4/0 rc -2/-2

Syslog on the MDS node client-10-ib showed that:

Jun 26 08:03:57 client-10-ib kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
Jun 26 08:04:22 client-10-ib kernel: LustreError: 17651:0:(client.c:2347:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff810320674400 x1372677249608261/t98784270265 o2->lustre-OST0005_UUID@192.168.4.132@o2ib:28/4 lens 400/592 e 0 to 1 dl 1309100720 ref 2 fl Interpret:R/4/0 rc -2/-2

Syslog on the OSS node fat-amd-1-ib showed that:

Jun 26 08:03:57 fat-amd-1-ib kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
Jun 26 08:04:21 fat-amd-1-ib kernel: Lustre: 6278:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) lustre-OST0005: 5 recoverable clients remain
Jun 26 08:04:21 fat-amd-1-ib kernel: Lustre: 6278:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) Skipped 2 previous similar messagesJun 26 08:04:21 fat-amd-1-ib kernel: LustreError: 6336:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-lustre-OST0005_UUID: lvbo_init failed for resource 161916: rc -2
Jun 26 08:04:21 fat-amd-1-ib kernel: LustreError: 6336:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 18 previous similar messagesJun 26 08:04:25 fat-amd-1-ib kernel: LustreError: 7708:0:(filter_log.c:135:filter_cancel_cookies_cb()) error cancelling log cookies: rc = -19
Jun 26 08:04:25 fat-amd-1-ib kernel: LustreError: 7708:0:(filter_log.c:135:filter_cancel_cookies_cb()) Skipped 8 previous similar messagesJun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: Recovery period over after 0:05, of 6 clients 6 recovered and 0 were evicted.
Jun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: sending delayed replies to recovered clientsJun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: received MDS connection from 192.168.4.10@o2ib

Maloo report: https://maloo.whamcloud.com/test_sets/f1c2fd72-a067-11e0-aee5-52540025f9af

Please find the debug logs in the attachment.

This is a known issue: bug 22777

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

recovery-mds-scale-1309100884.tar.bz2
26/Jun/11 11:46 PM
2.17 MB
Jian Yu
recovery-oss-scale.1315539020.log.tar.bz2
09/Sep/11 12:14 AM
3.58 MB
Jian Yu
recovery-oss-scale.1318474116.log.tar.bz2
13/Oct/11 1:13 AM
1.57 MB
Jian Yu
recovery-oss-scale-1311567030.tar.bz2
25/Jul/11 3:18 AM
1.04 MB
Jian Yu

Issue Links

duplicates

LU-6200 Failover recovery-mds-scale test_failover_ost: test_failover_ost returned 1

Resolved

Trackbacks

Lustre 1.8.6-wc1 release testing tracker Lustre 1.8.6wc1 RC1 Tag: v186RC1 Created Date: 20110610 RC1 was DOA due to a build failure related to tag name LU408

Lustre 1.8.7-wc1 release testing tracker Lustre 1.8.7wc1 RC1 Tag: v187WC1RC1 Build:

Lustre 1.8.8-wc1 release testing tracker Lustre 1.8.8wc1 RC1 Tag: v188WC1RC1 Build:

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Lustre 2.1.0 release testing tracker Lustre 2.1.0 RC0 Tag: v2100RC0 Build:

Lustre 2.1.1 release testing tracker Lustre 2.1.1 RC4 Tag: v2110RC4 Build:

Lustre 2.1.2 release testing tracker Lustre 2.1.2 RC2 Tag: v212RC2 Build:

Lustre 2.1.3 release testing tracker Lustre 2.1.3 RC1 Tag: v213RC1 Build:

(4 Trackbacks)

Activity

People

Assignee:: Hongchao Zhang

Reporter:: Jian Yu

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 26/Jun/11 11:46 PM

Updated:: 05/Apr/16 10:57 PM

Resolved:: 05/Apr/16 10:57 PM