[LU-463] orphan recovery happens too late, causing writes to fail with ENOENT after recovery - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.1.0, (10)
Lustre 2.2.0, Lustre 2.1.1, Lustre 2.1.2, Lustre 2.1.3, Lustre 2.1.4, Lustre 2.1.5, Lustre 1.8.8, Lustre 1.8.6, Lustre 1.8.9, Lustre 2.1.6
Labels:
None
Environment:

Hide

Lustre Branch: v1_8_6_RC3
Lustre Build: http://newbuild.whamcloud.com/job/lustre-b1_8/90/
e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/
Distro/Arch: RHEL6/x86_64(patchless client, in-kernel OFED, kernel version: 2.6.32-131.2.1.el6)
                    RHEL5/x86_64(server, OFED 1.5.3.1, kernel version: 2.6.18-238.12.1.el5_lustre)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

MGS/MDS Nodes: client-10-ib(active), client-12-ib(passive)
                                                           \ /
                                            1 combined MGS/MDT

OSS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(active)
                                                   \ /
                                               OST1 (active in fat-amd-1-ib)
                                               OST2 (active in fat-amd-2-ib)
                                               OST3 (active in fat-amd-1-ib)
                                               OST4 (active in fat-amd-2-ib)
                                               OST5 (active in fat-amd-1-ib)
                                               OST6 (active in fat-amd-2-ib)

Client Nodes: fat-amd-3-ib, client-[6,7,16,21,24]-ib

Show
Lustre Branch: v1_8_6_RC3 Lustre Build: http://newbuild.whamcloud.com/job/lustre-b1_8/90/ e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/ Distro/Arch: RHEL6/x86_64(patchless client, in-kernel OFED, kernel version: 2.6.32-131.2.1.el6)                     RHEL5/x86_64(server, OFED 1.5.3.1, kernel version: 2.6.18-238.12.1.el5_lustre) ENABLE_QUOTA=yes FAILURE_MODE=HARD MGS/MDS Nodes: client-10-ib(active), client-12-ib(passive)                                                            \ /                                             1 combined MGS/MDT OSS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(active)                                                    \ /                                                OST1 (active in fat-amd-1-ib)                                                OST2 (active in fat-amd-2-ib)                                                OST3 (active in fat-amd-1-ib)                                                OST4 (active in fat-amd-2-ib)                                                OST5 (active in fat-amd-1-ib)                                                OST6 (active in fat-amd-2-ib) Client Nodes: fat-amd-3-ib, client-[6,7,16,21,24]-ib

Severity:
3
Bugzilla ID:
22,777
Rank (Obsolete):
5680

Description

While running recovery-mds-scale with FLAVOR=OSS, it failed as follows after running 3 hours:

==== Checking the clients loads AFTER  failover -- failure NOT OK
ost5 has failed over 5 times, and counting...
sleeping 246 seconds ... 
tar: etc/rc.d/rc6.d/K88rsyslog: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors
Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
client-21-ib
Client load failed on node client-21-ib

client client-21-ib load stdout and debug files :
              /tmp/recovery-mds-scale.log_run_tar.sh-client-21-ib
              /tmp/recovery-mds-scale.log_run_tar.sh-client-21-ib.debug
2011-06-26 08:08:03 Terminating clients loads ...
Duration:                86400
Server failover period: 600 seconds
Exited after:           13565 seconds
Number of failovers before exit:
mds: 0 times
ost1: 2 times
ost2: 6 times
ost3: 3 times
ost4: 4 times
ost5: 5 times
ost6: 3 times
Status: FAIL: rc=1

Syslog on client node client-21-ib showed that:

Jun 26 08:03:55 client-21 kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
Jun 26 08:04:20 client-21 kernel: LustreError: 18613:0:(client.c:2347:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff88031daf6c00 x1372677268199869/t98784270264 o2->lustre-OST0005_UUID@192.168.4.132@o2ib:28/4 lens 400/592 e 0 to 1 dl 1309100718 ref 2 fl Interpret:R/4/0 rc -2/-2

Syslog on the MDS node client-10-ib showed that:

Jun 26 08:03:57 client-10-ib kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
Jun 26 08:04:22 client-10-ib kernel: LustreError: 17651:0:(client.c:2347:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff810320674400 x1372677249608261/t98784270265 o2->lustre-OST0005_UUID@192.168.4.132@o2ib:28/4 lens 400/592 e 0 to 1 dl 1309100720 ref 2 fl Interpret:R/4/0 rc -2/-2

Syslog on the OSS node fat-amd-1-ib showed that:

Jun 26 08:03:57 fat-amd-1-ib kernel: Lustre: DEBUG MARKER: ost5 has failed over 5 times, and counting...
Jun 26 08:04:21 fat-amd-1-ib kernel: Lustre: 6278:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) lustre-OST0005: 5 recoverable clients remain
Jun 26 08:04:21 fat-amd-1-ib kernel: Lustre: 6278:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) Skipped 2 previous similar messagesJun 26 08:04:21 fat-amd-1-ib kernel: LustreError: 6336:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-lustre-OST0005_UUID: lvbo_init failed for resource 161916: rc -2
Jun 26 08:04:21 fat-amd-1-ib kernel: LustreError: 6336:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 18 previous similar messagesJun 26 08:04:25 fat-amd-1-ib kernel: LustreError: 7708:0:(filter_log.c:135:filter_cancel_cookies_cb()) error cancelling log cookies: rc = -19
Jun 26 08:04:25 fat-amd-1-ib kernel: LustreError: 7708:0:(filter_log.c:135:filter_cancel_cookies_cb()) Skipped 8 previous similar messagesJun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: Recovery period over after 0:05, of 6 clients 6 recovered and 0 were evicted.
Jun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: sending delayed replies to recovered clientsJun 26 08:04:25 fat-amd-1-ib kernel: Lustre: lustre-OST0005: received MDS connection from 192.168.4.10@o2ib

Maloo report: https://maloo.whamcloud.com/test_sets/f1c2fd72-a067-11e0-aee5-52540025f9af

Please find the debug logs in the attachment.

This is a known issue: bug 22777

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

recovery-mds-scale-1309100884.tar.bz2
2.17 MB
26/Jun/11 11:46 PM
recovery-oss-scale.1315539020.log.tar.bz2
3.58 MB
09/Sep/11 12:14 AM
recovery-oss-scale.1318474116.log.tar.bz2
1.57 MB
13/Oct/11 1:13 AM
recovery-oss-scale-1311567030.tar.bz2
1.04 MB
25/Jul/11 3:18 AM

Issue Links

duplicates

LU-6200 Failover recovery-mds-scale test_failover_ost: test_failover_ost returned 1

Resolved

Trackbacks

Lustre 1.8.6-wc1 release testing tracker Lustre 1.8.6wc1 RC1 Tag: v186RC1 Created Date: 20110610 RC1 was DOA due to a build failure related to tag name LU408

Lustre 1.8.7-wc1 release testing tracker Lustre 1.8.7wc1 RC1 Tag: v187WC1RC1 Build:

Lustre 1.8.8-wc1 release testing tracker Lustre 1.8.8wc1 RC1 Tag: v188WC1RC1 Build:

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Lustre 2.1.0 release testing tracker Lustre 2.1.0 RC0 Tag: v2100RC0 Build:

Lustre 2.1.1 release testing tracker Lustre 2.1.1 RC4 Tag: v2110RC4 Build:

Lustre 2.1.2 release testing tracker Lustre 2.1.2 RC2 Tag: v212RC2 Build:

Lustre 2.1.3 release testing tracker Lustre 2.1.3 RC1 Tag: v213RC1 Build:

(4 Trackbacks)

Activity

[LU-463] orphan recovery happens too late, causing writes to fail with ENOENT after recovery

Hongchao Zhang added a comment - 05/Apr/16 8:11 AM - edited

Yes, this should a duplicate of ~~LU-6200~~, which contains a patch to fix the problem in a similar way.

Hongchao Zhang added a comment - 05/Apr/16 8:11 AM - edited Yes, this should a duplicate of LU-6200 , which contains a patch to fix the problem in a similar way.

Jian Yu added a comment - 05/Apr/16 5:42 AM

Hi Andreas,

The failure is still blocking recovery-mds-scale failover_ost testing on all of the Lustre branches. ~~LU-4621~~ and ~~LU-6200~~ were created later to track the failures.

Jian Yu added a comment - 05/Apr/16 5:42 AM Hi Andreas, The failure is still blocking recovery-mds-scale failover_ost testing on all of the Lustre branches. LU-4621 and LU-6200 were created later to track the failures.

Andreas Dilger added a comment - 03/Apr/16 4:04 AM

I noticed that the patch http://review.whamcloud.com/4868 (against b2_1) was never landed to any branch. Was this bug fixed in some other way (in which case it can be closed again) or are we just ignoring these test failures now?

Andreas Dilger added a comment - 03/Apr/16 4:04 AM I noticed that the patch http://review.whamcloud.com/4868 (against b2_1) was never landed to any branch. Was this bug fixed in some other way (in which case it can be closed again) or are we just ignoring these test failures now?

John Fuchs-Chesney (Inactive) added a comment - 07/Jan/16 11:55 PM

Incomplete and out of date.
~ jfc.

John Fuchs-Chesney (Inactive) added a comment - 07/Jan/16 11:55 PM Incomplete and out of date. ~ jfc.

Hongchao Zhang added a comment - 18/Jun/13 11:07 AM

the patch is updated http://review.hpdd.intel.com/#/c/4868/

Hongchao Zhang added a comment - 18/Jun/13 11:07 AM the patch is updated http://review.hpdd.intel.com/#/c/4868/

Jian Yu added a comment - 06/Jun/13 12:04 AM

Lustre Tag: v2_1_6_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/208
Distro/Arch: RHEL6.4/x86_64
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

The issue occurred again while running recovery-mds-scale failover_ost test:
https://maloo.whamcloud.com/test_sets/bca04cd4-cdf2-11e2-ba28-52540035b04c

Jian Yu added a comment - 06/Jun/13 12:04 AM Lustre Tag: v2_1_6_RC1 Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/208 Distro/Arch: RHEL6.4/x86_64 ENABLE_QUOTA=yes FAILURE_MODE=HARD The issue occurred again while running recovery-mds-scale failover_ost test: https://maloo.whamcloud.com/test_sets/bca04cd4-cdf2-11e2-ba28-52540035b04c

Jian Yu added a comment - 19/Mar/13 7:49 AM - edited

Lustre Branch: b2_1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/189/

The same issue occurred: https://maloo.whamcloud.com/test_sets/278684ba-902b-11e2-9b28-52540035b04c

Dmesg on OSS node showed that:

Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
cannot allocate a tage (271)
cannot allocate a tage (271)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost6 has failed over 5 times, and counting...
Lustre: DEBUG MARKER: ost6 has failed over 5 times, and counting...
Lustre: lustre-OST0006: sending delayed replies to recovered clients
LustreError: 3691:0:(ldlm_resource.c:1090:ldlm_resource_get()) lvbo_init failed for resource 14662: rc -2
Lustre: lustre-OST0006: received MDS connection from 10.10.4.190@tcp
__ratelimit: 12 callbacks suppressed
cannot allocate a tage (402)
cannot allocate a tage (402)
cannot allocate a tage (402)
cannot allocate a tage (402)
cannot allocate a tage (402)
cannot allocate a tage (402)
cannot allocate a tage (402)
cannot allocate a tage (402)
cannot allocate a tage (402)
cannot allocate a tage (402)
__ratelimit: 8 callbacks suppressed

Jian Yu added a comment - 19/Mar/13 7:49 AM - edited Lustre Branch: b2_1 Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/189/ The same issue occurred: https://maloo.whamcloud.com/test_sets/278684ba-902b-11e2-9b28-52540035b04c Dmesg on OSS node showed that: Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK cannot allocate a tage (271) cannot allocate a tage (271) Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost6 has failed over 5 times, and counting... Lustre: DEBUG MARKER: ost6 has failed over 5 times, and counting... Lustre: lustre-OST0006: sending delayed replies to recovered clients LustreError: 3691:0:(ldlm_resource.c:1090:ldlm_resource_get()) lvbo_init failed for resource 14662: rc -2 Lustre: lustre-OST0006: received MDS connection from 10.10.4.190@tcp __ratelimit: 12 callbacks suppressed cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) cannot allocate a tage (402) __ratelimit: 8 callbacks suppressed

Jian Yu added a comment - 20/Feb/13 6:25 AM - edited

Lustre Tag: v1_8_9_WC1_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/258
Distro/Arch: RHEL5.9/x86_64(server), RHEL6.3/x86_64(client)
Network: TCP (1GigE)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

The same issue occurred: https://maloo.whamcloud.com/test_sets/79cd620e-7af3-11e2-b916-52540035b04c

Jian Yu added a comment - 20/Feb/13 6:25 AM - edited Lustre Tag: v1_8_9_WC1_RC2 Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/258 Distro/Arch: RHEL5.9/x86_64(server), RHEL6.3/x86_64(client) Network: TCP (1GigE) ENABLE_QUOTA=yes FAILURE_MODE=HARD The same issue occurred: https://maloo.whamcloud.com/test_sets/79cd620e-7af3-11e2-b916-52540035b04c

Jian Yu added a comment - 18/Feb/13 10:44 AM

Lustre Tag: v1_8_9_WC1_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/256
Distro/Arch: RHEL5.9/x86_64
Network: IB (in-kernel OFED)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

The same issue occurred: https://maloo.whamcloud.com/test_sets/d7e82752-79db-11e2-8fd2-52540035b04c

Jian Yu added a comment - 18/Feb/13 10:44 AM Lustre Tag: v1_8_9_WC1_RC1 Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/256 Distro/Arch: RHEL5.9/x86_64 Network: IB (in-kernel OFED) ENABLE_QUOTA=yes FAILURE_MODE=HARD The same issue occurred: https://maloo.whamcloud.com/test_sets/d7e82752-79db-11e2-8fd2-52540035b04c

Jian Yu added a comment - 24/Jan/13 9:33 PM

Lustre Branch: b1_8
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/251/

The same issue occurred: https://maloo.whamcloud.com/test_sets/9734baba-661f-11e2-a42b-52540035b04c

Jian Yu added a comment - 24/Jan/13 9:33 PM Lustre Branch: b1_8 Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/251/ The same issue occurred: https://maloo.whamcloud.com/test_sets/9734baba-661f-11e2-a42b-52540035b04c

People

Assignee:: Hongchao Zhang

Reporter:: Jian Yu

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 26/Jun/11 11:46 PM

Updated:: 05/Apr/16 10:57 PM

Resolved:: 05/Apr/16 10:57 PM