[LU-2193] lvbo_init failed for resource XXX: rc -2, after recovery - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.4.0, Lustre 2.4.1, Lustre 2.5.0
Affects Version/s: Lustre 2.4.0
Labels:
- LB

Severity:
3
Epic:
- server
Project:
Orion
Rank (Obsolete):
3014

Description

We are seeing a lot (>30,000) of the following errors following recovery on our OSTs. This is only observed when the OSTs are under load when they are restarted. I believe this is the same as issue ~~LU-463~~ but I thought I'd open this as a separate issue until someone else can confirm it. If these errors actually do percolate all the way back to the client and are returned to the application then this is a serious issue for us.

Lustre: Lustre: Build Version: orion-2_2_49_57_1-44chaos-44chaos--PRISTINE-2.6.32-220.13.1.2chaos.ch5.x86_64
LustreError: 11-0: MGC10.1.1.211@o2ib9: Communicating with 10.1.1.211@o2ib9, operation llog_origin_handle_create failed with -2
LustreError: 5497:0:(mgc_request.c:250:do_config_log_add()) failed processing sptlrpc log: -2
Lustre: lcz-OST0000: Mounted
Lustre: lcz-OST0000: Will be in recovery for at least 5:00, or until 60 clients reconnect.
Lustre: lcz-OST0000: Recovery over after 1:12, of 60 clients 60 recovered and 0 were evicted.
LustreError: 5669:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25345259: rc -2
LustreError: 5663:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25352185: rc -2
LustreError: 5669:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 2 previous similar messages
LustreError: 5670:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25353836: rc -2
LustreError: 5670:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 168 previous similar messages
LustreError: 5665:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25418903: rc -2
LustreError: 5665:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 561 previous similar messages
LustreError: 5659:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25627504: rc -2
LustreError: 5659:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 2043 previous similar messages
LustreError: 5673:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25772020: rc -2
LustreError: 5673:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 4424 previous similar messages
LustreError: 5662:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 26456381: rc -2
LustreError: 5662:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 10386 previous similar messages
LustreError: 5672:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 28049681: rc -2
LustreError: 5672:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 23629 previous similar messages

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lvbo_init_failure.debug_log.log.tar.bz2
542 kB
06/May/13 2:13 AM
sanity.test_17m.debug_log.sookie-gig.1350338117.log.gz
290 kB
15/Oct/12 11:55 PM

Activity

[LU-2193] lvbo_init failed for resource XXX: rc -2, after recovery

Zhenyu Xu added a comment - 06/May/13 2:15 AM - edited

patch http://review.whamcloud.com/6266

commit message

LU-2193 ofd: check object exists before orphan destroy

MDS replay object destroys after recovery could meet non-existing
objects, skip it before following futile actions.

need check existence before trying to destroy an orphan object.

Zhenyu Xu added a comment - 06/May/13 2:15 AM - edited patch http://review.whamcloud.com/6266 commit message LU-2193 ofd: check object exists before orphan destroy MDS replay object destroys after recovery could meet non-existing objects, skip it before following futile actions. need check existence before trying to destroy an orphan object.

Zhenyu Xu added a comment - 06/May/13 2:13 AM

the lvbo_init still happens in ofd_orphans_destory().

Zhenyu Xu added a comment - 06/May/13 2:13 AM the lvbo_init still happens in ofd_orphans_destory().

Andreas Dilger added a comment - 03/May/13 11:18 PM

Seems that the initial patch to do the object lookup before destroy has slowed this message down but not removed it entirely, or maybe I just don't see as many messages under my test load. It appears that the lvbo_init objects are NOT in the range of the just-cleaned-up orphan objects, which is what I would expect from the original debugging:

Lustre: testfs-OST0000: deleting orphan objects from 0x0:28913 to 28929
Lustre: testfs-OST0001: deleting orphan objects from 0x0:29944 to 29985
:
:
Lustre: testfs-OST0001: deleting orphan objects from 0x0:55389 to 57441
Lustre: testfs-OST0000: deleting orphan objects from 0x0:54397 to 54977
LustreError: 6979:0:ldlm_resource_get()) testfs-OST0001: lvbo_init failed for resource 0x7e32:0x0: rc = -2
LustreError: 6999:0:ldlm_resource_get()) testfs-OST0001: lvbo_init failed for resource 0x80c4:0x0: rc = -2
LustreError: 6995:0:ldlm_resource_get()) testfs-OST0000: lvbo_init failed for resource 0x998d:0x0: rc = -2
(gdb) p 0x7e32
$5 = 32306
(gdb) p 0x80c4
$6 = 32964
(gdb) p 0x998d
$7 = 39309

There are of course messages being skipped in these console logs, but it seems like a disjoint set of objects.

Andreas Dilger added a comment - 03/May/13 11:18 PM Seems that the initial patch to do the object lookup before destroy has slowed this message down but not removed it entirely, or maybe I just don't see as many messages under my test load. It appears that the lvbo_init objects are NOT in the range of the just-cleaned-up orphan objects, which is what I would expect from the original debugging: Lustre: testfs-OST0000: deleting orphan objects from 0x0:28913 to 28929 Lustre: testfs-OST0001: deleting orphan objects from 0x0:29944 to 29985 : : Lustre: testfs-OST0001: deleting orphan objects from 0x0:55389 to 57441 Lustre: testfs-OST0000: deleting orphan objects from 0x0:54397 to 54977 LustreError: 6979:0:ldlm_resource_get()) testfs-OST0001: lvbo_init failed for resource 0x7e32:0x0: rc = -2 LustreError: 6999:0:ldlm_resource_get()) testfs-OST0001: lvbo_init failed for resource 0x80c4:0x0: rc = -2 LustreError: 6995:0:ldlm_resource_get()) testfs-OST0000: lvbo_init failed for resource 0x998d:0x0: rc = -2 (gdb) p 0x7e32 $5 = 32306 (gdb) p 0x80c4 $6 = 32964 (gdb) p 0x998d $7 = 39309 There are of course messages being skipped in these console logs, but it seems like a disjoint set of objects.

Hiroya Nozaki (Inactive) added a comment - 07/Dec/12 5:37 AM

Hi, Andreas.
After taking a glance at the patches here, I came to know that you're completely right. Something related to stale handler or after-recovery-behavior on MDS and OSS may cause my issues, not this issue.

I am sorry to have caused trouble, and thank you !!!

Hiroya Nozaki (Inactive) added a comment - 07/Dec/12 5:37 AM Hi, Andreas. After taking a glance at the patches here, I came to know that you're completely right. Something related to stale handler or after-recovery-behavior on MDS and OSS may cause my issues, not this issue. I am sorry to have caused trouble, and thank you !!!

Andreas Dilger added a comment - 06/Dec/12 9:28 PM

Sorry, the patches here will not fix this problem. My first patch was only for a bug in the master branch, and second patch is only for improving the error message.

Andreas Dilger added a comment - 06/Dec/12 9:28 PM Sorry, the patches here will not fix this problem. My first patch was only for a bug in the master branch, and second patch is only for improving the error message.

Hiroya Nozaki (Inactive) added a comment - 06/Dec/12 5:41 AM

Hi, Alex. Thank you for your instant reply.
It looks possible to me that the oss issue affect client-side somehow because the both messages show me the same res_name[0].

hm, but, anyway, it seems that I'd better read lustre source code much more.
thank you.

Hiroya Nozaki (Inactive) added a comment - 06/Dec/12 5:41 AM Hi, Alex. Thank you for your instant reply. It looks possible to me that the oss issue affect client-side somehow because the both messages show me the same res_name [0] . hm, but, anyway, it seems that I'd better read lustre source code much more. thank you.

Alex Zhuravlev added a comment - 06/Dec/12 5:23 AM

no, that patch is not supposed to fix this LBUG.

Alex Zhuravlev added a comment - 06/Dec/12 5:23 AM no, that patch is not supposed to fix this LBUG.

Hiroya Nozaki (Inactive) added a comment - 06/Dec/12 5:16 AM - edited

Hi, Andreas. We Fujitsu have almost the same issue besides one thing, So now, I'm planning to translate this patch (http://review.whamcloud.com/4276) into b1_8.
But before making a patch, I'd like to ask you one question.

Our customer got the below message, which is described here, on OSS.

14761:0:(ldlm_resource.c:1011:ldlm_resource_add()) filter-fefs-OST0113_UUID: lvbo_init failed for resource 332822: rc -2

Then, after a while, umount got LBUG in ldlm_lock_cancel on a client node because of an extra refcount of ldlm_lock->l_readers of the extent lock. And the messages showed the same res_name[0] to the oss message.
I'm not so sure, but I think the both issues may come from the same root. So I'd like to know wheter or not this patch will work to the client issue too.

5538:0:(ldlm_lock.c:1606:ldlm_lock_cancel()) ### lock still has references ns: cx-OST0113-osc-ffff883fd57f4800 lock: ffff881dc8ddb000/0xead40fdb3d64db52 lrc: 4/1,0 mode: PR/PR res: 332822/0 rrc: 2 type: EXT [0->18446744073
5538:0:(ldlm_lock.c:1607:ldlm_lock_cancel()) LBUG

crash> bt
PID: 5538   TASK: ffff883577ac6080  CPU: 8   COMMAND: "umount"
 #0 [ffff883db6139640] machine_kexec at ffffffff810310db
 #1 [ffff883db61396a0] crash_kexec at ffffffff810b63b2
 #2 [ffff883db6139770] panic at ffffffff814dac2f
 #3 [ffff883db61397f0] lbug_with_loc at ffffffffa04ba453
 #4 [ffff883db6139870] ldlm_cli_cancel_local at ffffffffa0697651
 #5 [ffff883db61398d0] ldlm_cli_cancel at ffffffffa0699d23
 #6 [ffff883db6139950] cleanup_resource at ffffffffa0681e83
 #7 [ffff883db61399d0] ldlm_namespace_cleanup at ffffffffa068218f
 #8 [ffff883db6139a60] __ldlm_namespace_free at ffffffffa068239b
 #9 [ffff883db6139b40] ldlm_namespace_free_prior at ffffffffa06829bc
#10 [ffff883db6139b80] client_disconnect_export at ffffffffa0687766
#11 [ffff883db6139c00] osc_disconnect at ffffffffa07a6df4
#12 [ffff883db6139c60] lov_putref at ffffffffa0829f21
#13 [ffff883db6139d30] ll_put_super at ffffffffa08bbf0d
#14 [ffff883db6139e30] generic_shutdown_super at ffffffff81174adb
#15 [ffff883db6139e50] kill_anon_super at ffffffff81174bc6
#16 [ffff883db6139e70] lustre_kill_super at ffffffffa05dc588
#17 [ffff883db6139e90] deactivate_super at ffffffff81175c40
#18 [ffff883db6139eb0] mntput_no_expire at ffffffff8119113f
#19 [ffff883db6139ee0] sys_umount at ffffffff8119156b
#20 [ffff883db6139f80] tracesys at ffffffff8100b387 (via system_call)
    RIP: 00007fdce2a7b717  RSP: 00007fff1f6b5728  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: ffffffff8100b387  RCX: ffffffffffffffff
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 00007fdce3c11a10
    RBP: 00007fdce3c119d0   R8: 00007fdce3c11a30   R9: 0000000000000000
    R10: 00007fff1f6b54b0  R11: 0000000000000246  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 00007fdce3c11a70
    ORIG_RAX: 00000000000000a6  CS: 0033  SS: 002b

Hiroya Nozaki (Inactive) added a comment - 06/Dec/12 5:16 AM - edited Hi, Andreas. We Fujitsu have almost the same issue besides one thing, So now, I'm planning to translate this patch ( http://review.whamcloud.com/4276 ) into b1_8. But before making a patch, I'd like to ask you one question. Our customer got the below message, which is described here, on OSS. 14761:0:(ldlm_resource.c:1011:ldlm_resource_add()) filter-fefs-OST0113_UUID: lvbo_init failed for resource 332822: rc -2 Then, after a while, umount got LBUG in ldlm_lock_cancel on a client node because of an extra refcount of ldlm_lock->l_readers of the extent lock. And the messages showed the same res_name [0] to the oss message. I'm not so sure, but I think the both issues may come from the same root. So I'd like to know wheter or not this patch will work to the client issue too. 5538:0:(ldlm_lock.c:1606:ldlm_lock_cancel()) ### lock still has references ns: cx-OST0113-osc-ffff883fd57f4800 lock: ffff881dc8ddb000/0xead40fdb3d64db52 lrc: 4/1,0 mode: PR/PR res: 332822/0 rrc: 2 type: EXT [0->18446744073 5538:0:(ldlm_lock.c:1607:ldlm_lock_cancel()) LBUG crash> bt PID: 5538 TASK: ffff883577ac6080 CPU: 8 COMMAND: "umount" #0 [ffff883db6139640] machine_kexec at ffffffff810310db #1 [ffff883db61396a0] crash_kexec at ffffffff810b63b2 #2 [ffff883db6139770] panic at ffffffff814dac2f #3 [ffff883db61397f0] lbug_with_loc at ffffffffa04ba453 #4 [ffff883db6139870] ldlm_cli_cancel_local at ffffffffa0697651 #5 [ffff883db61398d0] ldlm_cli_cancel at ffffffffa0699d23 #6 [ffff883db6139950] cleanup_resource at ffffffffa0681e83 #7 [ffff883db61399d0] ldlm_namespace_cleanup at ffffffffa068218f #8 [ffff883db6139a60] __ldlm_namespace_free at ffffffffa068239b #9 [ffff883db6139b40] ldlm_namespace_free_prior at ffffffffa06829bc #10 [ffff883db6139b80] client_disconnect_export at ffffffffa0687766 #11 [ffff883db6139c00] osc_disconnect at ffffffffa07a6df4 #12 [ffff883db6139c60] lov_putref at ffffffffa0829f21 #13 [ffff883db6139d30] ll_put_super at ffffffffa08bbf0d #14 [ffff883db6139e30] generic_shutdown_super at ffffffff81174adb #15 [ffff883db6139e50] kill_anon_super at ffffffff81174bc6 #16 [ffff883db6139e70] lustre_kill_super at ffffffffa05dc588 #17 [ffff883db6139e90] deactivate_super at ffffffff81175c40 #18 [ffff883db6139eb0] mntput_no_expire at ffffffff8119113f #19 [ffff883db6139ee0] sys_umount at ffffffff8119156b #20 [ffff883db6139f80] tracesys at ffffffff8100b387 (via system_call) RIP: 00007fdce2a7b717 RSP: 00007fff1f6b5728 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: ffffffff8100b387 RCX: ffffffffffffffff RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007fdce3c11a10 RBP: 00007fdce3c119d0 R8: 00007fdce3c11a30 R9: 0000000000000000 R10: 00007fff1f6b54b0 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 00007fdce3c11a70 ORIG_RAX: 00000000000000a6 CS: 0033 SS: 002b

Andreas Dilger added a comment - 09/Nov/12 4:54 AM

Minor patch to improve lvbo_init() error message to print full FID: http://review.whamcloud.com/4501

Andreas Dilger added a comment - 09/Nov/12 4:54 AM Minor patch to improve lvbo_init() error message to print full FID: http://review.whamcloud.com/4501

Alex Zhuravlev added a comment - 16/Oct/12 1:51 AM

> Is it possible to just pass the lu_env down via the lvbo_init code?

yes, and there are another places in ldlm we want to pass env in.

Alex Zhuravlev added a comment - 16/Oct/12 1:51 AM > Is it possible to just pass the lu_env down via the lvbo_init code? yes, and there are another places in ldlm we want to pass env in.

Andreas Dilger added a comment - 16/Oct/12 1:39 AM

Untested patch: http://review.whamcloud.com/4276

Andreas Dilger added a comment - 16/Oct/12 1:39 AM Untested patch: http://review.whamcloud.com/4276

People

Assignee:: Zhenyu Xu

Reporter:: Brian Behlendorf

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 10/May/12 12:36 PM

Updated:: 15/Oct/13 7:36 PM

Resolved:: 06/May/13 6:11 PM