|
Duplicate of LU-3601
|
|
Not a duplicate of LU-3601 since that issue describes a similar problem that could have been addresses using http://review.whamcloud.com/#/c/7148/ but was eventually fixed by changing the locking used by the coordinator.
|
|
I believe I have a patch for this in http://review.whamcloud.com/8083
Now, can you please share exact reproducing details (And also please give the patch a try).
|
|
lso there's a patch with alternative approach at http://review.whamcloud.com/8088
we might want this one to go into master only and not 2.5 since it's somewhat more involved.
Please also give this one a try.
|
|
Our first test run with the simpler of the two patches has locked up. The original lock re-ordering patch from LU-3601 allowed us to complete the test run successfully.
I'm going to make the dump available here shortly. It's possible it's the same issue, or it's possible there's some other layout lock related livelock that John's original LU-3601 patch also allowed us to avoid.
The second patch LBUGged almost immediately:
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: LustreError: 17452:0:(mdt_handler.c:3841:mdt_intent_reint()) ASSERTION( lustre_handle_is_used(&lhc->mlh_reg_lh) ) failed:
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: LustreError: 17452:0:(mdt_handler.c:3841:mdt_intent_reint()) LBUG
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: Pid: 17452, comm: mdt02_003
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel:
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: Call Trace:
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: [<ffffffffa0a20895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: [<ffffffffa0a20e97>] lbug_with_loc+0x47/0xb0 [libcfs]
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: [<ffffffffa13ab367>] mdt_intent_reint+0x4e7/0x520 [mdt]
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: [<ffffffffa13a8f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: [<ffffffffa0cc1831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: [<ffffffffa0ce893f>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: [<ffffffffa13a93a6>] mdt_enqueue+0x46/0xe0 [mdt]
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: [<ffffffffa13afa97>] mdt_handle_common+0x647/0x16d0 [mdt]
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: [<ffffffffa0d0b40c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: [<ffffffffa13e94d5>] mds_regular_handle+0x15/0x20 [mdt]
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: [<ffffffffa0d1ac28>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: [<ffffffffa0a215de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: [<ffffffffa0a32ebf>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: [<ffffffffa0d11f89>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: [<ffffffffa0d1bfbe>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: [<ffffffffa0d1b4f0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: [<ffffffffa0d1b4f0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: [<ffffffffa0d1b4f0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Oct 29 15:52:23 perses-esf-mds001 perses-esf-mds001 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
|
|
MDS dump from apparent live lock with the first patch is up on Cray's FTP at:
ftp.cray.com
u: anonymous
p: anonymous
Then:
cd outbound/LU-4152
And then the file is:
LU-4152-patch_8083.tar.gz
Note that anonymous doesn't have permission to ls in outbound or in LU-4152, but a get command will work fine.
|
|
apparently the vmcore that's part of this file is for the second panic, because I see the panic message inside.
|
|
Oops. You're right.
My apologies, Oleg, I grabbed the wrong dump package. Just a minute here.
|
|
The correct dump is now there.
There are now two files in outbound/LU-4152:
LU-4152-patch_8083.tar.gz
LU-4152-patch_8088.tar.gz
The file named for patch 8083 NOW correctly contains the dump from the hang we saw with patch 8083.
The one named for patch 8088 is the dump I uploaded previously, which has the LBUG in it.
|
|
I see.
Duh, apparently I made a check with a wrong lock handle.
Please update the patch with this fix and try again (note it introduces some sort of a crash on client shutdown, that I'll tackle soon, but I want to make sure it really kills the problem this time.
|
|
Oleg,
We've just tested the fixed patch and it appears to resolve the issue. We've run our reproducer twice now, and it's hit every time we've run with unpatched code.
Looks like it fixes it - If you can address the client shutdown crash, I think it's good to go.
|
|
By the way, here's the stack trace from our client dump on shutdown.. It seems simple enough, just an unreleased reference, presumably to one of those locks:
2013-10-30T16:34:38.526134-05:00 c0-0c0s2n3 LustreError: 11673:0:(lu_object.c:1141:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
2013-10-30T16:34:38.526158-05:00 c0-0c0s2n3 LustreError: 11673:0:(lu_object.c:1141:lu_device_fini()) LBUG
2013-10-30T16:34:38.526169-05:00 c0-0c0s2n3 Pid: 11673, comm: umount
2013-10-30T16:34:38.526175-05:00 c0-0c0s2n3 Call Trace:
2013-10-30T16:34:38.555669-05:00 c0-0c0s2n3 [<ffffffff81007e59>] try_stack_unwind+0x1a9/0x200
2013-10-30T16:34:38.555680-05:00 c0-0c0s2n3 [<ffffffff81006625>] dump_trace+0x95/0x300
2013-10-30T16:34:38.555693-05:00 c0-0c0s2n3 [<ffffffffa02a28d7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
2013-10-30T16:34:38.555700-05:00 c0-0c0s2n3 [<ffffffffa02a2e27>] lbug_with_loc+0x47/0xb0 [libcfs]
2013-10-30T16:34:38.555725-05:00 c0-0c0s2n3 [<ffffffffa05190c7>] lu_device_fini+0x87/0xc0 [obdclass]
2013-10-30T16:34:38.555734-05:00 c0-0c0s2n3 [<ffffffffa094efd4>] lovsub_device_free+0x24/0x160 [lov]
2013-10-30T16:34:38.585779-05:00 c0-0c0s2n3 [<ffffffffa051cd0e>] lu_stack_fini+0x7e/0xc0 [obdclass]
2013-10-30T16:34:38.585790-05:00 c0-0c0s2n3 [<ffffffffa0522c1e>] cl_stack_fini+0xe/0x10 [obdclass]
2013-10-30T16:34:38.585801-05:00 c0-0c0s2n3 [<ffffffffa093e9a8>] lov_device_fini+0x58/0x120 [lov]
2013-10-30T16:34:38.585866-05:00 c0-0c0s2n3 [<ffffffffa051ccda>] lu_stack_fini+0x4a/0xc0 [obdclass]
2013-10-30T16:34:38.585887-05:00 c0-0c0s2n3 [<ffffffffa0522c1e>] cl_stack_fini+0xe/0x10 [obdclass]
2013-10-30T16:34:38.585896-05:00 c0-0c0s2n3 [<ffffffffa0a1edec>] cl_sb_fini+0x6c/0x1a0 [lustre]
2013-10-30T16:34:38.615774-05:00 c0-0c0s2n3 [<ffffffffa09e4fc5>] client_common_put_super+0x55/0xad0 [lustre]
2013-10-30T16:34:38.615784-05:00 c0-0c0s2n3 [<ffffffffa09e5b05>] ll_put_super+0xc5/0x330 [lustre]
2013-10-30T16:34:38.647596-05:00 c0-0c0s2n3 [<ffffffff8111546d>] generic_shutdown_super+0x5d/0x110
2013-10-30T16:34:38.647614-05:00 c0-0c0s2n3 [<ffffffff81115586>] kill_anon_super+0x16/0x60
2013-10-30T16:34:38.647620-05:00 c0-0c0s2n3 [<ffffffffa050f2c6>] lustre_kill_super+0x36/0x50 [obdclass]
2013-10-30T16:34:38.647626-05:00 c0-0c0s2n3 [<ffffffff81115a23>] deactivate_super+0x73/0x90
2013-10-30T16:34:38.673196-05:00 c0-0c0s2n3 [<ffffffff8112db02>] mntput_no_expire+0xc2/0xf0
2013-10-30T16:34:38.673208-05:00 c0-0c0s2n3 [<ffffffff8112debc>] sys_umount+0x7c/0x360
2013-10-30T16:34:38.673213-05:00 c0-0c0s2n3 [<ffffffff8100305b>] system_call_fastpath+0x16/0x1b
2013-10-30T16:34:38.673237-05:00 c0-0c0s2n3 [<00007fe4da8afd07>] 0x7fe4da8afd07
2013-10-30T16:34:38.673245-05:00 c0-0c0s2n3 Kernel panic - not syncing: LBUG
2013-10-30T16:34:38.698848-05:00 c0-0c0s2n3 Pid: 11673, comm: umount Tainted: P 2.6.32.59-0.7.1_1.0402.7496-cray_gem_s #1
2013-10-30T16:34:38.698859-05:00 c0-0c0s2n3 Call Trace:
2013-10-30T16:34:38.698882-05:00 c0-0c0s2n3 [<ffffffff81007e59>] try_stack_unwind+0x1a9/0x200
2013-10-30T16:34:38.698891-05:00 c0-0c0s2n3 [<ffffffff81006625>] dump_trace+0x95/0x300
2013-10-30T16:34:38.698896-05:00 c0-0c0s2n3 [<ffffffff8100786c>] show_trace_log_lvl+0x5c/0x80
2013-10-30T16:34:38.698911-05:00 c0-0c0s2n3 [<ffffffff810078a5>] show_trace+0x15/0x20
2013-10-30T16:34:38.698916-05:00 c0-0c0s2n3 [<ffffffff81427e15>] dump_stack+0x77/0x82
2013-10-30T16:34:38.698921-05:00 c0-0c0s2n3 [<ffffffff81427e9a>] panic+0x7a/0x165
2013-10-30T16:34:38.724451-05:00 c0-0c0s2n3 [<ffffffffa02a2e7b>] lbug_with_loc+0x9b/0xb0 [libcfs]
2013-10-30T16:34:38.724462-05:00 c0-0c0s2n3 [<ffffffffa05190c7>] lu_device_fini+0x87/0xc0 [obdclass]
2013-10-30T16:34:38.724484-05:00 c0-0c0s2n3 [<ffffffffa094efd4>] lovsub_device_free+0x24/0x160 [lov]
2013-10-30T16:34:38.750105-05:00 c0-0c0s2n3 [<ffffffffa051cd0e>] lu_stack_fini+0x7e/0xc0 [obdclass]
2013-10-30T16:34:38.750118-05:00 c0-0c0s2n3 [<ffffffffa0522c1e>] cl_stack_fini+0xe/0x10 [obdclass]
2013-10-30T16:34:38.750142-05:00 c0-0c0s2n3 [<ffffffffa093e9a8>] lov_device_fini+0x58/0x120 [lov]
2013-10-30T16:34:38.750151-05:00 c0-0c0s2n3 [<ffffffffa051ccda>] lu_stack_fini+0x4a/0xc0 [obdclass]
2013-10-30T16:34:38.750156-05:00 c0-0c0s2n3 [<ffffffffa0522c1e>] cl_stack_fini+0xe/0x10 [obdclass]
2013-10-30T16:34:38.750168-05:00 c0-0c0s2n3 [<ffffffffa0a1edec>] cl_sb_fini+0x6c/0x1a0 [lustre]
2013-10-30T16:34:38.775711-05:00 c0-0c0s2n3 [<ffffffffa09e4fc5>] client_common_put_super+0x55/0xad0 [lustre]
2013-10-30T16:34:38.775723-05:00 c0-0c0s2n3 [<ffffffffa09e5b05>] ll_put_super+0xc5/0x330 [lustre]
2013-10-30T16:34:38.775749-05:00 c0-0c0s2n3 [<ffffffff8111546d>] generic_shutdown_super+0x5d/0x110
2013-10-30T16:34:38.775760-05:00 c0-0c0s2n3 [<ffffffff81115586>] kill_anon_super+0x16/0x60
2013-10-30T16:34:38.775766-05:00 c0-0c0s2n3 [<ffffffffa050f2c6>] lustre_kill_super+0x36/0x50 [obdclass]
2013-10-30T16:34:38.801373-05:00 c0-0c0s2n3 [<ffffffff81115a23>] deactivate_super+0x73/0x90
2013-10-30T16:34:38.801384-05:00 c0-0c0s2n3 [<ffffffff8112db02>] mntput_no_expire+0xc2/0xf0
2013-10-30T16:34:38.801410-05:00 c0-0c0s2n3 [<ffffffff8112debc>] sys_umount+0x7c/0x360
2013-10-30T16:34:38.801419-05:00 c0-0c0s2n3 [<ffffffff8100305b>] system_call_fastpath+0x16/0x1b
2013-10-30T16:34:38.801429-05:00 c0-0c0s2n3 [<00007fe4da8afd07>] 0x7fe4da8afd07
|
|
We've now observed this same deadlock in a non-NFS situation as well.
Oleg - If you can give a pointer on how to get the client reference releasing correctly, I'd be happy to generate and test the patch.
|
|
This LBUG appeared due to
during shutdown ll_file_release() and final dput() is called before ll_md_blocking_ast() and d_lustre_invalidate(), so there is no condition to call d_kill()
|
|
Andriy - Can you be more specific about what portion of that patch you believe caused this issue, and why you think it's at fault here from the stack trace?
|
|
Well, I came to a conclusion that the assertion is in fact a preexisting bug condition since I can reproduce it at will without any patches with this:
I am going to open a new bug for this issue since it seems to be separate.
|
|
Opened LU-4272
Updated the patch 4083 to fix the typo.
|
|
Thank you for the update, Oleg. It's nice to know the crash is unrelated.
|
|
http://review.whamcloud.com/8083 landed to master. What else is still needed to close this bug?
|
|
Patch landed to Master. Closing ticket.
|
|
Are we sure this problem should not be fixed in 2.5 too?
|
|
Aurelien
That would be tracked separately
Peter
|
Generated at Sat Feb 10 01:40:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.