[LU-2665] LBUG while unmounting client - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.5.0, Lustre 2.4.2
Affects Version/s: Lustre 2.4.0, Lustre 2.1.3
Labels:
- mn1
- ptr

Severity:
3
Rank (Obsolete):
6217

Description

Hi,

When trying to unmount a Lustre client, we got the following problem:

Lustre: DEBUG MARKER: Wed Nov 21 06:25:01 2012

LustreError: 11559:0:(ldlm_lock.c:1697:ldlm_lock_cancel()) ### lock still has references ns:
ptmp-MDT0000-mdc-ffff88030871bc00 lock: ffff88060dbd2d80/0x4618f3ec8d79d8be lrc: 4/0,1 mode: PW/PW res: 8590405073/266
rrc: 2 type: FLK pid: 4414 [0->551] flags: 0x22002890 remote: 0xc8980c051f8f6afd expref: -99 pid: 4414 timeout: 0
LustreError: 11559:0:(ldlm_lock.c:1698:ldlm_lock_cancel()) LBUG
Pid: 11559, comm: umount

Call Trace:
 [<ffffffffa040d7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa040de07>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa063343d>] ldlm_lock_cancel+0x1ad/0x1b0 [ptlrpc]
 [<ffffffffa064d245>] ldlm_cli_cancel_local+0xb5/0x380 [ptlrpc]
 [<ffffffffa06510b8>] ldlm_cli_cancel+0x58/0x3b0 [ptlrpc]
 [<ffffffffa063ae18>] cleanup_resource+0x168/0x300 [ptlrpc]
 [<ffffffffa063afda>] ldlm_resource_clean+0x2a/0x50 [ptlrpc]
 [<ffffffffa041e28f>] cfs_hash_for_each_relax+0x17f/0x380 [libcfs]
 [<ffffffffa063afb0>] ? ldlm_resource_clean+0x0/0x50 [ptlrpc]
 [<ffffffffa063afb0>] ? ldlm_resource_clean+0x0/0x50 [ptlrpc]
 [<ffffffffa041fcaf>] cfs_hash_for_each_nolock+0x7f/0x1c0 [libcfs]
 [<ffffffffa0637a69>] ldlm_namespace_cleanup+0x29/0xb0 [ptlrpc]
 [<ffffffffa0638adb>] __ldlm_namespace_free+0x4b/0x540 [ptlrpc]
 [<ffffffffa06502d0>] ? ldlm_cli_hash_cancel_unused+0x0/0xa0 [ptlrpc]
 [<ffffffffa06502d0>] ? ldlm_cli_hash_cancel_unused+0x0/0xa0 [ptlrpc]
 [<ffffffffa06502d0>] ? ldlm_cli_hash_cancel_unused+0x0/0xa0 [ptlrpc]
 [<ffffffffa041fcb7>] ? cfs_hash_for_each_nolock+0x87/0x1c0 [libcfs]
 [<ffffffffa063903f>] ldlm_namespace_free_prior+0x6f/0x230 [ptlrpc]
 [<ffffffffa063fc4c>] client_disconnect_export+0x23c/0x460 [ptlrpc]
 [<ffffffffa0b42a44>] lmv_disconnect+0x644/0xc70 [lmv]
 [<ffffffffa0a470bc>] client_common_put_super+0x46c/0xe80 [lustre]
 [<ffffffffa0a47ba0>] ll_put_super+0xd0/0x360 [lustre]
 [<ffffffff8117e01c>] ? dispose_list+0x11c/0x140
 [<ffffffff8117e4a8>] ? invalidate_inodes+0x158/0x1a0
 [<ffffffff8116542b>] generic_shutdown_super+0x5b/0x110
 [<ffffffff81165546>] kill_anon_super+0x16/0x60
 [<ffffffffa050897a>] lustre_kill_super+0x4a/0x60 [obdclass]
 [<ffffffff811664e0>] deactivate_super+0x70/0x90
 [<ffffffff811826bf>] mntput_no_expire+0xbf/0x110
 [<ffffffff81183188>] sys_umount+0x78/0x3c0
 [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b

Kernel panic - not syncing: LBUG
Pid: 11559, comm: umount Not tainted 2.6.32-220.23.1.bl6.Bull.28.8.x86_64 #1
Call Trace:
 [<ffffffff81484650>] ? panic+0x78/0x143
 [<ffffffffa040de5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [<ffffffffa063343d>] ? ldlm_lock_cancel+0x1ad/0x1b0 [ptlrpc]
 [<ffffffffa064d245>] ? ldlm_cli_cancel_local+0xb5/0x380 [ptlrpc]
 [<ffffffffa06510b8>] ? ldlm_cli_cancel+0x58/0x3b0 [ptlrpc]
 [<ffffffffa063ae18>] ? cleanup_resource+0x168/0x300 [ptlrpc]
 [<ffffffffa063afda>] ? ldlm_resource_clean+0x2a/0x50 [ptlrpc]
 [<ffffffffa041e28f>] ? cfs_hash_for_each_relax+0x17f/0x380 [libcfs]
 [<ffffffffa063afb0>] ? ldlm_resource_clean+0x0/0x50 [ptlrpc]
 [<ffffffffa063afb0>] ? ldlm_resource_clean+0x0/0x50 [ptlrpc]
 [<ffffffffa041fcaf>] ? cfs_hash_for_each_nolock+0x7f/0x1c0 [libcfs]
 [<ffffffffa0637a69>] ? ldlm_namespace_cleanup+0x29/0xb0 [ptlrpc]
 [<ffffffffa0638adb>] ? __ldlm_namespace_free+0x4b/0x540 [ptlrpc]
 [<ffffffffa06502d0>] ? ldlm_cli_hash_cancel_unused+0x0/0xa0 [ptlrpc]
 [<ffffffffa06502d0>] ? ldlm_cli_hash_cancel_unused+0x0/0xa0 [ptlrpc]
 [<ffffffffa06502d0>] ? ldlm_cli_hash_cancel_unused+0x0/0xa0 [ptlrpc]
 [<ffffffffa041fcb7>] ? cfs_hash_for_each_nolock+0x87/0x1c0 [libcfs]
 [<ffffffffa063903f>] ? ldlm_namespace_free_prior+0x6f/0x230 [ptlrpc]
 [<ffffffffa063fc4c>] ? client_disconnect_export+0x23c/0x460 [ptlrpc]
 [<ffffffffa0b42a44>] ? lmv_disconnect+0x644/0xc70 [lmv]
 [<ffffffffa0a470bc>] ? client_common_put_super+0x46c/0xe80 [lustre]
 [<ffffffffa0a47ba0>] ? ll_put_super+0xd0/0x360 [lustre]
 [<ffffffff8117e01c>] ? dispose_list+0x11c/0x140
 [<ffffffff8117e4a8>] ? invalidate_inodes+0x158/0x1a0
 [<ffffffff8116542b>] ? generic_shutdown_super+0x5b/0x110
 [<ffffffff81165546>] ? kill_anon_super+0x16/0x60
 [<ffffffffa050897a>] ? lustre_kill_super+0x4a/0x60 [obdclass]
 [<ffffffff811664e0>] ? deactivate_super+0x70/0x90
 [<ffffffff811826bf>] ? mntput_no_expire+0xbf/0x110
 [<ffffffff81183188>] ? sys_umount+0x78/0x3c0
 [<ffffffff810030f2>] ? system_call_fastpath+0x16/0x1b

This issue is exactly the same as the one described in ~~LU-1429~~, which is a duplicate of ~~LU-1328~~, which itself seems to be related to ~~LU-1421~~.
The issue seems to be resolved, but it is very unclear to me which patches are needed in order to completely fix the issue.
I add that we need of fix for b2_1.

Can you please advise?

TIA,
Sebastien.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

LU-2665-2.tgz
1.29 MB
28/Mar/13 9:50 AM
LU-2665-trace.tgz
61 kB
07/Mar/13 7:39 AM

Issue Links

is related to

LU-3701 Failure on test suite posix subtest test_1: fcntl.18/fcntl.35 Unresolved

Resolved

Activity

[LU-2665] LBUG while unmounting client

Jian Yu added a comment - 22/Nov/13 8:21 AM

Patch http://review.whamcloud.com/6415 was cherry-picked to Lustre b2_4 branch.

Jian Yu added a comment - 22/Nov/13 8:21 AM Patch http://review.whamcloud.com/6415 was cherry-picked to Lustre b2_4 branch.

Peter Jones added a comment - 18/Sep/13 10:20 AM

ok - thanks Sebastien!

Peter Jones added a comment - 18/Sep/13 10:20 AM ok - thanks Sebastien!

Sebastien Buisson (Inactive) added a comment - 18/Sep/13 10:11 AM

Sure, this bug can be closed, as additional work was carried out by Bruno in ~~LU-3701~~.

Thanks,
Sebastien.

Sebastien Buisson (Inactive) added a comment - 18/Sep/13 10:11 AM Sure, this bug can be closed, as additional work was carried out by Bruno in LU-3701 . Thanks, Sebastien.

Andreas Dilger added a comment - 17/Sep/13 4:53 PM

Patches are landed to b2_1 and master for 2.5.0. Can this bug be closed?

Andreas Dilger added a comment - 17/Sep/13 4:53 PM Patches are landed to b2_1 and master for 2.5.0. Can this bug be closed?

Bruno Faccini (Inactive) added a comment - 29/Aug/13 3:41 PM

> But "retry" mechanism doesn't guarantee that lock will reach MDS
Yes, but we need to do our best! And particularly for FLock/F_UNLCKs (where Server must know, unless other Clients/processes will stay stuck forever) which can not be trashed. At least with retries we now also cover MDS crashes and communications problems, if reboot/restart/failover/fix/... occurs finally and if not Lustre is dead on this Client and who cares that we retry forever ?

> cleanup_resource() should deal with orphaned locks also
Only during evict or "umount -f".

Bruno Faccini (Inactive) added a comment - 29/Aug/13 3:41 PM > But "retry" mechanism doesn't guarantee that lock will reach MDS Yes, but we need to do our best! And particularly for FLock/F_UNLCKs (where Server must know, unless other Clients/processes will stay stuck forever) which can not be trashed. At least with retries we now also cover MDS crashes and communications problems, if reboot/restart/failover/fix/... occurs finally and if not Lustre is dead on this Client and who cares that we retry forever ? > cleanup_resource() should deal with orphaned locks also Only during evict or "umount -f".

Andriy Skulysh added a comment - 29/Aug/13 10:28 AM

But "retry" mechanism doesn't guarantee that lock will reach MDS. cleanup_resource() should deal with orphaned locks also.

Andriy Skulysh added a comment - 29/Aug/13 10:28 AM But "retry" mechanism doesn't guarantee that lock will reach MDS. cleanup_resource() should deal with orphaned locks also.

Bruno Faccini (Inactive) added a comment - 29/Aug/13 8:31 AM

Cory: Thanks to add a reference to ~~LU-3701~~. This is the follow-on to this ticket as its change introduced some regressions vs the POSIX test suite.

Andriy: If you read carefully history/comments for this/~~LU-2665~~ ticket, you will find it addresses a very particular scenario which can be simplified as "FLock/F_UNLCK requests can be trashed upon MDS crash or communications problem, leaving orphaned FLocks". This is definitelly not the race problem described in ~~LU-2177~~. And concerning the "retry" mechanism in ~~LU-3701~~ change it becomes fully limited to FLock/F_UNLCK requests.

Bruno Faccini (Inactive) added a comment - 29/Aug/13 8:31 AM Cory: Thanks to add a reference to LU-3701 . This is the follow-on to this ticket as its change introduced some regressions vs the POSIX test suite. Andriy: If you read carefully history/comments for this/ LU-2665 ticket, you will find it addresses a very particular scenario which can be simplified as "FLock/F_UNLCK requests can be trashed upon MDS crash or communications problem, leaving orphaned FLocks". This is definitelly not the race problem described in LU-2177 . And concerning the "retry" mechanism in LU-3701 change it becomes fully limited to FLock/F_UNLCK requests.

Andriy Skulysh added a comment - 28/Aug/13 7:04 PM

I don't like an idea of resending flocks. There is no guarantee that request will succeed.
Client has cleanup mechanism already.
ldlm_flock_completion_ast() should call ldlm_lock_decref_internal() correctly.
The patch should be combined with similar patch from ~~LU-2177~~ .

Andriy Skulysh added a comment - 28/Aug/13 7:04 PM I don't like an idea of resending flocks. There is no guarantee that request will succeed. Client has cleanup mechanism already. ldlm_flock_completion_ast() should call ldlm_lock_decref_internal() correctly. The patch should be combined with similar patch from LU-2177 .

Cory Spitz added a comment - 28/Aug/13 4:04 PM

Related to ~~LU-3701~~ and http://review.whamcloud.com/#/c/7453.

Cory Spitz added a comment - 28/Aug/13 4:04 PM Related to LU-3701 and http://review.whamcloud.com/#/c/7453 .

Sebastien Buisson (Inactive) added a comment - 25/Jul/13 3:28 PM

Hi Bruno,

We have retrieved the patch for b2_1. It will be rolled out at CEA's next maintenance, planned for early September.

Cheers,
Sebastien.

Sebastien Buisson (Inactive) added a comment - 25/Jul/13 3:28 PM Hi Bruno, We have retrieved the patch for b2_1. It will be rolled out at CEA's next maintenance, planned for early September. Cheers, Sebastien.

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Sebastien Buisson (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 22/Jan/13 11:29 AM

Updated:: 22/Nov/13 8:21 AM

Resolved:: 18/Sep/13 10:20 AM