[LU-10302] hsm: obscure bug with multi-mountpoints and ldlm Created: 30/Nov/17 Updated: 05/Aug/20 Resolved: 22/Dec/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | CEA | Assignee: | John Hammond |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
I do not have much to share except the attached reproducer. The key elements of the reproducer seem to be:
*You can use mountpoint B at step 5, but only if you created the file from mountpoint A. I added some debug in the reproducer that should be logged in /tmp. I suspect those two lines in the dmesg are related to this issue (they are logged at umount time): [ 143.575078] LustreError: 3703:0:(ldlm_resource.c:1094:ldlm_resource_complain()) filter-lustre-OST0000_UUID: namespace resource [0x2:0x0:0x0].0x0 (ffff8806ab7b6900) refcount nonzero (1) after lock cleanup; forcing cleanup. [ 143.578233] LustreError: 3703:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x2:0x0:0x0].0x0 (ffff8806ab7b6900) refcount = 2 Note: the title should probably be updated once we figure what the issue exactly is |
| Comments |
| Comment by Andreas Dilger [ 30/Nov/17 ] |
|
Quentin, it isn’t clear from your bug report what the actual problem is that you are hitting? Does the client unmount fail, or are the error messages unexpected but not a problem otherwise? Is this problem hit in normal usage? It does look like the copytool is holding a lock reference on the OST object longer than it should be, but they should be cleaned up at mount.
|
| Comment by Quentin Bouget [ 01/Dec/17 ] |
|
My bad, I updated the description: the client unmount hangs. > Is this problem hit in normal usage? The reproducer I provided works on a single node setup but you can also reproduce on a multi-node setup (copytool on one node, client doing the rm on another node), so this definitely impacts production setups. |
| Comment by Quentin Bouget [ 01/Dec/17 ] |
|
Letting the hsm request timeout is not a requirement to reproduce, rather than that, syncing data/metadata is what is important. I updated the description (once again) and the reproducer accordingly. |
| Comment by Peter Jones [ 01/Dec/17 ] |
|
Bruno Can you look into this one? Thanks Peter |
| Comment by Quentin Bouget [ 04/Dec/17 ] |
|
The condition to trigger the bug is a bit more complex than I first thought: lhsmtool_posix != rm && !(create == lfs hsm_archive == rm) The more verbose version: lhsmtool_posix and rm are run on different mountpoints, and the file is not created, archived and deleted from the same mountpoint. I am not sure how useful this is. I am putting it here... just in case. |
| Comment by John Hammond [ 08/Dec/17 ] |
|
You are seeing the fact that the lock and resource reference counting in LDLM is intolerant of some lvbo init errors. In particular, it ofd_lvbo_init() fails because the object could not be found then a reference on the resource is somehow leaked. |
| Comment by John Hammond [ 08/Dec/17 ] |
|
BTW, the CT is able to hit this because it calls search_inode_for_lustre() to get the data version so it is not seeing that the file has been deleted. |
| Comment by Quentin Bouget [ 11/Dec/17 ] |
|
I cannot reproduce the bug anymore when I apply the patch you proposed for Maybe we can keep this LU to fix search_inode_for_lustre() or ofd_lvbo_init()... or both, depending on what makes more sense. =) |
| Comment by Gerrit Updater [ 11/Dec/17 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30477 |
| Comment by Gerrit Updater [ 22/Dec/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30477/ |
| Comment by Peter Jones [ 22/Dec/17 ] |
|
Landed for 2.11 |