[LU-10302] hsm: obscure bug with multi-mountpoints and ldlm Created: 30/Nov/17  Updated: 05/Aug/20  Resolved: 22/Dec/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Minor
Reporter: CEA Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: None

Attachments: File reproducer-lu-10302.sh    
Issue Links:
Related
is related to LU-10357 ll_ioc_copy_{start,end}() depend on s... Resolved
is related to LU-10723 Interop 2.10.3<->2.11 sanity test_232... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I do not have much to share except the attached reproducer.

The key elements of the reproducer seem to be:

  1. setup lustre with two mountpoints;
  2. create a file;
  3. launch a copytool on mountpoint A;
  4. suspend the copytool;
  5. archive the file created at step 1 from mountpoint A*;
  6. delete the file on mountpoint B;
  7. sync;
  8. un-suspend the copytool (the output of the copytool should indicate that llapi_hsm_action_begin() failed with EIO, not ENOENT)
  9. umount => the process hangs in an unkillable state.

*You can use mountpoint B at step 5, but only if you created the file from mountpoint A.

I added some debug in the reproducer that should be logged in /tmp.

I suspect those two lines in the dmesg are related to this issue (they are logged at umount time):

[  143.575078] LustreError: 3703:0:(ldlm_resource.c:1094:ldlm_resource_complain()) filter-lustre-OST0000_UUID: namespace resource [0x2:0x0:0x0].0x0 (ffff8806ab7b6900) refcount nonzero (1) after lock cleanup; forcing cleanup.
[  143.578233] LustreError: 3703:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x2:0x0:0x0].0x0 (ffff8806ab7b6900) refcount = 2

Note: the title should probably be updated once we figure what the issue exactly is



 Comments   
Comment by Andreas Dilger [ 30/Nov/17 ]

Quentin, it isn’t clear from your bug report what the actual problem is that you are hitting?  Does the client unmount fail, or are the error messages unexpected but not a problem otherwise? Is this problem hit in normal usage?

 It does look like the copytool is holding a lock reference on the OST object longer than it should be, but they should be cleaned up at mount.

 

Comment by Quentin Bouget [ 01/Dec/17 ]

My bad, I updated the description: the client unmount hangs.

> Is this problem hit in normal usage?

The reproducer I provided works on a single node setup but you can also reproduce on a multi-node setup (copytool on one node, client doing the rm on another node), so this definitely impacts production setups.

Comment by Quentin Bouget [ 01/Dec/17 ]

Letting the hsm request timeout is not a requirement to reproduce, rather than that, syncing data/metadata is what is important.

I updated the description (once again) and the reproducer accordingly.

Comment by Peter Jones [ 01/Dec/17 ]

Bruno

Can you look into this one?

Thanks

Peter

Comment by Quentin Bouget [ 04/Dec/17 ]

The condition to trigger the bug is a bit more complex than I first thought: lhsmtool_posix != rm && !(create == lfs hsm_archive == rm)

The more verbose version: lhsmtool_posix and rm are run on different mountpoints, and the file is not created, archived and deleted from the same mountpoint.

I am not sure how useful this is. I am putting it here... just in case.

Comment by John Hammond [ 08/Dec/17 ]

You are seeing the fact that the lock and resource reference counting in LDLM is intolerant of some lvbo init errors. In particular, it ofd_lvbo_init() fails because the object could not be found then a reference on the resource is somehow leaked.

Comment by John Hammond [ 08/Dec/17 ]

BTW, the CT is able to hit this because it calls search_inode_for_lustre() to get the data version so it is not seeing that the file has been deleted.

Comment by Quentin Bouget [ 11/Dec/17 ]

I cannot reproduce the bug anymore when I apply the patch you proposed for LU-10357. Thank you!

Maybe we can keep this LU to fix search_inode_for_lustre() or ofd_lvbo_init()... or both, depending on what makes more sense. =)

Comment by Gerrit Updater [ 11/Dec/17 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30477
Subject: LU-10302 ldlm: destroy lock if LVB init fails
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0be0459c0b1409c790a214a73735673ed9907b57

Comment by Gerrit Updater [ 22/Dec/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30477/
Subject: LU-10302 ldlm: destroy lock if LVB init fails
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c91cb6ee81e7751b719228efa58dc32fdea836e5

Comment by Peter Jones [ 22/Dec/17 ]

Landed for 2.11

Generated at Sat Feb 10 02:33:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.