Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
I do not have much to share except the attached reproducer.
The key elements of the reproducer seem to be:
- setup lustre with two mountpoints;
- create a file;
- launch a copytool on mountpoint A;
- suspend the copytool;
- archive the file created at step 1 from mountpoint A*;
- delete the file on mountpoint B;
- sync;
- un-suspend the copytool (the output of the copytool should indicate that llapi_hsm_action_begin() failed with EIO, not ENOENT)
- umount => the process hangs in an unkillable state.
*You can use mountpoint B at step 5, but only if you created the file from mountpoint A.
I added some debug in the reproducer that should be logged in /tmp.
I suspect those two lines in the dmesg are related to this issue (they are logged at umount time):
[ 143.575078] LustreError: 3703:0:(ldlm_resource.c:1094:ldlm_resource_complain()) filter-lustre-OST0000_UUID: namespace resource [0x2:0x0:0x0].0x0 (ffff8806ab7b6900) refcount nonzero (1) after lock cleanup; forcing cleanup. [ 143.578233] LustreError: 3703:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x2:0x0:0x0].0x0 (ffff8806ab7b6900) refcount = 2
Note: the title should probably be updated once we figure what the issue exactly is