[LU-12659] change log level for "lvbo_init failed" back to CERROR Created: 12/Aug/19  Updated: 04/Oct/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Upstream
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Gu Zheng (Inactive) Assignee: Gu Zheng (Inactive)
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-5702 ldlm_handle_enqueue0()) ### delayed l... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

The log level of LVB initialization (lvbo_init failed) was changed to CDEBUG/DLMTRACE in commit 8739f132 (LU-5042 ldlm: delay filling resource's LVB upon replay), and that change makes users hard to notice the fail until analyzing the debug log or application report it.



 Comments   
Comment by Gerrit Updater [ 12/Aug/19 ]

Gu Zheng (gzheng@ddn.com) uploaded a new patch: https://review.whamcloud.com/35767
Subject: LU-12659 ldlm: change log level for "lvbo_init failed" back to CERROR
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0bba47c0a7289e32e86cbc296b44ad34d04a3c05

Comment by Andreas Dilger [ 04/Oct/19 ]

Discussion on the patch indicates that this is not very easily fixed::

>>> Do you mean it's possible that the file may not exist when
>>> this is called? and it's a legitimate condition. Could you
>>> please show me an example for that?
>>>
>> Yes. Imagine a situation of two clients. One does a lookup
>> on a file and another does unlink.
>> Both got their lookups with EA data back and then one sends
>> in the destroy (now handled by mds) and the other one sends
>> in the stat/getattr to OST. If the destroy wins, the getattr will
>> get a false alarm because the object is already gone.
>
> Is there any better way to distinguish there two conditions
> (legitimate race or objects missing)? IMO, from OST side, the
> phenomenons are the same, object is gone.

There's no way to distinguish them that I can think of readily.
But the legitimate race is a lot more common I imagine.

One possibility that comes to mind to distinguish this race condition is if the client sends the MDT parent FID in its RPC (this might need to be added for getattr/glimpse RPCs) then the OSS can verify if the MDT inode still exists or not, like LFSCK does. The filter_fid xattr on the object would be gone at this point, so it would not be usable to determine the parent FID. If the MDT parent inode still exists then this is an error, otherwise it is a harmless race. This would need an extra OSS->MDS GETATTR RPC, which could cause issues if the MDS is offline or busy, but should only happen in the rare case of a missing object.

Another alternative would be to keep the DLM resource around for a short time after unlink (eg. a few seconds) to allow detecting the "just deleted" case from the "has not existed for a long time" case. This would need some asynchronous process to clean up the resources after a timeout, and would consume memory, so probably not ideal.

Generated at Sat Feb 10 02:54:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.