Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12659

change log level for "lvbo_init failed" back to CERROR

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • Upstream
    • None
    • 9223372036854775807

    Description

      The log level of LVB initialization (lvbo_init failed) was changed to CDEBUG/DLMTRACE in commit 8739f132 (LU-5042 ldlm: delay filling resource's LVB upon replay), and that change makes users hard to notice the fail until analyzing the debug log or application report it.

      Attachments

        Issue Links

          Activity

            [LU-12659] change log level for "lvbo_init failed" back to CERROR
            adilger Andreas Dilger added a comment - - edited

            Discussion on the patch indicates that this is not very easily fixed::

            >>> Do you mean it's possible that the file may not exist when
            >>> this is called? and it's a legitimate condition. Could you
            >>> please show me an example for that?
            >>>
            >> Yes. Imagine a situation of two clients. One does a lookup
            >> on a file and another does unlink.
            >> Both got their lookups with EA data back and then one sends
            >> in the destroy (now handled by mds) and the other one sends
            >> in the stat/getattr to OST. If the destroy wins, the getattr will
            >> get a false alarm because the object is already gone.
            >
            > Is there any better way to distinguish there two conditions
            > (legitimate race or objects missing)? IMO, from OST side, the
            > phenomenons are the same, object is gone.

            There's no way to distinguish them that I can think of readily.
            But the legitimate race is a lot more common I imagine.

            One possibility that comes to mind to distinguish this race condition is if the client sends the MDT parent FID in its RPC (this might need to be added for getattr/glimpse RPCs) then the OSS can verify if the MDT inode still exists or not, like LFSCK does. The filter_fid xattr on the object would be gone at this point, so it would not be usable to determine the parent FID. If the MDT parent inode still exists then this is an error, otherwise it is a harmless race. This would need an extra OSS->MDS GETATTR RPC, which could cause issues if the MDS is offline or busy, but should only happen in the rare case of a missing object.

            Another alternative would be to keep the DLM resource around for a short time after unlink (eg. a few seconds) to allow detecting the "just deleted" case from the "has not existed for a long time" case. This would need some asynchronous process to clean up the resources after a timeout, and would consume memory, so probably not ideal.

            adilger Andreas Dilger added a comment - - edited Discussion on the patch indicates that this is not very easily fixed:: >>> Do you mean it's possible that the file may not exist when >>> this is called? and it's a legitimate condition. Could you >>> please show me an example for that? >>> >> Yes. Imagine a situation of two clients. One does a lookup >> on a file and another does unlink. >> Both got their lookups with EA data back and then one sends >> in the destroy (now handled by mds) and the other one sends >> in the stat/getattr to OST. If the destroy wins, the getattr will >> get a false alarm because the object is already gone. > > Is there any better way to distinguish there two conditions > (legitimate race or objects missing)? IMO, from OST side, the > phenomenons are the same, object is gone. There's no way to distinguish them that I can think of readily. But the legitimate race is a lot more common I imagine. One possibility that comes to mind to distinguish this race condition is if the client sends the MDT parent FID in its RPC (this might need to be added for getattr/glimpse RPCs) then the OSS can verify if the MDT inode still exists or not, like LFSCK does. The filter_fid xattr on the object would be gone at this point, so it would not be usable to determine the parent FID. If the MDT parent inode still exists then this is an error, otherwise it is a harmless race. This would need an extra OSS->MDS GETATTR RPC, which could cause issues if the MDS is offline or busy, but should only happen in the rare case of a missing object. Another alternative would be to keep the DLM resource around for a short time after unlink (eg. a few seconds) to allow detecting the "just deleted" case from the "has not existed for a long time" case. This would need some asynchronous process to clean up the resources after a timeout, and would consume memory, so probably not ideal.

            Gu Zheng (gzheng@ddn.com) uploaded a new patch: https://review.whamcloud.com/35767
            Subject: LU-12659 ldlm: change log level for "lvbo_init failed" back to CERROR
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 0bba47c0a7289e32e86cbc296b44ad34d04a3c05

            gerrit Gerrit Updater added a comment - Gu Zheng (gzheng@ddn.com) uploaded a new patch: https://review.whamcloud.com/35767 Subject: LU-12659 ldlm: change log level for "lvbo_init failed" back to CERROR Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0bba47c0a7289e32e86cbc296b44ad34d04a3c05

            People

              guzheng Gu Zheng (Inactive)
              guzheng Gu Zheng (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: