Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6664

(ost_handler.c:1765:ost_blocking_ast()) Error -2 syncing data on lock cancel

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.5.3
    • 3
    • 9223372036854775807

    Description

      On all of our filesystems, the following error message is extremely common:

      LustreError: 8746:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel
      

      There is nothing else in the logs that gives any hint as to why this message is appearing.

      Our filesystems all use osd-zfs, and we are currently running Lustre 2.5.3-5chaos (see github.com/chaos/lustre).

      If this is a symptom of a bug, then please fix it. If this is not a symptom of a bug, then please stop scaring our system administrators with this message.

      Attachments

        Issue Links

          Activity

            [LU-6664] (ost_handler.c:1765:ost_blocking_ast()) Error -2 syncing data on lock cancel
            green Oleg Drokin added a comment -

            Just as I was reviewing this patch (the comments are n the patch), I just remembered that LLNL had suspicions of double referenced objects in the past (LU-5648) where same object was potentially referenced twice (that was never confirmed, though).
            So having objects owned by two files would most likely lead to this message too.

            I wonder if lfsck in 2.5.4 is already in a good enough shape to be able to detect that.

            green Oleg Drokin added a comment - Just as I was reviewing this patch (the comments are n the patch), I just remembered that LLNL had suspicions of double referenced objects in the past ( LU-5648 ) where same object was potentially referenced twice (that was never confirmed, though). So having objects owned by two files would most likely lead to this message too. I wonder if lfsck in 2.5.4 is already in a good enough shape to be able to detect that.
            bobijam Zhenyu Xu added a comment -

            http://review.whamcloud.com/15167

            commit message
            LU-6664 ofd: LDLM lock should cover object destroy 
            
            The exclusive PW lock protecting OST object destroy should be 
            released after object destroy procedure. 
            
            Quench error messages of object unavailability when trying to cancel 
            a LDLM lock. 
            
            bobijam Zhenyu Xu added a comment - http://review.whamcloud.com/15167 commit message LU-6664 ofd: LDLM lock should cover object destroy The exclusive PW lock protecting OST object destroy should be released after object destroy procedure. Quench error messages of object unavailability when trying to cancel a LDLM lock.

            No, it has not just undergone recovery. There are no other messages in the logs surrounding these ost_blocking_ast() messages.

            morrone Christopher Morrone (Inactive) added a comment - No, it has not just undergone recovery. There are no other messages in the logs surrounding these ost_blocking_ast() messages.
            bobijam Zhenyu Xu added a comment -

            Chris,

            Does your system just undergo recovery before these messages appears?

            bobijam Zhenyu Xu added a comment - Chris, Does your system just undergo recovery before these messages appears?
            green Oleg Drokin added a comment -

            Bobi: This message is ENOENT when we are supposedly trying to flush data on lock cancel. But technically if we have a lock, the object should be there (the only exception I can think of is the actual object destroy is happening under the lock, so at the end of destroy the lock is still there and the object is not, but then there should be nothing to flush).
            So can you please examine server side code for lock cancel to see if there are any possible races that could lead to this message.

            green Oleg Drokin added a comment - Bobi: This message is ENOENT when we are supposedly trying to flush data on lock cancel. But technically if we have a lock, the object should be there (the only exception I can think of is the actual object destroy is happening under the lock, so at the end of destroy the lock is still there and the object is not, but then there should be nothing to flush). So can you please examine server side code for lock cancel to see if there are any possible races that could lead to this message.
            pjones Peter Jones added a comment -

            Bobijam

            Could you please look into this issue?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bobijam Could you please look into this issue? Thanks Peter

            People

              bobijam Zhenyu Xu
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: