Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6664

(ost_handler.c:1765:ost_blocking_ast()) Error -2 syncing data on lock cancel

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.5.3
    • 3
    • 9223372036854775807

    Description

      On all of our filesystems, the following error message is extremely common:

      LustreError: 8746:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel
      

      There is nothing else in the logs that gives any hint as to why this message is appearing.

      Our filesystems all use osd-zfs, and we are currently running Lustre 2.5.3-5chaos (see github.com/chaos/lustre).

      If this is a symptom of a bug, then please fix it. If this is not a symptom of a bug, then please stop scaring our system administrators with this message.

      Attachments

        Issue Links

          Activity

            [LU-6664] (ost_handler.c:1765:ost_blocking_ast()) Error -2 syncing data on lock cancel
            pjones Peter Jones added a comment -

            As per LLNL ok to close

            pjones Peter Jones added a comment - As per LLNL ok to close

            I too see ton's of these error messages: Any help in resolving them will be very helpful. I can provide any debug logs if required. It is consistently appearing in most of the OSS's.

            LustreError: 7577:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel
            LustreError: 25058:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel
            LustreError: 1634:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel
            LustreError: 1634:0:(ost_handler.c:1764:ost_blocking_ast()) Skipped 1 previous similar message
            LustreError: 25058:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel
            LustreError: 25058:0:(ost_handler.c:1764:ost_blocking_ast()) Skipped 2 previous similar messages
            LustreError: 33552:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel

            Thank you,
            Amit

            ahkumar Amit (Inactive) added a comment - I too see ton's of these error messages: Any help in resolving them will be very helpful. I can provide any debug logs if required. It is consistently appearing in most of the OSS's. LustreError: 7577:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel LustreError: 25058:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel LustreError: 1634:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel LustreError: 1634:0:(ost_handler.c:1764:ost_blocking_ast()) Skipped 1 previous similar message LustreError: 25058:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel LustreError: 25058:0:(ost_handler.c:1764:ost_blocking_ast()) Skipped 2 previous similar messages LustreError: 33552:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel Thank you, Amit

            Oleg and I looked into this issue more closely, and the current patch doesn't really solve the problem, since the race is when the two destroy threads are getting and dropping the DLM lock, and not when the actual destroy is happening. In master, the equivalent function tgt_blocking_ast() already has a check for dt_object_exists() and skips the call into ofd_sync() that generates this message completely.

            I think the right fix (for 2.5.x only) is to just skip this message for rc == -ENOENT as is already done in master.

            adilger Andreas Dilger added a comment - Oleg and I looked into this issue more closely, and the current patch doesn't really solve the problem, since the race is when the two destroy threads are getting and dropping the DLM lock, and not when the actual destroy is happening. In master, the equivalent function tgt_blocking_ast() already has a check for dt_object_exists() and skips the call into ofd_sync() that generates this message completely. I think the right fix (for 2.5.x only) is to just skip this message for rc == -ENOENT as is already done in master.

            Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/15997
            Subject: LU-6664 ofd: LDLM lock should cover object destroy
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4575d04887bfd2a78a5a0340841d2da6ef23c165

            gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/15997 Subject: LU-6664 ofd: LDLM lock should cover object destroy Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4575d04887bfd2a78a5a0340841d2da6ef23c165

            Bobijam, can you please also make a version of your patch for master.

            adilger Andreas Dilger added a comment - Bobijam, can you please also make a version of your patch for master.

            The LFSCK in 2.5.x does not check MDT<->OST consistency. That feature ("lctl lfsck_start -t layout") wasn't added until 2.6.0.

            adilger Andreas Dilger added a comment - The LFSCK in 2.5.x does not check MDT<->OST consistency. That feature ("lctl lfsck_start -t layout") wasn't added until 2.6.0.
            green Oleg Drokin added a comment -

            Just as I was reviewing this patch (the comments are n the patch), I just remembered that LLNL had suspicions of double referenced objects in the past (LU-5648) where same object was potentially referenced twice (that was never confirmed, though).
            So having objects owned by two files would most likely lead to this message too.

            I wonder if lfsck in 2.5.4 is already in a good enough shape to be able to detect that.

            green Oleg Drokin added a comment - Just as I was reviewing this patch (the comments are n the patch), I just remembered that LLNL had suspicions of double referenced objects in the past ( LU-5648 ) where same object was potentially referenced twice (that was never confirmed, though). So having objects owned by two files would most likely lead to this message too. I wonder if lfsck in 2.5.4 is already in a good enough shape to be able to detect that.
            bobijam Zhenyu Xu added a comment -

            http://review.whamcloud.com/15167

            commit message
            LU-6664 ofd: LDLM lock should cover object destroy 
            
            The exclusive PW lock protecting OST object destroy should be 
            released after object destroy procedure. 
            
            Quench error messages of object unavailability when trying to cancel 
            a LDLM lock. 
            
            bobijam Zhenyu Xu added a comment - http://review.whamcloud.com/15167 commit message LU-6664 ofd: LDLM lock should cover object destroy The exclusive PW lock protecting OST object destroy should be released after object destroy procedure. Quench error messages of object unavailability when trying to cancel a LDLM lock.

            No, it has not just undergone recovery. There are no other messages in the logs surrounding these ost_blocking_ast() messages.

            morrone Christopher Morrone (Inactive) added a comment - No, it has not just undergone recovery. There are no other messages in the logs surrounding these ost_blocking_ast() messages.
            bobijam Zhenyu Xu added a comment -

            Chris,

            Does your system just undergo recovery before these messages appears?

            bobijam Zhenyu Xu added a comment - Chris, Does your system just undergo recovery before these messages appears?

            People

              bobijam Zhenyu Xu
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: