Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6664

(ost_handler.c:1765:ost_blocking_ast()) Error -2 syncing data on lock cancel

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.5.3
    • 3
    • 9223372036854775807

    Description

      On all of our filesystems, the following error message is extremely common:

      LustreError: 8746:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel
      

      There is nothing else in the logs that gives any hint as to why this message is appearing.

      Our filesystems all use osd-zfs, and we are currently running Lustre 2.5.3-5chaos (see github.com/chaos/lustre).

      If this is a symptom of a bug, then please fix it. If this is not a symptom of a bug, then please stop scaring our system administrators with this message.

      Attachments

        Issue Links

          Activity

            [LU-6664] (ost_handler.c:1765:ost_blocking_ast()) Error -2 syncing data on lock cancel
            pjones Peter Jones added a comment -

            As per LLNL ok to close

            pjones Peter Jones added a comment - As per LLNL ok to close

            I too see ton's of these error messages: Any help in resolving them will be very helpful. I can provide any debug logs if required. It is consistently appearing in most of the OSS's.

            LustreError: 7577:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel
            LustreError: 25058:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel
            LustreError: 1634:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel
            LustreError: 1634:0:(ost_handler.c:1764:ost_blocking_ast()) Skipped 1 previous similar message
            LustreError: 25058:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel
            LustreError: 25058:0:(ost_handler.c:1764:ost_blocking_ast()) Skipped 2 previous similar messages
            LustreError: 33552:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel

            Thank you,
            Amit

            ahkumar Amit (Inactive) added a comment - I too see ton's of these error messages: Any help in resolving them will be very helpful. I can provide any debug logs if required. It is consistently appearing in most of the OSS's. LustreError: 7577:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel LustreError: 25058:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel LustreError: 1634:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel LustreError: 1634:0:(ost_handler.c:1764:ost_blocking_ast()) Skipped 1 previous similar message LustreError: 25058:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel LustreError: 25058:0:(ost_handler.c:1764:ost_blocking_ast()) Skipped 2 previous similar messages LustreError: 33552:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel Thank you, Amit

            Oleg and I looked into this issue more closely, and the current patch doesn't really solve the problem, since the race is when the two destroy threads are getting and dropping the DLM lock, and not when the actual destroy is happening. In master, the equivalent function tgt_blocking_ast() already has a check for dt_object_exists() and skips the call into ofd_sync() that generates this message completely.

            I think the right fix (for 2.5.x only) is to just skip this message for rc == -ENOENT as is already done in master.

            adilger Andreas Dilger added a comment - Oleg and I looked into this issue more closely, and the current patch doesn't really solve the problem, since the race is when the two destroy threads are getting and dropping the DLM lock, and not when the actual destroy is happening. In master, the equivalent function tgt_blocking_ast() already has a check for dt_object_exists() and skips the call into ofd_sync() that generates this message completely. I think the right fix (for 2.5.x only) is to just skip this message for rc == -ENOENT as is already done in master.

            Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/15997
            Subject: LU-6664 ofd: LDLM lock should cover object destroy
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4575d04887bfd2a78a5a0340841d2da6ef23c165

            gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/15997 Subject: LU-6664 ofd: LDLM lock should cover object destroy Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4575d04887bfd2a78a5a0340841d2da6ef23c165

            Bobijam, can you please also make a version of your patch for master.

            adilger Andreas Dilger added a comment - Bobijam, can you please also make a version of your patch for master.

            The LFSCK in 2.5.x does not check MDT<->OST consistency. That feature ("lctl lfsck_start -t layout") wasn't added until 2.6.0.

            adilger Andreas Dilger added a comment - The LFSCK in 2.5.x does not check MDT<->OST consistency. That feature ("lctl lfsck_start -t layout") wasn't added until 2.6.0.

            People

              bobijam Zhenyu Xu
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: