Details

    • 3
    • 9223372036854775807

    Description

      The problem unfortunately is not solved, even with the patch http://review.whamcloud.com/#/c/15841/
      I propose to raise the topic again LU-6944
      The system restarts unexpectedly with errors
      Message from syslogd @ hard at Aug 13 8:56:38 ...
        kernel: LustreError: 2796: 0: (osp_sync.c: 1139: osp_sync_thread ()) ASSERTION (thread-> t_flags! = SVC_RUNNING) failed: 684 changes, 1137 in progress, 7 in flight
      Message from syslogd @ hard at Aug 13 8:56:38 ...
        kernel: LustreError: 2796: 0: (osp_sync.c: 1139: osp_sync_thread ()) LBUG

      Attachments

        Issue Links

          Activity

            [LU-7001] osp_sync.c: 1139: osp_sync_thread

            John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/32097/
            Subject: LU-7001 osp: fix llog processing
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set:
            Commit: 10cc97e3c1487692b460702bf46220b1acb452ee

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/32097/ Subject: LU-7001 osp: fix llog processing Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 10cc97e3c1487692b460702bf46220b1acb452ee

            Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32097
            Subject: LU-7001 osp: fix llog processing
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: bf6768549dfa09711daa66ccbf9db766c9f074f6

            gerrit Gerrit Updater added a comment - Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32097 Subject: LU-7001 osp: fix llog processing Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: bf6768549dfa09711daa66ccbf9db766c9f074f6
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26132/
            Subject: LU-7001 osp: fix llog processing
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 8da9fb0cf14cc79bf1985d144d0a201e136dfe51

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26132/ Subject: LU-7001 osp: fix llog processing Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8da9fb0cf14cc79bf1985d144d0a201e136dfe51

            Alexander Boyko (alexander.boyko@seagate.com) uploaded a new patch: https://review.whamcloud.com/26359
            Subject: LU-7001 tests: check osp_sync_thread for wrapped llog
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 8d779dab1025cd46ac44fc80f4dbd5ac85cba8a8

            gerrit Gerrit Updater added a comment - Alexander Boyko (alexander.boyko@seagate.com) uploaded a new patch: https://review.whamcloud.com/26359 Subject: LU-7001 tests: check osp_sync_thread for wrapped llog Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8d779dab1025cd46ac44fc80f4dbd5ac85cba8a8

            Alexander Boyko (alexander.boyko@seagate.com) uploaded a new patch: https://review.whamcloud.com/26132
            Subject: LU-7001 osp: fix llog processing
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bcebaa04977761773d24fed0821c44fcbd5bef83

            gerrit Gerrit Updater added a comment - Alexander Boyko (alexander.boyko@seagate.com) uploaded a new patch: https://review.whamcloud.com/26132 Subject: LU-7001 osp: fix llog processing Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bcebaa04977761773d24fed0821c44fcbd5bef83

            Thank you Andreas for the information.

            Do you think it is possible to write a userspace tool to read as well as edit the llog files? I know that llog_reader is being changed, so hopefully, we will be able to at least dump the llog file. But since the llog files can be read locally from MDT/OST ldiskfs, maybe we can use a tool to remove wrong records mannually too?

            lixi Li Xi (Inactive) added a comment - Thank you Andreas for the information. Do you think it is possible to write a userspace tool to read as well as edit the llog files? I know that llog_reader is being changed, so hopefully, we will be able to at least dump the llog file. But since the llog files can be read locally from MDT/OST ldiskfs, maybe we can use a tool to remove wrong records mannually too?

            Li Xi, there are a couple of patches in flight that will repair or skip corrupted log records, but there may still be more types of corruption found on the future.

            adilger Andreas Dilger added a comment - Li Xi, there are a couple of patches in flight that will repair or skip corrupted log records, but there may still be more types of corruption found on the future.

            Finally, we walk around this problem by remove the CATALOGS file. I am wondering whether there is anyway to chack and recover broken llogs records...

            lixi Li Xi (Inactive) added a comment - Finally, we walk around this problem by remove the CATALOGS file. I am wondering whether there is anyway to chack and recover broken llogs records...

            We are hitting this issue repeatedly. I guess it will never recover unless we skip recovery or do something trick.

            Can we just remove the assertion? It seems this assertion is not proper, since the running thread has no idea when it will be requested to stop. Also, in osp_init0(), if ptlrpc_init_import() function returns a failure (ptlrpc_init_import() will not return any failure at least currently), it seems the assertion will fail. So this assertion looks dangerous.

            lixi Li Xi (Inactive) added a comment - We are hitting this issue repeatedly. I guess it will never recover unless we skip recovery or do something trick. Can we just remove the assertion? It seems this assertion is not proper, since the running thread has no idea when it will be requested to stop. Also, in osp_init0(), if ptlrpc_init_import() function returns a failure (ptlrpc_init_import() will not return any failure at least currently), it seems the assertion will fail. So this assertion looks dangerous.

            People

              aboyko Alexander Boyko
              Lexsoft Alex (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: