Details

    • Bug
    • Resolution: Incomplete
    • Blocker
    • None
    • Lustre 2.5.3
    • None
    • 1
    • 9223372036854775807

    Description

      When trying to mount mdt all osp-syn threads stuck in 'D' state.

      Debug logs are filled with these messages

      00000004:00080000:8.0:1463850740.016156:0:14081:0:(osp_sync.c:317:osp_sync_request_commit_cb()) commit req ffff883ebf799800, transno 0
      00000004:00080000:8.0:1463850740.016164:0:14081:0:(osp_sync.c:351:osp_sync_interpret()) reply req ffff883ebf799800/1, rc -2, transno 0
      00000100:00100000:8.0:1463850740.016176:0:14081:0:(client.c:1872:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_3:nbp2-MDT0000-mdtlov_UUID:14081:1534957896521600:10.151.26.98@o2ib:6
      00000004:00080000:9.0:1463850740.016219:0:14087:0:(osp_sync.c:317:osp_sync_request_commit_cb()) commit req ffff883ebed48800, transno 0
      00000004:00080000:9.0:1463850740.016226:0:14087:0:(osp_sync.c:351:osp_sync_interpret()) reply req ffff883ebed48800/1, rc -2, transno 0
      

      I will upload full debug logs to ftp site.

      Attachments

        Issue Links

          Activity

            [LU-8177] osp-syn threads in D state

            I looked in /O/1/d* and there where files going back to 2015.

            should i just delete everything in /0/1/* and remount?

            mhanafi Mahmoud Hanafi added a comment - I looked in /O/1/d* and there where files going back to 2015. should i just delete everything in /0/1/* and remount?
            green Oleg Drokin added a comment -

            if you really need to clear the condition immediately, it's possible to unmount the MDT, mount it as ldiskfs, remove the stale llogs, unmount ldiskfs and remount mdt as lustre.
            Perhaps not all of them needs removing but just hte really old ones (you can tell by the date).

            green Oleg Drokin added a comment - if you really need to clear the condition immediately, it's possible to unmount the MDT, mount it as ldiskfs, remove the stale llogs, unmount ldiskfs and remount mdt as lustre. Perhaps not all of them needs removing but just hte really old ones (you can tell by the date).
            green Oleg Drokin added a comment -

            Alex advises that the condition will clear on it's own after all llogs are reproceessed. the duration of that is hard to tell as it depends on number of those llogs.

            green Oleg Drokin added a comment - Alex advises that the condition will clear on it's own after all llogs are reproceessed. the duration of that is hard to tell as it depends on number of those llogs.

            This was a remount after power down.
            The OST are mounted.
            The shutdown was normal.

            I unmounted all the OSTs the the mdt got mounted. Then I remounted the OSTs and the mdt got back to osp-sync in 'D' state. but at least i am able to mount it on the client.

            So do we need to apply the patch from LU-7079 and remount? or can we some how stop the osp-sync.

            mhanafi Mahmoud Hanafi added a comment - This was a remount after power down. The OST are mounted. The shutdown was normal. I unmounted all the OSTs the the mdt got mounted. Then I remounted the OSTs and the mdt got back to osp-sync in 'D' state. but at least i am able to mount it on the client. So do we need to apply the patch from LU-7079 and remount? or can we some how stop the osp-sync.

            basically some llog cancels got lost by mistake causing lots of IO to rescan llogs at startup.

            bzzz Alex Zhuravlev added a comment - basically some llog cancels got lost by mistake causing lots of IO to rescan llogs at startup.

            I think this can be a dup of LU-7079

            bzzz Alex Zhuravlev added a comment - I think this can be a dup of LU-7079
            green Oleg Drokin added a comment -

            Are there any messages in dmesg on mds or osts?
            Is this a normal mount after a normal shutdown? a failover after something else?
            are the OSTs up?

            green Oleg Drokin added a comment - Are there any messages in dmesg on mds or osts? Is this a normal mount after a normal shutdown? a failover after something else? are the OSTs up?

            Assigning to me – Oleg is looking.
            ~ jfc.

            jfc John Fuchs-Chesney (Inactive) added a comment - Assigning to me – Oleg is looking. ~ jfc.

            Email from Mahmoud: "Sorry this should be severity1. The production site is down and unusable."

            ~ jfc.

            jfc John Fuchs-Chesney (Inactive) added a comment - Email from Mahmoud: "Sorry this should be severity1. The production site is down and unusable." ~ jfc.

            Sorry it should be level 1.

            mhanafi Mahmoud Hanafi added a comment - Sorry it should be level 1.

            Mahmoud,

            Can you please clarify if you have a production site down emergency please? We rate that as a SEV-1 event, and you have selected SEV-4.

            Thanks,
            ~ jfc.

            jfc John Fuchs-Chesney (Inactive) added a comment - Mahmoud, Can you please clarify if you have a production site down emergency please? We rate that as a SEV-1 event, and you have selected SEV-4. Thanks, ~ jfc.

            People

              jfc John Fuchs-Chesney (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: