Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3302

ll_fill_super() Unable to process log: -2

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • PPC client
    • 3
    • 8173

    Description

      We updated a client to 2.3.64-4chaos and tried to mount a 2.3.63-6chaos server. The mount fails with

      LustreError: 15c-8: MGC172.20.20.201@o2ib500: The configuration from log 'fsv-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      LustreError: 14351:0:(llite_lib.c:1043:ll_fill_super()) Unable to process log: -2
      Lustre: Unmounted fsv-client
      LustreError: 14351:0:(obd_mount.c:1265:lustre_fill_super()) Unable to mount  (-2)
      

      Using git bisect I found the mount failure was introduced with this patch:

      http://review.whamcloud.com/#change,5820

      LU-2684 fid: unify ostid and FID
      

      The critical questions at this point are:

      • Can we solve this problem by updating both server and client to 2.3.64-4chaos?
      • Can we safely upgrade the server, or does the above patch introduce on-disk format incompatibilities?
      • Will we be able to safely revert the server to 2.3.63 in case we find problems, or will it write new objects in an incompatible format?

      LLNL-bug-id: TOSS-2060

      Attachments

        Issue Links

          Activity

            [LU-3302] ll_fill_super() Unable to process log: -2

            Based on latest comments, this patch landed and has fixed the issue. Closing ticket.

            jlevi Jodi Levi (Inactive) added a comment - Based on latest comments, this patch landed and has fixed the issue. Closing ticket.

            With the patch, a 2.3.64 PPC client can mount from a 2.3.63 server. So this appears to be fixed. Thanks

            nedbass Ned Bass (Inactive) added a comment - With the patch, a 2.3.64 PPC client can mount from a 2.3.63 server. So this appears to be fixed. Thanks

            Now that this patch has landed, can we get confirmation that this is fixed?
            Thank you!

            jlevi Jodi Levi (Inactive) added a comment - Now that this patch has landed, can we get confirmation that this is fixed? Thank you!

            Yes these are x86_64 servers and ppc64 clients.

            Also, if it is an unfixed swabbing bug, I would expect the mount to also fail with 2.3.64 servers.

            nedbass Ned Bass (Inactive) added a comment - Yes these are x86_64 servers and ppc64 clients. Also, if it is an unfixed swabbing bug, I would expect the mount to also fail with 2.3.64 servers.

            Ned would you confirm that these are x86_64 servers and ppc/ppc64 clients? In that case it's unlikely that you're affected by LU-3294 since that issue is probably limited to BE servers.

            jhammond John Hammond added a comment - Ned would you confirm that these are x86_64 servers and ppc/ppc64 clients? In that case it's unlikely that you're affected by LU-3294 since that issue is probably limited to BE servers.
            di.wang Di Wang added a comment - http://review.whamcloud.com/#change,6305
            di.wang Di Wang added a comment -

            Ned, I just checked the debug log, it seems client get correct log ID after swab,

            Here are the client log

            00000040:00000001:5.0:1368040600.989913:5152:8187:0:(llog_swab.c:86:lustre_swab_llogd_body()) Process entered
            00000040:00001000:5.0:1368040600.989914:5328:8187:0:(llog_swab.c:53:print_llogd_body()) llogd body: c000000f50e9a100
            00000040:00001000:5.0:1368040600.989915:5328:8187:0:(llog_swab.c:55:print_llogd_body())         lgd_logid.lgl_oi: 0x6400000000000000:16777216
            00000040:00001000:5.0:1368040600.989915:5328:8187:0:(llog_swab.c:56:print_llogd_body())         lgd_logid.lgl_ogen: 0x0
            00000040:00001000:5.0:1368040600.989916:5328:8187:0:(llog_swab.c:57:print_llogd_body())         lgd_ctxt_idx: 0x0
            00000040:00001000:5.0:1368040600.989917:5328:8187:0:(llog_swab.c:58:print_llogd_body())         lgd_llh_flags: 0x0
            00000040:00001000:5.0:1368040600.989917:5328:8187:0:(llog_swab.c:59:print_llogd_body())         lgd_index: 0x0
            00000040:00001000:5.0:1368040600.989918:5328:8187:0:(llog_swab.c:60:print_llogd_body())         lgd_saved_index: 0x0
            00000040:00001000:5.0:1368040600.989918:5328:8187:0:(llog_swab.c:61:print_llogd_body())         lgd_len: 0x0
            00000040:00001000:5.0:1368040600.989919:5328:8187:0:(llog_swab.c:62:print_llogd_body())         lgd_cur_offset: 0x0
            00000040:00001000:5.0:1368040600.989920:5328:8187:0:(llog_swab.c:53:print_llogd_body()) llogd body: c000000f50e9a100
            00000040:00001000:5.0:1368040600.989920:5328:8187:0:(llog_swab.c:55:print_llogd_body())         lgd_logid.lgl_oi: 0x64:1
            00000040:00001000:5.0:1368040600.989921:5328:8187:0:(llog_swab.c:56:print_llogd_body())         lgd_logid.lgl_ogen: 0x0
            00000040:00001000:5.0:1368040600.989921:5328:8187:0:(llog_swab.c:57:print_llogd_body())         lgd_ctxt_idx: 0x0
            00000040:00001000:5.0:1368040600.989922:5328:8187:0:(llog_swab.c:58:print_llogd_body())         lgd_llh_flags: 0x0
            00000040:00001000:5.0:1368040600.989923:5328:8187:0:(llog_swab.c:59:print_llogd_body())         lgd_index: 0x0
            00000040:00001000:5.0:1368040600.989923:5328:8187:0:(llog_swab.c:60:print_llogd_body())         lgd_saved_index: 0x0
            00000040:00001000:5.0:1368040600.989924:5328:8187:0:(llog_swab.c:61:print_llogd_body())         lgd_len: 0x0
            00000040:00001000:5.0:1368040600.989924:5328:8187:0:(llog_swab.c:62:print_llogd_body())         lgd_cur_offset: 0x0
            00000040:00000001:5.0:1368040600.989925:5152:8187:0:(llog_swab.c:97:lustre_swab_llogd_body()) Process leaving
            

            But somehow server can not find the log object by this ID. Unfortunately, I can not find correspondent mgs handling information in the MDS debug log. Could you please redo the test update the debug log.

            In the mean time, I do see there are some problem during the logid swab(John also point out one in LU-3294) I will cook the patch now.

            di.wang Di Wang added a comment - Ned, I just checked the debug log, it seems client get correct log ID after swab, Here are the client log 00000040:00000001:5.0:1368040600.989913:5152:8187:0:(llog_swab.c:86:lustre_swab_llogd_body()) Process entered 00000040:00001000:5.0:1368040600.989914:5328:8187:0:(llog_swab.c:53:print_llogd_body()) llogd body: c000000f50e9a100 00000040:00001000:5.0:1368040600.989915:5328:8187:0:(llog_swab.c:55:print_llogd_body()) lgd_logid.lgl_oi: 0x6400000000000000:16777216 00000040:00001000:5.0:1368040600.989915:5328:8187:0:(llog_swab.c:56:print_llogd_body()) lgd_logid.lgl_ogen: 0x0 00000040:00001000:5.0:1368040600.989916:5328:8187:0:(llog_swab.c:57:print_llogd_body()) lgd_ctxt_idx: 0x0 00000040:00001000:5.0:1368040600.989917:5328:8187:0:(llog_swab.c:58:print_llogd_body()) lgd_llh_flags: 0x0 00000040:00001000:5.0:1368040600.989917:5328:8187:0:(llog_swab.c:59:print_llogd_body()) lgd_index: 0x0 00000040:00001000:5.0:1368040600.989918:5328:8187:0:(llog_swab.c:60:print_llogd_body()) lgd_saved_index: 0x0 00000040:00001000:5.0:1368040600.989918:5328:8187:0:(llog_swab.c:61:print_llogd_body()) lgd_len: 0x0 00000040:00001000:5.0:1368040600.989919:5328:8187:0:(llog_swab.c:62:print_llogd_body()) lgd_cur_offset: 0x0 00000040:00001000:5.0:1368040600.989920:5328:8187:0:(llog_swab.c:53:print_llogd_body()) llogd body: c000000f50e9a100 00000040:00001000:5.0:1368040600.989920:5328:8187:0:(llog_swab.c:55:print_llogd_body()) lgd_logid.lgl_oi: 0x64:1 00000040:00001000:5.0:1368040600.989921:5328:8187:0:(llog_swab.c:56:print_llogd_body()) lgd_logid.lgl_ogen: 0x0 00000040:00001000:5.0:1368040600.989921:5328:8187:0:(llog_swab.c:57:print_llogd_body()) lgd_ctxt_idx: 0x0 00000040:00001000:5.0:1368040600.989922:5328:8187:0:(llog_swab.c:58:print_llogd_body()) lgd_llh_flags: 0x0 00000040:00001000:5.0:1368040600.989923:5328:8187:0:(llog_swab.c:59:print_llogd_body()) lgd_index: 0x0 00000040:00001000:5.0:1368040600.989923:5328:8187:0:(llog_swab.c:60:print_llogd_body()) lgd_saved_index: 0x0 00000040:00001000:5.0:1368040600.989924:5328:8187:0:(llog_swab.c:61:print_llogd_body()) lgd_len: 0x0 00000040:00001000:5.0:1368040600.989924:5328:8187:0:(llog_swab.c:62:print_llogd_body()) lgd_cur_offset: 0x0 00000040:00000001:5.0:1368040600.989925:5152:8187:0:(llog_swab.c:97:lustre_swab_llogd_body()) Process leaving But somehow server can not find the log object by this ID. Unfortunately, I can not find correspondent mgs handling information in the MDS debug log. Could you please redo the test update the debug log. In the mean time, I do see there are some problem during the logid swab(John also point out one in LU-3294 ) I will cook the patch now.

            Haven't tried 2.3.65 yet, but initial testing suggests updating the server to 2.3.64 lets the mount succeed. Here's what I did:

            1. Tried to mount a 2.3.62 server from a 2.3.64 PPC client. Fails with "ll_fill_super() Unable to process log: -2"
            2. Updated the server to 2.3.64. Mount from 2.3.64 PPC client succeeds.

            nedbass Ned Bass (Inactive) added a comment - Haven't tried 2.3.65 yet, but initial testing suggests updating the server to 2.3.64 lets the mount succeed. Here's what I did: 1. Tried to mount a 2.3.62 server from a 2.3.64 PPC client. Fails with "ll_fill_super() Unable to process log: -2" 2. Updated the server to 2.3.64. Mount from 2.3.64 PPC client succeeds.

            John, okay, we're getting a test environment set up where I should be able to do that test.

            nedbass Ned Bass (Inactive) added a comment - John, okay, we're getting a test environment set up where I should be able to do that test.
            jhammond John Hammond added a comment - - edited

            Possibly. Please see LU-3294.

            Ned, it would be interesting to know what happens when you create a new 2.3.65 FS on ppc, unmount, and then remount it.

            jhammond John Hammond added a comment - - edited Possibly. Please see LU-3294 . Ned, it would be interesting to know what happens when you create a new 2.3.65 FS on ppc, unmount, and then remount it.

            People

              di.wang Di Wang
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: