Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3302

ll_fill_super() Unable to process log: -2

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • PPC client
    • 3
    • 8173

    Description

      We updated a client to 2.3.64-4chaos and tried to mount a 2.3.63-6chaos server. The mount fails with

      LustreError: 15c-8: MGC172.20.20.201@o2ib500: The configuration from log 'fsv-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      LustreError: 14351:0:(llite_lib.c:1043:ll_fill_super()) Unable to process log: -2
      Lustre: Unmounted fsv-client
      LustreError: 14351:0:(obd_mount.c:1265:lustre_fill_super()) Unable to mount  (-2)
      

      Using git bisect I found the mount failure was introduced with this patch:

      http://review.whamcloud.com/#change,5820

      LU-2684 fid: unify ostid and FID
      

      The critical questions at this point are:

      • Can we solve this problem by updating both server and client to 2.3.64-4chaos?
      • Can we safely upgrade the server, or does the above patch introduce on-disk format incompatibilities?
      • Will we be able to safely revert the server to 2.3.63 in case we find problems, or will it write new objects in an incompatible format?

      LLNL-bug-id: TOSS-2060

      Attachments

        Issue Links

          Activity

            [LU-3302] ll_fill_super() Unable to process log: -2

            With the patch, a 2.3.64 PPC client can mount from a 2.3.63 server. So this appears to be fixed. Thanks

            nedbass Ned Bass (Inactive) added a comment - With the patch, a 2.3.64 PPC client can mount from a 2.3.63 server. So this appears to be fixed. Thanks

            Now that this patch has landed, can we get confirmation that this is fixed?
            Thank you!

            jlevi Jodi Levi (Inactive) added a comment - Now that this patch has landed, can we get confirmation that this is fixed? Thank you!

            Yes these are x86_64 servers and ppc64 clients.

            Also, if it is an unfixed swabbing bug, I would expect the mount to also fail with 2.3.64 servers.

            nedbass Ned Bass (Inactive) added a comment - Yes these are x86_64 servers and ppc64 clients. Also, if it is an unfixed swabbing bug, I would expect the mount to also fail with 2.3.64 servers.

            Ned would you confirm that these are x86_64 servers and ppc/ppc64 clients? In that case it's unlikely that you're affected by LU-3294 since that issue is probably limited to BE servers.

            jhammond John Hammond added a comment - Ned would you confirm that these are x86_64 servers and ppc/ppc64 clients? In that case it's unlikely that you're affected by LU-3294 since that issue is probably limited to BE servers.
            di.wang Di Wang added a comment - http://review.whamcloud.com/#change,6305
            di.wang Di Wang added a comment -

            Ned, I just checked the debug log, it seems client get correct log ID after swab,

            Here are the client log

            00000040:00000001:5.0:1368040600.989913:5152:8187:0:(llog_swab.c:86:lustre_swab_llogd_body()) Process entered
            00000040:00001000:5.0:1368040600.989914:5328:8187:0:(llog_swab.c:53:print_llogd_body()) llogd body: c000000f50e9a100
            00000040:00001000:5.0:1368040600.989915:5328:8187:0:(llog_swab.c:55:print_llogd_body())         lgd_logid.lgl_oi: 0x6400000000000000:16777216
            00000040:00001000:5.0:1368040600.989915:5328:8187:0:(llog_swab.c:56:print_llogd_body())         lgd_logid.lgl_ogen: 0x0
            00000040:00001000:5.0:1368040600.989916:5328:8187:0:(llog_swab.c:57:print_llogd_body())         lgd_ctxt_idx: 0x0
            00000040:00001000:5.0:1368040600.989917:5328:8187:0:(llog_swab.c:58:print_llogd_body())         lgd_llh_flags: 0x0
            00000040:00001000:5.0:1368040600.989917:5328:8187:0:(llog_swab.c:59:print_llogd_body())         lgd_index: 0x0
            00000040:00001000:5.0:1368040600.989918:5328:8187:0:(llog_swab.c:60:print_llogd_body())         lgd_saved_index: 0x0
            00000040:00001000:5.0:1368040600.989918:5328:8187:0:(llog_swab.c:61:print_llogd_body())         lgd_len: 0x0
            00000040:00001000:5.0:1368040600.989919:5328:8187:0:(llog_swab.c:62:print_llogd_body())         lgd_cur_offset: 0x0
            00000040:00001000:5.0:1368040600.989920:5328:8187:0:(llog_swab.c:53:print_llogd_body()) llogd body: c000000f50e9a100
            00000040:00001000:5.0:1368040600.989920:5328:8187:0:(llog_swab.c:55:print_llogd_body())         lgd_logid.lgl_oi: 0x64:1
            00000040:00001000:5.0:1368040600.989921:5328:8187:0:(llog_swab.c:56:print_llogd_body())         lgd_logid.lgl_ogen: 0x0
            00000040:00001000:5.0:1368040600.989921:5328:8187:0:(llog_swab.c:57:print_llogd_body())         lgd_ctxt_idx: 0x0
            00000040:00001000:5.0:1368040600.989922:5328:8187:0:(llog_swab.c:58:print_llogd_body())         lgd_llh_flags: 0x0
            00000040:00001000:5.0:1368040600.989923:5328:8187:0:(llog_swab.c:59:print_llogd_body())         lgd_index: 0x0
            00000040:00001000:5.0:1368040600.989923:5328:8187:0:(llog_swab.c:60:print_llogd_body())         lgd_saved_index: 0x0
            00000040:00001000:5.0:1368040600.989924:5328:8187:0:(llog_swab.c:61:print_llogd_body())         lgd_len: 0x0
            00000040:00001000:5.0:1368040600.989924:5328:8187:0:(llog_swab.c:62:print_llogd_body())         lgd_cur_offset: 0x0
            00000040:00000001:5.0:1368040600.989925:5152:8187:0:(llog_swab.c:97:lustre_swab_llogd_body()) Process leaving
            

            But somehow server can not find the log object by this ID. Unfortunately, I can not find correspondent mgs handling information in the MDS debug log. Could you please redo the test update the debug log.

            In the mean time, I do see there are some problem during the logid swab(John also point out one in LU-3294) I will cook the patch now.

            di.wang Di Wang added a comment - Ned, I just checked the debug log, it seems client get correct log ID after swab, Here are the client log 00000040:00000001:5.0:1368040600.989913:5152:8187:0:(llog_swab.c:86:lustre_swab_llogd_body()) Process entered 00000040:00001000:5.0:1368040600.989914:5328:8187:0:(llog_swab.c:53:print_llogd_body()) llogd body: c000000f50e9a100 00000040:00001000:5.0:1368040600.989915:5328:8187:0:(llog_swab.c:55:print_llogd_body()) lgd_logid.lgl_oi: 0x6400000000000000:16777216 00000040:00001000:5.0:1368040600.989915:5328:8187:0:(llog_swab.c:56:print_llogd_body()) lgd_logid.lgl_ogen: 0x0 00000040:00001000:5.0:1368040600.989916:5328:8187:0:(llog_swab.c:57:print_llogd_body()) lgd_ctxt_idx: 0x0 00000040:00001000:5.0:1368040600.989917:5328:8187:0:(llog_swab.c:58:print_llogd_body()) lgd_llh_flags: 0x0 00000040:00001000:5.0:1368040600.989917:5328:8187:0:(llog_swab.c:59:print_llogd_body()) lgd_index: 0x0 00000040:00001000:5.0:1368040600.989918:5328:8187:0:(llog_swab.c:60:print_llogd_body()) lgd_saved_index: 0x0 00000040:00001000:5.0:1368040600.989918:5328:8187:0:(llog_swab.c:61:print_llogd_body()) lgd_len: 0x0 00000040:00001000:5.0:1368040600.989919:5328:8187:0:(llog_swab.c:62:print_llogd_body()) lgd_cur_offset: 0x0 00000040:00001000:5.0:1368040600.989920:5328:8187:0:(llog_swab.c:53:print_llogd_body()) llogd body: c000000f50e9a100 00000040:00001000:5.0:1368040600.989920:5328:8187:0:(llog_swab.c:55:print_llogd_body()) lgd_logid.lgl_oi: 0x64:1 00000040:00001000:5.0:1368040600.989921:5328:8187:0:(llog_swab.c:56:print_llogd_body()) lgd_logid.lgl_ogen: 0x0 00000040:00001000:5.0:1368040600.989921:5328:8187:0:(llog_swab.c:57:print_llogd_body()) lgd_ctxt_idx: 0x0 00000040:00001000:5.0:1368040600.989922:5328:8187:0:(llog_swab.c:58:print_llogd_body()) lgd_llh_flags: 0x0 00000040:00001000:5.0:1368040600.989923:5328:8187:0:(llog_swab.c:59:print_llogd_body()) lgd_index: 0x0 00000040:00001000:5.0:1368040600.989923:5328:8187:0:(llog_swab.c:60:print_llogd_body()) lgd_saved_index: 0x0 00000040:00001000:5.0:1368040600.989924:5328:8187:0:(llog_swab.c:61:print_llogd_body()) lgd_len: 0x0 00000040:00001000:5.0:1368040600.989924:5328:8187:0:(llog_swab.c:62:print_llogd_body()) lgd_cur_offset: 0x0 00000040:00000001:5.0:1368040600.989925:5152:8187:0:(llog_swab.c:97:lustre_swab_llogd_body()) Process leaving But somehow server can not find the log object by this ID. Unfortunately, I can not find correspondent mgs handling information in the MDS debug log. Could you please redo the test update the debug log. In the mean time, I do see there are some problem during the logid swab(John also point out one in LU-3294 ) I will cook the patch now.

            Haven't tried 2.3.65 yet, but initial testing suggests updating the server to 2.3.64 lets the mount succeed. Here's what I did:

            1. Tried to mount a 2.3.62 server from a 2.3.64 PPC client. Fails with "ll_fill_super() Unable to process log: -2"
            2. Updated the server to 2.3.64. Mount from 2.3.64 PPC client succeeds.

            nedbass Ned Bass (Inactive) added a comment - Haven't tried 2.3.65 yet, but initial testing suggests updating the server to 2.3.64 lets the mount succeed. Here's what I did: 1. Tried to mount a 2.3.62 server from a 2.3.64 PPC client. Fails with "ll_fill_super() Unable to process log: -2" 2. Updated the server to 2.3.64. Mount from 2.3.64 PPC client succeeds.

            John, okay, we're getting a test environment set up where I should be able to do that test.

            nedbass Ned Bass (Inactive) added a comment - John, okay, we're getting a test environment set up where I should be able to do that test.
            jhammond John Hammond added a comment - - edited

            Possibly. Please see LU-3294.

            Ned, it would be interesting to know what happens when you create a new 2.3.65 FS on ppc, unmount, and then remount it.

            jhammond John Hammond added a comment - - edited Possibly. Please see LU-3294 . Ned, it would be interesting to know what happens when you create a new 2.3.65 FS on ppc, unmount, and then remount it.

            John, it was mentioned to me that you have already found some endian issues with the FID-on-OST code? Could you please point out where they are, it might be that this is the source of the problem being seen here, since we didn't see any problems with our x86_64 clients for interoperability.

            adilger Andreas Dilger added a comment - John, it was mentioned to me that you have already found some endian issues with the FID-on-OST code? Could you please point out where they are, it might be that this is the source of the problem being seen here, since we didn't see any problems with our x86_64 clients for interoperability.

            Yes it is. Sorry for the omission.

            nedbass Ned Bass (Inactive) added a comment - Yes it is. Sorry for the omission.

            People

              di.wang Di Wang
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: