Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.8.0
    • Lustre 1.8.x (1.8.0 - 1.8.5), Lustre 2.1.6
    • None
    • 3
    • 9223372036854775807

    Description

      We hit a problem when using Sparc Client vs x86_64 Lustre server(Server is 2.1.6)

      Call Trace:
      [<ffffffff886ed601>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
      [<ffffffff886edb08>] lbug_with_loc+0x48/0x90 [libcfs]
      [<ffffffff89032f87>] filter_cancel_cookies_cb+0x1e7/0x5a0 [obdfilter]
      [<ffffffff88e2b7c7>] fsfilt_ldiskfs_cb_func+0x17/0x160 [fsfilt_ldiskfs]
      [<ffffffff88d46b00>] jbd2_journal_commit_transaction+0xbb8/0x1120 [jbd2]
      [<ffffffff8003dddd>] lock_timer_base+0x1b/0x3c
      [<ffffffff88d4a2c3>] kjournald2+0x9a/0x1ec [jbd2]
      [<ffffffff800a3cdf>] autoremove_wake_function+0x0/0x2e
      [<ffffffff88d4a229>] kjournald2+0x0/0x1ec [jbd2]
      [<ffffffff800a3ac7>] keventd_create_kthread+0x0/0xc4
      [<ffffffff80032c4f>] kthread+0xfe/0x132
      [<ffffffff8005dfc1>] child_rip+0xa/0x11
      [<ffffffff800a3ac7>] keventd_create_kthread+0x0/0xc4
      [<ffffffff80032b51>] kthread+0x0/0x132
      [<ffffffff8005dfb7>] child_rip+0x0/0x11

      Kernel panic - not syncing: LBUG

      But we could mount x86_64 client using same Lustre version, problems seems come to
      o_lcookie is not swabbed properly so that it caused problems. and we applied fix, Sparc
      client could mount Server without problems.

      I still could not see where master branch fix the problem, so i think latest master branch
      aslo have this problem.

      Attachments

        Activity

          [LU-8858] o_lcookie is not swabbed properly

          Problem dose not exist in the latest master.

          wangshilong Wang Shilong (Inactive) added a comment - Problem dose not exist in the latest master.
          wangshilong Wang Shilong (Inactive) added a comment - - edited

          I checked codes again, In the 1.8 series and 2.1 series branch, I could see o_lcookie passed from Clients to MDS.

          [root@localhost lustre-release]# git grep o_lcookie
          lustre/include/lustre/lustre_idl.h:        struct llog_cookie      o_lcookie;      /* destroy: unlink cookie from MDS */
          lustre/obdclass/obdo.c:                dst->o_lcookie = src->o_lcookie;
          lustre/obdfilter/filter.c:                        *fcc = oa->o_lcookie;
          lustre/obdfilter/filter.c:                        fcc = &oa->o_lcookie;
          lustre/obdfilter/filter.c:                        *fcc = oa->o_lcookie;
          lustre/obdfilter/filter_log.c:        oa->o_lcookie = *cookie;
          lustre/obdfilter/filter_log.c:        oinfo.oi_oa->o_lcookie = *cookie;
          lustre/osc/osc_request.c:                oinfo->oi_oa->o_lcookie = *oti->oti_logcookies;
          lustre/osc/osc_request.c:                        *oti->oti_logcookies = oa->o_lcookie;
          lustre/osc/osc_request.c:                oa->o_lcookie = *oti->oti_logcookies;
          lustre/ost/ost_handler.c:                oti->oti_logcookies = &body->oa.o_lcookie;
          lustre/ost/ost_handler.c:        oti->oti_logcookies = &repbody->oa.o_lcookie;
          lustre/ptlrpc/wiretest.c:        LASSERTF((int)offsetof(struct obdo, o_lcookie) == 136, " found %lld\n",
          lustre/ptlrpc/wiretest.c:                 (long long)(int)offsetof(struct obdo, o_lcookie));
          lustre/ptlrpc/wiretest.c:        LASSERTF((int)sizeof(((struct obdo *)0)->o_lcookie) == 32, " found %lld\n",
          lustre/ptlrpc/wiretest.c:                 (long long)(int)sizeof(((struct obdo *)0)->o_lcookie));
          lustre/utils/wirecheck.c:        CHECK_MEMBER(obdo, o_lcookie);
          lustre/utils/wiretest.c:        LASSERTF((int)offsetof(struct obdo, o_lcookie) == 136, " found %lld\n",
          lustre/utils/wiretest.c:                 (long long)(int)offsetof(struct obdo, o_lcookie));
          lustre/utils/wiretest.c:        LASSERTF((int)sizeof(((struct obdo *)0)->o_lcookie) == 32, " found %lld\n",
          lustre/utils/wiretest.c:                 (long long)(int)sizeof(((struct obdo *)0)->o_lcookie));
          

          But in the latest master codes Clients won't pass o_lcookie to server any more.

          wangshilong Wang Shilong (Inactive) added a comment - - edited I checked codes again, In the 1.8 series and 2.1 series branch, I could see o_lcookie passed from Clients to MDS. [root@localhost lustre-release]# git grep o_lcookie lustre/include/lustre/lustre_idl.h: struct llog_cookie o_lcookie; /* destroy: unlink cookie from MDS */ lustre/obdclass/obdo.c: dst->o_lcookie = src->o_lcookie; lustre/obdfilter/filter.c: *fcc = oa->o_lcookie; lustre/obdfilter/filter.c: fcc = &oa->o_lcookie; lustre/obdfilter/filter.c: *fcc = oa->o_lcookie; lustre/obdfilter/filter_log.c: oa->o_lcookie = *cookie; lustre/obdfilter/filter_log.c: oinfo.oi_oa->o_lcookie = *cookie; lustre/osc/osc_request.c: oinfo->oi_oa->o_lcookie = *oti->oti_logcookies; lustre/osc/osc_request.c: *oti->oti_logcookies = oa->o_lcookie; lustre/osc/osc_request.c: oa->o_lcookie = *oti->oti_logcookies; lustre/ost/ost_handler.c: oti->oti_logcookies = &body->oa.o_lcookie; lustre/ost/ost_handler.c: oti->oti_logcookies = &repbody->oa.o_lcookie; lustre/ptlrpc/wiretest.c: LASSERTF((int)offsetof(struct obdo, o_lcookie) == 136, " found %lld\n", lustre/ptlrpc/wiretest.c: (long long)(int)offsetof(struct obdo, o_lcookie)); lustre/ptlrpc/wiretest.c: LASSERTF((int)sizeof(((struct obdo *)0)->o_lcookie) == 32, " found %lld\n", lustre/ptlrpc/wiretest.c: (long long)(int)sizeof(((struct obdo *)0)->o_lcookie)); lustre/utils/wirecheck.c: CHECK_MEMBER(obdo, o_lcookie); lustre/utils/wiretest.c: LASSERTF((int)offsetof(struct obdo, o_lcookie) == 136, " found %lld\n", lustre/utils/wiretest.c: (long long)(int)offsetof(struct obdo, o_lcookie)); lustre/utils/wiretest.c: LASSERTF((int)sizeof(((struct obdo *)0)->o_lcookie) == 32, " found %lld\n", lustre/utils/wiretest.c: (long long)(int)sizeof(((struct obdo *)0)->o_lcookie)); But in the latest master codes Clients won't pass o_lcookie to server any more.

          Hi Andreas,

          Sorry, i did not make it clear enough, Problem Client version is b1_8 based. and We applied this patch in the Server
          Side, b1_8 Sparc clients could mount without problems then.

          wangshilong Wang Shilong (Inactive) added a comment - Hi Andreas, Sorry, i did not make it clear enough, Problem Client version is b1_8 based. and We applied this patch in the Server Side, b1_8 Sparc clients could mount without problems then.

          Since 2.8 the client does not pass o_lcookie from the MDS at all, it is only used internally by the OSP on the MDS. All of the code to pass the cookie from the MDS to the OSS via the client was removed in patch http://review.whamcloud.com/12922 "LU-6017 obd: remove destroy cookie handling". It was kept from 2.4 to 2.7 for compatibility with older MDSes, but it wasn't expected that 2.8 clients would be used with 2.1 servers.

          Even on 2.3 and earlier clients, the cookie is opaque to the client and is only passed through the client from the MDS to the OSS, so the client shouldn't be swabbing this structure.

          Do you have the actual LASSERT() that is failing? Is it:

          static inline struct llog_ctxt *llog_group_get_ctxt(struct obd_llog_group *olg,
                                                              int index)
          {
                  struct llog_ctxt *ctxt;
          
                  LASSERT(index >= 0 && index < LLOG_MAX_CTXTS);
          

          This shouldn't be LASSERTing on data from the network. Do you have any idea what values were being sent for index?

          I'm not against adding the swabbing on the client, but I don't think that is actually fixing the right problem. It may be that the 2.8 client is just sending garbage values in this field and swabbing won't make any difference.

          adilger Andreas Dilger added a comment - Since 2.8 the client does not pass o_lcookie from the MDS at all, it is only used internally by the OSP on the MDS. All of the code to pass the cookie from the MDS to the OSS via the client was removed in patch http://review.whamcloud.com/12922 " LU-6017 obd: remove destroy cookie handling". It was kept from 2.4 to 2.7 for compatibility with older MDSes, but it wasn't expected that 2.8 clients would be used with 2.1 servers. Even on 2.3 and earlier clients, the cookie is opaque to the client and is only passed through the client from the MDS to the OSS, so the client shouldn't be swabbing this structure. Do you have the actual LASSERT() that is failing? Is it: static inline struct llog_ctxt *llog_group_get_ctxt(struct obd_llog_group *olg, int index) { struct llog_ctxt *ctxt; LASSERT(index >= 0 && index < LLOG_MAX_CTXTS); This shouldn't be LASSERTing on data from the network. Do you have any idea what values were being sent for index ? I'm not against adding the swabbing on the client, but I don't think that is actually fixing the right problem. It may be that the 2.8 client is just sending garbage values in this field and swabbing won't make any difference.

          Wang Shilong (wshilong@ddn.com) uploaded a new patch: http://review.whamcloud.com/23891
          Subject: LU-8858 ptlrpc: swab o_lcookie propely
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 3ee629a0134d137f0364dbf8f883d53c804a009b

          gerrit Gerrit Updater added a comment - Wang Shilong (wshilong@ddn.com) uploaded a new patch: http://review.whamcloud.com/23891 Subject: LU-8858 ptlrpc: swab o_lcookie propely Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3ee629a0134d137f0364dbf8f883d53c804a009b

          People

            wc-triage WC Triage
            wangshilong Wang Shilong (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: