Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3294

osp_sync_llog_init(): ASSERTION( lgh != ((void *)0) ) failed

Details

    • 3
    • 8157

    Description

      When starting an MDT on a SPARC MDS, this assertion failure occurred:

      Lustre: lustre-MDT0000: used disk, loading
      Lustre: lustre-OST0000-osc-MDT0000: Init llog for 0 - catid 0x2:0:0
      LustreError: 11309:0:(osp_sync.c:963:osp_sync_llog_init()) ASSERTION( lgh != ((void *)0) ) failed:
      LustreError: 11309:0:(osp_sync.c:963:osp_sync_llog_init()) LBUG
      Pid: 11309, comm: llog_process_th
      
      Call Trace:
      
      Kernel panic - not syncing: LBUG
      Call Trace:
       [0000000010181194] lbug_with_loc+0x94/0xc0 [libcfs]
       [0000000010dc28c8] osp_sync_llog_init+0xa28/0xc00 [osp]
       [0000000010dc6d78] osp_sync_init+0x1f8/0xbe0 [osp]
       [0000000010daf51c] osp_device_alloc+0x4d7c/0x5c40 [osp]
       [000000001033a500] class_setup+0x6e0/0xf00 [obdclass]
       [000000001033da58] class_process_config+0x1738/0x5180 [obdclass]
      [...]
      

      According to the "catid" printed, I guess the FID of the log must be [1:2:0]. The problem is in the definition of oat_id:

      struct ost_id { 
              union {
                      struct ostid {
                              __u64   oi_id;
                              __u64   oi_seq;
                      } oi;
                      struct lu_fid oi_fid;
              };      
      };
      

      When fid_to_logid() assigns a 64-bit sequence number to oi_seq, which 32 bits go to f_oid and f_ver really depends on the endianness of the MDS. On the SPARC MDS, the FID_SEQ_LLOG goes to f_ver, causing oatid_id() to return 0, while the log ID as a whole is nonzero. This combined caused osp_sync_llog_init() to neither open nor re-create the log.

      Attachments

        Issue Links

          Activity

            [LU-3294] osp_sync_llog_init(): ASSERTION( lgh != ((void *)0) ) failed
            jhammond John Hammond added a comment -

            Patch landed.

            jhammond John Hammond added a comment - Patch landed.
            jhammond John Hammond added a comment -

            The proposed change to osp_sync_llog_init() was landed to master as part of http://review.whamcloud.com/6305.

            jhammond John Hammond added a comment - The proposed change to osp_sync_llog_init() was landed to master as part of http://review.whamcloud.com/6305 .
            jhammond John Hammond added a comment -

            The proposed change for ops_syn_llog_init() has been rolled into http://review.whamcloud.com/#change,6305.

            However, there are still some spots where ostid_id() is being applied to lgl_oi, and similar combinations. This only affects big-endian servers and so is not an issue for common setups (including LLNL's x86_64 servers with ppc64 clients).

            If I understand this code correctly it may be useful to add some assertions/trace that:

            In logid_id() the seq is FID_SEQ_LLOG or FID_SEQ_LLOG_NAME.

            Same for logid_set_id().

            In ostid_id() the seq is not FID_SEQ_LLOG or FID_SEQ_LLOG_NAME.

            Same for ostid_set_id().

            A double swab of a ost_id/llog_logid for various valid seqs and oids (big and small) is the identity.

            This may require fixing up POSTID, DOSTID, ... or replacing them with PLOGID(), DLOGID(), ...

            jhammond John Hammond added a comment - The proposed change for ops_syn_llog_init() has been rolled into http://review.whamcloud.com/#change,6305 . However, there are still some spots where ostid_id() is being applied to lgl_oi, and similar combinations. This only affects big-endian servers and so is not an issue for common setups (including LLNL's x86_64 servers with ppc64 clients). If I understand this code correctly it may be useful to add some assertions/trace that: In logid_id() the seq is FID_SEQ_LLOG or FID_SEQ_LLOG_NAME. Same for logid_set_id(). In ostid_id() the seq is not FID_SEQ_LLOG or FID_SEQ_LLOG_NAME. Same for ostid_set_id(). A double swab of a ost_id/llog_logid for various valid seqs and oids (big and small) is the identity. This may require fixing up POSTID, DOSTID, ... or replacing them with PLOGID(), DLOGID(), ...
            di.wang Di Wang added a comment -

            John, yes, you are right. it should logid_id(), instead of ostid_id. Sigh, I had thought all of it has been revert to logid_id in LU-2888.

            di.wang Di Wang added a comment - John, yes, you are right. it should logid_id(), instead of ostid_id. Sigh, I had thought all of it has been revert to logid_id in LU-2888 .
            jhammond John Hammond added a comment -

            After 725f3f8e it looks like we should use logid_id(&osi->osi_cid.lci_logid) instead of ostid_id(&osi->osi_cid.lci_logid.lgl_oi) in osp_sync_llog_init() and similarly elsewhere. Di, can you comment?

            jhammond John Hammond added a comment - After 725f3f8e it looks like we should use logid_id(&osi->osi_cid.lci_logid) instead of ostid_id(&osi->osi_cid.lci_logid.lgl_oi) in osp_sync_llog_init() and similarly elsewhere. Di, can you comment?

            People

              jhammond John Hammond
              liwei Li Wei (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: