Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3139

osp_precreate_send()) ASSERTION( lu_fid_diff(fid, &d->opd_pre_used_fid) > 0 ) failed

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.4.0
    • 3
    • 7623

    Description

      When starting lustre on Sequoia's MDS/MGS, it is hitting the following assertion:

      2013-04-09 16:46:16 Lustre: lsv-MDT0000: Will be in recovery for at least 5:00, or until 2 clients reconnect.
      2013-04-09 16:46:19 Lustre: lsv-MDT0000: Recovery over after 0:03, of 2 clients 2 recovered and 0 were evicted.
      2013-04-09 16:46:58 LustreError: 11-0: lsv-OST000c-osc-MDT0000: Communicating with 172.20.20.12@o2ib500, operation ost_connect failed with -16.
      2013-04-09 16:47:38 LustreError: 11-0: lsv-OST000b-osc-MDT0000: Communicating with 172.20.20.11@o2ib500, operation ost_connect failed with -16.
      2013-04-09 16:47:38 LustreError: Skipped 9 previous similar messages
      2013-04-09 16:48:03 LustreError: 11-0: lsv-OST0007-osc-MDT0000: Communicating with 172.20.20.7@o2ib500, operation ost_connect failed with -16.
      2013-04-09 16:48:03 LustreError: Skipped 9 previous similar messages
      2013-04-09 16:48:24 Lustre: lsv-OST0001-osc-MDT0000: Connection restored to lsv-OST0001 (at 172.20.20.1@o2ib500)
      2013-04-09 16:48:24 Lustre: lsv-OST0003-osc-MDT0000: Connection restored to lsv-OST0003 (at 172.20.20.3@o2ib500)
      2013-04-09 16:49:44 LustreError: 18017:0:(osp_precreate.c:496:osp_precreate_send()) ASSERTION( lu_fid_diff(fid, &d->opd_pre_used_fid) > 0 ) failed: reply fid [0x100090000:0x4c00:0x0] pre used fid [0x100090000:0x16bec0:0x0]
      2013-04-09 16:49:44 LustreError: 18017:0:(osp_precreate.c:496:osp_precreate_send()) LBUG
      

      This is an x86_64 server with ppc64 clients. Lustre versions 2.3.63-3chaos and 2.3.63-4chaos.

      Seeing some vague similarity with LU-2895, we applited the patch from that issue with no improvement. But this assertion is in a different function so not necessarily surprising.

      Attachments

        Activity

          [LU-3139] osp_precreate_send()) ASSERTION( lu_fid_diff(fid, &d->opd_pre_used_fid) > 0 ) failed

          Okay, we'll schedule a time to try out this fix. It will probably be sometime next week.

          nedbass Ned Bass (Inactive) added a comment - Okay, we'll schedule a time to try out this fix. It will probably be sometime next week.
          di.wang Di Wang added a comment -

          Yes, LAST_ID should be used.

          di.wang Di Wang added a comment - Yes, LAST_ID should be used.

          Alex, the oi.* directories are mostly empty through the ZPL. /oi.1/0x200000001:0x14:0x0 doesn't exist , but /O/0/LAST_ID and /O/0/d0/0 are there with contents as you describe. So we should use the LAST_ID file instead, correct?

          nedbass Ned Bass (Inactive) added a comment - Alex, the oi.* directories are mostly empty through the ZPL. /oi.1/0x200000001:0x14:0x0 doesn't exist , but /O/0/LAST_ID and /O/0/d0/0 are there with contents as you describe. So we should use the LAST_ID file instead, correct?

          This is good news then. I'd suggest to: take a snapshot for safety, then ..

          mount with ZPL, check file /oi.1/0x200000001:0x14:0x0 and check its content, it should be 8byte length and contain a number close to 0x16bec0 (last id used on MDS).

          the new file should be in /O/0/d0/0 - it should be 8byte too and the number much less than 0x16bec0, close to the first number you saw last in
          a message like: Apr 9 16:50:12 vesta5 kernel: Lustre: lsv-OST0005: Slow creates, 2048/1482320 objects created at a rate of 40/s

          I think it should be enough to write the content from the old file (/oi.1/0x200000001:0x14:0x0) into the new one (/O/0/d0/0)

          Niu, Di, could you please confirm this suggestion is sane?

          bzzz Alex Zhuravlev added a comment - This is good news then. I'd suggest to: take a snapshot for safety, then .. mount with ZPL, check file /oi.1/0x200000001:0x14:0x0 and check its content, it should be 8byte length and contain a number close to 0x16bec0 (last id used on MDS). the new file should be in /O/0/d0/0 - it should be 8byte too and the number much less than 0x16bec0, close to the first number you saw last in a message like: Apr 9 16:50:12 vesta5 kernel: Lustre: lsv-OST0005: Slow creates, 2048/1482320 objects created at a rate of 40/s I think it should be enough to write the content from the old file (/oi.1/0x200000001:0x14:0x0) into the new one (/O/0/d0/0) Niu, Di, could you please confirm this suggestion is sane?

          Alex, BTW this filesystem will not be long-lived due to the risk of these on-disk incompatibilities. We will provide a newly-formatted filesystem for users that will coexist in the same zpools as this legacy one. We just need to be able to mount the legacy one under 2.3.63+ long enough for users to migrate their data.

          nedbass Ned Bass (Inactive) added a comment - Alex, BTW this filesystem will not be long-lived due to the risk of these on-disk incompatibilities. We will provide a newly-formatted filesystem for users that will coexist in the same zpools as this legacy one. We just need to be able to mount the legacy one under 2.3.63+ long enough for users to migrate their data.

          We can mount it with ZPL. There is some strange behavior like . or .. missing or showing up twice, or incorrect hard link counts. But we can read/write/open/close local objects like LAST_ID, last_rcvd, lov_objid, etc.

          nedbass Ned Bass (Inactive) added a comment - We can mount it with ZPL. There is some strange behavior like . or .. missing or showing up twice, or incorrect hard link counts. But we can read/write/open/close local objects like LAST_ID, last_rcvd, lov_objid, etc.

          AFAIU, yes, it usually can be mounted with ZPL. but.. this may not work for the old filesystem as compatibility with ZPL was implemented just before the landing in September, iirc.

          bzzz Alex Zhuravlev added a comment - AFAIU, yes, it usually can be mounted with ZPL. but.. this may not work for the old filesystem as compatibility with ZPL was implemented just before the landing in September, iirc.

          hmm, this is not quite right as new object to track last_id with oid=0 has been created already.. I guess instead we should lookup OFD_GROUP0_LAST_OID first if osd_fid_lookup() is called for

          Unknown macro: {FID_SEQ_OST_MDT0; 0}

          Indeed. I'm wondering if there are any tools for zfs which can copy the old LAST_ID into new

          {seq, 0, 0}

          ? then we could probably avoid above extra checking (I assume there isn't any other system in the world needs such checking). Of course, it would be great if the system is possible to be reformat.

          niu Niu Yawei (Inactive) added a comment - hmm, this is not quite right as new object to track last_id with oid=0 has been created already.. I guess instead we should lookup OFD_GROUP0_LAST_OID first if osd_fid_lookup() is called for Unknown macro: {FID_SEQ_OST_MDT0; 0} Indeed. I'm wondering if there are any tools for zfs which can copy the old LAST_ID into new {seq, 0, 0} ? then we could probably avoid above extra checking (I assume there isn't any other system in the world needs such checking). Of course, it would be great if the system is possible to be reformat.

          hmm, this is not quite right as new object to track last_id with oid=0 has been created already.. I guess instead we should lookup OFD_GROUP0_LAST_OID first if osd_fid_lookup() is called for

          {FID_SEQ_OST_MDT0; 0}

          ?

          another (hopefully not fatal) issue is about lots of orphans we just created. luckily, creation rate wasn't great..

          bzzz Alex Zhuravlev added a comment - hmm, this is not quite right as new object to track last_id with oid=0 has been created already.. I guess instead we should lookup OFD_GROUP0_LAST_OID first if osd_fid_lookup() is called for {FID_SEQ_OST_MDT0; 0} ? another (hopefully not fatal) issue is about lots of orphans we just created. luckily, creation rate wasn't great..
          bzzz Alex Zhuravlev added a comment - - edited

          Niu's description seem to be correct.. and we can do something like:

          if (zap lookup in OI failed) {
            if (fid_idif() && seq==FID_SEQ_OST_MDT0 && oid==0)
              lookup {FID_SEQ_LOCAL_FILE; OFD_GROUP0_LAST_OID} in OI
          }
          

          though all the numbers/names above should be checked twice..

          the ideal "solution" would be to reformat, but I'm not sure this is possible.

          for a reference, we were using:

                    lu_local_obj_fid(&info->fti_fid, OFD_GROUP0_LAST_OID + group);
          

          in OFD to access last_id for group0, where

          static inline void lu_local_obj_fid(struct lu_fid *fid, __u32 oid)
           230 {
           231         fid->f_seq = FID_SEQ_LOCAL_FILE;
           232         fid->f_oid = oid;
          
          then
           221         OFD_GROUP0_LAST_OID     = 20UL,
           422         FID_SEQ_LOCAL_FILE = 0x200000001ULL,
          

          so, it should be

          {FID_SEQ_LOCAL_FILE; 20}

          as OFD_GROUP0_LAST_OID is not defined now.

          bzzz Alex Zhuravlev added a comment - - edited Niu's description seem to be correct.. and we can do something like: if (zap lookup in OI failed) { if (fid_idif() && seq==FID_SEQ_OST_MDT0 && oid==0) lookup {FID_SEQ_LOCAL_FILE; OFD_GROUP0_LAST_OID} in OI } though all the numbers/names above should be checked twice.. the ideal "solution" would be to reformat, but I'm not sure this is possible. for a reference, we were using: lu_local_obj_fid(&info->fti_fid, OFD_GROUP0_LAST_OID + group); in OFD to access last_id for group0, where static inline void lu_local_obj_fid(struct lu_fid *fid, __u32 oid) 230 { 231 fid->f_seq = FID_SEQ_LOCAL_FILE; 232 fid->f_oid = oid; then 221 OFD_GROUP0_LAST_OID = 20UL, 422 FID_SEQ_LOCAL_FILE = 0x200000001ULL, so, it should be {FID_SEQ_LOCAL_FILE; 20} as OFD_GROUP0_LAST_OID is not defined now.

          If the systemm is upgraded from 2.3.58 to 2.3.63, I think the LASSERT was probably triggered as following:

          1.MDT read the correct last_used_fid 1490242 (or some very large number) from lov_objid, and use it to do orhpan cleanup;

          2.On OST side, OST got the incorrect last_oid 0 from disk (because of fid-on-ost changes, it failed to locate old last_rcvd), so orhpan cleanup will try to recreate million objects;

          3.The million objects re-creation breaks in the middle (see the "Slow creates.." message), and returned MDT with the created number 2304 (or some small value);

          4.MDT found that returned last_fid is smaller than current last_used_fid, so still keep using the last_used_fid (the larger one) to do precreate;

          5.The current last_oid is still a very small value, OST start million pre-creation again, and it should also break in the middle ("Slow creates..."), and return the last created fid (0x4c00 or something similar);

          6.The assert triggered on MDT, because the last fid returned from OST is still much smaller than opd_pre_used_fid (the correct one).

          niu Niu Yawei (Inactive) added a comment - If the systemm is upgraded from 2.3.58 to 2.3.63, I think the LASSERT was probably triggered as following: 1.MDT read the correct last_used_fid 1490242 (or some very large number) from lov_objid, and use it to do orhpan cleanup; 2.On OST side, OST got the incorrect last_oid 0 from disk (because of fid-on-ost changes, it failed to locate old last_rcvd), so orhpan cleanup will try to recreate million objects; 3.The million objects re-creation breaks in the middle (see the "Slow creates.." message), and returned MDT with the created number 2304 (or some small value); 4.MDT found that returned last_fid is smaller than current last_used_fid, so still keep using the last_used_fid (the larger one) to do precreate; 5.The current last_oid is still a very small value, OST start million pre-creation again, and it should also break in the middle ("Slow creates..."), and return the last created fid (0x4c00 or something similar); 6.The assert triggered on MDT, because the last fid returned from OST is still much smaller than opd_pre_used_fid (the correct one).

          People

            niu Niu Yawei (Inactive)
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: