[LU-3139] osp_precreate_send()) ASSERTION( lu_fid_diff(fid, &d->opd_pre_used_fid) > 0 ) failed Created: 10/Apr/13  Updated: 24/Apr/14  Resolved: 22/Apr/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Christopher Morrone Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: LB, sequoia, topsequoia

Severity: 3
Rank (Obsolete): 7623

 Description   

When starting lustre on Sequoia's MDS/MGS, it is hitting the following assertion:

2013-04-09 16:46:16 Lustre: lsv-MDT0000: Will be in recovery for at least 5:00, or until 2 clients reconnect.
2013-04-09 16:46:19 Lustre: lsv-MDT0000: Recovery over after 0:03, of 2 clients 2 recovered and 0 were evicted.
2013-04-09 16:46:58 LustreError: 11-0: lsv-OST000c-osc-MDT0000: Communicating with 172.20.20.12@o2ib500, operation ost_connect failed with -16.
2013-04-09 16:47:38 LustreError: 11-0: lsv-OST000b-osc-MDT0000: Communicating with 172.20.20.11@o2ib500, operation ost_connect failed with -16.
2013-04-09 16:47:38 LustreError: Skipped 9 previous similar messages
2013-04-09 16:48:03 LustreError: 11-0: lsv-OST0007-osc-MDT0000: Communicating with 172.20.20.7@o2ib500, operation ost_connect failed with -16.
2013-04-09 16:48:03 LustreError: Skipped 9 previous similar messages
2013-04-09 16:48:24 Lustre: lsv-OST0001-osc-MDT0000: Connection restored to lsv-OST0001 (at 172.20.20.1@o2ib500)
2013-04-09 16:48:24 Lustre: lsv-OST0003-osc-MDT0000: Connection restored to lsv-OST0003 (at 172.20.20.3@o2ib500)
2013-04-09 16:49:44 LustreError: 18017:0:(osp_precreate.c:496:osp_precreate_send()) ASSERTION( lu_fid_diff(fid, &d->opd_pre_used_fid) > 0 ) failed: reply fid [0x100090000:0x4c00:0x0] pre used fid [0x100090000:0x16bec0:0x0]
2013-04-09 16:49:44 LustreError: 18017:0:(osp_precreate.c:496:osp_precreate_send()) LBUG

This is an x86_64 server with ppc64 clients. Lustre versions 2.3.63-3chaos and 2.3.63-4chaos.

Seeing some vague similarity with LU-2895, we applited the patch from that issue with no improvement. But this assertion is in a different function so not necessarily surprising.



 Comments   
Comment by Peter Jones [ 10/Apr/13 ]

Niu

Could you please look into this one?

Thanks

Peter

Comment by Christopher Morrone [ 10/Apr/13 ]

Any suggestions? Sequoia's filesystem can't be started currently. We need a quick solution.

Comment by Andreas Dilger [ 10/Apr/13 ]

Alex is looking into this.

Comment by Alex Zhuravlev [ 10/Apr/13 ]

Christopher, would it be possible to start with full debug and attach the log here, please?

Comment by Ned Bass [ 10/Apr/13 ]

I uploaded ftp.whamcloud.com/uploads/LU-3139.tar.gz which contains several debug logs from the MDS. We had panic_on_lbug disabled and hit the assertion multiple times.

Comment by Ned Bass [ 10/Apr/13 ]

We see some suspicious-looking Lustre messages related to object precreation on the OSTs. When running 2.3.63, the OSTs all log messages like these:

Apr  9 16:50:06 vesta3 kernel: Lustre: lsv-OST0003: Slow creates, 2304/1482096 objects created at a rate of 46/s
Apr  9 16:50:06 vesta1 kernel: Lustre: lsv-OST0001: Slow creates, 2176/1474192 objects created at a rate of 43/s
Apr  9 16:50:06 vesta4 kernel: Lustre: lsv-OST0004: Slow creates, 2304/1478224 objects created at a rate of 46/s
Apr  9 16:50:08 vesta2 kernel: Lustre: lsv-OST0002: Slow creates, 2304/1472688 objects created at a rate of 46/s
Apr  9 16:50:10 vesta7 kernel: Lustre: lsv-OST0007: Slow creates, 2048/1482448 objects created at a rate of 40/s
Apr  9 16:50:12 vesta6 kernel: Lustre: lsv-OST0006: Slow creates, 2176/1482448 objects created at a rate of 43/s
Apr  9 16:50:12 vesta5 kernel: Lustre: lsv-OST0005: Slow creates, 2048/1482320 objects created at a rate of 40/s

Which makes us wonder why they seem to be trying to precreate ~1.5 million objects. Perhaps the FID of the last created object is not being honored, causing it to restart from zero?

Perhaps related, the OSTs also seem to be requesting very large grants:

Apr  9 16:49:14 vesta3 kernel: LustreError: 5369:0:(ofd_grant.c:605:ofd_grant()) lsv-OST0003: client lsv-OST0003_UUID/ffff880809acf000 requesting > 2GB grant 3035332608
Apr  9 16:49:15 vesta1 kernel: LustreError: 5386:0:(ofd_grant.c:605:ofd_grant()) lsv-OST0001: client lsv-OST0001_UUID/ffff8807f6423400 requesting > 2GB grant 3019145216
Apr  9 16:49:15 vesta4 kernel: LustreError: 5350:0:(ofd_grant.c:605:ofd_grant()) lsv-OST0004: client lsv-OST0004_UUID/ffff8807f90d1800 requesting > 2GB grant 3027402752
Apr  9 16:49:16 vesta2 kernel: LustreError: 5375:0:(ofd_grant.c:605:ofd_grant()) lsv-OST0002: client lsv-OST0002_UUID/ffff8807f8a9ac00 requesting > 2GB grant 3016065024
Apr  9 16:49:18 vesta7 kernel: LustreError: 5496:0:(ofd_grant.c:605:ofd_grant()) lsv-OST0007: client lsv-OST0007_UUID/ffff8807f7d64800 requesting > 2GB grant 3036053504
Apr  9 16:49:21 vesta6 kernel: LustreError: 5487:0:(ofd_grant.c:605:ofd_grant()) lsv-OST0006: client lsv-OST0006_UUID/ffff8807f7404c00 requesting > 2GB grant 3036053504
Apr  9 16:49:21 vesta5 kernel: LustreError: 5557:0:(ofd_grant.c:605:ofd_grant()) lsv-OST0005: client lsv-OST0005_UUID/ffff880808ccf800 requesting > 2GB grant 3035791360

Finally, when we bring the OSTs back up running the old Lustre version 2.3.58, we see messages like these.

Apr 10 12:50:42 vesta44 kernel: Lustre: lsv-OST002c: deleting orphan objects from 1490242 to 1490307
Apr 10 12:50:42 vesta43 kernel: Lustre: lsv-OST002b: deleting orphan objects from 1489633 to 1489699
Apr 10 12:50:42 vesta72 kernel: Lustre: lsv-OST0048: deleting orphan objects from 1490721 to 1490787
Apr 10 12:50:42 vesta29 kernel: Lustre: lsv-OST001d: deleting orphan objects from 1490592 to 1490659
Apr 10 12:50:42 vesta78 kernel: Lustre: lsv-OST004e: deleting orphan objects from 1490080 to 1490147
Apr 10 12:50:42 vesta41 kernel: Lustre: lsv-OST0029: deleting orphan objects from 1490401 to 1490467
Apr 10 12:50:42 vesta25 kernel: Lustre: lsv-OST0019: deleting orphan objects from 1490337 to 1490403

Note that the ranges roughly coincide with the upper bound of the 'Slow create' messages quoted above.

Comment by Ned Bass [ 11/Apr/13 ]

I wonder if this is related to the changes for fid on OST. In particular I am looking at http://review.whamcloud.com/#change,4324 which seems to have changed the on-disk representation of the last-used object id on the OST.

Comment by Niu Yawei (Inactive) [ 11/Apr/13 ]

Hi, Ned, the fs is upgraded from 2.3.58? Then I agree with you that the changes of fid-on-ost could be the culprit. It'll be helpful if there are full logs on OST.

Comment by Di Wang [ 11/Apr/13 ]

Hmm, with patch 4324, it will use

{seq, 0, 0}

as the fid of last_rcvd, I guess original OSD-ZFS does not map this zero oid FID to the last_rcvd file. We probably need another special handling here. Alex, Could you please confirm?

Comment by Niu Yawei (Inactive) [ 11/Apr/13 ]

If the systemm is upgraded from 2.3.58 to 2.3.63, I think the LASSERT was probably triggered as following:

1.MDT read the correct last_used_fid 1490242 (or some very large number) from lov_objid, and use it to do orhpan cleanup;

2.On OST side, OST got the incorrect last_oid 0 from disk (because of fid-on-ost changes, it failed to locate old last_rcvd), so orhpan cleanup will try to recreate million objects;

3.The million objects re-creation breaks in the middle (see the "Slow creates.." message), and returned MDT with the created number 2304 (or some small value);

4.MDT found that returned last_fid is smaller than current last_used_fid, so still keep using the last_used_fid (the larger one) to do precreate;

5.The current last_oid is still a very small value, OST start million pre-creation again, and it should also break in the middle ("Slow creates..."), and return the last created fid (0x4c00 or something similar);

6.The assert triggered on MDT, because the last fid returned from OST is still much smaller than opd_pre_used_fid (the correct one).

Comment by Alex Zhuravlev [ 11/Apr/13 ]

Niu's description seem to be correct.. and we can do something like:

if (zap lookup in OI failed) {
  if (fid_idif() && seq==FID_SEQ_OST_MDT0 && oid==0)
    lookup {FID_SEQ_LOCAL_FILE; OFD_GROUP0_LAST_OID} in OI
}

though all the numbers/names above should be checked twice..

the ideal "solution" would be to reformat, but I'm not sure this is possible.

for a reference, we were using:

          lu_local_obj_fid(&info->fti_fid, OFD_GROUP0_LAST_OID + group);

in OFD to access last_id for group0, where

static inline void lu_local_obj_fid(struct lu_fid *fid, __u32 oid)
 230 {
 231         fid->f_seq = FID_SEQ_LOCAL_FILE;
 232         fid->f_oid = oid;

then
 221         OFD_GROUP0_LAST_OID     = 20UL,
 422         FID_SEQ_LOCAL_FILE = 0x200000001ULL,

so, it should be

{FID_SEQ_LOCAL_FILE; 20}

as OFD_GROUP0_LAST_OID is not defined now.

Comment by Alex Zhuravlev [ 11/Apr/13 ]

hmm, this is not quite right as new object to track last_id with oid=0 has been created already.. I guess instead we should lookup OFD_GROUP0_LAST_OID first if osd_fid_lookup() is called for

{FID_SEQ_OST_MDT0; 0}

?

another (hopefully not fatal) issue is about lots of orphans we just created. luckily, creation rate wasn't great..

Comment by Niu Yawei (Inactive) [ 11/Apr/13 ]

hmm, this is not quite right as new object to track last_id with oid=0 has been created already.. I guess instead we should lookup OFD_GROUP0_LAST_OID first if osd_fid_lookup() is called for

Unknown macro: {FID_SEQ_OST_MDT0; 0}

Indeed. I'm wondering if there are any tools for zfs which can copy the old LAST_ID into new

{seq, 0, 0}

? then we could probably avoid above extra checking (I assume there isn't any other system in the world needs such checking). Of course, it would be great if the system is possible to be reformat.

Comment by Alex Zhuravlev [ 11/Apr/13 ]

AFAIU, yes, it usually can be mounted with ZPL. but.. this may not work for the old filesystem as compatibility with ZPL was implemented just before the landing in September, iirc.

Comment by Ned Bass [ 11/Apr/13 ]

We can mount it with ZPL. There is some strange behavior like . or .. missing or showing up twice, or incorrect hard link counts. But we can read/write/open/close local objects like LAST_ID, last_rcvd, lov_objid, etc.

Comment by Ned Bass [ 11/Apr/13 ]

Alex, BTW this filesystem will not be long-lived due to the risk of these on-disk incompatibilities. We will provide a newly-formatted filesystem for users that will coexist in the same zpools as this legacy one. We just need to be able to mount the legacy one under 2.3.63+ long enough for users to migrate their data.

Comment by Alex Zhuravlev [ 11/Apr/13 ]

This is good news then. I'd suggest to: take a snapshot for safety, then ..

mount with ZPL, check file /oi.1/0x200000001:0x14:0x0 and check its content, it should be 8byte length and contain a number close to 0x16bec0 (last id used on MDS).

the new file should be in /O/0/d0/0 - it should be 8byte too and the number much less than 0x16bec0, close to the first number you saw last in
a message like: Apr 9 16:50:12 vesta5 kernel: Lustre: lsv-OST0005: Slow creates, 2048/1482320 objects created at a rate of 40/s

I think it should be enough to write the content from the old file (/oi.1/0x200000001:0x14:0x0) into the new one (/O/0/d0/0)

Niu, Di, could you please confirm this suggestion is sane?

Comment by Ned Bass [ 11/Apr/13 ]

Alex, the oi.* directories are mostly empty through the ZPL. /oi.1/0x200000001:0x14:0x0 doesn't exist , but /O/0/LAST_ID and /O/0/d0/0 are there with contents as you describe. So we should use the LAST_ID file instead, correct?

Comment by Di Wang [ 11/Apr/13 ]

Yes, LAST_ID should be used.

Comment by Ned Bass [ 11/Apr/13 ]

Okay, we'll schedule a time to try out this fix. It will probably be sometime next week.

Comment by Ned Bass [ 12/Apr/13 ]

Niu, Alex, Di, can you think of any other on-disk format changes that may bite us after this one? I don't want to get into a state where we can't mount the filesystem under any version of Lustre.

Comment by Andreas Dilger [ 12/Apr/13 ]

Ned, there was a change with http://review.whamcloud.com/5820 (LU-2684) that affects the MDS FID storage in the LOV EA. This shouldn't affect normal Lustre operation, but there is a bit of churn in that code right now (e.g. LU-3152, LU-2888) that may affect upgraded filesystems and it would probably be better to wait until that issue is resolved.

Comment by Ned Bass [ 15/Apr/13 ]

We went ahead with the proposed workaround for one affected filesystem (lscratchv, used by vulcan). We were able to bring it up under Lustre 2.3.63 without hitting this bug.

We will do the same for the legacy Sequoia filesystem (lscratch1) tomorrow. Sequoia is already mounting a new filesystem formatted using Lustre 2.3.63, but we want to mount the old one read-only to allow data migration. I'll report back on how it goes tomorrow.

Comment by Peter Jones [ 15/Apr/13 ]

Thanks for the update Ned. I have dropped the priority slightly to reflect that this is still an important support issue but is not a general blocker for the release itself.

Comment by Ned Bass [ 18/Apr/13 ]

The update of the sequoia filesystem was successful. If no work is planned for adding compatibility code, I think we can close this issue.

Comment by Niu Yawei (Inactive) [ 22/Apr/13 ]

The sequoia system is fixed, and no plan to add compatibility code for 2.3.58 <-> 2.3.63.

Comment by James A Simmons [ 24/Apr/14 ]

I just hit this same bug on 2.5.1 branch. This happened on a newly formated file system.

Comment by James A Simmons [ 24/Apr/14 ]

Actually this is more likely LU-4653

Generated at Sat Feb 10 01:31:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.