[LU-3139] osp_precreate_send()) ASSERTION( lu_fid_diff(fid, &d->opd_pre_used_fid) > 0 ) failed Created: 10/Apr/13 Updated: 24/Apr/14 Resolved: 22/Apr/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Christopher Morrone | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LB, sequoia, topsequoia | ||
| Severity: | 3 |
| Rank (Obsolete): | 7623 |
| Description |
|
When starting lustre on Sequoia's MDS/MGS, it is hitting the following assertion: 2013-04-09 16:46:16 Lustre: lsv-MDT0000: Will be in recovery for at least 5:00, or until 2 clients reconnect. 2013-04-09 16:46:19 Lustre: lsv-MDT0000: Recovery over after 0:03, of 2 clients 2 recovered and 0 were evicted. 2013-04-09 16:46:58 LustreError: 11-0: lsv-OST000c-osc-MDT0000: Communicating with 172.20.20.12@o2ib500, operation ost_connect failed with -16. 2013-04-09 16:47:38 LustreError: 11-0: lsv-OST000b-osc-MDT0000: Communicating with 172.20.20.11@o2ib500, operation ost_connect failed with -16. 2013-04-09 16:47:38 LustreError: Skipped 9 previous similar messages 2013-04-09 16:48:03 LustreError: 11-0: lsv-OST0007-osc-MDT0000: Communicating with 172.20.20.7@o2ib500, operation ost_connect failed with -16. 2013-04-09 16:48:03 LustreError: Skipped 9 previous similar messages 2013-04-09 16:48:24 Lustre: lsv-OST0001-osc-MDT0000: Connection restored to lsv-OST0001 (at 172.20.20.1@o2ib500) 2013-04-09 16:48:24 Lustre: lsv-OST0003-osc-MDT0000: Connection restored to lsv-OST0003 (at 172.20.20.3@o2ib500) 2013-04-09 16:49:44 LustreError: 18017:0:(osp_precreate.c:496:osp_precreate_send()) ASSERTION( lu_fid_diff(fid, &d->opd_pre_used_fid) > 0 ) failed: reply fid [0x100090000:0x4c00:0x0] pre used fid [0x100090000:0x16bec0:0x0] 2013-04-09 16:49:44 LustreError: 18017:0:(osp_precreate.c:496:osp_precreate_send()) LBUG This is an x86_64 server with ppc64 clients. Lustre versions 2.3.63-3chaos and 2.3.63-4chaos. Seeing some vague similarity with |
| Comments |
| Comment by Peter Jones [ 10/Apr/13 ] |
|
Niu Could you please look into this one? Thanks Peter |
| Comment by Christopher Morrone [ 10/Apr/13 ] |
|
Any suggestions? Sequoia's filesystem can't be started currently. We need a quick solution. |
| Comment by Andreas Dilger [ 10/Apr/13 ] |
|
Alex is looking into this. |
| Comment by Alex Zhuravlev [ 10/Apr/13 ] |
|
Christopher, would it be possible to start with full debug and attach the log here, please? |
| Comment by Ned Bass [ 10/Apr/13 ] |
|
I uploaded ftp.whamcloud.com/uploads/ |
| Comment by Ned Bass [ 10/Apr/13 ] |
|
We see some suspicious-looking Lustre messages related to object precreation on the OSTs. When running 2.3.63, the OSTs all log messages like these: Apr 9 16:50:06 vesta3 kernel: Lustre: lsv-OST0003: Slow creates, 2304/1482096 objects created at a rate of 46/s Apr 9 16:50:06 vesta1 kernel: Lustre: lsv-OST0001: Slow creates, 2176/1474192 objects created at a rate of 43/s Apr 9 16:50:06 vesta4 kernel: Lustre: lsv-OST0004: Slow creates, 2304/1478224 objects created at a rate of 46/s Apr 9 16:50:08 vesta2 kernel: Lustre: lsv-OST0002: Slow creates, 2304/1472688 objects created at a rate of 46/s Apr 9 16:50:10 vesta7 kernel: Lustre: lsv-OST0007: Slow creates, 2048/1482448 objects created at a rate of 40/s Apr 9 16:50:12 vesta6 kernel: Lustre: lsv-OST0006: Slow creates, 2176/1482448 objects created at a rate of 43/s Apr 9 16:50:12 vesta5 kernel: Lustre: lsv-OST0005: Slow creates, 2048/1482320 objects created at a rate of 40/s Which makes us wonder why they seem to be trying to precreate ~1.5 million objects. Perhaps the FID of the last created object is not being honored, causing it to restart from zero? Perhaps related, the OSTs also seem to be requesting very large grants: Apr 9 16:49:14 vesta3 kernel: LustreError: 5369:0:(ofd_grant.c:605:ofd_grant()) lsv-OST0003: client lsv-OST0003_UUID/ffff880809acf000 requesting > 2GB grant 3035332608 Apr 9 16:49:15 vesta1 kernel: LustreError: 5386:0:(ofd_grant.c:605:ofd_grant()) lsv-OST0001: client lsv-OST0001_UUID/ffff8807f6423400 requesting > 2GB grant 3019145216 Apr 9 16:49:15 vesta4 kernel: LustreError: 5350:0:(ofd_grant.c:605:ofd_grant()) lsv-OST0004: client lsv-OST0004_UUID/ffff8807f90d1800 requesting > 2GB grant 3027402752 Apr 9 16:49:16 vesta2 kernel: LustreError: 5375:0:(ofd_grant.c:605:ofd_grant()) lsv-OST0002: client lsv-OST0002_UUID/ffff8807f8a9ac00 requesting > 2GB grant 3016065024 Apr 9 16:49:18 vesta7 kernel: LustreError: 5496:0:(ofd_grant.c:605:ofd_grant()) lsv-OST0007: client lsv-OST0007_UUID/ffff8807f7d64800 requesting > 2GB grant 3036053504 Apr 9 16:49:21 vesta6 kernel: LustreError: 5487:0:(ofd_grant.c:605:ofd_grant()) lsv-OST0006: client lsv-OST0006_UUID/ffff8807f7404c00 requesting > 2GB grant 3036053504 Apr 9 16:49:21 vesta5 kernel: LustreError: 5557:0:(ofd_grant.c:605:ofd_grant()) lsv-OST0005: client lsv-OST0005_UUID/ffff880808ccf800 requesting > 2GB grant 3035791360 Finally, when we bring the OSTs back up running the old Lustre version 2.3.58, we see messages like these. Apr 10 12:50:42 vesta44 kernel: Lustre: lsv-OST002c: deleting orphan objects from 1490242 to 1490307 Apr 10 12:50:42 vesta43 kernel: Lustre: lsv-OST002b: deleting orphan objects from 1489633 to 1489699 Apr 10 12:50:42 vesta72 kernel: Lustre: lsv-OST0048: deleting orphan objects from 1490721 to 1490787 Apr 10 12:50:42 vesta29 kernel: Lustre: lsv-OST001d: deleting orphan objects from 1490592 to 1490659 Apr 10 12:50:42 vesta78 kernel: Lustre: lsv-OST004e: deleting orphan objects from 1490080 to 1490147 Apr 10 12:50:42 vesta41 kernel: Lustre: lsv-OST0029: deleting orphan objects from 1490401 to 1490467 Apr 10 12:50:42 vesta25 kernel: Lustre: lsv-OST0019: deleting orphan objects from 1490337 to 1490403 Note that the ranges roughly coincide with the upper bound of the 'Slow create' messages quoted above. |
| Comment by Ned Bass [ 11/Apr/13 ] |
|
I wonder if this is related to the changes for fid on OST. In particular I am looking at http://review.whamcloud.com/#change,4324 which seems to have changed the on-disk representation of the last-used object id on the OST. |
| Comment by Niu Yawei (Inactive) [ 11/Apr/13 ] |
|
Hi, Ned, the fs is upgraded from 2.3.58? Then I agree with you that the changes of fid-on-ost could be the culprit. It'll be helpful if there are full logs on OST. |
| Comment by Di Wang [ 11/Apr/13 ] |
|
Hmm, with patch 4324, it will use {seq, 0, 0}as the fid of last_rcvd, I guess original OSD-ZFS does not map this zero oid FID to the last_rcvd file. We probably need another special handling here. Alex, Could you please confirm? |
| Comment by Niu Yawei (Inactive) [ 11/Apr/13 ] |
|
If the systemm is upgraded from 2.3.58 to 2.3.63, I think the LASSERT was probably triggered as following: 1.MDT read the correct last_used_fid 1490242 (or some very large number) from lov_objid, and use it to do orhpan cleanup; 2.On OST side, OST got the incorrect last_oid 0 from disk (because of fid-on-ost changes, it failed to locate old last_rcvd), so orhpan cleanup will try to recreate million objects; 3.The million objects re-creation breaks in the middle (see the "Slow creates.." message), and returned MDT with the created number 2304 (or some small value); 4.MDT found that returned last_fid is smaller than current last_used_fid, so still keep using the last_used_fid (the larger one) to do precreate; 5.The current last_oid is still a very small value, OST start million pre-creation again, and it should also break in the middle ("Slow creates..."), and return the last created fid (0x4c00 or something similar); 6.The assert triggered on MDT, because the last fid returned from OST is still much smaller than opd_pre_used_fid (the correct one). |
| Comment by Alex Zhuravlev [ 11/Apr/13 ] |
|
Niu's description seem to be correct.. and we can do something like: if (zap lookup in OI failed) { if (fid_idif() && seq==FID_SEQ_OST_MDT0 && oid==0) lookup {FID_SEQ_LOCAL_FILE; OFD_GROUP0_LAST_OID} in OI } though all the numbers/names above should be checked twice.. the ideal "solution" would be to reformat, but I'm not sure this is possible. for a reference, we were using: lu_local_obj_fid(&info->fti_fid, OFD_GROUP0_LAST_OID + group); in OFD to access last_id for group0, where static inline void lu_local_obj_fid(struct lu_fid *fid, __u32 oid)
230 {
231 fid->f_seq = FID_SEQ_LOCAL_FILE;
232 fid->f_oid = oid;
then
221 OFD_GROUP0_LAST_OID = 20UL,
422 FID_SEQ_LOCAL_FILE = 0x200000001ULL,
so, it should be {FID_SEQ_LOCAL_FILE; 20}as OFD_GROUP0_LAST_OID is not defined now. |
| Comment by Alex Zhuravlev [ 11/Apr/13 ] |
|
hmm, this is not quite right as new object to track last_id with oid=0 has been created already.. I guess instead we should lookup OFD_GROUP0_LAST_OID first if osd_fid_lookup() is called for {FID_SEQ_OST_MDT0; 0}? another (hopefully not fatal) issue is about lots of orphans we just created. luckily, creation rate wasn't great.. |
| Comment by Niu Yawei (Inactive) [ 11/Apr/13 ] |
Indeed. I'm wondering if there are any tools for zfs which can copy the old LAST_ID into new {seq, 0, 0}? then we could probably avoid above extra checking (I assume there isn't any other system in the world needs such checking). Of course, it would be great if the system is possible to be reformat. |
| Comment by Alex Zhuravlev [ 11/Apr/13 ] |
|
AFAIU, yes, it usually can be mounted with ZPL. but.. this may not work for the old filesystem as compatibility with ZPL was implemented just before the landing in September, iirc. |
| Comment by Ned Bass [ 11/Apr/13 ] |
|
We can mount it with ZPL. There is some strange behavior like . or .. missing or showing up twice, or incorrect hard link counts. But we can read/write/open/close local objects like LAST_ID, last_rcvd, lov_objid, etc. |
| Comment by Ned Bass [ 11/Apr/13 ] |
|
Alex, BTW this filesystem will not be long-lived due to the risk of these on-disk incompatibilities. We will provide a newly-formatted filesystem for users that will coexist in the same zpools as this legacy one. We just need to be able to mount the legacy one under 2.3.63+ long enough for users to migrate their data. |
| Comment by Alex Zhuravlev [ 11/Apr/13 ] |
|
This is good news then. I'd suggest to: take a snapshot for safety, then .. mount with ZPL, check file /oi.1/0x200000001:0x14:0x0 and check its content, it should be 8byte length and contain a number close to 0x16bec0 (last id used on MDS). the new file should be in /O/0/d0/0 - it should be 8byte too and the number much less than 0x16bec0, close to the first number you saw last in I think it should be enough to write the content from the old file (/oi.1/0x200000001:0x14:0x0) into the new one (/O/0/d0/0) Niu, Di, could you please confirm this suggestion is sane? |
| Comment by Ned Bass [ 11/Apr/13 ] |
|
Alex, the oi.* directories are mostly empty through the ZPL. /oi.1/0x200000001:0x14:0x0 doesn't exist , but /O/0/LAST_ID and /O/0/d0/0 are there with contents as you describe. So we should use the LAST_ID file instead, correct? |
| Comment by Di Wang [ 11/Apr/13 ] |
|
Yes, LAST_ID should be used. |
| Comment by Ned Bass [ 11/Apr/13 ] |
|
Okay, we'll schedule a time to try out this fix. It will probably be sometime next week. |
| Comment by Ned Bass [ 12/Apr/13 ] |
|
Niu, Alex, Di, can you think of any other on-disk format changes that may bite us after this one? I don't want to get into a state where we can't mount the filesystem under any version of Lustre. |
| Comment by Andreas Dilger [ 12/Apr/13 ] |
|
Ned, there was a change with http://review.whamcloud.com/5820 ( |
| Comment by Ned Bass [ 15/Apr/13 ] |
|
We went ahead with the proposed workaround for one affected filesystem (lscratchv, used by vulcan). We were able to bring it up under Lustre 2.3.63 without hitting this bug. We will do the same for the legacy Sequoia filesystem (lscratch1) tomorrow. Sequoia is already mounting a new filesystem formatted using Lustre 2.3.63, but we want to mount the old one read-only to allow data migration. I'll report back on how it goes tomorrow. |
| Comment by Peter Jones [ 15/Apr/13 ] |
|
Thanks for the update Ned. I have dropped the priority slightly to reflect that this is still an important support issue but is not a general blocker for the release itself. |
| Comment by Ned Bass [ 18/Apr/13 ] |
|
The update of the sequoia filesystem was successful. If no work is planned for adding compatibility code, I think we can close this issue. |
| Comment by Niu Yawei (Inactive) [ 22/Apr/13 ] |
|
The sequoia system is fixed, and no plan to add compatibility code for 2.3.58 <-> 2.3.63. |
| Comment by James A Simmons [ 24/Apr/14 ] |
|
I just hit this same bug on 2.5.1 branch. This happened on a newly formated file system. |
| Comment by James A Simmons [ 24/Apr/14 ] |
|
Actually this is more likely |