[LU-6050] Master testing: Unable to set striping after master downgrade to 2.5 Created: 18/Dec/14 Updated: 29/Nov/16 Resolved: 08/Feb/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Patrick Farrell (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Environment: |
Upgrade from 2.5.3 to Master (Latest commit |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 16860 | ||||||||
| Description |
|
After downgrading from master (Latest commit was I99ea077ae79fcdfedd7bb16c2a664714e0ea5ea3), we are unable to do lfs setstripe to specific OSTs, and setstripe with a count of greater than 1 appears to succeed, but the files created are singly striped. Viz: lfs setstripe -i 2 some_file error on ioctl 0x4008669a for 'some_file' (3): File too large error: setstripe: create stripe file 'some_file' failed lfs setstripe -c 2 some_file lfs getstripe some_file some_file lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 0 obdidx objid objid group 0 86890763 0x52dd90b 0 Digging on the client, I've traced this to a failed RPC to the MDT: 00000002:00100000:5.0:1418933152.848140:0:4339:0:(mdc_locks.c:642:mdc_finish_enqueue()) @@@ op: 1 disposition: 3, status: -27 req@ffff8803c4f9d000 x1487836602112680/t0(0) o101->perses1-MDT0000-mdc-ffff8803f3c1e400@4@gni:12/10 lens 600/544 e 0 to 0 dl 1418933314 ref 1 fl Complete:R/0/0 rc 301/301 And looking on the MDT, I see the failure starting here: 00000004:00000001:2.0:1418933152.838169:0:625:0:(osp_precreate.c:1057:osp_precreate_reserve()) Process entered 00000004:00000001:2.0:1418933152.838170:0:625:0:(osp_precreate.c:1139:osp_precreate_reserve()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea) (Note the -22 is turned in to -27 before being passed back to the client.) Looking at the code, we see the only way to return -EINVAL here is if it's already set in d->opd_pre_status. After logging the system startup, I see that being set here (MDT log): 00000100:00000040:4.0:1418933124.438340:0:668:0:(client.c:1176:ptlrpc_check_status()) @@@ status is -22 req@ffff88022d32cc00 x1487859182731960/t0(0) o5->perses1-OST0005-osc-MDT0000@26@gni:28/4 lens 432/400 e 0 to 0 dl 1418933286 ref 2 fl Rpc:RN/0/0 rc 0/-22 00000004:00020000:4.0:1418933124.438394:0:668:0:(osp_precreate.c:736:osp_precreate_cleanup_orphans()) perses1-OST0005-osc-MDT0000: cannot cleanup orphans: rc = -22 ^-- Which sets the 'EINVAL' in d->opd_pre_status. And that RPC is coming from this action on the MDT: 00000004:00080000:4.0:1418933124.429108:0:668:0:(osp_precreate.c:643:osp_precreate_cleanup_orphans()) perses1-OST0005-osc-MDT0000: going to cleanup orphans since [0x100050000:0x490dcec:0x0] Looking at logs from OST0005, I see the following: 00000004:00000002:2.0:1418933124.428804:0:24404:0:(osd_handler.c:487:osd_check_lma()) perses1-OST0005: FID [0x100000000:0x0:0x0] != self_fid [0x100050000:0x0:0x0] 00000004:00000001:2.0:1418933124.428806:0:24404:0:(osd_handler.c:491:osd_check_lma()) Process leaving (rc=18446744073709551538 : -78 : ffffffffffffffb2) 00000001:00000001:2.0:1418933124.428807:0:24404:0:(osd_compat.c:330:osd_lookup_in_remote_parent()) Process entered 00000001:00000001:2.0:1418933124.428810:0:24404:0:(osd_compat.c:349:osd_lookup_in_remote_parent()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) 00000004:00000001:2.0:1418933124.428811:0:24404:0:(osd_handler.c:624:osd_fid_lookup()) Process leaving via out (rc=18446744073709551501 : -115 : 0xffffffffffffff8d) 00000020:00000001:2.0:1418933124.428813:0:24404:0:(lustre_fid.h:719:fid_flatten32()) Process leaving (rc=4278189824 : 4278189824 : feffff00) 00000004:00000010:2.0:1418933124.428815:0:24404:0:(osd_handler.c:721:osd_object_free()) kfreed 'obj': 176 at ffff880328190e00. 00002000:00000001:2.0:1418933124.428816:0:24404:0:(ofd_dev.c:327:ofd_object_free()) Process entered 00002000:00000040:2.0:1418933124.428817:0:24404:0:(ofd_dev.c:331:ofd_object_free()) object free, fid = [0x100000000:0x0:0x0] 00002000:00000010:2.0:1418933124.428818:0:24404:0:(ofd_dev.c:335:ofd_object_free()) slab-freed '(of)': 160 at ffff880250060310. 00002000:00000001:2.0:1418933124.428819:0:24404:0:(ofd_dev.c:336:ofd_object_free()) Process leaving 00000020:00000001:2.0:1418933124.428819:0:24404:0:(lu_object.c:242:lu_object_alloc()) Process leaving (rc=18446744073709551501 : -115 : ffffffffffffff8d) 00000020:00000001:2.0:1418933124.428821:0:24404:0:(dt_object.c:386:dt_find_or_create()) Process leaving (rc=18446744073709551501 : -115 : ffffffffffffff8d) 00002000:00000010:2.0:1418933124.428822:0:24404:0:(ofd_fs.c:291:ofd_seq_load()) kfreed 'oseq': 96 at ffff8802b52ab140. 00002000:00000001:2.0:1418933124.428824:0:24404:0:(ofd_fs.c:292:ofd_seq_load()) Process leaving (rc=18446744073709551501 : -115 : ffffffffffffff8d) 00002000:00020000:2.0:1418933124.428825:0:24404:0:(ofd_obd.c:1209:ofd_create()) perses1-OST0005: Can't find FID Sequence 0x0: rc = -115 00002000:00000001:2.0:1418933124.428826:0:24404:0:(ofd_obd.c:1210:ofd_create()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea) 00000010:00000001:2.0:1418933124.428827:0:24404:0:(obd_class.h:840:obd_create()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea) And this error eventually gets bubbled back to the MDT. I can't figure out what items should be in that FID sequence, or how those orphans came to be. I'm posting this mainly so the underlying issue can hopefully be identified, but I'd also welcome it if anyone had any ideas how to get the live system working again. We would prefer not to have to reformat. I'm going to attach logs from the client, MDS, and an OSS (all OSSes report this issue). |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 18/Dec/14 ] |
|
Logs are on FTP at ftp.whamcloud.com:/uploads/ |
| Comment by Andreas Dilger [ 19/Dec/14 ] |
|
The MDS-side problem looks like it would be fixed by http://review.whamcloud.com/12617 " There is a separate question of why the OST is down. This appears to be caused by: osd_check_lma() perses1-OST0005: FID [0x100000000:0x0:0x0] != self_fid [0x100050000:0x0:0x0] This is because master uses the "correct" IDIF FID for OST index 5 (SEQ=0x100000000 | (ost_idx << 16)) while the old code used the "wrong" fid (only storing OST index 0 in the IDIF FID). The object in question looks to be the O/0/LAST_ID object (OID=0, VER=0 is LAST_ID for a particular IDIF sequence number). You may be able to fix this problem by mounting the OST as type ldiskfs, and deleting the "trusted.lma" xattr (either using setfattr -d "trusted.lma" O/0/LAST_ID, or copying it to a temporary file and then renaming it back on top of the original (which will work as long as "cp" doesn't copy xattrs). Did you run in DNE mode on master, or run LFSCK on the OST0005? A quick look at the commit log shows this might relate to http://review.whamcloud.com/11304 part (2), or http://review.whamcloud.com/10580, or http://review.whamcloud.com/9560, or http://review.whamcloud.com/7053 so we need to see what can be done to avoid this, and whether there is already a patch in b2_5 that will avoid this problem? |
| Comment by Patrick Farrell (Inactive) [ 19/Dec/14 ] |
|
Andreas, Thanks for the response. Good to understand what that object is. (It was clearly something special, but I couldn't quite tell which special object it was or find it on the underlying volume.) I've just tried the suggested workaround and can confirm that worked to get the file system up and running. Thank you very much. If you don't mind, could you elaborate on what's stored in that xattr and how the file system recovered from having it deleted? We did NOT run in DNE mode on master (just one MDT on this file system), and while I didn't deliberately run LFSCK on OST0005, I note that the OI scrub was run automatically. I have log messages of it trying to run and failing. (Note that we have this same error on all of the OSTs. There were similar messages for OST0005, these are just the errors I have handy.) 2014-12-18T10:53:07.060468-06:00 c1-0c1s4n2 Lustre: perses1-OST0003-: trigger OI scrub by RPC for [0x100000000:0x491c8a1:0x0], rc = 0 [1] |
| Comment by Andreas Dilger [ 19/Dec/14 ] |
|
The trusted.lma (struct lustre_mdt_attrs) xattr stores the "self" FID for each object. In older OST filesystems this was stored in the trusted.fid xattr along with the MDT inode's FID, but in 2.4 the back-end code between OST and MDT were unified. The old OST code didn't have any concept of what OST index it was using, so it always stored "f_seq=0" into the LMA FID. One of the patches referenced in my previous comment changed the OST code to store this FID as a proper IDIF FID, which includes the OST index into the FID so that it is unique across all OSTs, instead of always re-using OST index 0. That resolved some issues with LFSCK checking consistency between the MDT and OST objects. We've done upgrade/downgrade testing previously, so it isn't clear why this problem wasn't seen in our local testing, or on the other OSTs in your test system. There were some changes made to the code to allow the old and new style of IDIF FIDs to compare properly, which likely needs to be backported to b2_5, but I think Wang Di and Fan Yong recall the details better than I do. |
| Comment by Patrick Farrell (Inactive) [ 20/Dec/14 ] |
Actually, this hit all the OSTs on our test system. Sorry, I may not have made that clear. One thing I forgot to mention - This is a system that was upgraded from 1.8 to 2.4, then from 2.4 to 2.5. The file system was originally formatted with 1.8. (Dirdata was added and quotas updated.) Thanks for the detailed explanation. I'm a bit concerned by the idea that users need to patch 2.5 in order to be able to downgrade - That sort of defeats the idea of backwards compatibility, doesn't it? |
| Comment by nasf (Inactive) [ 21/Dec/14 ] |
|
The issue you hit is that when your system was upgraded to master, the master generated some new-formatted IDIF (which contains the real OST index) and stored such IDIF in the OST-object's LMA EA. There are three cases that may cause the new-formtted IDIF generated: 1) New created the OST-object after the upgrading. Currently, master will generate new-formatted IDIF for the new created OST-obejct. If the OST-object contains new-formatted IDIF in its LMV EA, in spite of it is new created or converted one, then after the system downgraded to Lustre-2.5, the old system will found that the up layer given FID (old formatted IDIF) does not match the self-FID in LMV EA (new formatted IDIF). Because there is no further compatible handling on b2_5, then it will case kinds of unexpected failures. To resolve the compatibility issues, there are two possible solutions: |
| Comment by Patrick Farrell (Inactive) [ 05/Jan/15 ] |
|
nasf, since we've effectively created an incompatibility preventing downgrades, I think your second option is the way to go. (Unless there is some other way to resolve this.) I think in general it's not OK to make downgrade impossible without the user taking specific action. Your second suggestion would provide that. |
| Comment by Andreas Dilger [ 07/Jan/15 ] |
|
The tune2fs tool is not the right mechanism for this. The tune2fs utility is for ext2/3/4 filesystems only, and has nothing to do with the problem here. This needs to be done with the feature flags in the last_rcvd file, that can be set for newly formatted filesystems, or manually set via tunefs.lustre or maybe when a manual LFSCK is run on the OST? I also recommend that we make a patch for b2_5 if this problem was present in 2.6.0, since we are not making any updates for 2.6.x. It is bad for the code to be fragile in the face of a minor inconsistency on disk and should be able to handle this case, and you can bet that someone will want to downgrade even after they "knew" they didn't have to anymore. I doubt that 2.5.3 is the end of the road for any 2.5 installation, so we may as well fix this problem in both places. |
| Comment by Andreas Dilger [ 07/Jan/15 ] |
|
Note: the new ROCOMPAT flag in the OSTs should be in the SUPPORTED mask for 2.7, and can be added to the SUPPORTED mask for 2.5 when it is patched, so that downgrades are safe. The b2_5 patch should NOT set the ROCOMPAT flag in the last_rcvd file, since this would cause problems for 2.5.x interop. |
| Comment by Gerrit Updater [ 23/Jan/15 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13516 |
| Comment by Andreas Dilger [ 31/Jan/15 ] |
|
Making this one a blocker so it is landed for 2.7. Otherwise it isn't possible to downgrade to 2.5 after an upgrade. Note that it would also be possible to backport support for the OST index in the IDIF FIDs to 2.5 in order to allow downgrades after an upgrade, but that isn't a requirement for this to land to 2.7 and is only mentioned here for future reference. |
| Comment by Gerrit Updater [ 08/Feb/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13516/ |
| Comment by Peter Jones [ 08/Feb/15 ] |
|
Landed for 2.7 |