[LU-5447] MGS fails to mount after 1.8 to 2.4.3 upgrade: checking for existing Lustre data: not found Created: 04/Aug/14 Updated: 08/Aug/14 Resolved: 08/Aug/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Blake Caldwell | Assignee: | Oleg Drokin |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL6.5 |
||
| Attachments: |
|
| Severity: | 2 |
| Rank (Obsolete): | 15165 |
| Description |
|
After upgrading our lustre servers to lustre-2.4.3 (the exact branch can be seen at the github link below), we are not able to start the filesystem it fails at the first mount of the MGT with the messages below. A debugfs 'stats' output is attached. The only difference I am able to notice between this filesystem and another that was successfully upgraded to 2.4.3 from 1.8.9 is that this has the flag "update" in the tunefs.lustre output. Is there a particular meaning to that? We mounted with e2fsprogs-1.42.9 first, and then downgraded to e2fsprogs-1.42.7 and still noticed the same result. The system was last mounted as a 1.8.9 filesystem and was cleanly unmounted. The multipath configuration would have changed slightly in the rhel5 to rhel6 transition, but the block device is still readable by debugfs. Is this related to the index not being assigned in 1.8? There were several jira tickets related, but they all appear to have been resolved in 2.4.0. We are working on a public repo of our branch. This should be it, but the one we are running has the patch for Our repo: [root@sns-mds1 ~]# mount -vt lustre /dev/mpath/snsfs-mgt /tmp/lustre/snsfs/sns-mgs [root@sns-mds1 ~]# tunefs.lustre --dryrun /dev/mapper/snsfs-mgt Read previous values: Permanent disk data: exiting before disk write. |
| Comments |
| Comment by John Fuchs-Chesney (Inactive) [ 04/Aug/14 ] |
|
We are locating an engineer to take a look at this problem. |
| Comment by Oleg Drokin [ 04/Aug/14 ] |
|
Looking at the mount util code, the decision that is printing after "checking for existing Lustre data" is: int ldiskfs_is_lustre(char *dev, unsigned *mount_type)
{
int ret;
ret = file_in_dev(MOUNT_DATA_FILE, dev);
if (ret) {
/* in the -1 case, 'extents' means IS a lustre target */
*mount_type = LDD_MT_LDISKFS;
return 1;
}
ret = file_in_dev(LAST_RCVD, dev);
if (ret) {
*mount_type = LDD_MT_LDISKFS;
return 1;
}
return 0;
}
But the most strange thing is that the same check works from one tool, but not from the other. Can you please check that /dev/mapper/snsfs-mgt fails with mount as well as the first step please? |
| Comment by John Fuchs-Chesney (Inactive) [ 04/Aug/14 ] |
|
Thank you Oleg for jumping in. |
| Comment by Blake Caldwell [ 04/Aug/14 ] |
|
Thank you! Forgive my blind omission... /dev/mapper allowed the MGT to mount! Except now we hit |
| Comment by Oleg Drokin [ 04/Aug/14 ] |
|
the patch is trivial and cleanyl cherry picks into my b2_4 tree, so I expect it to work in your tree as well. |
| Comment by Blake Caldwell [ 04/Aug/14 ] |
|
Sounds good. Sorry to be pedantic, but any way to identify the obsoleted record type 10612401 that will be skipped? |
| Comment by Oleg Drokin [ 04/Aug/14 ] |
|
That used to be a setattr record that is now deprecated: /* MDS_SETATTR_REC = LLOG_OP_MAGIC | 0x12401, obsolete 1.8.0 */ |
| Comment by John Fuchs-Chesney (Inactive) [ 04/Aug/14 ] |
|
Blake, Many thanks, |
| Comment by Blake Caldwell [ 04/Aug/14 ] |
|
Thank you. Yes, please lower severity level. I'll update this ticket once we've applied LU--4743 |
| Comment by John Fuchs-Chesney (Inactive) [ 04/Aug/14 ] |
|
Done – thanks Blake. |
| Comment by Blake Caldwell [ 04/Aug/14 ] |
|
It mounted successfully! Thanks for your help. This ticket can be closed. |
| Comment by John Fuchs-Chesney (Inactive) [ 04/Aug/14 ] |
|
Excellent! Thank you Blake, and thank you Oleg. Best regards, |
| Comment by Blake Caldwell [ 04/Aug/14 ] |
|
However now we have an LBUG on MDT unmount that appears to be caused by this patch. This is a similar situation to Could this get attention as to the possibility of a backport? While it only happens on unmount during test, the concern is that it happens under load. We are aiming for a return to production tomorrow am. Aug 4 18:29:32 sns-mds1.ornl.gov kernel: [10346.158586] Lustre: Failing over snsfs-MDT0000 |
| Comment by John Fuchs-Chesney (Inactive) [ 04/Aug/14 ] |
|
Reopened due to reported new LBUG. |
| Comment by Oleg Drokin [ 04/Aug/14 ] |
|
Hm, indeed, it looks like this is the case. I tried the http://review.whamcloud.com/#/c/10828/4 and it also cleanly applies to b2_4, so you should be able to apply it as is. |
| Comment by Blake Caldwell [ 05/Aug/14 ] |
|
The patch was successful. Several mount/unmount cycles were completed without a hitch. Thanks Oleg and John! All done with this ticket. |
| Comment by John Fuchs-Chesney (Inactive) [ 05/Aug/14 ] |
|
Thank you for this update Blake – glad to see that things are working well. I'll leave this ticket 'as is' for a few days, and then we can decide to mark it resolved, if no further problems come along. Best regards, |
| Comment by James Nunez (Inactive) [ 08/Aug/14 ] |
|
ORNL applied patches for Confirmed with ORNL that we can close this ticket. |