[LU-5447] MGS fails to mount after 1.8 to 2.4.3 upgrade: checking for existing Lustre data: not found Created: 04/Aug/14  Updated: 08/Aug/14  Resolved: 08/Aug/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Blake Caldwell Assignee: Oleg Drokin
Resolution: Duplicate Votes: 0
Labels: None
Environment:

RHEL6.5
kernel-2.6.32-358.23.2.el6
e2fsprogs-1.42.7.wc1-7.el6.x86_64


Attachments: File debugfs.snsfs-mgt     File lustre_2.4.3_upgrade_kernel_logs.gz    
Severity: 2
Rank (Obsolete): 15165

 Description   

After upgrading our lustre servers to lustre-2.4.3 (the exact branch can be seen at the github link below), we are not able to start the filesystem it fails at the first mount of the MGT with the messages below. A debugfs 'stats' output is attached.

The only difference I am able to notice between this filesystem and another that was successfully upgraded to 2.4.3 from 1.8.9 is that this has the flag "update" in the tunefs.lustre output. Is there a particular meaning to that?

We mounted with e2fsprogs-1.42.9 first, and then downgraded to e2fsprogs-1.42.7 and still noticed the same result. The system was last mounted as a 1.8.9 filesystem and was cleanly unmounted. The multipath configuration would have changed slightly in the rhel5 to rhel6 transition, but the block device is still readable by debugfs.

Is this related to the index not being assigned in 1.8? There were several jira tickets related, but they all appear to have been resolved in 2.4.0.

We are working on a public repo of our branch. This should be it, but the one we are running has the patch for LU-5284:http://review.whamcloud.com/#/c/11136/

Our repo:
https://github.com/ORNL-TechInt/lustre/commits/master

[root@sns-mds1 ~]# mount -vt lustre /dev/mpath/snsfs-mgt /tmp/lustre/snsfs/sns-mgs
arg[0] = /sbin/mount.lustre
arg[1] = -v
arg[2] = -o
arg[3] = rw
arg[4] = /dev/mpath/snsfs-mgt
arg[5] = /tmp/lustre/snsfs/sns-mgs
source = /dev/mpath/snsfs-mgt (/dev/mpath/snsfs-mgt), target = /tmp/lustre/snsfs/sns-mgs
options = rw
checking for existing Lustre data: not found
mount.lustre: /dev/mpath/snsfs-mgt has not been formatted with mkfs.lustre or the backend filesystem type is not supported by this tool

[root@sns-mds1 ~]# tunefs.lustre --dryrun /dev/mapper/snsfs-mgt
checking for existing Lustre data: found
Reading CONFIGS/mountdata

Read previous values:
Target: MGS
Index: unassigned
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x54
(MGS needs_index update )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters:

Permanent disk data:
Target: MGS
Index: unassigned
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x44
(MGS update )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters:

exiting before disk write.



 Comments   
Comment by John Fuchs-Chesney (Inactive) [ 04/Aug/14 ]

We are locating an engineer to take a look at this problem.
~ jfc.

Comment by Oleg Drokin [ 04/Aug/14 ]

Looking at the mount util code, the decision that is printing after "checking for existing Lustre data" is:

int ldiskfs_is_lustre(char *dev, unsigned *mount_type)
{
        int ret;

        ret = file_in_dev(MOUNT_DATA_FILE, dev);
        if (ret) {
                /* in the -1 case, 'extents' means IS a lustre target */
                *mount_type = LDD_MT_LDISKFS;
                return 1;
        }

        ret = file_in_dev(LAST_RCVD, dev);
        if (ret) {
                *mount_type = LDD_MT_LDISKFS;
                return 1;
        }

        return 0;
}

But the most strange thing is that the same check works from one tool, but not from the other.
I cannot help but notice that the mount command has different path: /dev/mpath/snsfs-mgt where as tunefs was given /dev/mapper/snsfs-mgt

Can you please check that /dev/mapper/snsfs-mgt fails with mount as well as the first step please?

Comment by John Fuchs-Chesney (Inactive) [ 04/Aug/14 ]

Thank you Oleg for jumping in.
~ jfc.

Comment by Blake Caldwell [ 04/Aug/14 ]

Thank you! Forgive my blind omission... /dev/mapper allowed the MGT to mount! Except now we hit LU-4743. I am upload log messages Could you advise whether the patch landed for 2.5.2 can be backported to 2.4.3?
http://review.whamcloud.com/#/c/9574/

Comment by Oleg Drokin [ 04/Aug/14 ]

the patch is trivial and cleanyl cherry picks into my b2_4 tree, so I expect it to work in your tree as well.

Comment by Blake Caldwell [ 04/Aug/14 ]

Sounds good. Sorry to be pedantic, but any way to identify the obsoleted record type 10612401 that will be skipped?

Comment by Oleg Drokin [ 04/Aug/14 ]

That used to be a setattr record that is now deprecated:

/* MDS_SETATTR_REC = LLOG_OP_MAGIC | 0x12401, obsolete 1.8.0 */

Comment by John Fuchs-Chesney (Inactive) [ 04/Aug/14 ]

Blake,
May we mark this ticket as resolved?
Or, it you want us to keep it open a while longer, can I downgrade the severity level?

Many thanks,
~ jfc.

Comment by Blake Caldwell [ 04/Aug/14 ]

Thank you. Yes, please lower severity level. I'll update this ticket once we've applied LU--4743

Comment by John Fuchs-Chesney (Inactive) [ 04/Aug/14 ]

Done – thanks Blake.
~ jfc.

Comment by Blake Caldwell [ 04/Aug/14 ]

It mounted successfully! Thanks for your help. This ticket can be closed.

Comment by John Fuchs-Chesney (Inactive) [ 04/Aug/14 ]

Excellent! Thank you Blake, and thank you Oleg.

Best regards,
~ jfc.

Comment by Blake Caldwell [ 04/Aug/14 ]

However now we have an LBUG on MDT unmount that appears to be caused by this patch. This is a similar situation to LU-5188 that caused LU-5244 (the error below).

Could this get attention as to the possibility of a backport? While it only happens on unmount during test, the concern is that it happens under load. We are aiming for a return to production tomorrow am.

Aug 4 18:29:32 sns-mds1.ornl.gov kernel: [10346.158586] Lustre: Failing over snsfs-MDT0000
Aug 4 18:29:38 sns-mds1.ornl.gov kernel: [10352.174342] Lustre: 20277:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1407191372/real 1407191372] req@ffff8808054b5000 x1475540343371484/t0(0) o9->snsfs-OST0033-osc@128.219.249.38@tcp:28/4 lens 224/224 e 0 to 1 dl 1407191378 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Aug 4 18:29:38 sns-mds1.ornl.gov kernel: [10352.202412] Lustre: 20277:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 58 previous similar messages
Aug 4 18:29:44 sns-mds1.ornl.gov kernel: [10358.215880] Lustre: 20277:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1407191378/real 1407191378] req@ffff880805bce400 x1475540343371524/t0(0) o9->snsfs-OST0034-osc@128.219.249.35@tcp:28/4 lens 224/224 e 0 to 1 dl 1407191384 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Aug 4 18:29:45 sns-mds1.ornl.gov kernel: [10359.063530] Lustre: 10969:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1407191373/real 1407191373] req@ffff880415a31000 x1475540343371488/t0(0) o13->snsfs-OST0038-osc@128.219.249.35@tcp:7/4 lens 224/368 e 0 to 1 dl 1407191385 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Aug 4 18:29:45 sns-mds1.ornl.gov kernel: [10359.065516] Lustre: snsfs-OST003b-osc: Connection to snsfs-OST003b (at 128.219.249.38@tcp) was lost; in progress operations using this service will wait for recovery to complete
Aug 4 18:29:45 sns-mds1.ornl.gov kernel: [10359.065521] Lustre: Skipped 12 previous similar messages
Aug 4 18:29:45 sns-mds1.ornl.gov kernel: [10359.112859] Lustre: 10969:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Aug 4 18:29:50 sns-mds1.ornl.gov kernel: [10364.247409] Lustre: 20277:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1407191384/real 1407191384] req@ffff880805bce400 x1475540343371532/t0(0) o9->snsfs-OST0035-osc@128.219.249.36@tcp:28/4 lens 224/224 e 0 to 1 dl 1407191390 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Aug 4 18:30:40 sns-mds1.ornl.gov kernel: [10414.341707] LustreError: 11515:0:(osp_sync.c:885:osp_sync_thread()) ASSERTION( count < 10 ) failed: snsfs-OST0001-osc: 2 2 empty
Aug 4 18:30:40 sns-mds1.ornl.gov kernel: [10414.353399] LustreError: 11515:0:(osp_sync.c:885:osp_sync_thread()) LBUG
Aug 4 18:30:40 sns-mds1.ornl.gov kernel: [10414.360167] Pid: 11515, comm: osp-syn-1
Aug 4 18:30:40 sns-mds1.ornl.gov kernel: [10414.364061]
Aug 4 18:30:40 sns-mds1.ornl.gov kernel: [10414.364062] Call Trace:
Aug 4 18:30:40 sns-mds1.ornl.gov kernel: [10414.368132] [<ffffffffa04af895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Aug 4 18:30:40 sns-mds1.ornl.gov kernel: [10414.375168] [<ffffffffa04afe97>] lbug_with_loc+0x47/0xb0 [libcfs]
Aug 4 18:30:40 sns-mds1.ornl.gov kernel: [10414.381423] [<ffffffffa0f81f04>] osp_sync_thread+0x6d4/0x7e0 [osp]
Aug 4 18:30:40 sns-mds1.ornl.gov kernel: [10414.387758] [<ffffffff81063b80>] ? default_wake_function+0x0/0x20
Aug 4 18:30:40 sns-mds1.ornl.gov kernel: [10414.394006] [<ffffffffa0f81830>] ? osp_sync_thread+0x0/0x7e0 [osp]
Aug 4 18:30:40 sns-mds1.ornl.gov kernel: [10414.400343] [<ffffffff8100c0ca>] child_rip+0xa/0x20
Aug 4 18:30:40 sns-mds1.ornl.gov kernel: [10414.405377] [<ffffffffa0f81830>] ? osp_sync_thread+0x0/0x7e0 [osp]
Aug 4 18:30:40 sns-mds1.ornl.gov kernel: [10414.411712] [<ffffffffa0f81830>] ? osp_sync_thread+0x0/0x7e0 [osp]
Aug 4 18:30:40 sns-mds1.ornl.gov kernel: [10414.418043] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Aug 4 18:30:40 sns-mds1.ornl.gov kernel: [10414.423244]

Comment by John Fuchs-Chesney (Inactive) [ 04/Aug/14 ]

Reopened due to reported new LBUG.
~ jfc.

Comment by Oleg Drokin [ 04/Aug/14 ]

Hm, indeed, it looks like this is the case.

I tried the http://review.whamcloud.com/#/c/10828/4 and it also cleanly applies to b2_4, so you should be able to apply it as is.

Comment by Blake Caldwell [ 05/Aug/14 ]

The patch was successful. Several mount/unmount cycles were completed without a hitch. Thanks Oleg and John! All done with this ticket.

Comment by John Fuchs-Chesney (Inactive) [ 05/Aug/14 ]

Thank you for this update Blake – glad to see that things are working well.

I'll leave this ticket 'as is' for a few days, and then we can decide to mark it resolved, if no further problems come along.

Best regards,
~ jfc.

Comment by James Nunez (Inactive) [ 08/Aug/14 ]

ORNL applied patches for LU-4743 (http://review.whamcloud.com/#/c/10624/) and one of the LU-5188 patches (http://review.whamcloud.com/#/c/10828/4 ) and this fixed the issues they were seeing.

Confirmed with ORNL that we can close this ticket.

Generated at Sat Feb 10 01:51:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.