[LU-5299] osd_start() LBUG when doing parallel mount of the same target - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4
Affects Version/s: Lustre 2.6.0, Lustre 2.4.2
Labels:
- crash
- mount
- racer

Severity:
3
Rank (Obsolete):
14785

Description

One of our customers hits a crash when mounting OST targets. Here is the system log and backtrace.

LustreError: 5048:0:(obd_class.h:1008:obd_connect()) Device 35 not setup
LustreError: 5048:0:(obd_config.c:619:class_cleanup()) Device 35 not setup
LustreError: 5048:0:(obd_mount_server.c:1635:osd_start()) ASSERTION( obd->obd_lu_dev ) failed:
LustreError: 5048:0:(obd_mount_server.c:1635:osd_start()) LBUG

crash> bt
PID: 5048 TASK: ffff88085be42b00 CPU: 7 COMMAND: "mount.lustre"
 #0 [ffff8807c8189ac0] machine_kexec at ffffffff81035f1b
 #1 [ffff8807c8189b20] crash_kexec at ffffffff810c0f12
 #2 [ffff8807c8189bf0] panic at ffffffff8150e884
 #3 [ffff8807c8189c70] lbug_with_loc at ffffffffa047deeb [libcfs]
 #4 [ffff8807c8189c90] server_fill_super at ffffffffa0717673 [obdclass]
 #5 [ffff8807c8189d70] lustre_fill_super at ffffffffa06e76f8 [obdclass]
 #6 [ffff8807c8189da0] get_sb_nodev at ffffffff81184bcf
 #7 [ffff8807c8189de0] lustre_get_sb at ffffffffa06df175 [obdclass]
 #8 [ffff8807c8189e00] vfs_kern_mount at ffffffff8118420b
 #9 [ffff8807c8189e50] do_kern_mount at ffffffff811843b2
#10 [ffff8807c8189ea0] do_mount at ffffffff811a4572
#11 [ffff8807c8189f20] sys_mount at ffffffff811a4c00
#12 [ffff8807c8189f80] system_call_fastpath at ffffffff8100b072
    RIP: 00000030ce4e909a RSP: 00007fff59ccbb70 RFLAGS: 00010202
    RAX: 00000000000000a5 RBX: ffffffff8100b072 RCX: 00002b58cacf1006
    RDX: 0000000000408bbf RSI: 00007fff59cd4f28 RDI: 00000000012a3170
    RBP: 0000000000000000 R8: 00000000012a3190 R9: 0000000000000000
    R10: 0000000001000000 R11: 0000000000000206 R12: 000000000060db18
    R13: 000000000060db10 R14: 00000000012a3190 R15: 0000000000000000
    ORIG_RAX: 00000000000000a5 CS: 0033 SS: 002b

Lustre release is lustre-2.4.2.
The issue occured on several OSSs, after a first centralized mount of OSTs was interrupted (shine start ... Ctr^C) and relaunched.

I have been able to reproduce the issue on lustre-2.4.2, by launching mount of the same OST target in parallel.

# mntpoint=/mnt/fs1/ost/4
# mntdev=/dev/mapper/mpathe
# umount $mntpoint
# for i in $(seq 1 10); do dd if=$mntdev of=/dev/null bs=4k count=1M & done ;
# /bin/mount -t lustre -o errors=panic $mntdev $mntpoint & \
/bin/mount -t lustre -o errors=panic $mntdev $mntpoint

Launching read IOs on the device in backgroung may help reproduce the issue, since ldiskfs_get_sb() calls set_blocksize() (via get_sb_bdev() and sb_set_blocksize()), which serializes with IOs on the device.

   ...
    /* Prevent starting I/O or mapping the device */
    percpu_down_write(&bdev->bd_block_size_semaphore);

This probably opens a larger window for a potential race between the two mount operations.

This ticket is certainly related to ~~LU-2134~~, which is currently unresolved.

I have reproduced the issue on Lustre 2.5.60 with a sligthly different assertion.

<3>[264188.026108] LustreError: 22665:0:(obd_class.h:954:obd_connect()) Device 1 not setup
<3>[264188.033999] LustreError: 22665:0:(obd_config.c:615:class_cleanup()) Device 1 not setup
<3>[264188.042166] LustreError: 22665:0:(obd_mount_server.c:1735:server_fill_super()) Unable to start osd on /dev/ram0: -19
<3>[264188.052925] LustreError: 22665:0:(obd_mount.c:1335:lustre_fill_super()) Unable to mount  (-19)
<6>[264189.997203] LDISKFS-fs (ram0): mounted filesystem with ordered data mode. quota=on. Opts: 
<0>[264190.065537] LustreError: 22664:0:(obd_mount_server.c:1690:osd_start()) ASSERTION( obd ) failed: 
<0>[264190.074554] LustreError: 22664:0:(obd_mount_server.c:1690:osd_start()) LBUG
<4>[264190.081682] Pid: 22664, comm: mount.lustre
<4>[264190.085933] 
<4>[264190.085935] Call Trace:
<4>[264190.090213]  [<ffffffffa04fd895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4>[264190.097358]  [<ffffffffa04fde97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4>[264190.103765]  [<ffffffffa066e522>] server_fill_super+0x1082/0x1620 [obdclass]
<4>[264190.111014]  [<ffffffffa0605a6e>] ? obd_zombie_is_idle+0x1e/0x70 [obdclass]
<4>[264190.118183]  [<ffffffffa0639a68>] lustre_fill_super+0x1d8/0x530 [obdclass]
<4>[264190.125268]  [<ffffffffa0639890>] ? lustre_fill_super+0x0/0x530 [obdclass]
<4>[264190.132326]  [<ffffffff8118c7bf>] get_sb_nodev+0x5f/0xa0
<4>[264190.137847]  [<ffffffffa06317b5>] lustre_get_sb+0x25/0x30 [obdclass]
<4>[264190.144370]  [<ffffffff8118be1b>] vfs_kern_mount+0x7b/0x1b0
<4>[264190.150108]  [<ffffffff8118bfc2>] do_kern_mount+0x52/0x130
<4>[264190.155751]  [<ffffffff811acf9b>] do_mount+0x2fb/0x930
<4>[264190.162326]  [<ffffffff811ad660>] sys_mount+0x90/0xe0
<4>[264190.167536]  [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
<4>[264190.173699]

In both cases, there is a lack of synchronisation in osd_start() when osd is retrieved from class_name2obd(). Obd device may still be in initialization phase while second mount.lustre process reaches this portion of code.

static int osd_start(struct lustre_sb_info *lsi, unsigned long mflags)
{
        struct lustre_mount_data *lmd = lsi->lsi_lmd;
        struct obd_device        *obd;
        struct dt_device_param    p;
        char                      flagstr[16];
        int                       rc;
        ENTRY;

        CDEBUG(D_MOUNT,
               "Attempting to start %s, type=%s, lsifl=%x, mountfl=%lx\n",
               lsi->lsi_svname, lsi->lsi_osd_type, lsi->lsi_flags, mflags);

        sprintf(lsi->lsi_osd_obdname, "%s-osd", lsi->lsi_svname);
        strcpy(lsi->lsi_osd_uuid, lsi->lsi_osd_obdname);
        strcat(lsi->lsi_osd_uuid, "_UUID");
        sprintf(flagstr, "%lu:%lu", mflags, (unsigned long) lmd->lmd_flags);

        obd = class_name2obd(lsi->lsi_osd_obdname);
        if (obd == NULL) {
                rc = lustre_start_simple(lsi->lsi_osd_obdname,
                                         lsi->lsi_osd_type,
                                         lsi->lsi_osd_uuid, lmd->lmd_dev,
                                         flagstr, lsi->lsi_lmd->lmd_opts,
                                         lsi->lsi_svname);
                if (rc)
                        GOTO(out, rc);
                obd = class_name2obd(lsi->lsi_osd_obdname);
                LASSERT(obd);
        }

        rc = obd_connect(NULL, &lsi->lsi_osd_exp,
                         obd, &obd->obd_uuid, NULL, NULL);
        if (rc) {
                obd->obd_force = 1;
                class_manual_cleanup(obd);
                lsi->lsi_dt_dev = NULL;
        }

        /* XXX: to keep support old components relying on lsi_srv_mnt
         *      we get this info from OSD just started */
        LASSERT(obd->obd_lu_dev);
        lsi->lsi_dt_dev = lu2dt_dev(obd->obd_lu_dev);
        LASSERT(lsi->lsi_dt_dev);

        dt_conf_get(NULL, lsi->lsi_dt_dev, &p);

        lsi->lsi_srv_mnt = p.ddp_mnt;

out:
        RETURN(rc);
}

Additional data from the crash dumps is available if needed.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

crash.lustre-2.4.2.txt
373 kB
07/Jul/14 2:13 PM

Issue Links

is related to

LU-5573 Test timeout conf-sanity test_41c

Resolved

LU-6553 Recurrence of LU-5299: obd_mount_server.c:1690:osd_start()) ASSERTION( obd ) failed

Resolved

LU-5606 Interop 2.5.2<->2.6 conf-sanity test_41c: mount MDS failed, Input/output error

Resolved

LU-5921 conf-sanity test_41c: unexpected concurent OST mounts result, rc=0 rc2=1

Resolved

Activity

[LU-5299] osd_start() LBUG when doing parallel mount of the same target

Bruno Faccini (Inactive) added a comment - 18/Jul/14 11:39 AM - edited

Ok, but now that I have tried to embed server_fill_super()/osd_start() (generic/lustre mount operation) within open_bdev_exclusive()/close_bdev_exclusive() protection, osd_mount() fails with -EBUSY, due to 2nd vfs_kern_mount() call for same device but now for specific underlying device file-system type, in osd_mount() for ldiskfs ...

Also I have better read the concerned source code, and I have already found the difference between the 2 LBUGs when running your reproducer against 2.4.x or 2.6, it comes from this missing RETURN() in 2.4 version of osd_start() :

static int osd_start(struct lustre_sb_info *lsi, unsigned long mflags)
{
        struct lustre_mount_data *lmd = lsi->lsi_lmd;
        struct obd_device        *obd;
        struct dt_device_param    p;
        char                      flagstr[16];
        int                       rc;
        ENTRY;

        CDEBUG(D_MOUNT,
               "Attempting to start %s, type=%s, lsifl=%x, mountfl=%lx\n",
               lsi->lsi_svname, lsi->lsi_osd_type, lsi->lsi_flags, mflags);

        sprintf(lsi->lsi_osd_obdname, "%s-osd", lsi->lsi_svname);
        strcpy(lsi->lsi_osd_uuid, lsi->lsi_osd_obdname);
        strcat(lsi->lsi_osd_uuid, "_UUID");
        sprintf(flagstr, "%lu:%lu", mflags, (unsigned long) lmd->lmd_flags);

        obd = class_name2obd(lsi->lsi_osd_obdname);
        if (obd == NULL) {
                rc = lustre_start_simple(lsi->lsi_osd_obdname,
                                         lsi->lsi_osd_type,
                                         lsi->lsi_osd_uuid, lmd->lmd_dev,
                                         flagstr, lsi->lsi_lmd->lmd_opts,
                                         lsi->lsi_svname);
                if (rc)
                        GOTO(out, rc);
                obd = class_name2obd(lsi->lsi_osd_obdname);
                LASSERT(obd);
        }

        rc = obd_connect(NULL, &lsi->lsi_osd_exp,
                         obd, &obd->obd_uuid, NULL, NULL);
        if (rc) {
                obd->obd_force = 1;
                class_manual_cleanup(obd);
                lsi->lsi_dt_dev = NULL;
                RETURN(rc);   <<<<<<<<<<<<<<<<<<<<<<
        }

        LASSERT(obd->obd_lu_dev);

..............

which allows a 2nd/extra mount command to continue even after errors during cleanup and thus to trigger the LASSERT(obd->obd_lu_dev) when in the 2.6/master case it is the 1st mount command that finally/later triggers the LASSERT(obd); likely due to the cleanup actions from 2nd/extra command ...
The missing RETURN() in b2_4 has been fixed by :

commit 3e38436dc09097429e1ca1fbfa3ef4981e124715
Author: Mikhail Pershin <mike.pershin@intel.com>
Date:   Mon Mar 4 00:06:41 2013 +0400

    LU-2059 llog: MGC to use OSD API for backup logs
    
    MGC uses lvfs API to access local llogs blocking removal of old code
    
    - MGS is converted to use OSD API for local llogs
    - llog_is_empty() and llog_backup() are introduced
    - initial OSD start initialize run ctxt so llog can use it too
    - shrink dcache after initial scrub in osd-ldiskfs to avoid holding
      data of unlinked objects until umount.
    
    Signed-off-by: Mikhail Pershin <mike.pershin@intel.com>
    Change-Id: I137e38abb0e93c12406c6610c0922f4294e67290

BTW, this allowed me to think of a more easy/simple way to fix the race, it is at http://review.whamcloud.com/11139.

Bruno Faccini (Inactive) added a comment - 18/Jul/14 11:39 AM - edited Ok, but now that I have tried to embed server_fill_super()/osd_start() (generic/lustre mount operation) within open_bdev_exclusive()/close_bdev_exclusive() protection, osd_mount() fails with -EBUSY, due to 2nd vfs_kern_mount() call for same device but now for specific underlying device file-system type, in osd_mount() for ldiskfs ... Also I have better read the concerned source code, and I have already found the difference between the 2 LBUGs when running your reproducer against 2.4.x or 2.6, it comes from this missing RETURN() in 2.4 version of osd_start() : static int osd_start(struct lustre_sb_info *lsi, unsigned long mflags) { struct lustre_mount_data *lmd = lsi->lsi_lmd; struct obd_device *obd; struct dt_device_param p; char flagstr[16]; int rc; ENTRY; CDEBUG(D_MOUNT, "Attempting to start %s, type=%s, lsifl=%x, mountfl=%lx\n", lsi->lsi_svname, lsi->lsi_osd_type, lsi->lsi_flags, mflags); sprintf(lsi->lsi_osd_obdname, "%s-osd", lsi->lsi_svname); strcpy(lsi->lsi_osd_uuid, lsi->lsi_osd_obdname); strcat(lsi->lsi_osd_uuid, "_UUID"); sprintf(flagstr, "%lu:%lu", mflags, (unsigned long) lmd->lmd_flags); obd = class_name2obd(lsi->lsi_osd_obdname); if (obd == NULL) { rc = lustre_start_simple(lsi->lsi_osd_obdname, lsi->lsi_osd_type, lsi->lsi_osd_uuid, lmd->lmd_dev, flagstr, lsi->lsi_lmd->lmd_opts, lsi->lsi_svname); if (rc) GOTO(out, rc); obd = class_name2obd(lsi->lsi_osd_obdname); LASSERT(obd); } rc = obd_connect(NULL, &lsi->lsi_osd_exp, obd, &obd->obd_uuid, NULL, NULL); if (rc) { obd->obd_force = 1; class_manual_cleanup(obd); lsi->lsi_dt_dev = NULL; RETURN(rc); <<<<<<<<<<<<<<<<<<<<<< } LASSERT(obd->obd_lu_dev); .............. which allows a 2nd/extra mount command to continue even after errors during cleanup and thus to trigger the LASSERT(obd->obd_lu_dev) when in the 2.6/master case it is the 1st mount command that finally/later triggers the LASSERT(obd); likely due to the cleanup actions from 2nd/extra command ... The missing RETURN() in b2_4 has been fixed by : commit 3e38436dc09097429e1ca1fbfa3ef4981e124715 Author: Mikhail Pershin <mike.pershin@intel.com> Date: Mon Mar 4 00:06:41 2013 +0400 LU-2059 llog: MGC to use OSD API for backup logs MGC uses lvfs API to access local llogs blocking removal of old code - MGS is converted to use OSD API for local llogs - llog_is_empty() and llog_backup() are introduced - initial OSD start initialize run ctxt so llog can use it too - shrink dcache after initial scrub in osd-ldiskfs to avoid holding data of unlinked objects until umount. Signed-off-by: Mikhail Pershin <mike.pershin@intel.com> Change-Id: I137e38abb0e93c12406c6610c0922f4294e67290 BTW, this allowed me to think of a more easy/simple way to fix the race, it is at http://review.whamcloud.com/11139 .

Bruno Faccini (Inactive) added a comment - 17/Jul/14 8:37 AM

Humm, seems that 2.6.38 Kernel introduced some API changes, and open_bdev_exclusive()/close_bdev_exclusive() need to be replaced by blkdev_get_by_path()/blkdev_put() calls with explicit setting of FMODE_EXCL mode ...
So I added some compile time conditionnal stuff.

Bruno Faccini (Inactive) added a comment - 17/Jul/14 8:37 AM Humm, seems that 2.6.38 Kernel introduced some API changes, and open_bdev_exclusive()/close_bdev_exclusive() need to be replaced by blkdev_get_by_path()/blkdev_put() calls with explicit setting of FMODE_EXCL mode ... So I added some compile time conditionnal stuff.

Bruno Faccini (Inactive) added a comment - 16/Jul/14 10:08 PM

My patch at http://review.whamcloud.com/11122 is an attempt to implement a simple protection mechanism based on Alex's suggestion, thanks for the trick Alex!

Bruno Faccini (Inactive) added a comment - 16/Jul/14 10:08 PM My patch at http://review.whamcloud.com/11122 is an attempt to implement a simple protection mechanism based on Alex's suggestion, thanks for the trick Alex!

Bruno Faccini (Inactive) added a comment - 11/Jul/14 1:00 PM

Gregoire,
Thanks for providing these last infos, that will help me to save time in implementing the fix for this problem. Will update ticket with a patch proposal now.

Bruno Faccini (Inactive) added a comment - 11/Jul/14 1:00 PM Gregoire, Thanks for providing these last infos, that will help me to save time in implementing the fix for this problem. Will update ticket with a patch proposal now.

John Fuchs-Chesney (Inactive) added a comment - 07/Jul/14 7:59 PM

Thank you for looking at this Bruno.
~ jfc.

John Fuchs-Chesney (Inactive) added a comment - 07/Jul/14 7:59 PM Thank you for looking at this Bruno. ~ jfc.

Gregoire Pichon added a comment - 07/Jul/14 2:13 PM

Attaching crash.lustre-2.4.2.txt that contains crash dump output of following commands: foreach bt, ps | grep mount, bt -f of mount.lustre commands, and dmesg.

Gregoire Pichon added a comment - 07/Jul/14 2:13 PM Attaching crash.lustre-2.4.2.txt that contains crash dump output of following commands: foreach bt , ps | grep mount , bt -f of mount.lustre commands, and dmesg .

Bruno Faccini (Inactive) added a comment - 07/Jul/14 1:38 PM

Hello Gregoire,
Thanks for the report and to detail how this can be reproduced, looks like it could be done in-house on a test platform. Just in case, can you attach the "foreach bt" output if you have a crash dump available ?

Bruno Faccini (Inactive) added a comment - 07/Jul/14 1:38 PM Hello Gregoire, Thanks for the report and to detail how this can be reproduced, looks like it could be done in-house on a test platform. Just in case, can you attach the "foreach bt" output if you have a crash dump available ?

Alex Zhuravlev added a comment - 07/Jul/14 10:19 AM

a regular filesystem does open a block device exclusively:

bdev = open_bdev_exclusive(dev_name, mode, fs_type);
if (IS_ERR(bdev))
return PTR_ERR(bdev);

this is where real competition is taking place. the only winner will be doing actual initialization, etc.
I guess we need something similar, I don't think we need strong race control inside initialization?

Alex Zhuravlev added a comment - 07/Jul/14 10:19 AM a regular filesystem does open a block device exclusively: bdev = open_bdev_exclusive(dev_name, mode, fs_type); if (IS_ERR(bdev)) return PTR_ERR(bdev); this is where real competition is taking place. the only winner will be doing actual initialization, etc. I guess we need something similar, I don't think we need strong race control inside initialization?

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Gregoire Pichon

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 07/Jul/14 9:20 AM

Updated:: 01/May/15 5:20 PM

Resolved:: 27/Aug/14 4:16 AM