[LU-5299] osd_start() LBUG when doing parallel mount of the same target - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4
Affects Version/s: Lustre 2.6.0, Lustre 2.4.2
Labels:
- crash
- mount
- racer

Severity:
3
Rank (Obsolete):
14785

Description

One of our customers hits a crash when mounting OST targets. Here is the system log and backtrace.

LustreError: 5048:0:(obd_class.h:1008:obd_connect()) Device 35 not setup
LustreError: 5048:0:(obd_config.c:619:class_cleanup()) Device 35 not setup
LustreError: 5048:0:(obd_mount_server.c:1635:osd_start()) ASSERTION( obd->obd_lu_dev ) failed:
LustreError: 5048:0:(obd_mount_server.c:1635:osd_start()) LBUG

crash> bt
PID: 5048 TASK: ffff88085be42b00 CPU: 7 COMMAND: "mount.lustre"
 #0 [ffff8807c8189ac0] machine_kexec at ffffffff81035f1b
 #1 [ffff8807c8189b20] crash_kexec at ffffffff810c0f12
 #2 [ffff8807c8189bf0] panic at ffffffff8150e884
 #3 [ffff8807c8189c70] lbug_with_loc at ffffffffa047deeb [libcfs]
 #4 [ffff8807c8189c90] server_fill_super at ffffffffa0717673 [obdclass]
 #5 [ffff8807c8189d70] lustre_fill_super at ffffffffa06e76f8 [obdclass]
 #6 [ffff8807c8189da0] get_sb_nodev at ffffffff81184bcf
 #7 [ffff8807c8189de0] lustre_get_sb at ffffffffa06df175 [obdclass]
 #8 [ffff8807c8189e00] vfs_kern_mount at ffffffff8118420b
 #9 [ffff8807c8189e50] do_kern_mount at ffffffff811843b2
#10 [ffff8807c8189ea0] do_mount at ffffffff811a4572
#11 [ffff8807c8189f20] sys_mount at ffffffff811a4c00
#12 [ffff8807c8189f80] system_call_fastpath at ffffffff8100b072
    RIP: 00000030ce4e909a RSP: 00007fff59ccbb70 RFLAGS: 00010202
    RAX: 00000000000000a5 RBX: ffffffff8100b072 RCX: 00002b58cacf1006
    RDX: 0000000000408bbf RSI: 00007fff59cd4f28 RDI: 00000000012a3170
    RBP: 0000000000000000 R8: 00000000012a3190 R9: 0000000000000000
    R10: 0000000001000000 R11: 0000000000000206 R12: 000000000060db18
    R13: 000000000060db10 R14: 00000000012a3190 R15: 0000000000000000
    ORIG_RAX: 00000000000000a5 CS: 0033 SS: 002b

Lustre release is lustre-2.4.2.
The issue occured on several OSSs, after a first centralized mount of OSTs was interrupted (shine start ... Ctr^C) and relaunched.

I have been able to reproduce the issue on lustre-2.4.2, by launching mount of the same OST target in parallel.

# mntpoint=/mnt/fs1/ost/4
# mntdev=/dev/mapper/mpathe
# umount $mntpoint
# for i in $(seq 1 10); do dd if=$mntdev of=/dev/null bs=4k count=1M & done ;
# /bin/mount -t lustre -o errors=panic $mntdev $mntpoint & \
/bin/mount -t lustre -o errors=panic $mntdev $mntpoint

Launching read IOs on the device in backgroung may help reproduce the issue, since ldiskfs_get_sb() calls set_blocksize() (via get_sb_bdev() and sb_set_blocksize()), which serializes with IOs on the device.

   ...
    /* Prevent starting I/O or mapping the device */
    percpu_down_write(&bdev->bd_block_size_semaphore);

This probably opens a larger window for a potential race between the two mount operations.

This ticket is certainly related to ~~LU-2134~~, which is currently unresolved.

I have reproduced the issue on Lustre 2.5.60 with a sligthly different assertion.

<3>[264188.026108] LustreError: 22665:0:(obd_class.h:954:obd_connect()) Device 1 not setup
<3>[264188.033999] LustreError: 22665:0:(obd_config.c:615:class_cleanup()) Device 1 not setup
<3>[264188.042166] LustreError: 22665:0:(obd_mount_server.c:1735:server_fill_super()) Unable to start osd on /dev/ram0: -19
<3>[264188.052925] LustreError: 22665:0:(obd_mount.c:1335:lustre_fill_super()) Unable to mount  (-19)
<6>[264189.997203] LDISKFS-fs (ram0): mounted filesystem with ordered data mode. quota=on. Opts: 
<0>[264190.065537] LustreError: 22664:0:(obd_mount_server.c:1690:osd_start()) ASSERTION( obd ) failed: 
<0>[264190.074554] LustreError: 22664:0:(obd_mount_server.c:1690:osd_start()) LBUG
<4>[264190.081682] Pid: 22664, comm: mount.lustre
<4>[264190.085933] 
<4>[264190.085935] Call Trace:
<4>[264190.090213]  [<ffffffffa04fd895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4>[264190.097358]  [<ffffffffa04fde97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4>[264190.103765]  [<ffffffffa066e522>] server_fill_super+0x1082/0x1620 [obdclass]
<4>[264190.111014]  [<ffffffffa0605a6e>] ? obd_zombie_is_idle+0x1e/0x70 [obdclass]
<4>[264190.118183]  [<ffffffffa0639a68>] lustre_fill_super+0x1d8/0x530 [obdclass]
<4>[264190.125268]  [<ffffffffa0639890>] ? lustre_fill_super+0x0/0x530 [obdclass]
<4>[264190.132326]  [<ffffffff8118c7bf>] get_sb_nodev+0x5f/0xa0
<4>[264190.137847]  [<ffffffffa06317b5>] lustre_get_sb+0x25/0x30 [obdclass]
<4>[264190.144370]  [<ffffffff8118be1b>] vfs_kern_mount+0x7b/0x1b0
<4>[264190.150108]  [<ffffffff8118bfc2>] do_kern_mount+0x52/0x130
<4>[264190.155751]  [<ffffffff811acf9b>] do_mount+0x2fb/0x930
<4>[264190.162326]  [<ffffffff811ad660>] sys_mount+0x90/0xe0
<4>[264190.167536]  [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
<4>[264190.173699]

In both cases, there is a lack of synchronisation in osd_start() when osd is retrieved from class_name2obd(). Obd device may still be in initialization phase while second mount.lustre process reaches this portion of code.

static int osd_start(struct lustre_sb_info *lsi, unsigned long mflags)
{
        struct lustre_mount_data *lmd = lsi->lsi_lmd;
        struct obd_device        *obd;
        struct dt_device_param    p;
        char                      flagstr[16];
        int                       rc;
        ENTRY;

        CDEBUG(D_MOUNT,
               "Attempting to start %s, type=%s, lsifl=%x, mountfl=%lx\n",
               lsi->lsi_svname, lsi->lsi_osd_type, lsi->lsi_flags, mflags);

        sprintf(lsi->lsi_osd_obdname, "%s-osd", lsi->lsi_svname);
        strcpy(lsi->lsi_osd_uuid, lsi->lsi_osd_obdname);
        strcat(lsi->lsi_osd_uuid, "_UUID");
        sprintf(flagstr, "%lu:%lu", mflags, (unsigned long) lmd->lmd_flags);

        obd = class_name2obd(lsi->lsi_osd_obdname);
        if (obd == NULL) {
                rc = lustre_start_simple(lsi->lsi_osd_obdname,
                                         lsi->lsi_osd_type,
                                         lsi->lsi_osd_uuid, lmd->lmd_dev,
                                         flagstr, lsi->lsi_lmd->lmd_opts,
                                         lsi->lsi_svname);
                if (rc)
                        GOTO(out, rc);
                obd = class_name2obd(lsi->lsi_osd_obdname);
                LASSERT(obd);
        }

        rc = obd_connect(NULL, &lsi->lsi_osd_exp,
                         obd, &obd->obd_uuid, NULL, NULL);
        if (rc) {
                obd->obd_force = 1;
                class_manual_cleanup(obd);
                lsi->lsi_dt_dev = NULL;
        }

        /* XXX: to keep support old components relying on lsi_srv_mnt
         *      we get this info from OSD just started */
        LASSERT(obd->obd_lu_dev);
        lsi->lsi_dt_dev = lu2dt_dev(obd->obd_lu_dev);
        LASSERT(lsi->lsi_dt_dev);

        dt_conf_get(NULL, lsi->lsi_dt_dev, &p);

        lsi->lsi_srv_mnt = p.ddp_mnt;

out:
        RETURN(rc);
}

Additional data from the crash dumps is available if needed.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

crash.lustre-2.4.2.txt
373 kB
07/Jul/14 2:13 PM

Issue Links

is related to

LU-5573 Test timeout conf-sanity test_41c

Resolved

LU-6553 Recurrence of LU-5299: obd_mount_server.c:1690:osd_start()) ASSERTION( obd ) failed

Resolved

LU-5606 Interop 2.5.2<->2.6 conf-sanity test_41c: mount MDS failed, Input/output error

Resolved

LU-5921 conf-sanity test_41c: unexpected concurent OST mounts result, rc=0 rc2=1

Resolved

Activity

[LU-5299] osd_start() LBUG when doing parallel mount of the same target

Gregoire Pichon added a comment - 28/Jul/14 10:36 AM

Bruno,

I have tested patch #11139 (patchset 6) on lustre 2.4.3 Bull's branch.
It prevents the crash with the reproducer testcase.

One of the two mount commands fails with the following output message:

mount.lustre: mount /dev/mapper/mpathe at /mnt/fs1/ost/4 failed: No such device
Are the lustre modules loaded?
Check /etc/modprobe.conf and /proc/filesystems

Here are the Lustre logs with D_SUPER "super" component log enabled.

00000020:01200004:6.0F:1406539153.106223:0:30249:0:(obd_mount.c:1260:lustre_fill_super()) VFS Op: sb ffff88063788a000
00000020:01000004:6.0:1406539153.106249:0:30249:0:(obd_mount.c:832:lmd_print()) mount data:
00000020:01000004:6.0:1406539153.106250:0:30249:0:(obd_mount.c:835:lmd_print()) device: /dev/mapper/mpathe
00000020:01000004:6.0:1406539153.106251:0:30249:0:(obd_mount.c:836:lmd_print()) flags: 0
00000020:01000004:6.0:1406539153.106251:0:30249:0:(obd_mount.c:839:lmd_print()) options: errors=panic,errors=remount-ro
00000020:01000004:6.0:1406539153.106252:0:30249:0:(obd_mount.c:1305:lustre_fill_super()) Mounting server from /dev/mapper/mpathe
00000020:01000004:6.0:1406539153.106255:0:30249:0:(obd_mount_server.c:1622:osd_start()) Attempting to start fs1-OST0004, type=osd-ldiskfs, lsifl=200002, mountfl=0
00000020:01000004:6.0:1406539153.106299:0:30249:0:(obd_mount.c:192:lustre_start_simple()) Starting obd fs1-OST0004-osd (typ=osd-ldiskfs)
00000020:01200004:4.0F:1406539153.106404:0:30248:0:(obd_mount.c:1260:lustre_fill_super()) VFS Op: sb ffff880632ce7800
00000020:01000004:4.0:1406539153.106423:0:30248:0:(obd_mount.c:832:lmd_print()) mount data:
00000020:01000004:4.0:1406539153.106423:0:30248:0:(obd_mount.c:835:lmd_print()) device: /dev/mapper/mpathe
00000020:01000004:4.0:1406539153.106424:0:30248:0:(obd_mount.c:836:lmd_print()) flags: 0
00000020:01000004:4.0:1406539153.106425:0:30248:0:(obd_mount.c:839:lmd_print()) options: errors=panic,errors=remount-ro
00000020:01000004:4.0:1406539153.106425:0:30248:0:(obd_mount.c:1305:lustre_fill_super()) Mounting server from /dev/mapper/mpathe
00000020:01000004:4.0:1406539153.106426:0:30248:0:(obd_mount_server.c:1622:osd_start()) Attempting to start fs1-OST0004, type=osd-ldiskfs, lsifl=200002, mountfl=0
00000020:01000004:4.0:1406539153.106428:0:30248:0:(obd_mount_server.c:1641:osd_start()) fs1-OST0004-osd already started
00000020:00020000:4.0:1406539153.106429:0:30248:0:(obd_class.h:1008:obd_connect()) Device 0 not setup
00000020:00020000:4.0:1406539153.112948:0:30248:0:(obd_mount_server.c:1698:server_fill_super()) Unable to start osd on /dev/mapper/mpathe: -19
00000020:01000004:4.0:1406539153.123109:0:30248:0:(obd_mount.c:654:lustre_put_lsi()) put ffff880632ce7800 1
00000020:01000004:4.0:1406539153.123110:0:30248:0:(obd_mount.c:604:lustre_free_lsi()) Freeing lsi ffff880632980000
00000020:00020000:4.0:1406539153.123113:0:30248:0:(obd_mount.c:1324:lustre_fill_super()) Unable to mount (-19)
00000020:01000004:1.0F:1406539158.823188:0:30249:0:(obd_mount_server.c:1704:server_fill_super()) Found service fs1-OST0004 on device /dev/mapper/mpathe
00000020:01000004:1.0:1406539158.823267:0:30249:0:(obd_mount.c:333:lustre_start_mgc()) Start MGC 'MGC10.3.1.5@o2ib'
00000020:01000004:1.0:1406539158.823268:0:30249:0:(obd_mount.c:340:lustre_start_mgc()) mgs nids 10.3.1.5@o2ib:10.4.1.5@o2ib1:10.3.1.6@o2ib:10.4.1.6@o2ib1.
00000020:01000004:1.0:1406539158.823286:0:30249:0:(obd_mount.c:192:lustre_start_simple()) Starting obd MGC10.3.1.5@o2ib (typ=mgc)
00000020:01000004:3.0F:1406539158.824801:0:30249:0:(obd_mount_server.c:1210:server_start_targets()) starting target fs1-OST0004
00000020:01000004:3.0:1406539158.824838:0:30249:0:(obd_mount.c:192:lustre_start_simple()) Starting obd OSS (typ=ost)
00000020:01000004:2.0F:1406539158.830802:0:30249:0:(obd_mount_server.c:307:server_mgc_set_fs()) Set mgc disk for /dev/mapper/mpathe
00000020:01000004:0.0F:1406539158.882797:0:30249:0:(obd_mount_server.c:1115:server_register_target()) Registration fs1-OST0004, fs=fs1, 10.3.1.10@o2ib, index=0004, flags=0x2
00000020:01000004:1.0:1406539158.883033:0:30249:0:(obd_mount_server.c:120:server_register_mount()) reg_mnt ffff880639d1f9c0 from fs1-OST0004
00000020:01000004:5.0F:1406539158.884893:0:30311:0:(obd_mount.c:870:lustre_check_exclusion()) Check exclusion fs1-OST0004 (4) in 0 of /dev/mapper/mpathe
00000020:01000004:5.0:1406539158.885041:0:30311:0:(obd_mount_server.c:170:server_get_mount()) get_mnt ffff880639d1f9c0 from fs1-OST0004, refs=2
00000001:00000004:4.0:1406539158.887323:0:30311:0:(tgt_lastrcvd.c:253:tgt_server_data_update()) fs1-OST0004_UUID: mount_count is 44, last_transno is 184683593728
00000020:01000004:5.0:1406539158.890256:0:30312:0:(obd_mount.c:192:lustre_start_simple()) Starting obd fs1-MDT0000-lwp-OST0004 (typ=lwp)
00000020:02000000:3.0:1406539158.891493:0:30249:0:(obd_mount_server.c:1808:server_calc_timeout()) fs1-OST0004: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
00000020:01000004:2.0:1406539158.901448:0:30249:0:(obd_mount_server.c:324:server_mgc_clear_fs()) Unassign mgc disk
00000020:01000004:2.0:1406539158.901510:0:30249:0:(obd_mount_server.c:1581:server_fill_super_common()) Server sb, dev=23
00000020:00000004:2.0:1406539158.901513:0:30249:0:(obd_mount.c:1327:lustre_fill_super()) Mount /dev/mapper/mpathe complete
00010000:02000400:5.0:1406539164.331731:0:30295:0:(ldlm_lib.c:1566:target_start_recovery_timer()) fs1-OST0004: Will be in recovery for at least 2:30, or until 3 clients reconnect

So I think the patch is good and fixes the initial issue. Thanks.
We are now waiting positive code reviews.

Note that the lustre 2.4.3 Bull's branch now includes the RETURN() call at the end of error handling of obd_connect() in osd_start() routine. It came from the patch for ~~LU-2059~~ that has been integrated in our repository several weeks ago.

Gregoire Pichon added a comment - 28/Jul/14 10:36 AM Bruno, I have tested patch #11139 (patchset 6) on lustre 2.4.3 Bull's branch. It prevents the crash with the reproducer testcase. One of the two mount commands fails with the following output message: mount.lustre: mount /dev/mapper/mpathe at /mnt/fs1/ost/4 failed: No such device Are the lustre modules loaded? Check /etc/modprobe.conf and /proc/filesystems Here are the Lustre logs with D_SUPER "super" component log enabled. 00000020:01200004:6.0F:1406539153.106223:0:30249:0:(obd_mount.c:1260:lustre_fill_super()) VFS Op: sb ffff88063788a000 00000020:01000004:6.0:1406539153.106249:0:30249:0:(obd_mount.c:832:lmd_print()) mount data: 00000020:01000004:6.0:1406539153.106250:0:30249:0:(obd_mount.c:835:lmd_print()) device: /dev/mapper/mpathe 00000020:01000004:6.0:1406539153.106251:0:30249:0:(obd_mount.c:836:lmd_print()) flags: 0 00000020:01000004:6.0:1406539153.106251:0:30249:0:(obd_mount.c:839:lmd_print()) options: errors=panic,errors=remount-ro 00000020:01000004:6.0:1406539153.106252:0:30249:0:(obd_mount.c:1305:lustre_fill_super()) Mounting server from /dev/mapper/mpathe 00000020:01000004:6.0:1406539153.106255:0:30249:0:(obd_mount_server.c:1622:osd_start()) Attempting to start fs1-OST0004, type=osd-ldiskfs, lsifl=200002, mountfl=0 00000020:01000004:6.0:1406539153.106299:0:30249:0:(obd_mount.c:192:lustre_start_simple()) Starting obd fs1-OST0004-osd (typ=osd-ldiskfs) 00000020:01200004:4.0F:1406539153.106404:0:30248:0:(obd_mount.c:1260:lustre_fill_super()) VFS Op: sb ffff880632ce7800 00000020:01000004:4.0:1406539153.106423:0:30248:0:(obd_mount.c:832:lmd_print()) mount data: 00000020:01000004:4.0:1406539153.106423:0:30248:0:(obd_mount.c:835:lmd_print()) device: /dev/mapper/mpathe 00000020:01000004:4.0:1406539153.106424:0:30248:0:(obd_mount.c:836:lmd_print()) flags: 0 00000020:01000004:4.0:1406539153.106425:0:30248:0:(obd_mount.c:839:lmd_print()) options: errors=panic,errors=remount-ro 00000020:01000004:4.0:1406539153.106425:0:30248:0:(obd_mount.c:1305:lustre_fill_super()) Mounting server from /dev/mapper/mpathe 00000020:01000004:4.0:1406539153.106426:0:30248:0:(obd_mount_server.c:1622:osd_start()) Attempting to start fs1-OST0004, type=osd-ldiskfs, lsifl=200002, mountfl=0 00000020:01000004:4.0:1406539153.106428:0:30248:0:(obd_mount_server.c:1641:osd_start()) fs1-OST0004-osd already started 00000020:00020000:4.0:1406539153.106429:0:30248:0:(obd_class.h:1008:obd_connect()) Device 0 not setup 00000020:00020000:4.0:1406539153.112948:0:30248:0:(obd_mount_server.c:1698:server_fill_super()) Unable to start osd on /dev/mapper/mpathe: -19 00000020:01000004:4.0:1406539153.123109:0:30248:0:(obd_mount.c:654:lustre_put_lsi()) put ffff880632ce7800 1 00000020:01000004:4.0:1406539153.123110:0:30248:0:(obd_mount.c:604:lustre_free_lsi()) Freeing lsi ffff880632980000 00000020:00020000:4.0:1406539153.123113:0:30248:0:(obd_mount.c:1324:lustre_fill_super()) Unable to mount (-19) 00000020:01000004:1.0F:1406539158.823188:0:30249:0:(obd_mount_server.c:1704:server_fill_super()) Found service fs1-OST0004 on device /dev/mapper/mpathe 00000020:01000004:1.0:1406539158.823267:0:30249:0:(obd_mount.c:333:lustre_start_mgc()) Start MGC 'MGC10.3.1.5@o2ib' 00000020:01000004:1.0:1406539158.823268:0:30249:0:(obd_mount.c:340:lustre_start_mgc()) mgs nids 10.3.1.5@o2ib:10.4.1.5@o2ib1:10.3.1.6@o2ib:10.4.1.6@o2ib1. 00000020:01000004:1.0:1406539158.823286:0:30249:0:(obd_mount.c:192:lustre_start_simple()) Starting obd MGC10.3.1.5@o2ib (typ=mgc) 00000020:01000004:3.0F:1406539158.824801:0:30249:0:(obd_mount_server.c:1210:server_start_targets()) starting target fs1-OST0004 00000020:01000004:3.0:1406539158.824838:0:30249:0:(obd_mount.c:192:lustre_start_simple()) Starting obd OSS (typ=ost) 00000020:01000004:2.0F:1406539158.830802:0:30249:0:(obd_mount_server.c:307:server_mgc_set_fs()) Set mgc disk for /dev/mapper/mpathe 00000020:01000004:0.0F:1406539158.882797:0:30249:0:(obd_mount_server.c:1115:server_register_target()) Registration fs1-OST0004, fs=fs1, 10.3.1.10@o2ib, index=0004, flags=0x2 00000020:01000004:1.0:1406539158.883033:0:30249:0:(obd_mount_server.c:120:server_register_mount()) reg_mnt ffff880639d1f9c0 from fs1-OST0004 00000020:01000004:5.0F:1406539158.884893:0:30311:0:(obd_mount.c:870:lustre_check_exclusion()) Check exclusion fs1-OST0004 (4) in 0 of /dev/mapper/mpathe 00000020:01000004:5.0:1406539158.885041:0:30311:0:(obd_mount_server.c:170:server_get_mount()) get_mnt ffff880639d1f9c0 from fs1-OST0004, refs=2 00000001:00000004:4.0:1406539158.887323:0:30311:0:(tgt_lastrcvd.c:253:tgt_server_data_update()) fs1-OST0004_UUID: mount_count is 44, last_transno is 184683593728 00000020:01000004:5.0:1406539158.890256:0:30312:0:(obd_mount.c:192:lustre_start_simple()) Starting obd fs1-MDT0000-lwp-OST0004 (typ=lwp) 00000020:02000000:3.0:1406539158.891493:0:30249:0:(obd_mount_server.c:1808:server_calc_timeout()) fs1-OST0004: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 00000020:01000004:2.0:1406539158.901448:0:30249:0:(obd_mount_server.c:324:server_mgc_clear_fs()) Unassign mgc disk 00000020:01000004:2.0:1406539158.901510:0:30249:0:(obd_mount_server.c:1581:server_fill_super_common()) Server sb, dev=23 00000020:00000004:2.0:1406539158.901513:0:30249:0:(obd_mount.c:1327:lustre_fill_super()) Mount /dev/mapper/mpathe complete 00010000:02000400:5.0:1406539164.331731:0:30295:0:(ldlm_lib.c:1566:target_start_recovery_timer()) fs1-OST0004: Will be in recovery for at least 2:30, or until 3 clients reconnect So I think the patch is good and fixes the initial issue. Thanks. We are now waiting positive code reviews. Note that the lustre 2.4.3 Bull's branch now includes the RETURN() call at the end of error handling of obd_connect() in osd_start() routine. It came from the patch for LU-2059 that has been integrated in our repository several weeks ago.

Bruno Faccini (Inactive) added a comment - 24/Jul/14 11:34 PM

http://review.whamcloud.com/11139 master patch passed autotests with success. Will now expose it to Gregoire's reproducer.

Will abandon http://review.whamcloud.com/11122 ...

Bruno Faccini (Inactive) added a comment - 24/Jul/14 11:34 PM http://review.whamcloud.com/11139 master patch passed autotests with success. Will now expose it to Gregoire's reproducer. Will abandon http://review.whamcloud.com/11122 ...

Bruno Faccini (Inactive) added a comment - 19/Jul/14 11:45 AM

Humm too bad, with http://review.whamcloud.com/11139 conf-sanity/test_41a fails because after a 1st MDT mount with nosvc option, the 2nd with nomgs option now fails with -EALREADY, as expected with patch !!...

Do I correctly understand, that these options are only intended to start co-located MGT and MDT separately on Server/MDS side, and this is the only case of a legal 2nd mount/start attempt for the same device on a Server?

Try to fix this special case with patch-set #2 of http://review.whamcloud.com/11139.

Bruno Faccini (Inactive) added a comment - 19/Jul/14 11:45 AM Humm too bad, with http://review.whamcloud.com/11139 conf-sanity/test_41a fails because after a 1st MDT mount with nosvc option, the 2nd with nomgs option now fails with -EALREADY, as expected with patch !!... Do I correctly understand, that these options are only intended to start co-located MGT and MDT separately on Server/MDS side, and this is the only case of a legal 2nd mount/start attempt for the same device on a Server? Try to fix this special case with patch-set #2 of http://review.whamcloud.com/11139 .

Bruno Faccini (Inactive) added a comment - 18/Jul/14 11:39 AM - edited

Ok, but now that I have tried to embed server_fill_super()/osd_start() (generic/lustre mount operation) within open_bdev_exclusive()/close_bdev_exclusive() protection, osd_mount() fails with -EBUSY, due to 2nd vfs_kern_mount() call for same device but now for specific underlying device file-system type, in osd_mount() for ldiskfs ...

Also I have better read the concerned source code, and I have already found the difference between the 2 LBUGs when running your reproducer against 2.4.x or 2.6, it comes from this missing RETURN() in 2.4 version of osd_start() :

static int osd_start(struct lustre_sb_info *lsi, unsigned long mflags)
{
        struct lustre_mount_data *lmd = lsi->lsi_lmd;
        struct obd_device        *obd;
        struct dt_device_param    p;
        char                      flagstr[16];
        int                       rc;
        ENTRY;

        CDEBUG(D_MOUNT,
               "Attempting to start %s, type=%s, lsifl=%x, mountfl=%lx\n",
               lsi->lsi_svname, lsi->lsi_osd_type, lsi->lsi_flags, mflags);

        sprintf(lsi->lsi_osd_obdname, "%s-osd", lsi->lsi_svname);
        strcpy(lsi->lsi_osd_uuid, lsi->lsi_osd_obdname);
        strcat(lsi->lsi_osd_uuid, "_UUID");
        sprintf(flagstr, "%lu:%lu", mflags, (unsigned long) lmd->lmd_flags);

        obd = class_name2obd(lsi->lsi_osd_obdname);
        if (obd == NULL) {
                rc = lustre_start_simple(lsi->lsi_osd_obdname,
                                         lsi->lsi_osd_type,
                                         lsi->lsi_osd_uuid, lmd->lmd_dev,
                                         flagstr, lsi->lsi_lmd->lmd_opts,
                                         lsi->lsi_svname);
                if (rc)
                        GOTO(out, rc);
                obd = class_name2obd(lsi->lsi_osd_obdname);
                LASSERT(obd);
        }

        rc = obd_connect(NULL, &lsi->lsi_osd_exp,
                         obd, &obd->obd_uuid, NULL, NULL);
        if (rc) {
                obd->obd_force = 1;
                class_manual_cleanup(obd);
                lsi->lsi_dt_dev = NULL;
                RETURN(rc);   <<<<<<<<<<<<<<<<<<<<<<
        }

        LASSERT(obd->obd_lu_dev);

..............

which allows a 2nd/extra mount command to continue even after errors during cleanup and thus to trigger the LASSERT(obd->obd_lu_dev) when in the 2.6/master case it is the 1st mount command that finally/later triggers the LASSERT(obd); likely due to the cleanup actions from 2nd/extra command ...
The missing RETURN() in b2_4 has been fixed by :

commit 3e38436dc09097429e1ca1fbfa3ef4981e124715
Author: Mikhail Pershin <mike.pershin@intel.com>
Date:   Mon Mar 4 00:06:41 2013 +0400

    LU-2059 llog: MGC to use OSD API for backup logs
    
    MGC uses lvfs API to access local llogs blocking removal of old code
    
    - MGS is converted to use OSD API for local llogs
    - llog_is_empty() and llog_backup() are introduced
    - initial OSD start initialize run ctxt so llog can use it too
    - shrink dcache after initial scrub in osd-ldiskfs to avoid holding
      data of unlinked objects until umount.
    
    Signed-off-by: Mikhail Pershin <mike.pershin@intel.com>
    Change-Id: I137e38abb0e93c12406c6610c0922f4294e67290

BTW, this allowed me to think of a more easy/simple way to fix the race, it is at http://review.whamcloud.com/11139.

Bruno Faccini (Inactive) added a comment - 18/Jul/14 11:39 AM - edited Ok, but now that I have tried to embed server_fill_super()/osd_start() (generic/lustre mount operation) within open_bdev_exclusive()/close_bdev_exclusive() protection, osd_mount() fails with -EBUSY, due to 2nd vfs_kern_mount() call for same device but now for specific underlying device file-system type, in osd_mount() for ldiskfs ... Also I have better read the concerned source code, and I have already found the difference between the 2 LBUGs when running your reproducer against 2.4.x or 2.6, it comes from this missing RETURN() in 2.4 version of osd_start() : static int osd_start(struct lustre_sb_info *lsi, unsigned long mflags) { struct lustre_mount_data *lmd = lsi->lsi_lmd; struct obd_device *obd; struct dt_device_param p; char flagstr[16]; int rc; ENTRY; CDEBUG(D_MOUNT, "Attempting to start %s, type=%s, lsifl=%x, mountfl=%lx\n", lsi->lsi_svname, lsi->lsi_osd_type, lsi->lsi_flags, mflags); sprintf(lsi->lsi_osd_obdname, "%s-osd", lsi->lsi_svname); strcpy(lsi->lsi_osd_uuid, lsi->lsi_osd_obdname); strcat(lsi->lsi_osd_uuid, "_UUID"); sprintf(flagstr, "%lu:%lu", mflags, (unsigned long) lmd->lmd_flags); obd = class_name2obd(lsi->lsi_osd_obdname); if (obd == NULL) { rc = lustre_start_simple(lsi->lsi_osd_obdname, lsi->lsi_osd_type, lsi->lsi_osd_uuid, lmd->lmd_dev, flagstr, lsi->lsi_lmd->lmd_opts, lsi->lsi_svname); if (rc) GOTO(out, rc); obd = class_name2obd(lsi->lsi_osd_obdname); LASSERT(obd); } rc = obd_connect(NULL, &lsi->lsi_osd_exp, obd, &obd->obd_uuid, NULL, NULL); if (rc) { obd->obd_force = 1; class_manual_cleanup(obd); lsi->lsi_dt_dev = NULL; RETURN(rc); <<<<<<<<<<<<<<<<<<<<<< } LASSERT(obd->obd_lu_dev); .............. which allows a 2nd/extra mount command to continue even after errors during cleanup and thus to trigger the LASSERT(obd->obd_lu_dev) when in the 2.6/master case it is the 1st mount command that finally/later triggers the LASSERT(obd); likely due to the cleanup actions from 2nd/extra command ... The missing RETURN() in b2_4 has been fixed by : commit 3e38436dc09097429e1ca1fbfa3ef4981e124715 Author: Mikhail Pershin <mike.pershin@intel.com> Date: Mon Mar 4 00:06:41 2013 +0400 LU-2059 llog: MGC to use OSD API for backup logs MGC uses lvfs API to access local llogs blocking removal of old code - MGS is converted to use OSD API for local llogs - llog_is_empty() and llog_backup() are introduced - initial OSD start initialize run ctxt so llog can use it too - shrink dcache after initial scrub in osd-ldiskfs to avoid holding data of unlinked objects until umount. Signed-off-by: Mikhail Pershin <mike.pershin@intel.com> Change-Id: I137e38abb0e93c12406c6610c0922f4294e67290 BTW, this allowed me to think of a more easy/simple way to fix the race, it is at http://review.whamcloud.com/11139 .

Bruno Faccini (Inactive) added a comment - 17/Jul/14 8:37 AM

Humm, seems that 2.6.38 Kernel introduced some API changes, and open_bdev_exclusive()/close_bdev_exclusive() need to be replaced by blkdev_get_by_path()/blkdev_put() calls with explicit setting of FMODE_EXCL mode ...
So I added some compile time conditionnal stuff.

Bruno Faccini (Inactive) added a comment - 17/Jul/14 8:37 AM Humm, seems that 2.6.38 Kernel introduced some API changes, and open_bdev_exclusive()/close_bdev_exclusive() need to be replaced by blkdev_get_by_path()/blkdev_put() calls with explicit setting of FMODE_EXCL mode ... So I added some compile time conditionnal stuff.

Bruno Faccini (Inactive) added a comment - 16/Jul/14 10:08 PM

My patch at http://review.whamcloud.com/11122 is an attempt to implement a simple protection mechanism based on Alex's suggestion, thanks for the trick Alex!

Bruno Faccini (Inactive) added a comment - 16/Jul/14 10:08 PM My patch at http://review.whamcloud.com/11122 is an attempt to implement a simple protection mechanism based on Alex's suggestion, thanks for the trick Alex!

Bruno Faccini (Inactive) added a comment - 11/Jul/14 1:00 PM

Gregoire,
Thanks for providing these last infos, that will help me to save time in implementing the fix for this problem. Will update ticket with a patch proposal now.

Bruno Faccini (Inactive) added a comment - 11/Jul/14 1:00 PM Gregoire, Thanks for providing these last infos, that will help me to save time in implementing the fix for this problem. Will update ticket with a patch proposal now.

John Fuchs-Chesney (Inactive) added a comment - 07/Jul/14 7:59 PM

Thank you for looking at this Bruno.
~ jfc.

John Fuchs-Chesney (Inactive) added a comment - 07/Jul/14 7:59 PM Thank you for looking at this Bruno. ~ jfc.

Gregoire Pichon added a comment - 07/Jul/14 2:13 PM

Attaching crash.lustre-2.4.2.txt that contains crash dump output of following commands: foreach bt, ps | grep mount, bt -f of mount.lustre commands, and dmesg.

Gregoire Pichon added a comment - 07/Jul/14 2:13 PM Attaching crash.lustre-2.4.2.txt that contains crash dump output of following commands: foreach bt , ps | grep mount , bt -f of mount.lustre commands, and dmesg .

Bruno Faccini (Inactive) added a comment - 07/Jul/14 1:38 PM

Hello Gregoire,
Thanks for the report and to detail how this can be reproduced, looks like it could be done in-house on a test platform. Just in case, can you attach the "foreach bt" output if you have a crash dump available ?

Bruno Faccini (Inactive) added a comment - 07/Jul/14 1:38 PM Hello Gregoire, Thanks for the report and to detail how this can be reproduced, looks like it could be done in-house on a test platform. Just in case, can you attach the "foreach bt" output if you have a crash dump available ?

Alex Zhuravlev added a comment - 07/Jul/14 10:19 AM

a regular filesystem does open a block device exclusively:

bdev = open_bdev_exclusive(dev_name, mode, fs_type);
if (IS_ERR(bdev))
return PTR_ERR(bdev);

this is where real competition is taking place. the only winner will be doing actual initialization, etc.
I guess we need something similar, I don't think we need strong race control inside initialization?

Alex Zhuravlev added a comment - 07/Jul/14 10:19 AM a regular filesystem does open a block device exclusively: bdev = open_bdev_exclusive(dev_name, mode, fs_type); if (IS_ERR(bdev)) return PTR_ERR(bdev); this is where real competition is taking place. the only winner will be doing actual initialization, etc. I guess we need something similar, I don't think we need strong race control inside initialization?

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Gregoire Pichon

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 07/Jul/14 9:20 AM

Updated:: 01/May/15 5:20 PM

Resolved:: 27/Aug/14 4:16 AM