Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5299

osd_start() LBUG when doing parallel mount of the same target

Details

    • 3
    • 14785

    Description

      One of our customers hits a crash when mounting OST targets. Here is the system log and backtrace.

      LustreError: 5048:0:(obd_class.h:1008:obd_connect()) Device 35 not setup
      LustreError: 5048:0:(obd_config.c:619:class_cleanup()) Device 35 not setup
      LustreError: 5048:0:(obd_mount_server.c:1635:osd_start()) ASSERTION( obd->obd_lu_dev ) failed:
      LustreError: 5048:0:(obd_mount_server.c:1635:osd_start()) LBUG
      
      crash> bt
      PID: 5048 TASK: ffff88085be42b00 CPU: 7 COMMAND: "mount.lustre"
       #0 [ffff8807c8189ac0] machine_kexec at ffffffff81035f1b
       #1 [ffff8807c8189b20] crash_kexec at ffffffff810c0f12
       #2 [ffff8807c8189bf0] panic at ffffffff8150e884
       #3 [ffff8807c8189c70] lbug_with_loc at ffffffffa047deeb [libcfs]
       #4 [ffff8807c8189c90] server_fill_super at ffffffffa0717673 [obdclass]
       #5 [ffff8807c8189d70] lustre_fill_super at ffffffffa06e76f8 [obdclass]
       #6 [ffff8807c8189da0] get_sb_nodev at ffffffff81184bcf
       #7 [ffff8807c8189de0] lustre_get_sb at ffffffffa06df175 [obdclass]
       #8 [ffff8807c8189e00] vfs_kern_mount at ffffffff8118420b
       #9 [ffff8807c8189e50] do_kern_mount at ffffffff811843b2
      #10 [ffff8807c8189ea0] do_mount at ffffffff811a4572
      #11 [ffff8807c8189f20] sys_mount at ffffffff811a4c00
      #12 [ffff8807c8189f80] system_call_fastpath at ffffffff8100b072
          RIP: 00000030ce4e909a RSP: 00007fff59ccbb70 RFLAGS: 00010202
          RAX: 00000000000000a5 RBX: ffffffff8100b072 RCX: 00002b58cacf1006
          RDX: 0000000000408bbf RSI: 00007fff59cd4f28 RDI: 00000000012a3170
          RBP: 0000000000000000 R8: 00000000012a3190 R9: 0000000000000000
          R10: 0000000001000000 R11: 0000000000000206 R12: 000000000060db18
          R13: 000000000060db10 R14: 00000000012a3190 R15: 0000000000000000
          ORIG_RAX: 00000000000000a5 CS: 0033 SS: 002b
      

      Lustre release is lustre-2.4.2.
      The issue occured on several OSSs, after a first centralized mount of OSTs was interrupted (shine start ... Ctr^C) and relaunched.

      I have been able to reproduce the issue on lustre-2.4.2, by launching mount of the same OST target in parallel.

      # mntpoint=/mnt/fs1/ost/4
      # mntdev=/dev/mapper/mpathe
      # umount $mntpoint
      # for i in $(seq 1 10); do dd if=$mntdev of=/dev/null bs=4k count=1M & done ;
      # /bin/mount -t lustre -o errors=panic $mntdev $mntpoint & \
      /bin/mount -t lustre -o errors=panic $mntdev $mntpoint
      

      Launching read IOs on the device in backgroung may help reproduce the issue, since ldiskfs_get_sb() calls set_blocksize() (via get_sb_bdev() and sb_set_blocksize()), which serializes with IOs on the device.

         ...
          /* Prevent starting I/O or mapping the device */
          percpu_down_write(&bdev->bd_block_size_semaphore);
      

      This probably opens a larger window for a potential race between the two mount operations.

      This ticket is certainly related to LU-2134, which is currently unresolved.

      I have reproduced the issue on Lustre 2.5.60 with a sligthly different assertion.

      <3>[264188.026108] LustreError: 22665:0:(obd_class.h:954:obd_connect()) Device 1 not setup
      <3>[264188.033999] LustreError: 22665:0:(obd_config.c:615:class_cleanup()) Device 1 not setup
      <3>[264188.042166] LustreError: 22665:0:(obd_mount_server.c:1735:server_fill_super()) Unable to start osd on /dev/ram0: -19
      <3>[264188.052925] LustreError: 22665:0:(obd_mount.c:1335:lustre_fill_super()) Unable to mount  (-19)
      <6>[264189.997203] LDISKFS-fs (ram0): mounted filesystem with ordered data mode. quota=on. Opts: 
      <0>[264190.065537] LustreError: 22664:0:(obd_mount_server.c:1690:osd_start()) ASSERTION( obd ) failed: 
      <0>[264190.074554] LustreError: 22664:0:(obd_mount_server.c:1690:osd_start()) LBUG
      <4>[264190.081682] Pid: 22664, comm: mount.lustre
      <4>[264190.085933] 
      <4>[264190.085935] Call Trace:
      <4>[264190.090213]  [<ffffffffa04fd895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4>[264190.097358]  [<ffffffffa04fde97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4>[264190.103765]  [<ffffffffa066e522>] server_fill_super+0x1082/0x1620 [obdclass]
      <4>[264190.111014]  [<ffffffffa0605a6e>] ? obd_zombie_is_idle+0x1e/0x70 [obdclass]
      <4>[264190.118183]  [<ffffffffa0639a68>] lustre_fill_super+0x1d8/0x530 [obdclass]
      <4>[264190.125268]  [<ffffffffa0639890>] ? lustre_fill_super+0x0/0x530 [obdclass]
      <4>[264190.132326]  [<ffffffff8118c7bf>] get_sb_nodev+0x5f/0xa0
      <4>[264190.137847]  [<ffffffffa06317b5>] lustre_get_sb+0x25/0x30 [obdclass]
      <4>[264190.144370]  [<ffffffff8118be1b>] vfs_kern_mount+0x7b/0x1b0
      <4>[264190.150108]  [<ffffffff8118bfc2>] do_kern_mount+0x52/0x130
      <4>[264190.155751]  [<ffffffff811acf9b>] do_mount+0x2fb/0x930
      <4>[264190.162326]  [<ffffffff811ad660>] sys_mount+0x90/0xe0
      <4>[264190.167536]  [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      <4>[264190.173699] 
      

      In both cases, there is a lack of synchronisation in osd_start() when osd is retrieved from class_name2obd(). Obd device may still be in initialization phase while second mount.lustre process reaches this portion of code.

      static int osd_start(struct lustre_sb_info *lsi, unsigned long mflags)
      {
              struct lustre_mount_data *lmd = lsi->lsi_lmd;
              struct obd_device        *obd;
              struct dt_device_param    p;
              char                      flagstr[16];
              int                       rc;
              ENTRY;
      
              CDEBUG(D_MOUNT,
                     "Attempting to start %s, type=%s, lsifl=%x, mountfl=%lx\n",
                     lsi->lsi_svname, lsi->lsi_osd_type, lsi->lsi_flags, mflags);
      
              sprintf(lsi->lsi_osd_obdname, "%s-osd", lsi->lsi_svname);
              strcpy(lsi->lsi_osd_uuid, lsi->lsi_osd_obdname);
              strcat(lsi->lsi_osd_uuid, "_UUID");
              sprintf(flagstr, "%lu:%lu", mflags, (unsigned long) lmd->lmd_flags);
      
              obd = class_name2obd(lsi->lsi_osd_obdname);
              if (obd == NULL) {
                      rc = lustre_start_simple(lsi->lsi_osd_obdname,
                                               lsi->lsi_osd_type,
                                               lsi->lsi_osd_uuid, lmd->lmd_dev,
                                               flagstr, lsi->lsi_lmd->lmd_opts,
                                               lsi->lsi_svname);
                      if (rc)
                              GOTO(out, rc);
                      obd = class_name2obd(lsi->lsi_osd_obdname);
                      LASSERT(obd);
              }
      
              rc = obd_connect(NULL, &lsi->lsi_osd_exp,
                               obd, &obd->obd_uuid, NULL, NULL);
              if (rc) {
                      obd->obd_force = 1;
                      class_manual_cleanup(obd);
                      lsi->lsi_dt_dev = NULL;
              }
      
              /* XXX: to keep support old components relying on lsi_srv_mnt
               *      we get this info from OSD just started */
              LASSERT(obd->obd_lu_dev);
              lsi->lsi_dt_dev = lu2dt_dev(obd->obd_lu_dev);
              LASSERT(lsi->lsi_dt_dev);
      
              dt_conf_get(NULL, lsi->lsi_dt_dev, &p);
      
              lsi->lsi_srv_mnt = p.ddp_mnt;
      
      out:
              RETURN(rc);
      }
      

      Additional data from the crash dumps is available if needed.

      Attachments

        Issue Links

          Activity

            [LU-5299] osd_start() LBUG when doing parallel mount of the same target
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Ok, but now that I have tried to embed server_fill_super()/osd_start() (generic/lustre mount operation) within open_bdev_exclusive()/close_bdev_exclusive() protection, osd_mount() fails with -EBUSY, due to 2nd vfs_kern_mount() call for same device but now for specific underlying device file-system type, in osd_mount() for ldiskfs ...

            Also I have better read the concerned source code, and I have already found the difference between the 2 LBUGs when running your reproducer against 2.4.x or 2.6, it comes from this missing RETURN() in 2.4 version of osd_start() :

            static int osd_start(struct lustre_sb_info *lsi, unsigned long mflags)
            {
                    struct lustre_mount_data *lmd = lsi->lsi_lmd;
                    struct obd_device        *obd;
                    struct dt_device_param    p;
                    char                      flagstr[16];
                    int                       rc;
                    ENTRY;
            
                    CDEBUG(D_MOUNT,
                           "Attempting to start %s, type=%s, lsifl=%x, mountfl=%lx\n",
                           lsi->lsi_svname, lsi->lsi_osd_type, lsi->lsi_flags, mflags);
            
                    sprintf(lsi->lsi_osd_obdname, "%s-osd", lsi->lsi_svname);
                    strcpy(lsi->lsi_osd_uuid, lsi->lsi_osd_obdname);
                    strcat(lsi->lsi_osd_uuid, "_UUID");
                    sprintf(flagstr, "%lu:%lu", mflags, (unsigned long) lmd->lmd_flags);
            
                    obd = class_name2obd(lsi->lsi_osd_obdname);
                    if (obd == NULL) {
                            rc = lustre_start_simple(lsi->lsi_osd_obdname,
                                                     lsi->lsi_osd_type,
                                                     lsi->lsi_osd_uuid, lmd->lmd_dev,
                                                     flagstr, lsi->lsi_lmd->lmd_opts,
                                                     lsi->lsi_svname);
                            if (rc)
                                    GOTO(out, rc);
                            obd = class_name2obd(lsi->lsi_osd_obdname);
                            LASSERT(obd);
                    }
            
                    rc = obd_connect(NULL, &lsi->lsi_osd_exp,
                                     obd, &obd->obd_uuid, NULL, NULL);
                    if (rc) {
                            obd->obd_force = 1;
                            class_manual_cleanup(obd);
                            lsi->lsi_dt_dev = NULL;
                            RETURN(rc);   <<<<<<<<<<<<<<<<<<<<<<
                    }
            
                    LASSERT(obd->obd_lu_dev);
            
            ..............
            
            

            which allows a 2nd/extra mount command to continue even after errors during cleanup and thus to trigger the LASSERT(obd->obd_lu_dev) when in the 2.6/master case it is the 1st mount command that finally/later triggers the LASSERT(obd); likely due to the cleanup actions from 2nd/extra command ...
            The missing RETURN() in b2_4 has been fixed by :

            commit 3e38436dc09097429e1ca1fbfa3ef4981e124715
            Author: Mikhail Pershin <mike.pershin@intel.com>
            Date:   Mon Mar 4 00:06:41 2013 +0400
            
                LU-2059 llog: MGC to use OSD API for backup logs
                
                MGC uses lvfs API to access local llogs blocking removal of old code
                
                - MGS is converted to use OSD API for local llogs
                - llog_is_empty() and llog_backup() are introduced
                - initial OSD start initialize run ctxt so llog can use it too
                - shrink dcache after initial scrub in osd-ldiskfs to avoid holding
                  data of unlinked objects until umount.
                
                Signed-off-by: Mikhail Pershin <mike.pershin@intel.com>
                Change-Id: I137e38abb0e93c12406c6610c0922f4294e67290
            

            BTW, this allowed me to think of a more easy/simple way to fix the race, it is at http://review.whamcloud.com/11139.

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Ok, but now that I have tried to embed server_fill_super()/osd_start() (generic/lustre mount operation) within open_bdev_exclusive()/close_bdev_exclusive() protection, osd_mount() fails with -EBUSY, due to 2nd vfs_kern_mount() call for same device but now for specific underlying device file-system type, in osd_mount() for ldiskfs ... Also I have better read the concerned source code, and I have already found the difference between the 2 LBUGs when running your reproducer against 2.4.x or 2.6, it comes from this missing RETURN() in 2.4 version of osd_start() : static int osd_start(struct lustre_sb_info *lsi, unsigned long mflags) { struct lustre_mount_data *lmd = lsi->lsi_lmd; struct obd_device *obd; struct dt_device_param p; char flagstr[16]; int rc; ENTRY; CDEBUG(D_MOUNT, "Attempting to start %s, type=%s, lsifl=%x, mountfl=%lx\n", lsi->lsi_svname, lsi->lsi_osd_type, lsi->lsi_flags, mflags); sprintf(lsi->lsi_osd_obdname, "%s-osd", lsi->lsi_svname); strcpy(lsi->lsi_osd_uuid, lsi->lsi_osd_obdname); strcat(lsi->lsi_osd_uuid, "_UUID"); sprintf(flagstr, "%lu:%lu", mflags, (unsigned long) lmd->lmd_flags); obd = class_name2obd(lsi->lsi_osd_obdname); if (obd == NULL) { rc = lustre_start_simple(lsi->lsi_osd_obdname, lsi->lsi_osd_type, lsi->lsi_osd_uuid, lmd->lmd_dev, flagstr, lsi->lsi_lmd->lmd_opts, lsi->lsi_svname); if (rc) GOTO(out, rc); obd = class_name2obd(lsi->lsi_osd_obdname); LASSERT(obd); } rc = obd_connect(NULL, &lsi->lsi_osd_exp, obd, &obd->obd_uuid, NULL, NULL); if (rc) { obd->obd_force = 1; class_manual_cleanup(obd); lsi->lsi_dt_dev = NULL; RETURN(rc); <<<<<<<<<<<<<<<<<<<<<< } LASSERT(obd->obd_lu_dev); .............. which allows a 2nd/extra mount command to continue even after errors during cleanup and thus to trigger the LASSERT(obd->obd_lu_dev) when in the 2.6/master case it is the 1st mount command that finally/later triggers the LASSERT(obd); likely due to the cleanup actions from 2nd/extra command ... The missing RETURN() in b2_4 has been fixed by : commit 3e38436dc09097429e1ca1fbfa3ef4981e124715 Author: Mikhail Pershin <mike.pershin@intel.com> Date: Mon Mar 4 00:06:41 2013 +0400 LU-2059 llog: MGC to use OSD API for backup logs MGC uses lvfs API to access local llogs blocking removal of old code - MGS is converted to use OSD API for local llogs - llog_is_empty() and llog_backup() are introduced - initial OSD start initialize run ctxt so llog can use it too - shrink dcache after initial scrub in osd-ldiskfs to avoid holding data of unlinked objects until umount. Signed-off-by: Mikhail Pershin <mike.pershin@intel.com> Change-Id: I137e38abb0e93c12406c6610c0922f4294e67290 BTW, this allowed me to think of a more easy/simple way to fix the race, it is at http://review.whamcloud.com/11139 .

            Humm, seems that 2.6.38 Kernel introduced some API changes, and open_bdev_exclusive()/close_bdev_exclusive() need to be replaced by blkdev_get_by_path()/blkdev_put() calls with explicit setting of FMODE_EXCL mode ...
            So I added some compile time conditionnal stuff.

            bfaccini Bruno Faccini (Inactive) added a comment - Humm, seems that 2.6.38 Kernel introduced some API changes, and open_bdev_exclusive()/close_bdev_exclusive() need to be replaced by blkdev_get_by_path()/blkdev_put() calls with explicit setting of FMODE_EXCL mode ... So I added some compile time conditionnal stuff.

            My patch at http://review.whamcloud.com/11122 is an attempt to implement a simple protection mechanism based on Alex's suggestion, thanks for the trick Alex!

            bfaccini Bruno Faccini (Inactive) added a comment - My patch at http://review.whamcloud.com/11122 is an attempt to implement a simple protection mechanism based on Alex's suggestion, thanks for the trick Alex!

            Gregoire,
            Thanks for providing these last infos, that will help me to save time in implementing the fix for this problem. Will update ticket with a patch proposal now.

            bfaccini Bruno Faccini (Inactive) added a comment - Gregoire, Thanks for providing these last infos, that will help me to save time in implementing the fix for this problem. Will update ticket with a patch proposal now.

            Thank you for looking at this Bruno.
            ~ jfc.

            jfc John Fuchs-Chesney (Inactive) added a comment - Thank you for looking at this Bruno. ~ jfc.

            Attaching crash.lustre-2.4.2.txt that contains crash dump output of following commands: foreach bt, ps | grep mount, bt -f of mount.lustre commands, and dmesg.

            pichong Gregoire Pichon added a comment - Attaching crash.lustre-2.4.2.txt that contains crash dump output of following commands: foreach bt , ps | grep mount , bt -f of mount.lustre commands, and dmesg .

            Hello Gregoire,
            Thanks for the report and to detail how this can be reproduced, looks like it could be done in-house on a test platform. Just in case, can you attach the "foreach bt" output if you have a crash dump available ?

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Gregoire, Thanks for the report and to detail how this can be reproduced, looks like it could be done in-house on a test platform. Just in case, can you attach the "foreach bt" output if you have a crash dump available ?

            a regular filesystem does open a block device exclusively:

            bdev = open_bdev_exclusive(dev_name, mode, fs_type);
            if (IS_ERR(bdev))
            return PTR_ERR(bdev);

            this is where real competition is taking place. the only winner will be doing actual initialization, etc.
            I guess we need something similar, I don't think we need strong race control inside initialization?

            bzzz Alex Zhuravlev added a comment - a regular filesystem does open a block device exclusively: bdev = open_bdev_exclusive(dev_name, mode, fs_type); if (IS_ERR(bdev)) return PTR_ERR(bdev); this is where real competition is taking place. the only winner will be doing actual initialization, etc. I guess we need something similar, I don't think we need strong race control inside initialization?

            People

              bfaccini Bruno Faccini (Inactive)
              pichong Gregoire Pichon
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: