Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5299

osd_start() LBUG when doing parallel mount of the same target

Details

    • 3
    • 14785

    Description

      One of our customers hits a crash when mounting OST targets. Here is the system log and backtrace.

      LustreError: 5048:0:(obd_class.h:1008:obd_connect()) Device 35 not setup
      LustreError: 5048:0:(obd_config.c:619:class_cleanup()) Device 35 not setup
      LustreError: 5048:0:(obd_mount_server.c:1635:osd_start()) ASSERTION( obd->obd_lu_dev ) failed:
      LustreError: 5048:0:(obd_mount_server.c:1635:osd_start()) LBUG
      
      crash> bt
      PID: 5048 TASK: ffff88085be42b00 CPU: 7 COMMAND: "mount.lustre"
       #0 [ffff8807c8189ac0] machine_kexec at ffffffff81035f1b
       #1 [ffff8807c8189b20] crash_kexec at ffffffff810c0f12
       #2 [ffff8807c8189bf0] panic at ffffffff8150e884
       #3 [ffff8807c8189c70] lbug_with_loc at ffffffffa047deeb [libcfs]
       #4 [ffff8807c8189c90] server_fill_super at ffffffffa0717673 [obdclass]
       #5 [ffff8807c8189d70] lustre_fill_super at ffffffffa06e76f8 [obdclass]
       #6 [ffff8807c8189da0] get_sb_nodev at ffffffff81184bcf
       #7 [ffff8807c8189de0] lustre_get_sb at ffffffffa06df175 [obdclass]
       #8 [ffff8807c8189e00] vfs_kern_mount at ffffffff8118420b
       #9 [ffff8807c8189e50] do_kern_mount at ffffffff811843b2
      #10 [ffff8807c8189ea0] do_mount at ffffffff811a4572
      #11 [ffff8807c8189f20] sys_mount at ffffffff811a4c00
      #12 [ffff8807c8189f80] system_call_fastpath at ffffffff8100b072
          RIP: 00000030ce4e909a RSP: 00007fff59ccbb70 RFLAGS: 00010202
          RAX: 00000000000000a5 RBX: ffffffff8100b072 RCX: 00002b58cacf1006
          RDX: 0000000000408bbf RSI: 00007fff59cd4f28 RDI: 00000000012a3170
          RBP: 0000000000000000 R8: 00000000012a3190 R9: 0000000000000000
          R10: 0000000001000000 R11: 0000000000000206 R12: 000000000060db18
          R13: 000000000060db10 R14: 00000000012a3190 R15: 0000000000000000
          ORIG_RAX: 00000000000000a5 CS: 0033 SS: 002b
      

      Lustre release is lustre-2.4.2.
      The issue occured on several OSSs, after a first centralized mount of OSTs was interrupted (shine start ... Ctr^C) and relaunched.

      I have been able to reproduce the issue on lustre-2.4.2, by launching mount of the same OST target in parallel.

      # mntpoint=/mnt/fs1/ost/4
      # mntdev=/dev/mapper/mpathe
      # umount $mntpoint
      # for i in $(seq 1 10); do dd if=$mntdev of=/dev/null bs=4k count=1M & done ;
      # /bin/mount -t lustre -o errors=panic $mntdev $mntpoint & \
      /bin/mount -t lustre -o errors=panic $mntdev $mntpoint
      

      Launching read IOs on the device in backgroung may help reproduce the issue, since ldiskfs_get_sb() calls set_blocksize() (via get_sb_bdev() and sb_set_blocksize()), which serializes with IOs on the device.

         ...
          /* Prevent starting I/O or mapping the device */
          percpu_down_write(&bdev->bd_block_size_semaphore);
      

      This probably opens a larger window for a potential race between the two mount operations.

      This ticket is certainly related to LU-2134, which is currently unresolved.

      I have reproduced the issue on Lustre 2.5.60 with a sligthly different assertion.

      <3>[264188.026108] LustreError: 22665:0:(obd_class.h:954:obd_connect()) Device 1 not setup
      <3>[264188.033999] LustreError: 22665:0:(obd_config.c:615:class_cleanup()) Device 1 not setup
      <3>[264188.042166] LustreError: 22665:0:(obd_mount_server.c:1735:server_fill_super()) Unable to start osd on /dev/ram0: -19
      <3>[264188.052925] LustreError: 22665:0:(obd_mount.c:1335:lustre_fill_super()) Unable to mount  (-19)
      <6>[264189.997203] LDISKFS-fs (ram0): mounted filesystem with ordered data mode. quota=on. Opts: 
      <0>[264190.065537] LustreError: 22664:0:(obd_mount_server.c:1690:osd_start()) ASSERTION( obd ) failed: 
      <0>[264190.074554] LustreError: 22664:0:(obd_mount_server.c:1690:osd_start()) LBUG
      <4>[264190.081682] Pid: 22664, comm: mount.lustre
      <4>[264190.085933] 
      <4>[264190.085935] Call Trace:
      <4>[264190.090213]  [<ffffffffa04fd895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4>[264190.097358]  [<ffffffffa04fde97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4>[264190.103765]  [<ffffffffa066e522>] server_fill_super+0x1082/0x1620 [obdclass]
      <4>[264190.111014]  [<ffffffffa0605a6e>] ? obd_zombie_is_idle+0x1e/0x70 [obdclass]
      <4>[264190.118183]  [<ffffffffa0639a68>] lustre_fill_super+0x1d8/0x530 [obdclass]
      <4>[264190.125268]  [<ffffffffa0639890>] ? lustre_fill_super+0x0/0x530 [obdclass]
      <4>[264190.132326]  [<ffffffff8118c7bf>] get_sb_nodev+0x5f/0xa0
      <4>[264190.137847]  [<ffffffffa06317b5>] lustre_get_sb+0x25/0x30 [obdclass]
      <4>[264190.144370]  [<ffffffff8118be1b>] vfs_kern_mount+0x7b/0x1b0
      <4>[264190.150108]  [<ffffffff8118bfc2>] do_kern_mount+0x52/0x130
      <4>[264190.155751]  [<ffffffff811acf9b>] do_mount+0x2fb/0x930
      <4>[264190.162326]  [<ffffffff811ad660>] sys_mount+0x90/0xe0
      <4>[264190.167536]  [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      <4>[264190.173699] 
      

      In both cases, there is a lack of synchronisation in osd_start() when osd is retrieved from class_name2obd(). Obd device may still be in initialization phase while second mount.lustre process reaches this portion of code.

      static int osd_start(struct lustre_sb_info *lsi, unsigned long mflags)
      {
              struct lustre_mount_data *lmd = lsi->lsi_lmd;
              struct obd_device        *obd;
              struct dt_device_param    p;
              char                      flagstr[16];
              int                       rc;
              ENTRY;
      
              CDEBUG(D_MOUNT,
                     "Attempting to start %s, type=%s, lsifl=%x, mountfl=%lx\n",
                     lsi->lsi_svname, lsi->lsi_osd_type, lsi->lsi_flags, mflags);
      
              sprintf(lsi->lsi_osd_obdname, "%s-osd", lsi->lsi_svname);
              strcpy(lsi->lsi_osd_uuid, lsi->lsi_osd_obdname);
              strcat(lsi->lsi_osd_uuid, "_UUID");
              sprintf(flagstr, "%lu:%lu", mflags, (unsigned long) lmd->lmd_flags);
      
              obd = class_name2obd(lsi->lsi_osd_obdname);
              if (obd == NULL) {
                      rc = lustre_start_simple(lsi->lsi_osd_obdname,
                                               lsi->lsi_osd_type,
                                               lsi->lsi_osd_uuid, lmd->lmd_dev,
                                               flagstr, lsi->lsi_lmd->lmd_opts,
                                               lsi->lsi_svname);
                      if (rc)
                              GOTO(out, rc);
                      obd = class_name2obd(lsi->lsi_osd_obdname);
                      LASSERT(obd);
              }
      
              rc = obd_connect(NULL, &lsi->lsi_osd_exp,
                               obd, &obd->obd_uuid, NULL, NULL);
              if (rc) {
                      obd->obd_force = 1;
                      class_manual_cleanup(obd);
                      lsi->lsi_dt_dev = NULL;
              }
      
              /* XXX: to keep support old components relying on lsi_srv_mnt
               *      we get this info from OSD just started */
              LASSERT(obd->obd_lu_dev);
              lsi->lsi_dt_dev = lu2dt_dev(obd->obd_lu_dev);
              LASSERT(lsi->lsi_dt_dev);
      
              dt_conf_get(NULL, lsi->lsi_dt_dev, &p);
      
              lsi->lsi_srv_mnt = p.ddp_mnt;
      
      out:
              RETURN(rc);
      }
      

      Additional data from the crash dumps is available if needed.

      Attachments

        Issue Links

          Activity

            [LU-5299] osd_start() LBUG when doing parallel mount of the same target

            My patch at http://review.whamcloud.com/11122 is an attempt to implement a simple protection mechanism based on Alex's suggestion, thanks for the trick Alex!

            bfaccini Bruno Faccini (Inactive) added a comment - My patch at http://review.whamcloud.com/11122 is an attempt to implement a simple protection mechanism based on Alex's suggestion, thanks for the trick Alex!

            Gregoire,
            Thanks for providing these last infos, that will help me to save time in implementing the fix for this problem. Will update ticket with a patch proposal now.

            bfaccini Bruno Faccini (Inactive) added a comment - Gregoire, Thanks for providing these last infos, that will help me to save time in implementing the fix for this problem. Will update ticket with a patch proposal now.

            Thank you for looking at this Bruno.
            ~ jfc.

            jfc John Fuchs-Chesney (Inactive) added a comment - Thank you for looking at this Bruno. ~ jfc.

            Attaching crash.lustre-2.4.2.txt that contains crash dump output of following commands: foreach bt, ps | grep mount, bt -f of mount.lustre commands, and dmesg.

            pichong Gregoire Pichon added a comment - Attaching crash.lustre-2.4.2.txt that contains crash dump output of following commands: foreach bt , ps | grep mount , bt -f of mount.lustre commands, and dmesg .

            Hello Gregoire,
            Thanks for the report and to detail how this can be reproduced, looks like it could be done in-house on a test platform. Just in case, can you attach the "foreach bt" output if you have a crash dump available ?

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Gregoire, Thanks for the report and to detail how this can be reproduced, looks like it could be done in-house on a test platform. Just in case, can you attach the "foreach bt" output if you have a crash dump available ?

            a regular filesystem does open a block device exclusively:

            bdev = open_bdev_exclusive(dev_name, mode, fs_type);
            if (IS_ERR(bdev))
            return PTR_ERR(bdev);

            this is where real competition is taking place. the only winner will be doing actual initialization, etc.
            I guess we need something similar, I don't think we need strong race control inside initialization?

            bzzz Alex Zhuravlev added a comment - a regular filesystem does open a block device exclusively: bdev = open_bdev_exclusive(dev_name, mode, fs_type); if (IS_ERR(bdev)) return PTR_ERR(bdev); this is where real competition is taking place. the only winner will be doing actual initialization, etc. I guess we need something similar, I don't think we need strong race control inside initialization?

            People

              bfaccini Bruno Faccini (Inactive)
              pichong Gregoire Pichon
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: