Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5299

osd_start() LBUG when doing parallel mount of the same target

Details

    • 3
    • 14785

    Description

      One of our customers hits a crash when mounting OST targets. Here is the system log and backtrace.

      LustreError: 5048:0:(obd_class.h:1008:obd_connect()) Device 35 not setup
      LustreError: 5048:0:(obd_config.c:619:class_cleanup()) Device 35 not setup
      LustreError: 5048:0:(obd_mount_server.c:1635:osd_start()) ASSERTION( obd->obd_lu_dev ) failed:
      LustreError: 5048:0:(obd_mount_server.c:1635:osd_start()) LBUG
      
      crash> bt
      PID: 5048 TASK: ffff88085be42b00 CPU: 7 COMMAND: "mount.lustre"
       #0 [ffff8807c8189ac0] machine_kexec at ffffffff81035f1b
       #1 [ffff8807c8189b20] crash_kexec at ffffffff810c0f12
       #2 [ffff8807c8189bf0] panic at ffffffff8150e884
       #3 [ffff8807c8189c70] lbug_with_loc at ffffffffa047deeb [libcfs]
       #4 [ffff8807c8189c90] server_fill_super at ffffffffa0717673 [obdclass]
       #5 [ffff8807c8189d70] lustre_fill_super at ffffffffa06e76f8 [obdclass]
       #6 [ffff8807c8189da0] get_sb_nodev at ffffffff81184bcf
       #7 [ffff8807c8189de0] lustre_get_sb at ffffffffa06df175 [obdclass]
       #8 [ffff8807c8189e00] vfs_kern_mount at ffffffff8118420b
       #9 [ffff8807c8189e50] do_kern_mount at ffffffff811843b2
      #10 [ffff8807c8189ea0] do_mount at ffffffff811a4572
      #11 [ffff8807c8189f20] sys_mount at ffffffff811a4c00
      #12 [ffff8807c8189f80] system_call_fastpath at ffffffff8100b072
          RIP: 00000030ce4e909a RSP: 00007fff59ccbb70 RFLAGS: 00010202
          RAX: 00000000000000a5 RBX: ffffffff8100b072 RCX: 00002b58cacf1006
          RDX: 0000000000408bbf RSI: 00007fff59cd4f28 RDI: 00000000012a3170
          RBP: 0000000000000000 R8: 00000000012a3190 R9: 0000000000000000
          R10: 0000000001000000 R11: 0000000000000206 R12: 000000000060db18
          R13: 000000000060db10 R14: 00000000012a3190 R15: 0000000000000000
          ORIG_RAX: 00000000000000a5 CS: 0033 SS: 002b
      

      Lustre release is lustre-2.4.2.
      The issue occured on several OSSs, after a first centralized mount of OSTs was interrupted (shine start ... Ctr^C) and relaunched.

      I have been able to reproduce the issue on lustre-2.4.2, by launching mount of the same OST target in parallel.

      # mntpoint=/mnt/fs1/ost/4
      # mntdev=/dev/mapper/mpathe
      # umount $mntpoint
      # for i in $(seq 1 10); do dd if=$mntdev of=/dev/null bs=4k count=1M & done ;
      # /bin/mount -t lustre -o errors=panic $mntdev $mntpoint & \
      /bin/mount -t lustre -o errors=panic $mntdev $mntpoint
      

      Launching read IOs on the device in backgroung may help reproduce the issue, since ldiskfs_get_sb() calls set_blocksize() (via get_sb_bdev() and sb_set_blocksize()), which serializes with IOs on the device.

         ...
          /* Prevent starting I/O or mapping the device */
          percpu_down_write(&bdev->bd_block_size_semaphore);
      

      This probably opens a larger window for a potential race between the two mount operations.

      This ticket is certainly related to LU-2134, which is currently unresolved.

      I have reproduced the issue on Lustre 2.5.60 with a sligthly different assertion.

      <3>[264188.026108] LustreError: 22665:0:(obd_class.h:954:obd_connect()) Device 1 not setup
      <3>[264188.033999] LustreError: 22665:0:(obd_config.c:615:class_cleanup()) Device 1 not setup
      <3>[264188.042166] LustreError: 22665:0:(obd_mount_server.c:1735:server_fill_super()) Unable to start osd on /dev/ram0: -19
      <3>[264188.052925] LustreError: 22665:0:(obd_mount.c:1335:lustre_fill_super()) Unable to mount  (-19)
      <6>[264189.997203] LDISKFS-fs (ram0): mounted filesystem with ordered data mode. quota=on. Opts: 
      <0>[264190.065537] LustreError: 22664:0:(obd_mount_server.c:1690:osd_start()) ASSERTION( obd ) failed: 
      <0>[264190.074554] LustreError: 22664:0:(obd_mount_server.c:1690:osd_start()) LBUG
      <4>[264190.081682] Pid: 22664, comm: mount.lustre
      <4>[264190.085933] 
      <4>[264190.085935] Call Trace:
      <4>[264190.090213]  [<ffffffffa04fd895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4>[264190.097358]  [<ffffffffa04fde97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4>[264190.103765]  [<ffffffffa066e522>] server_fill_super+0x1082/0x1620 [obdclass]
      <4>[264190.111014]  [<ffffffffa0605a6e>] ? obd_zombie_is_idle+0x1e/0x70 [obdclass]
      <4>[264190.118183]  [<ffffffffa0639a68>] lustre_fill_super+0x1d8/0x530 [obdclass]
      <4>[264190.125268]  [<ffffffffa0639890>] ? lustre_fill_super+0x0/0x530 [obdclass]
      <4>[264190.132326]  [<ffffffff8118c7bf>] get_sb_nodev+0x5f/0xa0
      <4>[264190.137847]  [<ffffffffa06317b5>] lustre_get_sb+0x25/0x30 [obdclass]
      <4>[264190.144370]  [<ffffffff8118be1b>] vfs_kern_mount+0x7b/0x1b0
      <4>[264190.150108]  [<ffffffff8118bfc2>] do_kern_mount+0x52/0x130
      <4>[264190.155751]  [<ffffffff811acf9b>] do_mount+0x2fb/0x930
      <4>[264190.162326]  [<ffffffff811ad660>] sys_mount+0x90/0xe0
      <4>[264190.167536]  [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      <4>[264190.173699] 
      

      In both cases, there is a lack of synchronisation in osd_start() when osd is retrieved from class_name2obd(). Obd device may still be in initialization phase while second mount.lustre process reaches this portion of code.

      static int osd_start(struct lustre_sb_info *lsi, unsigned long mflags)
      {
              struct lustre_mount_data *lmd = lsi->lsi_lmd;
              struct obd_device        *obd;
              struct dt_device_param    p;
              char                      flagstr[16];
              int                       rc;
              ENTRY;
      
              CDEBUG(D_MOUNT,
                     "Attempting to start %s, type=%s, lsifl=%x, mountfl=%lx\n",
                     lsi->lsi_svname, lsi->lsi_osd_type, lsi->lsi_flags, mflags);
      
              sprintf(lsi->lsi_osd_obdname, "%s-osd", lsi->lsi_svname);
              strcpy(lsi->lsi_osd_uuid, lsi->lsi_osd_obdname);
              strcat(lsi->lsi_osd_uuid, "_UUID");
              sprintf(flagstr, "%lu:%lu", mflags, (unsigned long) lmd->lmd_flags);
      
              obd = class_name2obd(lsi->lsi_osd_obdname);
              if (obd == NULL) {
                      rc = lustre_start_simple(lsi->lsi_osd_obdname,
                                               lsi->lsi_osd_type,
                                               lsi->lsi_osd_uuid, lmd->lmd_dev,
                                               flagstr, lsi->lsi_lmd->lmd_opts,
                                               lsi->lsi_svname);
                      if (rc)
                              GOTO(out, rc);
                      obd = class_name2obd(lsi->lsi_osd_obdname);
                      LASSERT(obd);
              }
      
              rc = obd_connect(NULL, &lsi->lsi_osd_exp,
                               obd, &obd->obd_uuid, NULL, NULL);
              if (rc) {
                      obd->obd_force = 1;
                      class_manual_cleanup(obd);
                      lsi->lsi_dt_dev = NULL;
              }
      
              /* XXX: to keep support old components relying on lsi_srv_mnt
               *      we get this info from OSD just started */
              LASSERT(obd->obd_lu_dev);
              lsi->lsi_dt_dev = lu2dt_dev(obd->obd_lu_dev);
              LASSERT(lsi->lsi_dt_dev);
      
              dt_conf_get(NULL, lsi->lsi_dt_dev, &p);
      
              lsi->lsi_srv_mnt = p.ddp_mnt;
      
      out:
              RETURN(rc);
      }
      

      Additional data from the crash dumps is available if needed.

      Attachments

        Issue Links

          Activity

            [LU-5299] osd_start() LBUG when doing parallel mount of the same target
            pjones Peter Jones added a comment -

            Landed for 2.7. We will track landing for a future 2.5.x release separately

            pjones Peter Jones added a comment - Landed for 2.7. We will track landing for a future 2.5.x release separately

            Just pushed patch-set #8 of http://review.whamcloud.com/11139, now including a specific auto-test.

            bfaccini Bruno Faccini (Inactive) added a comment - Just pushed patch-set #8 of http://review.whamcloud.com/11139 , now including a specific auto-test.

            Tests of patch-set #7 are good, and error message is nicer.

            # PATH=/usr/lib/lustre:${PATH} /bin/mount -t lustre -o errors=panic $mntdev $mntpoint & \
            PATH=/usr/lib/lustre:${PATH} /bin/mount -t lustre -o errors=panic $mntdev $mntpoint
            [1] 26427
            mount.lustre: mount /dev/mapper/mpathe at /mnt/fs1/ost/4 failed: Operation already in progress
            The target service is already running. (/dev/mapper/mpathe)
            [1]+  Exit 114                PATH=/usr/lib/lustre:${PATH} /bin/mount -t lustre -o errors=panic $mntdev $mntpoint
            

            And the Lustre logs with "super" component enabled.

            00000020:01200004:6.0F:1406718500.823979:0:26430:0:(obd_mount.c:1260:lustre_fill_super()) VFS Op: sb ffff8805f1393800
            00000020:01000004:6.0:1406718500.824002:0:26430:0:(obd_mount.c:832:lmd_print())   mount data:
            00000020:01000004:6.0:1406718500.824003:0:26430:0:(obd_mount.c:835:lmd_print()) device:  /dev/mapper/mpathe
            00000020:01000004:6.0:1406718500.824003:0:26430:0:(obd_mount.c:836:lmd_print()) flags:   0
            00000020:01000004:6.0:1406718500.824004:0:26430:0:(obd_mount.c:839:lmd_print()) options: errors=panic,errors=remount-ro
            00000020:01000004:6.0:1406718500.824005:0:26430:0:(obd_mount.c:1305:lustre_fill_super()) Mounting server from /dev/mapper/mpathe
            00000020:01000004:6.0:1406718500.824006:0:26430:0:(obd_mount_server.c:1622:osd_start()) Attempting to start fs1-OST0004, type=osd-ldiskfs, lsifl=200002, mountfl=0
            00000020:01000004:6.0:1406718500.824045:0:26430:0:(obd_mount.c:192:lustre_start_simple()) Starting obd fs1-OST0004-osd (typ=osd-ldiskfs)
            00000020:01200004:5.0F:1406718500.824138:0:26429:0:(obd_mount.c:1260:lustre_fill_super()) VFS Op: sb ffff880618631000
            00000020:01000004:5.0:1406718500.824157:0:26429:0:(obd_mount.c:832:lmd_print())   mount data:
            00000020:01000004:5.0:1406718500.824158:0:26429:0:(obd_mount.c:835:lmd_print()) device:  /dev/mapper/mpathe
            00000020:01000004:5.0:1406718500.824158:0:26429:0:(obd_mount.c:836:lmd_print()) flags:   0
            00000020:01000004:5.0:1406718500.824159:0:26429:0:(obd_mount.c:839:lmd_print()) options: errors=panic,errors=remount-ro
            00000020:01000004:5.0:1406718500.824159:0:26429:0:(obd_mount.c:1305:lustre_fill_super()) Mounting server from /dev/mapper/mpathe
            00000020:01000004:5.0:1406718500.824160:0:26429:0:(obd_mount_server.c:1622:osd_start()) Attempting to start fs1-OST0004, type=osd-ldiskfs, lsifl=200002, mountfl=0
            00000020:01000004:5.0:1406718500.824162:0:26429:0:(obd_mount_server.c:1641:osd_start()) fs1-OST0004-osd already started
            00000020:00020000:5.0:1406718500.824163:0:26429:0:(obd_class.h:1008:obd_connect()) Device 0 not setup
            00000020:00020000:5.0:1406718500.830701:0:26429:0:(obd_mount_server.c:1698:server_fill_super()) Unable to start osd on /dev/mapper/mpathe: -114
            00000020:01000004:5.0:1406718500.840937:0:26429:0:(obd_mount.c:654:lustre_put_lsi()) put ffff880618631000 1
            00000020:01000004:5.0:1406718500.840938:0:26429:0:(obd_mount.c:604:lustre_free_lsi()) Freeing lsi ffff88060e5ef800
            00000020:00020000:5.0:1406718500.840940:0:26429:0:(obd_mount.c:1324:lustre_fill_super()) Unable to mount  (-114)
            00000020:01000004:4.0F:1406718500.970679:0:26430:0:(obd_mount_server.c:1704:server_fill_super()) Found service fs1-OST0004 on device /dev/mapper/mpathe
            ...
            
            pichong Gregoire Pichon added a comment - Tests of patch-set #7 are good, and error message is nicer. # PATH=/usr/lib/lustre:${PATH} /bin/mount -t lustre -o errors=panic $mntdev $mntpoint & \ PATH=/usr/lib/lustre:${PATH} /bin/mount -t lustre -o errors=panic $mntdev $mntpoint [1] 26427 mount.lustre: mount /dev/mapper/mpathe at /mnt/fs1/ost/4 failed: Operation already in progress The target service is already running. (/dev/mapper/mpathe) [1]+ Exit 114 PATH=/usr/lib/lustre:${PATH} /bin/mount -t lustre -o errors=panic $mntdev $mntpoint And the Lustre logs with "super" component enabled. 00000020:01200004:6.0F:1406718500.823979:0:26430:0:(obd_mount.c:1260:lustre_fill_super()) VFS Op: sb ffff8805f1393800 00000020:01000004:6.0:1406718500.824002:0:26430:0:(obd_mount.c:832:lmd_print()) mount data: 00000020:01000004:6.0:1406718500.824003:0:26430:0:(obd_mount.c:835:lmd_print()) device: /dev/mapper/mpathe 00000020:01000004:6.0:1406718500.824003:0:26430:0:(obd_mount.c:836:lmd_print()) flags: 0 00000020:01000004:6.0:1406718500.824004:0:26430:0:(obd_mount.c:839:lmd_print()) options: errors=panic,errors=remount-ro 00000020:01000004:6.0:1406718500.824005:0:26430:0:(obd_mount.c:1305:lustre_fill_super()) Mounting server from /dev/mapper/mpathe 00000020:01000004:6.0:1406718500.824006:0:26430:0:(obd_mount_server.c:1622:osd_start()) Attempting to start fs1-OST0004, type=osd-ldiskfs, lsifl=200002, mountfl=0 00000020:01000004:6.0:1406718500.824045:0:26430:0:(obd_mount.c:192:lustre_start_simple()) Starting obd fs1-OST0004-osd (typ=osd-ldiskfs) 00000020:01200004:5.0F:1406718500.824138:0:26429:0:(obd_mount.c:1260:lustre_fill_super()) VFS Op: sb ffff880618631000 00000020:01000004:5.0:1406718500.824157:0:26429:0:(obd_mount.c:832:lmd_print()) mount data: 00000020:01000004:5.0:1406718500.824158:0:26429:0:(obd_mount.c:835:lmd_print()) device: /dev/mapper/mpathe 00000020:01000004:5.0:1406718500.824158:0:26429:0:(obd_mount.c:836:lmd_print()) flags: 0 00000020:01000004:5.0:1406718500.824159:0:26429:0:(obd_mount.c:839:lmd_print()) options: errors=panic,errors=remount-ro 00000020:01000004:5.0:1406718500.824159:0:26429:0:(obd_mount.c:1305:lustre_fill_super()) Mounting server from /dev/mapper/mpathe 00000020:01000004:5.0:1406718500.824160:0:26429:0:(obd_mount_server.c:1622:osd_start()) Attempting to start fs1-OST0004, type=osd-ldiskfs, lsifl=200002, mountfl=0 00000020:01000004:5.0:1406718500.824162:0:26429:0:(obd_mount_server.c:1641:osd_start()) fs1-OST0004-osd already started 00000020:00020000:5.0:1406718500.824163:0:26429:0:(obd_class.h:1008:obd_connect()) Device 0 not setup 00000020:00020000:5.0:1406718500.830701:0:26429:0:(obd_mount_server.c:1698:server_fill_super()) Unable to start osd on /dev/mapper/mpathe: -114 00000020:01000004:5.0:1406718500.840937:0:26429:0:(obd_mount.c:654:lustre_put_lsi()) put ffff880618631000 1 00000020:01000004:5.0:1406718500.840938:0:26429:0:(obd_mount.c:604:lustre_free_lsi()) Freeing lsi ffff88060e5ef800 00000020:00020000:5.0:1406718500.840940:0:26429:0:(obd_mount.c:1324:lustre_fill_super()) Unable to mount (-114) 00000020:01000004:4.0F:1406718500.970679:0:26430:0:(obd_mount_server.c:1704:server_fill_super()) Found service fs1-OST0004 on device /dev/mapper/mpathe ...

            Hello Gregoire,
            Thanks to have exposed my patch to your reproducer and provide me with the results/traces.
            This allowed me to find a small problem in current/#6 patch-set, and according to this I pushed a new patch-set #7 version because I think it can be enhanced to better return -EALREADY in this situation.
            If this new version/patch-set passes auto-tests with success, can you also rerun you reproducer against it ? Thanks agin and in advance for your help on this.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Gregoire, Thanks to have exposed my patch to your reproducer and provide me with the results/traces. This allowed me to find a small problem in current/#6 patch-set, and according to this I pushed a new patch-set #7 version because I think it can be enhanced to better return -EALREADY in this situation. If this new version/patch-set passes auto-tests with success, can you also rerun you reproducer against it ? Thanks agin and in advance for your help on this.

            Bruno,

            I have tested patch #11139 (patchset 6) on lustre 2.4.3 Bull's branch.
            It prevents the crash with the reproducer testcase.

            One of the two mount commands fails with the following output message:

            mount.lustre: mount /dev/mapper/mpathe at /mnt/fs1/ost/4 failed: No such device
            Are the lustre modules loaded?
            Check /etc/modprobe.conf and /proc/filesystems
            

            Here are the Lustre logs with D_SUPER "super" component log enabled.

            00000020:01200004:6.0F:1406539153.106223:0:30249:0:(obd_mount.c:1260:lustre_fill_super()) VFS Op: sb ffff88063788a000
            00000020:01000004:6.0:1406539153.106249:0:30249:0:(obd_mount.c:832:lmd_print()) mount data:
            00000020:01000004:6.0:1406539153.106250:0:30249:0:(obd_mount.c:835:lmd_print()) device: /dev/mapper/mpathe
            00000020:01000004:6.0:1406539153.106251:0:30249:0:(obd_mount.c:836:lmd_print()) flags: 0
            00000020:01000004:6.0:1406539153.106251:0:30249:0:(obd_mount.c:839:lmd_print()) options: errors=panic,errors=remount-ro
            00000020:01000004:6.0:1406539153.106252:0:30249:0:(obd_mount.c:1305:lustre_fill_super()) Mounting server from /dev/mapper/mpathe
            00000020:01000004:6.0:1406539153.106255:0:30249:0:(obd_mount_server.c:1622:osd_start()) Attempting to start fs1-OST0004, type=osd-ldiskfs, lsifl=200002, mountfl=0
            00000020:01000004:6.0:1406539153.106299:0:30249:0:(obd_mount.c:192:lustre_start_simple()) Starting obd fs1-OST0004-osd (typ=osd-ldiskfs)
            00000020:01200004:4.0F:1406539153.106404:0:30248:0:(obd_mount.c:1260:lustre_fill_super()) VFS Op: sb ffff880632ce7800
            00000020:01000004:4.0:1406539153.106423:0:30248:0:(obd_mount.c:832:lmd_print()) mount data:
            00000020:01000004:4.0:1406539153.106423:0:30248:0:(obd_mount.c:835:lmd_print()) device: /dev/mapper/mpathe
            00000020:01000004:4.0:1406539153.106424:0:30248:0:(obd_mount.c:836:lmd_print()) flags: 0
            00000020:01000004:4.0:1406539153.106425:0:30248:0:(obd_mount.c:839:lmd_print()) options: errors=panic,errors=remount-ro
            00000020:01000004:4.0:1406539153.106425:0:30248:0:(obd_mount.c:1305:lustre_fill_super()) Mounting server from /dev/mapper/mpathe
            00000020:01000004:4.0:1406539153.106426:0:30248:0:(obd_mount_server.c:1622:osd_start()) Attempting to start fs1-OST0004, type=osd-ldiskfs, lsifl=200002, mountfl=0
            00000020:01000004:4.0:1406539153.106428:0:30248:0:(obd_mount_server.c:1641:osd_start()) fs1-OST0004-osd already started
            00000020:00020000:4.0:1406539153.106429:0:30248:0:(obd_class.h:1008:obd_connect()) Device 0 not setup
            00000020:00020000:4.0:1406539153.112948:0:30248:0:(obd_mount_server.c:1698:server_fill_super()) Unable to start osd on /dev/mapper/mpathe: -19
            00000020:01000004:4.0:1406539153.123109:0:30248:0:(obd_mount.c:654:lustre_put_lsi()) put ffff880632ce7800 1
            00000020:01000004:4.0:1406539153.123110:0:30248:0:(obd_mount.c:604:lustre_free_lsi()) Freeing lsi ffff880632980000
            00000020:00020000:4.0:1406539153.123113:0:30248:0:(obd_mount.c:1324:lustre_fill_super()) Unable to mount (-19)
            00000020:01000004:1.0F:1406539158.823188:0:30249:0:(obd_mount_server.c:1704:server_fill_super()) Found service fs1-OST0004 on device /dev/mapper/mpathe
            00000020:01000004:1.0:1406539158.823267:0:30249:0:(obd_mount.c:333:lustre_start_mgc()) Start MGC 'MGC10.3.1.5@o2ib'
            00000020:01000004:1.0:1406539158.823268:0:30249:0:(obd_mount.c:340:lustre_start_mgc()) mgs nids 10.3.1.5@o2ib:10.4.1.5@o2ib1:10.3.1.6@o2ib:10.4.1.6@o2ib1.
            00000020:01000004:1.0:1406539158.823286:0:30249:0:(obd_mount.c:192:lustre_start_simple()) Starting obd MGC10.3.1.5@o2ib (typ=mgc)
            00000020:01000004:3.0F:1406539158.824801:0:30249:0:(obd_mount_server.c:1210:server_start_targets()) starting target fs1-OST0004
            00000020:01000004:3.0:1406539158.824838:0:30249:0:(obd_mount.c:192:lustre_start_simple()) Starting obd OSS (typ=ost)
            00000020:01000004:2.0F:1406539158.830802:0:30249:0:(obd_mount_server.c:307:server_mgc_set_fs()) Set mgc disk for /dev/mapper/mpathe
            00000020:01000004:0.0F:1406539158.882797:0:30249:0:(obd_mount_server.c:1115:server_register_target()) Registration fs1-OST0004, fs=fs1, 10.3.1.10@o2ib, index=0004, flags=0x2
            00000020:01000004:1.0:1406539158.883033:0:30249:0:(obd_mount_server.c:120:server_register_mount()) reg_mnt ffff880639d1f9c0 from fs1-OST0004
            00000020:01000004:5.0F:1406539158.884893:0:30311:0:(obd_mount.c:870:lustre_check_exclusion()) Check exclusion fs1-OST0004 (4) in 0 of /dev/mapper/mpathe
            00000020:01000004:5.0:1406539158.885041:0:30311:0:(obd_mount_server.c:170:server_get_mount()) get_mnt ffff880639d1f9c0 from fs1-OST0004, refs=2
            00000001:00000004:4.0:1406539158.887323:0:30311:0:(tgt_lastrcvd.c:253:tgt_server_data_update()) fs1-OST0004_UUID: mount_count is 44, last_transno is 184683593728
            00000020:01000004:5.0:1406539158.890256:0:30312:0:(obd_mount.c:192:lustre_start_simple()) Starting obd fs1-MDT0000-lwp-OST0004 (typ=lwp)
            00000020:02000000:3.0:1406539158.891493:0:30249:0:(obd_mount_server.c:1808:server_calc_timeout()) fs1-OST0004: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
            00000020:01000004:2.0:1406539158.901448:0:30249:0:(obd_mount_server.c:324:server_mgc_clear_fs()) Unassign mgc disk
            00000020:01000004:2.0:1406539158.901510:0:30249:0:(obd_mount_server.c:1581:server_fill_super_common()) Server sb, dev=23
            00000020:00000004:2.0:1406539158.901513:0:30249:0:(obd_mount.c:1327:lustre_fill_super()) Mount /dev/mapper/mpathe complete
            00010000:02000400:5.0:1406539164.331731:0:30295:0:(ldlm_lib.c:1566:target_start_recovery_timer()) fs1-OST0004: Will be in recovery for at least 2:30, or until 3 clients reconnect
            

            So I think the patch is good and fixes the initial issue. Thanks.
            We are now waiting positive code reviews.

            Note that the lustre 2.4.3 Bull's branch now includes the RETURN() call at the end of error handling of obd_connect() in osd_start() routine. It came from the patch for LU-2059 that has been integrated in our repository several weeks ago.

            pichong Gregoire Pichon added a comment - Bruno, I have tested patch #11139 (patchset 6) on lustre 2.4.3 Bull's branch. It prevents the crash with the reproducer testcase. One of the two mount commands fails with the following output message: mount.lustre: mount /dev/mapper/mpathe at /mnt/fs1/ost/4 failed: No such device Are the lustre modules loaded? Check /etc/modprobe.conf and /proc/filesystems Here are the Lustre logs with D_SUPER "super" component log enabled. 00000020:01200004:6.0F:1406539153.106223:0:30249:0:(obd_mount.c:1260:lustre_fill_super()) VFS Op: sb ffff88063788a000 00000020:01000004:6.0:1406539153.106249:0:30249:0:(obd_mount.c:832:lmd_print()) mount data: 00000020:01000004:6.0:1406539153.106250:0:30249:0:(obd_mount.c:835:lmd_print()) device: /dev/mapper/mpathe 00000020:01000004:6.0:1406539153.106251:0:30249:0:(obd_mount.c:836:lmd_print()) flags: 0 00000020:01000004:6.0:1406539153.106251:0:30249:0:(obd_mount.c:839:lmd_print()) options: errors=panic,errors=remount-ro 00000020:01000004:6.0:1406539153.106252:0:30249:0:(obd_mount.c:1305:lustre_fill_super()) Mounting server from /dev/mapper/mpathe 00000020:01000004:6.0:1406539153.106255:0:30249:0:(obd_mount_server.c:1622:osd_start()) Attempting to start fs1-OST0004, type=osd-ldiskfs, lsifl=200002, mountfl=0 00000020:01000004:6.0:1406539153.106299:0:30249:0:(obd_mount.c:192:lustre_start_simple()) Starting obd fs1-OST0004-osd (typ=osd-ldiskfs) 00000020:01200004:4.0F:1406539153.106404:0:30248:0:(obd_mount.c:1260:lustre_fill_super()) VFS Op: sb ffff880632ce7800 00000020:01000004:4.0:1406539153.106423:0:30248:0:(obd_mount.c:832:lmd_print()) mount data: 00000020:01000004:4.0:1406539153.106423:0:30248:0:(obd_mount.c:835:lmd_print()) device: /dev/mapper/mpathe 00000020:01000004:4.0:1406539153.106424:0:30248:0:(obd_mount.c:836:lmd_print()) flags: 0 00000020:01000004:4.0:1406539153.106425:0:30248:0:(obd_mount.c:839:lmd_print()) options: errors=panic,errors=remount-ro 00000020:01000004:4.0:1406539153.106425:0:30248:0:(obd_mount.c:1305:lustre_fill_super()) Mounting server from /dev/mapper/mpathe 00000020:01000004:4.0:1406539153.106426:0:30248:0:(obd_mount_server.c:1622:osd_start()) Attempting to start fs1-OST0004, type=osd-ldiskfs, lsifl=200002, mountfl=0 00000020:01000004:4.0:1406539153.106428:0:30248:0:(obd_mount_server.c:1641:osd_start()) fs1-OST0004-osd already started 00000020:00020000:4.0:1406539153.106429:0:30248:0:(obd_class.h:1008:obd_connect()) Device 0 not setup 00000020:00020000:4.0:1406539153.112948:0:30248:0:(obd_mount_server.c:1698:server_fill_super()) Unable to start osd on /dev/mapper/mpathe: -19 00000020:01000004:4.0:1406539153.123109:0:30248:0:(obd_mount.c:654:lustre_put_lsi()) put ffff880632ce7800 1 00000020:01000004:4.0:1406539153.123110:0:30248:0:(obd_mount.c:604:lustre_free_lsi()) Freeing lsi ffff880632980000 00000020:00020000:4.0:1406539153.123113:0:30248:0:(obd_mount.c:1324:lustre_fill_super()) Unable to mount (-19) 00000020:01000004:1.0F:1406539158.823188:0:30249:0:(obd_mount_server.c:1704:server_fill_super()) Found service fs1-OST0004 on device /dev/mapper/mpathe 00000020:01000004:1.0:1406539158.823267:0:30249:0:(obd_mount.c:333:lustre_start_mgc()) Start MGC 'MGC10.3.1.5@o2ib' 00000020:01000004:1.0:1406539158.823268:0:30249:0:(obd_mount.c:340:lustre_start_mgc()) mgs nids 10.3.1.5@o2ib:10.4.1.5@o2ib1:10.3.1.6@o2ib:10.4.1.6@o2ib1. 00000020:01000004:1.0:1406539158.823286:0:30249:0:(obd_mount.c:192:lustre_start_simple()) Starting obd MGC10.3.1.5@o2ib (typ=mgc) 00000020:01000004:3.0F:1406539158.824801:0:30249:0:(obd_mount_server.c:1210:server_start_targets()) starting target fs1-OST0004 00000020:01000004:3.0:1406539158.824838:0:30249:0:(obd_mount.c:192:lustre_start_simple()) Starting obd OSS (typ=ost) 00000020:01000004:2.0F:1406539158.830802:0:30249:0:(obd_mount_server.c:307:server_mgc_set_fs()) Set mgc disk for /dev/mapper/mpathe 00000020:01000004:0.0F:1406539158.882797:0:30249:0:(obd_mount_server.c:1115:server_register_target()) Registration fs1-OST0004, fs=fs1, 10.3.1.10@o2ib, index=0004, flags=0x2 00000020:01000004:1.0:1406539158.883033:0:30249:0:(obd_mount_server.c:120:server_register_mount()) reg_mnt ffff880639d1f9c0 from fs1-OST0004 00000020:01000004:5.0F:1406539158.884893:0:30311:0:(obd_mount.c:870:lustre_check_exclusion()) Check exclusion fs1-OST0004 (4) in 0 of /dev/mapper/mpathe 00000020:01000004:5.0:1406539158.885041:0:30311:0:(obd_mount_server.c:170:server_get_mount()) get_mnt ffff880639d1f9c0 from fs1-OST0004, refs=2 00000001:00000004:4.0:1406539158.887323:0:30311:0:(tgt_lastrcvd.c:253:tgt_server_data_update()) fs1-OST0004_UUID: mount_count is 44, last_transno is 184683593728 00000020:01000004:5.0:1406539158.890256:0:30312:0:(obd_mount.c:192:lustre_start_simple()) Starting obd fs1-MDT0000-lwp-OST0004 (typ=lwp) 00000020:02000000:3.0:1406539158.891493:0:30249:0:(obd_mount_server.c:1808:server_calc_timeout()) fs1-OST0004: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 00000020:01000004:2.0:1406539158.901448:0:30249:0:(obd_mount_server.c:324:server_mgc_clear_fs()) Unassign mgc disk 00000020:01000004:2.0:1406539158.901510:0:30249:0:(obd_mount_server.c:1581:server_fill_super_common()) Server sb, dev=23 00000020:00000004:2.0:1406539158.901513:0:30249:0:(obd_mount.c:1327:lustre_fill_super()) Mount /dev/mapper/mpathe complete 00010000:02000400:5.0:1406539164.331731:0:30295:0:(ldlm_lib.c:1566:target_start_recovery_timer()) fs1-OST0004: Will be in recovery for at least 2:30, or until 3 clients reconnect So I think the patch is good and fixes the initial issue. Thanks. We are now waiting positive code reviews. Note that the lustre 2.4.3 Bull's branch now includes the RETURN() call at the end of error handling of obd_connect() in osd_start() routine. It came from the patch for LU-2059 that has been integrated in our repository several weeks ago.

            http://review.whamcloud.com/11139 master patch passed autotests with success. Will now expose it to Gregoire's reproducer.

            Will abandon http://review.whamcloud.com/11122 ...

            bfaccini Bruno Faccini (Inactive) added a comment - http://review.whamcloud.com/11139 master patch passed autotests with success. Will now expose it to Gregoire's reproducer. Will abandon http://review.whamcloud.com/11122 ...

            Humm too bad, with http://review.whamcloud.com/11139 conf-sanity/test_41a fails because after a 1st MDT mount with nosvc option, the 2nd with nomgs option now fails with -EALREADY, as expected with patch !!...

            Do I correctly understand, that these options are only intended to start co-located MGT and MDT separately on Server/MDS side, and this is the only case of a legal 2nd mount/start attempt for the same device on a Server?

            Try to fix this special case with patch-set #2 of http://review.whamcloud.com/11139.

            bfaccini Bruno Faccini (Inactive) added a comment - Humm too bad, with http://review.whamcloud.com/11139 conf-sanity/test_41a fails because after a 1st MDT mount with nosvc option, the 2nd with nomgs option now fails with -EALREADY, as expected with patch !!... Do I correctly understand, that these options are only intended to start co-located MGT and MDT separately on Server/MDS side, and this is the only case of a legal 2nd mount/start attempt for the same device on a Server? Try to fix this special case with patch-set #2 of http://review.whamcloud.com/11139 .

            People

              bfaccini Bruno Faccini (Inactive)
              pichong Gregoire Pichon
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: