Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2237

Customer Entered incorrect parameter when enabling quotas. System is down.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.4.0, Lustre 2.1.4
    • Lustre 2.1.1
    • None
    • 1
    • 5295

    Description

      Customer's system is down. This is a very high priority issue.

      Summary: The customer's file system was running fine, and he wanted to add quotas. He added them with an invalid option, and was unable to mount clients. We tried to fix the problem, and have not been able to.

      Here are the details.

      ==================== STEP 1 ====================

      Customer unmounted the file system, and ran the following command on each Lustre target. Note the addition of a dash before the letter u (i.e. the command should have said 'ug2', but it said '-ug2'.

      1. tunefs.lustre --param mdt.quota_type=-ug2 /dev/mapper/mapXX # Has a dash

      Customer was able to mount the targets on the Lustre servers. However, he could not connect with a client.

      ==================== STEP 2 ====================

      Once we noticed that the quota_type had the extra dash, we tried fixing the situation by running the command without the dash. The command we ran was:

      1. tunefs.lustre --param mdt.quota_type=ug2 /dev/mapper/mapXX # No dash

      tunefs.lustre showed that BOTH parameters were present. I.e. on an OST, we saw:

      checking for existing Lustre data: found CONFIGS/mountdata
      Reading CONFIGS/mountdata

      Read previous values:
      Target: xxxxxx-OST0000
      Index: 0
      Lustre FS: xxxxxx
      Mount type: ldiskfs
      Flags: 0x42
      (OST update )
      Persistent mount opts: errors=remount-ro,extents,mballoc
      Parameters: failover.node=xx.yy.zz.244@tcp mgsnode=xx.yy.zz.241@tcp mgsnode=xx.yy.zz.242@tcp mdt.quota_type=-ug2 mdt.quota_type=ug2

      And with tunefs.lustre on the MDT/MGS, we saw:

      checking for existing Lustre data: found CONFIGS/mountdata
      Reading CONFIGS/mountdata

      Read previous values:
      Target: xxxxxx-MDT0000
      Index: 0
      Lustre FS: xxxxxx
      Mount type: ldiskfs
      Flags: 0x445
      (MDT MGS update )
      Persistent mount opts: user_xattr,errors=remount-ro
      Parameters: mgsnode=xx.yy.zz.241@tcp failover.node=xx.yy.zz.242@tcp mdt.quota_type=-ug2 mdt.quota_type=ug2

      Again, we could mount all the Lustre servers, but clients would not mount.

      ==================== STEP 3 ====================

      Next, we thought that we could simply remove those parameters. So, for example, from the 2nd MDS (in an active-standby pair), customer ran:

      1. tunefs.lustre --erase-param --mgsnode=xx.yy.zz.242@tcp0 --failnode=xx.yy.zz.241@tcp0 --writeconf --fsname=xxxxxx /dev/mapper/map00

      The command and the output can be seen in the file first-try-tunefs.lustre.log.2. But, I put it here, because it shows that the mdt.quota_type parameter seems to be gone now.

      checking for existing Lustre data: found CONFIGS/mountdata
      Reading CONFIGS/mountdata

      Read previous values:
      Target: xxxxxx-MDT0000
      Index: 0
      Lustre FS: xxxxxx
      Mount type: ldiskfs
      Flags: 0x405
      (MDT MGS )
      Persistent mount opts: user_xattr,errors=remount-ro
      Parameters: mgsnode=xx.yy.zz.241@tcp failover.node=xx.yy.zz.242@tcp mdt.quota_type=-ug2 mdt.quota_type=ug2

      Permanent disk data:
      Target: xxxxxx-MDT0000
      Index: 0
      Lustre FS: xxxxxx
      Mount type: ldiskfs
      Flags: 0x545
      (MDT MGS update writeconf )
      Persistent mount opts: user_xattr,errors=remount-ro
      Parameters: mgsnode=xx.yy.zz.242@tcp failover.node=xx.yy.zz.241@tcp

      Writing CONFIGS/mountdata
      RC=0
      ALLRC=0

      We ran similar commands on each OST, e.g.

      1. tunefs.lustre --erase-param --failnode=xx.yy.zz.244@tcp0 --mgsnode=xx.yy.zz.242@tcp0 --mgsnode=xx.yy.zz.241@tcp0 --writeconf --fsname=xxxxxx /dev/mapper/map00

      The output from these commands can be found in first-try-tunefs.lustre.log.3 and first-try-tunefs.lustre.log.4 (3 & 4 are the two OSSes; each OSS has 6 targets).

      At this point, customer was NOT ABLE to mount the MDT/MGS.

      ==================== STEP 4 ====================

      So, we decided to run the command from the first MDS, as we normally format from there. So, we ran:

      1. tunefs.lustre --erase-param --mgsnode=xx.yy.zz.241@tcp0 --failnode=xx.yy.zz.242@tcp0 --writeconf --fsname=xxxxxx /dev/mapper/map00

      The output of this command was:

      checking for existing Lustre data: found CONFIGS/mountdata
      Reading CONFIGS/mountdata

      Read previous values:
      Target: xxxxxx-MDT0000
      Index: 0
      Lustre FS: xxxxxx
      Mount type: ldiskfs
      Flags: 0x405
      (MDT MGS )
      Persistent mount opts: user_xattr,errors=remount-ro
      Parameters: mgsnode=xx.yy.zz.242@tcp failover.node=xx.yy.zz.241@tcp

      Permanent disk data:
      Target: xxxxxx-MDT0000
      Index: 0
      Lustre FS: xxxxxx
      Mount type: ldiskfs
      Flags: 0x545
      (MDT MGS update writeconf )
      Persistent mount opts: user_xattr,errors=remount-ro
      Parameters: mgsnode=xx.yy.zz.241@tcp failover.node=xx.yy.zz.242@tcp

      Writing CONFIGS/mountdata
      RC=0

      We ran similar commands on each OST, e.g.

      1. tunefs.lustre --erase-param --failnode=xx.yy.zz.244@tcp0 --mgsnode=xx.yy.zz.241@tcp0 --mgsnode=xx.yy.zz.242@tcp0 --writeconf --fsname=xxxxxx /dev/mapper/map00

      The output from these commands can be found in second-try-tunefs.lustre.log.3 and second-try-tunefs.lustre.log.4.

      At this point, customer was STILL NOT ABLE to mount the MDT/MGS.

      ==================== STEP 5 ====================

      Customer thought that the problem might be with MMP. So, customer removed and enabled MMP, i.e.

      1. tune2fs -O ^mmp /dev/mapper/map00
      2. tune2fs -O mmp /dev/mapper/map00

      At this point, customer is STILL NOT ABLE to mount the MDT/MGS.

      ==================== LIST OF ATTACHED FILES ====================

      messages.01.gz: gzipped /var/log/messages file from first MDS
      messages.02: /var/log/messages file from second MDS
      messages.03: /var/log/messages file from first OSS
      messages.04: /var/log/messages file from second OSS

      first-try-tunefs.lustre.log.2: tunefs command run from second MDS
      first-try-tunefs.lustre.log.3: tunefs command run from first OSS
      first-try-tunefs.lustre.log.4: tunefs command run from standby OSS

      second-try-tunefs.lustre.log.1: tunefs command run from first MDS
      second-try-tunefs.lustre.log.3: tunefs command run from first OSS
      second-try-tunefs.lustre.log.4: tunefs command run from standby OSS

      ==================== LOGs from MDS ====================

      The log messages in /var/log/messages on the first MDS are:

      Oct 24 20:25:09 ts-xxxxxxxx-01 kernel: LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. Opts:
      Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. Opts:
      Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: Lustre: MGS MGS started
      Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: Lustre: 6030:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGCxx.yy.zz.241@tcp->MGCxx.yy.zz.241@tcp_1 netid 90000: select flavor null
      Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: Lustre: 6030:0:(sec.c:1474:sptlrpc_import_sec_adapt()) Skipped 1 previous similar message
      Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: Lustre: 6061:0:(ldlm_lib.c:877:target_handle_connect()) MGS: connection from 258ee8c7-213d-5dc4-fee0-c9990e7b9461@0@lo t0 exp (null) cur 1351110316 last 0
      Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: Lustre: MGCxx.yy.zz.241@tcp: Reactivating import
      Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: Lustre: Enabling ACL
      Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 6065:0:(osd_handler.c:336:osd_iget()) no inode
      Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 6065:0:(md_local_object.c:433:llo_local_objects_setup()) creating obj [last_rcvd] fid = [0x200000001:0xb:0x0] rc = -13
      Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 6065:0:(mdt_handler.c:4577:mdt_init0()) Can't init device stack, rc -13
      Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 6065:0:(obd_config.c:522:class_setup()) setup xxxxxx-MDT0000 failed (-13)
      Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 6065:0:(obd_config.c:1361:class_config_llog_handler()) Err -13 on cfg command:
      Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: Lustre: cmd=cf003 0:xxxxxx-MDT0000 1:xxxxxx-MDT0000_UUID 2:0 3:xxxxxx-MDT0000-mdtlov 4:f
      Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 15c-8: MGCxx.yy.zz.241@tcp: The configuration from log 'xxxxxx-MDT0000' failed (-13). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 6030:0:(obd_mount.c:1192:server_start_targets()) failed to start server xxxxxx-MDT0000: -13
      Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 6030:0:(obd_mount.c:1723:server_fill_super()) Unable to start targets: -13
      Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 6030:0:(obd_config.c:567:class_cleanup()) Device 3 not setup
      Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 6030:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
      Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: Lustre: MGS has stopped.
      Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 6030:0:(ldlm_request.c:1799:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
      Oct 24 20:25:22 ts-xxxxxxxx-01 kernel: Lustre: 6030:0:(client.c:1779:ptlrpc_expire_one_request()) @@@ Request x1416737969930383 sent from MGCxx.yy.zz.241@tcp to NID 0@lo has timed out for slow reply: [sent 1351110316] [real_sent 1351110316] [current 1351110322] [deadline 6s] [delay 0s] req@ffff88060e66ec00 x1416737969930383/t0(0) o-1->MGS@MGCxx.yy.zz.241@tcp_1:26/25 lens 192/192 e 0 to 1 dl 1351110322 ref 2 fl Rpc:XN/ffffffff/ffffffff rc 0/-1
      Oct 24 20:25:22 ts-xxxxxxxx-01 kernel: Lustre: server umount xxxxxx-MDT0000 complete
      Oct 24 20:25:22 ts-xxxxxxxx-01 kernel: LustreError: 6030:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount (-13)

      The file /var/log/messages.01 is from the first MDS.
      The file /var/log/messages.02 is from the second MDS.
      The file /var/log/messages.03 is from the first OSS.
      The file /var/log/messages.04 is from the second OSS.

      Attachments

        1. first-try-tunefs.lustre.log.2
          0.8 kB
        2. first-try-tunefs.lustre.log.3
          5 kB
        3. first-try-tunefs.lustre.log.4
          5 kB
        4. messages.01.gz
          1.37 MB
        5. messages.02
          556 kB
        6. messages.03
          775 kB
        7. messages.04
          798 kB
        8. second-try-tunefs.lustre.log.1
          0.8 kB
        9. second-try-tunefs.lustre.log.3
          5 kB
        10. second-try-tunefs.lustre.log.4
          5 kB
        11. messages.01.after.updating.last_rcvd
          21 kB
        12. mount.debug.after.updating.last_rcvd
          4.16 MB

        Activity

          People

            yong.fan nasf (Inactive)
            rspellman Roger Spellman (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: