Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.4.0, Lustre 2.1.4
Affects Version/s: Lustre 2.1.1
Labels:
None
Environment:

Hide
Lustre servers are running 2.6.32-220.el6, with Lustre 2.1.1.rc4.
Lustre clients are running 2.6.38.2, with special code created for this release, with http://review.whamcloud.com/#change,2170. (patch 8)

Show
Lustre servers are running 2.6.32-220.el6, with Lustre 2.1.1.rc4. Lustre clients are running 2.6.38.2, with special code created for this release, with http://review.whamcloud.com/#change,2170 . (patch 8)

Severity:
1
Rank (Obsolete):
5295

Description

Customer's system is down. This is a very high priority issue.

Summary: The customer's file system was running fine, and he wanted to add quotas. He added them with an invalid option, and was unable to mount clients. We tried to fix the problem, and have not been able to.

Here are the details.

==================== STEP 1 ====================

Customer unmounted the file system, and ran the following command on each Lustre target. Note the addition of a dash before the letter u (i.e. the command should have said 'ug2', but it said '-ug2'.

tunefs.lustre --param mdt.quota_type=-ug2 /dev/mapper/mapXX # Has a dash

Customer was able to mount the targets on the Lustre servers. However, he could not connect with a client.

==================== STEP 2 ====================

Once we noticed that the quota_type had the extra dash, we tried fixing the situation by running the command without the dash. The command we ran was:

tunefs.lustre --param mdt.quota_type=ug2 /dev/mapper/mapXX # No dash

tunefs.lustre showed that BOTH parameters were present. I.e. on an OST, we saw:

checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

Read previous values:
Target: xxxxxx-OST0000
Index: 0
Lustre FS: xxxxxx
Mount type: ldiskfs
Flags: 0x42
(OST update )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: failover.node=xx.yy.zz.244@tcp mgsnode=xx.yy.zz.241@tcp mgsnode=xx.yy.zz.242@tcp mdt.quota_type=-ug2 mdt.quota_type=ug2

And with tunefs.lustre on the MDT/MGS, we saw:

checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

Read previous values:
Target: xxxxxx-MDT0000
Index: 0
Lustre FS: xxxxxx
Mount type: ldiskfs
Flags: 0x445
(MDT MGS update )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters: mgsnode=xx.yy.zz.241@tcp failover.node=xx.yy.zz.242@tcp mdt.quota_type=-ug2 mdt.quota_type=ug2

Again, we could mount all the Lustre servers, but clients would not mount.

==================== STEP 3 ====================

Next, we thought that we could simply remove those parameters. So, for example, from the 2nd MDS (in an active-standby pair), customer ran:

tunefs.lustre --erase-param --mgsnode=xx.yy.zz.242@tcp0 --failnode=xx.yy.zz.241@tcp0 --writeconf --fsname=xxxxxx /dev/mapper/map00

The command and the output can be seen in the file first-try-tunefs.lustre.log.2. But, I put it here, because it shows that the mdt.quota_type parameter seems to be gone now.

checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

Read previous values:
Target: xxxxxx-MDT0000
Index: 0
Lustre FS: xxxxxx
Mount type: ldiskfs
Flags: 0x405
(MDT MGS )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters: mgsnode=xx.yy.zz.241@tcp failover.node=xx.yy.zz.242@tcp mdt.quota_type=-ug2 mdt.quota_type=ug2

Permanent disk data:
Target: xxxxxx-MDT0000
Index: 0
Lustre FS: xxxxxx
Mount type: ldiskfs
Flags: 0x545
(MDT MGS update writeconf )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters: mgsnode=xx.yy.zz.242@tcp failover.node=xx.yy.zz.241@tcp

Writing CONFIGS/mountdata
RC=0
ALLRC=0

We ran similar commands on each OST, e.g.

tunefs.lustre --erase-param --failnode=xx.yy.zz.244@tcp0 --mgsnode=xx.yy.zz.242@tcp0 --mgsnode=xx.yy.zz.241@tcp0 --writeconf --fsname=xxxxxx /dev/mapper/map00

The output from these commands can be found in first-try-tunefs.lustre.log.3 and first-try-tunefs.lustre.log.4 (3 & 4 are the two OSSes; each OSS has 6 targets).

At this point, customer was NOT ABLE to mount the MDT/MGS.

==================== STEP 4 ====================

So, we decided to run the command from the first MDS, as we normally format from there. So, we ran:

tunefs.lustre --erase-param --mgsnode=xx.yy.zz.241@tcp0 --failnode=xx.yy.zz.242@tcp0 --writeconf --fsname=xxxxxx /dev/mapper/map00

The output of this command was:

checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

Read previous values:
Target: xxxxxx-MDT0000
Index: 0
Lustre FS: xxxxxx
Mount type: ldiskfs
Flags: 0x405
(MDT MGS )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters: mgsnode=xx.yy.zz.242@tcp failover.node=xx.yy.zz.241@tcp

Permanent disk data:
Target: xxxxxx-MDT0000
Index: 0
Lustre FS: xxxxxx
Mount type: ldiskfs
Flags: 0x545
(MDT MGS update writeconf )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters: mgsnode=xx.yy.zz.241@tcp failover.node=xx.yy.zz.242@tcp

Writing CONFIGS/mountdata
RC=0

We ran similar commands on each OST, e.g.

tunefs.lustre --erase-param --failnode=xx.yy.zz.244@tcp0 --mgsnode=xx.yy.zz.241@tcp0 --mgsnode=xx.yy.zz.242@tcp0 --writeconf --fsname=xxxxxx /dev/mapper/map00

The output from these commands can be found in second-try-tunefs.lustre.log.3 and second-try-tunefs.lustre.log.4.

At this point, customer was STILL NOT ABLE to mount the MDT/MGS.

==================== STEP 5 ====================

Customer thought that the problem might be with MMP. So, customer removed and enabled MMP, i.e.

tune2fs -O ^mmp /dev/mapper/map00
tune2fs -O mmp /dev/mapper/map00

At this point, customer is STILL NOT ABLE to mount the MDT/MGS.

==================== LIST OF ATTACHED FILES ====================

messages.01.gz: gzipped /var/log/messages file from first MDS
messages.02: /var/log/messages file from second MDS
messages.03: /var/log/messages file from first OSS
messages.04: /var/log/messages file from second OSS

first-try-tunefs.lustre.log.2: tunefs command run from second MDS
first-try-tunefs.lustre.log.3: tunefs command run from first OSS
first-try-tunefs.lustre.log.4: tunefs command run from standby OSS

second-try-tunefs.lustre.log.1: tunefs command run from first MDS
second-try-tunefs.lustre.log.3: tunefs command run from first OSS
second-try-tunefs.lustre.log.4: tunefs command run from standby OSS

==================== LOGs from MDS ====================

The log messages in /var/log/messages on the first MDS are:

Oct 24 20:25:09 ts-xxxxxxxx-01 kernel: LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. Opts:
Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. Opts:
Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: Lustre: MGS MGS started
Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: Lustre: 6030:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGCxx.yy.zz.241@tcp->MGCxx.yy.zz.241@tcp_1 netid 90000: select flavor null
Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: Lustre: 6030:0:(sec.c:1474:sptlrpc_import_sec_adapt()) Skipped 1 previous similar message
Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: Lustre: 6061:0:(ldlm_lib.c:877:target_handle_connect()) MGS: connection from 258ee8c7-213d-5dc4-fee0-c9990e7b9461@0@lo t0 exp (null) cur 1351110316 last 0
Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: Lustre: MGCxx.yy.zz.241@tcp: Reactivating import
Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: Lustre: Enabling ACL
Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 6065:0:(osd_handler.c:336:osd_iget()) no inode
Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 6065:0:(md_local_object.c:433:llo_local_objects_setup()) creating obj [last_rcvd] fid = [0x200000001:0xb:0x0] rc = -13
Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 6065:0:(mdt_handler.c:4577:mdt_init0()) Can't init device stack, rc -13
Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 6065:0:(obd_config.c:522:class_setup()) setup xxxxxx-MDT0000 failed (-13)
Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 6065:0:(obd_config.c:1361:class_config_llog_handler()) Err -13 on cfg command:
Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: Lustre: cmd=cf003 0:xxxxxx-MDT0000 1:xxxxxx-MDT0000_UUID 2:0 3:xxxxxx-MDT0000-mdtlov 4:f
Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 15c-8: MGCxx.yy.zz.241@tcp: The configuration from log 'xxxxxx-MDT0000' failed (-13). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 6030:0:(obd_mount.c:1192:server_start_targets()) failed to start server xxxxxx-MDT0000: -13
Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 6030:0:(obd_mount.c:1723:server_fill_super()) Unable to start targets: -13
Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 6030:0:(obd_config.c:567:class_cleanup()) Device 3 not setup
Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 6030:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: Lustre: MGS has stopped.
Oct 24 20:25:16 ts-xxxxxxxx-01 kernel: LustreError: 6030:0:(ldlm_request.c:1799:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Oct 24 20:25:22 ts-xxxxxxxx-01 kernel: Lustre: 6030:0:(client.c:1779:ptlrpc_expire_one_request()) @@@ Request x1416737969930383 sent from MGCxx.yy.zz.241@tcp to NID 0@lo has timed out for slow reply: [sent 1351110316] [real_sent 1351110316] [current 1351110322] [deadline 6s] [delay 0s] req@ffff88060e66ec00 x1416737969930383/t0(0) o-1->MGS@MGCxx.yy.zz.241@tcp_1:26/25 lens 192/192 e 0 to 1 dl 1351110322 ref 2 fl Rpc:XN/ffffffff/ffffffff rc 0/-1
Oct 24 20:25:22 ts-xxxxxxxx-01 kernel: Lustre: server umount xxxxxx-MDT0000 complete
Oct 24 20:25:22 ts-xxxxxxxx-01 kernel: LustreError: 6030:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount (-13)

The file /var/log/messages.01 is from the first MDS.
The file /var/log/messages.02 is from the second MDS.
The file /var/log/messages.03 is from the first OSS.
The file /var/log/messages.04 is from the second OSS.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

first-try-tunefs.lustre.log.2
25/Oct/12 5:26 PM
0.8 kB
Roger Spellman
first-try-tunefs.lustre.log.3
25/Oct/12 5:26 PM
5 kB
Roger Spellman
first-try-tunefs.lustre.log.4
25/Oct/12 5:26 PM
5 kB
Roger Spellman
messages.01.after.updating.last_rcvd
26/Oct/12 2:31 PM
21 kB
Roger Spellman
messages.01.gz
25/Oct/12 5:26 PM
1.37 MB
Roger Spellman
messages.02
25/Oct/12 5:26 PM
556 kB
Roger Spellman
messages.03
25/Oct/12 5:26 PM
775 kB
Roger Spellman
messages.04
25/Oct/12 5:26 PM
798 kB
Roger Spellman
mount.debug.after.updating.last_rcvd
26/Oct/12 2:31 PM
4.16 MB
Roger Spellman
second-try-tunefs.lustre.log.1
25/Oct/12 5:26 PM
0.8 kB
Roger Spellman
second-try-tunefs.lustre.log.3
25/Oct/12 5:26 PM
5 kB
Roger Spellman
second-try-tunefs.lustre.log.4
25/Oct/12 5:26 PM
5 kB
Roger Spellman

Customer Entered incorrect parameter when enabling quotas. System is down.

Details

Description

Attachments

Attachments

Activity

People

Dates