[LU-8044] class_process_config() no device for: lustre-MDT0021-mdtlov Created: 19/Apr/16  Updated: 14/Jun/18  Resolved: 15/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Major
Reporter: Olaf Faaland Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

TOSS 2 (RHEL 6.7 based)
kernel 2.6.32-573.22.1.1chaos.ch5.4.x86_64
Lustre 2.8.0+patches 2.8-llnl-preview1
zfs-0.6.5.4-1.ch5.4.x86_64
1 MGS - separate server
40 MDTs - each on separate server
10 OSTs - each on separate server
Filesystem name is "lustre"


Attachments: File dk.catalyst240     File dmesg.catalyst240     File ldev.conf     File llog.MDT0000.onMGS     File llog.MDT0021.onMGS     Text File mgs.register_mdts.dk.gz    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

On startup for the first time after formatting, the MDT fails to process the config provided by the MGS. The MDT then fails to start.
The config log on the MGS appears to be invalid, with more than one setup and modify_mdc_tgt record for one of the other MDTs.

The MDT which fails to start reports:

Lustre: Lustre: Build Version: 2.8.0
LustreError: 11797:0:(obd_config.c:1262:class_process_config()) no device for: lustre-MDT0021-mdtlov
LustreError: 11797:0:(obd_config.c:1666:class_config_llog_handler()) MGC192.168.112.240@o2ib15: cfg command failed: rc = -22
Lustre:    cmd=cf014 0:lustre-MDT0021-mdtlov  1:lustre-MDT0014_UUID  2:20  3:1

LustreError: 15b-f: MGC192.168.112.240@o2ib15: The configuration from log 'lustre-MDT0021'failed from the MGS (-22).  Make sure this client and the MGS are running compatible versions of Lustre.
LustreError: 11667:0:(obd_mount_server.c:1309:server_start_targets()) failed to start server lustre-MDT0021: -22
LustreError: 11667:0:(obd_mount_server.c:1798:server_fill_super()) Unable to start targets: -22
LustreError: 11667:0:(obd_mount_server.c:1512:server_put_super()) no obd lustre-MDT0021
Lustre: server umount lustre-MDT0021 complete
LustreError: 11667:0:(obd_mount.c:1426:lustre_fill_super()) Unable to mount  (-22)

The config logs CONFIGS/lustre-MDT* do not all have the same number of records. lustre-MDT0021 has 2 more records than the other 29 MDTs.

The suspicious llog records are:

#04 (152)setup     0:lustre-MDT0014-osp-MDT0021  1:lustre-MDT0014_UUID  2:192.168.113.6@o2ib15
#05 (136)modify_mdc_tgts add 0:lustre-MDT0021-mdtlov  1:lustre-MDT0014_UUID  2:20  3:1
#179 (152)setup     0:lustre-MDT0014-osp-MDT0021  1:lustre-MDT0014_UUID  2:192.168.113.6@o2ib15
#180 (136)modify_mdc_tgts add 0:lustre-MDT0021-mdtlov  1:lustre-MDT0014_UUID  2:20  3:1


 Comments   
Comment by Olaf Faaland [ 19/Apr/16 ]

My description made it sound like this happens every time. That's not the case; it happens intermittently.

Comment by Olaf Faaland [ 19/Apr/16 ]

Attached:
config log for MDT0021 from CONFIGS/ on MGS
console log (dmesg) for MDS (catalyst240)
ldev.conf showing what nodes play what roles
lctl dk output from MGS, reflects default debug and subsystem_debug settings

Comment by Di Wang [ 19/Apr/16 ]
#04 (152)setup     0:lustre-MDT0014-osp-MDT0021  1:lustre-MDT0014_UUID  2:192.168.113.6@o2ib15
#05 (136)modify_mdc_tgts add 0:lustre-MDT0021-mdtlov  1:lustre-MDT0014_UUID  2:20  3:1

The index for OSP setup seems too earlier, which does not look right.

Could you please post CONFIGS/lustre-MDT0021 and CONFIG/lustre-MDT0000 here? thanks.

Comment by Olaf Faaland [ 19/Apr/16 ]

Config log for MDT0000.
The log for MDT0021 is already attached.

Comment by Olaf Faaland [ 19/Apr/16 ]

Di,

For the next few hours, I can either gather more information for you, or experiment. About 4 hours from now, I'll have to put the nodes back to their production use and the filesystem will be destroyed.

thanks,
Olaf

Comment by Di Wang [ 19/Apr/16 ]

According to the config log, it looks OSP (lustre-MDT0014-osp-MDT0021) setup record is added before "lov setup", which is clearly wrong.

#01 (224)marker 865 (flags=0x01, v2.8.0.0) lustre-MDT0014  'add osp' Tue Apr 19 08:55:48 2016-
#02 (088)add_uuid  nid=192.168.113.6@o2ib15(0x5000fc0a87106)  0:  1:192.168.113.6@o2ib15
#03 (144)attach    0:lustre-MDT0014-osp-MDT0021  1:osp  2:lustre-MDT0021-mdtlov_UUID
#04 (152)setup     0:lustre-MDT0014-osp-MDT0021  1:lustre-MDT0014_UUID  2:192.168.113.6@o2ib15
#05 (136)modify_mdc_tgts add 0:lustre-MDT0021-mdtlov  1:lustre-MDT0014_UUID  2:20  3:1
#06 (224)END   marker 865 (flags=0x02, v2.8.0.0) lustre-MDT0014  'add osp' Tue Apr 19 08:55:48 2016-
#07 (224)marker 873 (flags=0x01, v2.8.0.0) lustre-MDT0021  'add mdt' Tue Apr 19 08:55:48 2016-
#08 (120)attach    0:lustre-MDT0021  1:mdt  2:lustre-MDT0021_UUID
#09 (112)mount_option 0:  1:lustre-MDT0021  2:lustre-MDT0021-mdtlov
#10 (160)setup     0:lustre-MDT0021  1:lustre-MDT0021_UUID  2:33  3:lustre-MDT0021-mdtlov  4:f
#11 (224)END   marker 873 (flags=0x02, v2.8.0.0) lustre-MDT0021  'add mdt' Tue Apr 19 08:55:48 2016-

I am checking the debug log on MGS to see why this happen.

Olaf, Could you please try to reproduce the log with debug level = -1 on MGS? it will help me to figure out what happens there. thanks.

Comment by Di Wang [ 19/Apr/16 ]

Ah, it looks like a race when MGS register 2 MDTs at the same time, I will cook a patch.

Comment by Olaf Faaland [ 19/Apr/16 ]

Attach debug log from MGS with debug = -1, while MDTs coming up for first time.

In this log, MDT0002 (on catalyst243, NID 192.168.112.243@o2ib15) encountered the error.

Lustre: Lustre: Build Version: 2.8.0
LustreError: 11826:0:(obd_config.c:1262:class_process_config()) no device for: lustre-MDT0002-mdtlov
LustreError: 11826:0:(obd_config.c:1666:class_config_llog_handler()) MGC192.168.112.240@o2ib15: cfg command failed: rc = -22
Lustre:    cmd=cf014 0:lustre-MDT0002-mdtlov  1:lustre-MDT0023_UUID  2:35  3:1

LustreError: 15b-f: MGC192.168.112.240@o2ib15: The configuration from log 'lustre-MDT0002'failed from the MGS (-22).  Make sure this client and the MGS are running compatible versions of Lustre.
LustreError: 11696:0:(obd_mount_server.c:1309:server_start_targets()) failed to start server lustre-MDT0002: -22
LustreError: 11696:0:(obd_mount_server.c:1798:server_fill_super()) Unable to start targets: -22
LustreError: 11696:0:(obd_mount_server.c:1512:server_put_super()) no obd lustre-MDT0002
Lustre: server umount lustre-MDT0002 complete
LustreError: 11696:0:(obd_mount.c:1426:lustre_fill_super()) Unable to mount  (-22)
Comment by Gerrit Updater [ 19/Apr/16 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/19658
Subject: LU-8044 mgs: Only add OSP for registered MDT
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3ccd18da205192ec0ad527ec88b69793aa5e6670

Comment by Di Wang [ 19/Apr/16 ]

Olaf: the new debug log seems not catch the failure, probably too late or -1 make the dk log too big to catch all of information? But anyway the patch 19658 should help here. Please try this when you have another chance. Thanks.

Comment by Joseph Gmitter (Inactive) [ 20/Apr/16 ]

Hi Di,
Assigning to you as I see you have already commented and provided a fix in a new patch.
Thanks.
Joe

Comment by Gerrit Updater [ 14/Jun/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19658/
Subject: LU-8044 mgs: Only add OSP for registered MDT
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c67a74b55c126ec1be6c195cb2e8cb8c2e6cf868

Comment by Joseph Gmitter (Inactive) [ 15/Jun/16 ]

patch has landed to master for 2.9.0

Generated at Sat Feb 10 02:14:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.