[LU-1257] OST registration snafu Created: 23/Mar/12 Updated: 27/Apr/15 Resolved: 29/Apr/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Christopher Morrone | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 5533 | ||||||||
| Description |
|
Our sysadmins were expanding one of our 2.1 filesystems yesterday and ran into a problem. Because of previous 1.8 problems that made out-of-order OST registration problematic, the admin was using a script to mount the OSTs one at a time, in sequential ID order. With each new OST registration the MGS lock is revoked. Lock timeouts were not infrequent. One OST hit a timeout while communicating with the MGS, and this proved to be a fairly non-recoverable event. I was forced to break out hexedit to get things working again reasonably. 2012-03-22 10:07:58 LustreError: 166-1: MGC172.19.1.100@o2ib100: Connection to MGS (at 172.19.1.100@o2ib100) was lost; in progress operations using this service will fail 2012-03-22 10:07:58 LustreError: 12150:0:(ldlm_request.c:115:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1332435772, 306s ago), entering recovery for MGS@MGC172.19.1.100@o2ib100_0 ns: MGC172.19.1.100@o2ib100 lock: ffff880314ba0480/0x54303899ba77f6ed lrc: 4/1,0 mode: --/CR res: 6517612/0 rrc: 1 type: PLN flags: 0x10000010 remote: 0x1d01b502e0a66a07 expref: -99 pid: 12150 timeout 0 2012-03-22 10:07:59 LustreError: 12718:0:(obd_mount.c:1163:server_start_targets()) Required registration failed for lsc-OST0174: -4 2012-03-22 10:07:59 LustreError: 12718:0:(obd_mount.c:1719:server_fill_super()) Unable to start targets: -4 2012-03-22 10:07:59 LustreError: 12718:0:(obd_mount.c:1508:server_put_super()) no obd lsc-OST0174 2012-03-22 10:07:59 LustreError: 12718:0:(obd_mount.c:141:server_deregister_mount()) lsc-OST0174 not registered 2012-03-22 10:07:59 LustreError: 11-0: MGC172.19.1.100@o2ib100: Communicating with 172.19.1.100@o2ib100, operation mgs_connect failed with -16. 2012-03-22 10:07:59 Lustre: server umount lsc-OST0174 complete 2012-03-22 10:07:59 LustreError: 12718:0:(obd_mount.c:2160:lustre_fill_super()) Unable to mount (-4) 2012-03-22 10:08:24 LustreError: 11-0: MGC172.19.1.100@o2ib100: Communicating with 172.19.1.100@o2ib100, operation mgs_connect failed with -16. 2012-03-22 10:08:49 LustreError: 11-0: MGC172.19.1.100@o2ib100: Communicating with 172.19.1.100@o2ib100, operation mgs_connect failed with -16. 2012-03-22 10:09:14 LustreError: 11-0: MGC172.19.1.100@o2ib100: Communicating with 172.19.1.100@o2ib100, operation mgs_connect failed with -16. 2012-03-22 10:09:38 LustreError: 11-0: MGC172.19.1.100@o2ib100: Communicating with 172.19.1.100@o2ib100, operation mgs_connect failed with -16. The admin reran the mount command several more times and each time got either -4 (EINTR) which corresponds with an instance of a lock timeout as above, or -5 (EIO). When I was called in, I tried the mount my self to start gathering info. At that point it returned -98 (EADDRINUSE), and from the MGS logs that it HAD finally processed the OST's registration: 2012-03-22 11:09:22 LustreError: 140-5: Server lsc-OST0174 requested index 372, but that index is already in use. Use --writeconf to force 2012-03-22 11:09:22 LustreError: 12649:0:(mgs_llog.c:2710:mgs_write_log_target()) Can't get index (-98) 2012-03-22 11:09:22 LustreError: 12649:0:(mgs_handler.c:520:mgs_handle_target_reg()) Failed to write lsc-OST0174 log (-98) But the OST does not know that the registration succeeded, so its mountdata still has the flag LDD_F_VIRGIN set. Because of that, the MGS will never let the OST connect. That left us two courses of action (to the best of my knowledge)
Since we did not want to cause a downtime for the filesystem, we chose the latter. The mount of the OST seemed to mostly go well, and it appears to be functioning fine now, but I did see this error on the MGS/MDS console: 2012-03-22 15:09:48 Lustre: Found index 372 for lsc-OST0174, updating log 2012-03-22 15:12:26 LustreError: 6002:0:(obd_config.c:1019:class_process_config()) no device for: lsc-OST0174-osc 2012-03-22 15:12:26 LustreError: 6002:0:(obd_config.c:1363:class_config_llog_handler()) Err -22 on cfg command: 2012-03-22 15:12:26 Lustre: cmd=cf00b 0:lsc-OST0174-osc 1:172.19.1.127@o2ib100 The OST appears to be working fine, so I am not sure how worried I should be about that llog error. I will attach some console logs to show what was going when the OST registration failed. |
| Comments |
| Comment by Christopher Morrone [ 23/Mar/12 ] |
|
Attach console logs from the MGS/MDS (sumom-mds1) and the OSS (sumom25). It was lsc-OST0174 that had the problem. |
| Comment by Peter Jones [ 26/Mar/12 ] |
|
Niu Could you please help out with this one? Thanks Peter |
| Comment by Niu Yawei (Inactive) [ 27/Mar/12 ] |
Hi, Chris What do you mean the 'previous 1.8 problems that made out-of-order OST registration problematic? I'm lack of this knowledge. The final error on MDS console is showing that adding failover connection for the lsc-OST0174 (to the lsc-OST0174-osc on MDS) failed, failure reason is the osc hasn't been setup yet. I think this could be caused by a little defect in mgs_write_log_target(): In mgs_write_log_target(), if we find the ost index has already been used, then we'll not add the osc for this registering OST to the MDT/client config log (see the comment after the warnning message "Found index %d for %s, updating log\n"), however, the failover information will still be written into the MDT/client log in the following mgs_write_log_param(), but with old 1.8 osc name lsc-OST0174-osc but not lsc-OST0174-osc-MDT000. Actually, in mgs_write_log_ost(), we always generate osc name like xxx-MDT000, however, in mgs_write_log_add_failnid(), we generate osc name by name_create_mdt_osc(), which checks FSDB_OSCNAME18 of fsdb. I need to go through this part of code carefully, but seems your systemm is upgrading from 1.8 to 2.1? |
| Comment by Christopher Morrone [ 27/Mar/12 ] |
We were adding new OSTs to an existing filesystem while the file system was on line. So no, writeconf was not used on the MGS.
Thats not really important here, except to know perhaps that the OSTs were being mounted one at a time in the order of their pre-assigned OST index. There was an LU ticket about this somewhere I believe, but I can't find it at the moment.
It ran 1.8 in the past, yes. |
| Comment by Niu Yawei (Inactive) [ 28/Mar/12 ] |
|
seems there is problem when adding new OST on a system upgraded from 1.8: The mdt OSC name in 1.8 is fsname-svname-osc, it's different from the mdt OSC name in 2.0 fsname-svname-osc-MDT0000. When a 1.8 system is upgraded to 2.0, the existing mdt OSC in MDT config is preserved, and the 2.0 code is able to detect that there is old OSC name in the config, so when set paramters, we just keep using the old naming style (for instance, when setting failover node, see mgs_write_log_add_failnid()), however, the newly added OST (with LDD_F_VIRGIN) just arbitrarily using the new naming style to create mdt OSC in MDT log (see mgs_write_log_ost() -> mgs_write_log_osc_to_lov()). At the end, there will be two kinds of OSC name in MDT log, old OSC name for the OST added in 1.8, and new OSC name for the newly added OST in 2.0. Since the parameter setting code detected old OSC name and keep using old naming style, 'lctl conf_param' for the new OST will not work anymore. I think we probably should change the registration code to use old naming style as well (when there is already old name in the log). Hi, Chris Could you try 'lctl conf_param' on the newly added OST (lsc-OST0174) to see if it still works? for example, you can try 'lctl conf_param lsc-OST0174.osc.max_dirty=xxx'. Thanks. |
| Comment by Christopher Morrone [ 28/Mar/12 ] |
|
Yes, the old OSC have the old naming convention, and the new use the new convention. I agree that the mix of old and new naming is annoying. We should probably track that in a separate ticket though (unless it is somehow implicated in this bug).
There is no osc named "lsc-OST0174", it is named "lsc-OST0174-osc-MDT0000". I am not sure what you are getting at with the conf_param. Would you like me to try using it against lsc-OST0174-osc-MDT0000? |
| Comment by Niu Yawei (Inactive) [ 28/Mar/12 ] |
I'm afraid that using conf_param against OSC name won't work, the input of conf_param should be $svname.$prefix.$param=$value, and if you run 'lctl conf_param lsc-OST0174.osc.max_dirty_mb=10', the max_dirty_mb of all the OSCs for the OST0174 should be changed. |
| Comment by Niu Yawei (Inactive) [ 29/Mar/12 ] |
Hi, Chris 2012-03-22 15:09:48 Lustre: Found index 372 for lsc-OST0174, updating log 2012-03-22 15:12:26 LustreError: 6002:0:(obd_config.c:1019:class_process_config()) no device for: lsc-OST0174-osc 2012-03-22 15:12:26 LustreError: 6002:0:(obd_config.c:1363:class_config_llog_handler()) Err -22 on cfg command: 2012-03-22 15:12:26 Lustre: cmd=cf00b 0:lsc-OST0174-osc 1:172.19.1.127@o2ib100 Above MGS/MDS console message should be caused by the mixed OSC name in MDT log. I think such message is what you concerned, right? For the OST registration failure problems, I saw lots of 2012-03-22 10:08:24 LustreError: 11-0: MGC172.19.1.100@o2ib100: Communicating with 172.19.1.100@o2ib100, operation mgs_connect failed with -16 in the OSS log, I think there might be some network problmes between the MGS and OSS (or the MGS was not ready at that time?), so the OST0174 registration failed for EINTR/EIO many times. After the network (or the MGS?) back to normal, OST0174 registration updated the config log on MGS successfully, however, MGS revoke all config locks from all clients timed out, which caused the OST0174 registration reply timeout, that's why you got EADDRINUSE while you mounting the OST again. Actually, there are lots of lock timeout and client eviction on the MGS/MDS log, I'm not sure if the cluster is in a healthy state at that time. |
| Comment by Christopher Morrone [ 29/Mar/12 ] |
I am only concerned with that as far as it relates to the original problem: A state arose during the addition of OSTs where one OST would never be allowed to complete its registration. To the best of my knowledge, there was no legitimate hardware network problem between the MGS and OSS nodes. The problems were far more likely to be lustre software at fault. For instance, the MGS may have been unresponsive due to handling the lock revocation for the previous OST's registration when the registration for lsc-OST0174 came in. |
| Comment by Christopher Morrone [ 29/Mar/12 ] |
|
Please see |
| Comment by Christopher Morrone [ 29/Mar/12 ] |
But the osc is not named lsc-OST0174, it is named lsc-OST0174-osc-MDT0000. I think the confusion is that the error message on the MDT talks about "lsc-OST0174" because lustre added an incorrect "add failnid" entry using the old naming style "lsc-OST0174", even though the osc was already correctly registered as "lsc-OST0174-osc-MDT0000". See this comment. This I think that this was a result of me clearing the LDD_F_VIRGIN on the OST, but leaving LDD_F_UPDATE. For some reason on the MGS the update used the old osc naming style rather than the new style. That seems like another bug to me. |
| Comment by Build Master (Inactive) [ 01/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 01/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 01/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Niu Yawei (Inactive) [ 01/Apr/12 ] |
|
There are always possibilities that registration succeeded on MGS, but the registration RPC get error on the target (MGS revoking config locks timeout or other networking problems), then the target will never be abled to register to the MGS without erasing all server config logs, because futhure registration for this target will always get -EADDRINUSE from the MGS. I think we'd better not treat the -EADDRINUSE as a fatal error, but just print error messages on both MGS & target server console, and let user to make the decision. (to ignore the error or writeconf to re-register) http://review.whamcloud.com/2433 For the OSC name mismatching problem, I think we'd make the registration code follow the same rule as parameter updating (conf_param) code does: always use old OSC naming style for the system upgraded from 1.8: |
| Comment by Build Master (Inactive) [ 01/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 01/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 01/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 01/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 01/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 01/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 01/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 01/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 01/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 01/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 01/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 01/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 01/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 01/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 01/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Christopher Morrone [ 10/Apr/12 ] |
|
These are all just band-aids for what appears to be a fundamentally a bad design for initial registration. Really, the OST should probably generate some kind of random number and use that to identify itself upon first connection. Then if a problem occurs during registration, the MDS will be able to identify the OST as really the same OST when it connects again and can allow registration to be replayed. Or something along those lines. Clearing the virgin flag on the OST requires a level of knowledge of Lustre's internals that few people have. I'm not sure that we should even mention it in a console message. The console message should probably be "consult your lustre support vendor" and only when an expert has decided that the conditions are really correct would they say to use the --clear-virgin option. |
| Comment by Niu Yawei (Inactive) [ 04/Jun/12 ] |
I agree, but that will be a feature enhancement work, and I'm not sure if we have avaiable resource to work on that for now. Let's fix the inconsistent osc name defect first (http://review.whamcloud.com/#change,2432), it's not related to the target registeration design. What do you think about, Chris? |
| Comment by Christopher Morrone [ 04/Jun/12 ] |
|
That is a good start. |
| Comment by Christopher Morrone [ 12/Nov/12 ] |
|
This needs attention. Change 2432 for master has been sitting for a few months. |
| Comment by Niu Yawei (Inactive) [ 12/Nov/12 ] |
|
Yes, Chris. I just rebased the patch, and will keep it moving forward. Thanks. |
| Comment by Niu Yawei (Inactive) [ 28/Apr/14 ] |
|
patch 2432 (inconsistent osc name problem) has been landed. To the problem related to registration design, I'd open another ticket to track it. |
| Comment by Niu Yawei (Inactive) [ 28/Apr/14 ] |
|
LU-4966 is created to track the improvement. Chirs, can this be closed? |
| Comment by Christopher Morrone [ 28/Apr/14 ] |
|
I suppose it can be closed. In the future though, I would prefer that the side issues be fixed in new tickets and we leave the original ticket open to deal with the root issue. |
| Comment by Niu Yawei (Inactive) [ 29/Apr/14 ] |
|
Thank you, Chris. |