[LU-4939] Need to be able to sanely query and change MGS configuration information Created: 22/Apr/14 Updated: 18/Jul/19 Resolved: 02/Nov/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.12.0, Lustre 2.10.7 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Prakash Surya (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl, patch | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 13651 | ||||||||||||||||
| Description |
|
Is there a sane way to determine what NIDs a server is configured with as determined by the MGS? I know there's lctl list_nids, but that's not what I want. I need to determine which NIDs the MGS has for a given server, as I think the MGS is handing a client a non-existent NID for a server. |
| Comments |
| Comment by Robert Read (Inactive) [ 22/Apr/14 ] |
|
Is there a (sane) way to query the configuration stored on the MGS in general? |
| Comment by Prakash Surya (Inactive) [ 23/Apr/14 ] |
|
Sigh.. hex editor it is??? |
| Comment by Robert Read (Inactive) [ 23/Apr/14 ] |
|
lol! |
| Comment by Niu Yawei (Inactive) [ 23/Apr/14 ] |
|
Hi, Prakash. llog_reader can read the config file on MGS (mount MGS device as ldiskfs or dump out the config files by debugfs first), is this what you want? |
| Comment by Cliff White (Inactive) [ 23/Apr/14 ] |
|
Would lctl get_param osc.*.import give you what you what? connection: |
| Comment by Robert Read (Inactive) [ 23/Apr/14 ] |
|
Cliff, that might work, but I suspect he would like to validate the configuration on the MGS before he starts the filesystem with a potentially broken config. |
| Comment by Prakash Surya (Inactive) [ 23/Apr/14 ] |
|
This client can't mount the MDS, so I'm not sure the lctl get_param will work. Just to give a little more info on my actual problem.. From looking at the client lustre log, I see these two lines: 00000020:00000080:18.0:1398123825.044168:0:18652:0:(obd_config.c:1079:class_process_config()) adding mapping from uuid 10.1.1.212@o2ib9 to nid 0x500090a0101d4 (10.1.1.212@o2ib9) 00000020:00000080:18.0:1398123825.044199:0:18652:0:(obd_config.c:1079:class_process_config()) adding mapping from uuid 10.1.1.212@o2ib9 to nid 0x200000a0101d4 (10.1.1.212@tcp) The MDS is at NID 10.1.1.212@o2ib9, and NID 10.1.1.212@tcp doesn't even exist. The client is on the tcp0 network, and routed through to o2ib9. We can lctl ping from both: the client to the MDS, and the MDS to the client (so it doesn't look to be a routing issue). What I think is going on, is somehow (we don't know how) the MGS has "bad" data regarding the MDS nids. I think the MGS has the o2ib9 nid and the tcp nid listed for the MDS. Thus when the tcp client tries to mount, I think this occurs: 1. it connects to the MGS just fine Mounting from another client on another network (o2ib2) works just fine. My guess, is that's because the "good" client just happens to ignore the 10.1.1.212@tcp NID and uses the valid NID of 10.1.1.212@o2ib9. Granted, this is just from poking at the lustre logs on the "bad" client. I'd really like to dump the lustre MGS config data to know for sure, but obviously that's not easy to do. Sigh. |
| Comment by Prakash Surya (Inactive) [ 23/Apr/14 ] |
|
We just did a --writeconf on the servers to reset the NIDs and that has resolved the problem. The tcp0 client can mount the filesystem just fine now. Can I make a feature request for a tool to read the configuration information from the MGS? Perhaps the llog_reader is a good place to start (as Niu suggested), but mounting the MGS devices as ldiskfs doesn't seem like a "correct" solution. Proper and "easy to use on a live filesystem" administration tools are really lacking here. |
| Comment by Prakash Surya (Inactive) [ 23/Apr/14 ] |
|
And just to highlight some of the usability issues that we sometimes face with Lustre.. In this particular instance: 1. The client mount command just hung without any helpful messages (maybe that's "correct" in this case). 2014-04-21 14:17:54 LustreError: 17123:0:(lmv_obd.c:1292:lmv_statfs()) can't stat MDS #0 (lc2-MDT0000-mdc-ffff880d10805000), error -4 2014-04-21 14:17:54 LustreError: 16722:0:(lov_obd.c:937:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, lovrc=1 2014-04-21 14:17:54 LustreError: 16722:0:(lov_obd.c:937:lov_cleanup()) Skipped 26 previous similar messages 2014-04-21 14:17:54 Lustre: Unmounted lc2-client 3. lctl ping worked just fine from client to MDS, and from MDS to client So, even to a seasoned Lustre admin, everything "seemed" to be configured correctly (and actually worked depending on the client's network being used). The fact that I had to rely on my past 3 years of Lustre experience and the lustre debug logs to track this down, is troubling at best. I'm not trying to bash Lustre here, I just want try and make my constant frustrations a little more known. |
| Comment by Robert Read (Inactive) [ 23/Apr/14 ] |
|
Actually, in this case trying Cliff's suggestion while the mount command was hung (though with "mdc.*.import" instead of osc) might have shown the bad NID. I agree those error messages are unhelpful. (Not sure I want to know what happens when deathrow=1.) |
| Comment by Niu Yawei (Inactive) [ 24/Apr/14 ] |
You can dump the client config file by debugfs on a live system. (debugfs -c -R "dump CONFIGS/lustre-client tmpfile" $mgs_device) |
| Comment by Christopher Morrone [ 24/Apr/14 ] |
So the answer to Prakash's question is: "no, there is no sane way to do that". You basically need to be a Lustre developer to figure anything out about how an existing filesystem is configured, and that is just not acceptable this far into Lustre's development. It is going to make it very difficult for us to remain competitive with other filesystems if we don't start addressing the administration debt in Lustre. I changed the "summary" of this ticket to convert it from a question into a task that needs to be worked. Seeing how our sysadmins couldn't answer this question, and even Prakash struggled to get the information is a really great example of where work is needed for those of us that have been working on Lustre a long time. |
| Comment by Cliff White (Inactive) [ 25/Apr/14 ] |
|
The command I mentioned should be run on the MGS, not on a client. And I would agree with the comments above, we have needed a better way to get config information for quite awhile. |
| Comment by Gerrit Updater [ 26/Oct/16 ] |
|
Ben Evans (bevans@cray.com) uploaded a new patch: http://review.whamcloud.com/23395 |
| Comment by Andreas Dilger [ 15/Jun/17 ] |
|
For whatever reason, I don't recall ever seeing this ticket, nor the patch that was pushed against it. There was a command added many years ago "lctl --device %MGS llog_print \$<llog_name>" that will dump the current config llog (e.g. testfs-client or testfs-OST0001) from a running MGS. In 2.6 this was changed to dump the llog in YAML format, for ease of reading/parsing (the pre-2.6 format was a huge mess and mostly unusable) via patch http://review.whamcloud.com/4254 " Ideally, the offline llog_reader tool would be improved to also dump the config llogs in the same YAML format (possibly with a --yaml flag to keep compatibility with the old format, if that is considered important). What is really needed at this point is to be able to take the YAML data as input and apply the tuning parameters to a system after a --writeconf or if the filesystem was reformatted. |
| Comment by Gerrit Updater [ 09/Mar/18 ] |
|
Ben Evans (bevans@cray.com) uploaded a new patch: https://review.whamcloud.com/31620 |
| Comment by Gerrit Updater [ 02/Apr/18 ] |
|
Ben Evans (bevans@cray.com) uploaded a new patch: https://review.whamcloud.com/31846 |
| Comment by Gerrit Updater [ 09/Apr/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31620/ |
| Comment by Gerrit Updater [ 10/Apr/18 ] |
|
Ben Evans (bevans@cray.com) uploaded a new patch: https://review.whamcloud.com/31931 |
| Comment by Gerrit Updater [ 14/Apr/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31931/ |
| Comment by Gerrit Updater [ 02/Nov/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/31846/ |
| Comment by Peter Jones [ 02/Nov/18 ] |
|
Landed for 2.12 |
| Comment by Gerrit Updater [ 13/Feb/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34250 |
| Comment by Gerrit Updater [ 15/Feb/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34250/ |