[LU-9990] MDS fails to mount due to (client.c:96:ptlrpc_uuid_to_connection()) cannot find peer MGC10.37.248.196@o2ib1 _0! Created: 14/Sep/17 Updated: 06/Nov/17 Resolved: 24/Oct/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | James A Simmons | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Latest lustre 2.10.5X running on RHEL7.4 with default OFED. Using IB for LND. |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 2 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Recently I started to run into issues with the MDT failing to mount randomly. Now with the latest master the MDT fails to mount every single time. Looking at the debug log I noticed the following error on the MDT: (client.c:96:ptlrpc_uuid_to_connection()) cannot find peer MGC10.37.248.196@o2ib1_0! |
| Comments |
| Comment by James A Simmons [ 14/Sep/17 ] |
|
I attached full debug logs from the MDS/MGS. |
| Comment by Peter Jones [ 14/Sep/17 ] |
|
Dropping severity because I take it that this is not a production service interruption |
| Comment by James A Simmons [ 14/Sep/17 ] |
|
Just our test bed is busted |
| Comment by Peter Jones [ 14/Sep/17 ] |
|
James Could you please elaborate as to the exact commit you are seeing this with and the last commit that you did not see this problem? Peter |
| Comment by John Hammond [ 14/Sep/17 ] |
|
James, Could you be more specific about the Lustre version here? (Saying 'latest' does not age well.) Also can you give a recent Lustre version where you didn't see this issue? Dose this still happen if you set lnet_peer_discovery_disabled=1 in the lnet module parameters? |
| Comment by Amir Shehata (Inactive) [ 15/Sep/17 ] |
|
James, can you also try: https://review.whamcloud.com/#/c/29007/ |
| Comment by James A Simmons [ 15/Sep/17 ] |
|
I did and it still fail to bring up the file system |
| Comment by Amir Shehata (Inactive) [ 15/Sep/17 ] |
|
in the dump-mds.log I see: Is that issue resolved? Can you also turn on net and neterror when you're mounting and failing, and attach the output. |
| Comment by James A Simmons [ 15/Sep/17 ] |
|
That is due to patch |
| Comment by James A Simmons [ 19/Sep/17 ] |
|
With the revert of |
| Comment by James A Simmons [ 19/Sep/17 ] |
|
Ah, I found what caused this issue. I have a reproducer. Its very simple, add routes that are all down to your yaml config file and you will not be able to mount your file system. We have Cray routers and they have been done recently. So add to your yaml config file something like this: route:
And you will see problems. |
| Comment by Amir Shehata (Inactive) [ 20/Sep/17 ] |
|
I couldn't reproduce this on my local setup. I'm assuming that the downed gateways are not needed for the FS to be mounted, correct? IE do you need those routes for communication? If that's not the case, can you please enable net and neterror and dump the logs after the problem happens to so I can take a look. lctl set_param debug=+net lctl set_param debug=+neterror lctl dk > log |
| Comment by Alexey Lyashkov [ 20/Sep/17 ] |
|
James, can you confirm - |
| Comment by James A Simmons [ 21/Sep/17 ] |
|
I have been testing without |
| Comment by Alexey Lyashkov [ 21/Sep/17 ] |
|
James, what HW you use for testing? if it MLX5, these patches do nothing for you. MLX5 uses a FastReg only model, while MLX4 support a both Fast and FMR. I don't have access to HW IB for now, so may check only with VM's + M-OFED 4.1+ Soft IB |
| Comment by James A Simmons [ 25/Sep/17 ] |
|
Sorry I was having a hard time reproducing this problem. Its not the route configuration that breaks lnet but the numa node setting in my lnet.conf that did. I ended up removing the numa stuff from my config file. If you add numa: to your lnet yaml config file you will see this breakage. |
| Comment by Amir Shehata (Inactive) [ 25/Sep/17 ] |
|
I believe the numa range defaults to 0. When you remove it and you do "lnetctl numa show" do you see a different value for the range? |
| Comment by James A Simmons [ 25/Sep/17 ] |
|
[root@ninja34 ~]# lnetctl global show Does the numa_range have to be "under" global: in the YAML config file? |
| Comment by Amir Shehata (Inactive) [ 25/Sep/17 ] |
|
so the output is a little strange. It looks like you're using the latest master. But the configuration you're feeding in seems to have been generated from 2.10. numa should be fed in under the global, as shown in the global show output you pasted above. Now thinking about it, this is a backwards compatibility issue since 2.10 is already out. I'll have to make lnetctl handle the older configuration as well for master. Do you see "call back for 'numa' not found" error when you configure with the numa block? I think I know what the problem is. When the parser encounters a problem in the YAML file, it'll quit and simply stop configuring the rest of the items. So it could be that when it hits this error it doesn't finish the configuration leading to the problem you're seeing. Are you calling "lnetctl import" from a script? If so, I think you should be checking if the command succeeds or fails. If it fails you should assume that the node is not configured properly. Can you verify if my theory is correct? |
| Comment by James A Simmons [ 26/Sep/17 ] |
|
Yes I do see a "call back from 'num' not found error when I start up. Also I was using lnet.conf from my 2.10 setup. I'm running the lnetctl import from the command line not script. |
| Comment by Amir Shehata (Inactive) [ 26/Sep/17 ] |
|
ok. I'll make a change to handle "numa" entry in YAML file so I can make master backwards compatible with 2.10. In the meantime, if there is an error configuring, you should assume that the configuration is not complete, and the node is not really usable. |
| Comment by Amir Shehata (Inactive) [ 05/Oct/17 ] |
|
Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/29333 |
| Comment by Gerrit Updater [ 24/Oct/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29333/ |
| Comment by Peter Jones [ 24/Oct/17 ] |
|
Landed for 2.11 |
| Comment by Minh Diep [ 01/Nov/17 ] |
|
ashehata said we don't need this for LTS |