Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9990

MDS fails to mount due to (client.c:96:ptlrpc_uuid_to_connection()) cannot find peer MGC10.37.248.196@o2ib1 _0!

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.11.0
    • Lustre 2.11.0
    • None
    • Latest lustre 2.10.5X running on RHEL7.4 with default OFED. Using IB for LND.
    • 2
    • 9223372036854775807

    Description

      Recently I started to run into issues with the MDT failing to mount randomly. Now with the latest master the MDT fails to mount every single time. Looking at the debug log I noticed the following error on the MDT:

      (client.c:96:ptlrpc_uuid_to_connection()) cannot find peer MGC10.37.248.196@o2ib1_0!

      Attachments

        1. dump-mds.log
          20 kB
        2. dump-mgs.log
          1.70 MB

        Issue Links

          Activity

            [LU-9990] MDS fails to mount due to (client.c:96:ptlrpc_uuid_to_connection()) cannot find peer MGC10.37.248.196@o2ib1 _0!
            mdiep Minh Diep added a comment -

            ashehata said we don't need this for LTS

            mdiep Minh Diep added a comment - ashehata said we don't need this for LTS
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29333/
            Subject: LU-9990 lnet: add backwards compatibility for YAML config
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 3187d551d538bd8203c7156daaa617620c6569ab

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29333/ Subject: LU-9990 lnet: add backwards compatibility for YAML config Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3187d551d538bd8203c7156daaa617620c6569ab
            ashehata Amir Shehata (Inactive) added a comment - - edited

            Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/29333
            Subject: LU-9990 lnet: add backwards compatibility for YAML config
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e72cdc1373dfb930eccbc5d9afba215d8368b331

            ashehata Amir Shehata (Inactive) added a comment - - edited Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/29333 Subject: LU-9990 lnet: add backwards compatibility for YAML config Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e72cdc1373dfb930eccbc5d9afba215d8368b331

            ok. I'll make a change to handle "numa" entry in YAML file so I can make master backwards compatible with 2.10. In the meantime, if there is an error configuring, you should assume that the configuration is not complete, and the node is not really usable.

            ashehata Amir Shehata (Inactive) added a comment - ok. I'll make a change to handle "numa" entry in YAML file so I can make master backwards compatible with 2.10. In the meantime, if there is an error configuring, you should assume that the configuration is not complete, and the node is not really usable.

            Yes I do see a "call back from 'num' not found error when I start up. Also I was using lnet.conf from my 2.10 setup. I'm running the lnetctl import from the command line not script.

            simmonsja James A Simmons added a comment - Yes I do see a "call back from 'num' not found error when I start up. Also I was using lnet.conf from my 2.10 setup. I'm running the lnetctl import from the command line not script.

            so the output is a little strange. It looks like you're using the latest master. But the configuration you're feeding in seems to have been generated from 2.10. numa should be fed in under the global, as shown in the global show output you pasted above. Now thinking about it, this is a backwards compatibility issue since 2.10 is already out. I'll have to make lnetctl handle the older configuration as well for master.

            Do you see "call back for 'numa' not found" error when you configure with the numa block?

            I think I know what the problem is. When the parser encounters a problem in the YAML file, it'll quit and simply stop configuring the rest of the items. So it could be that when it hits this error it doesn't finish the configuration leading to the problem you're seeing.

            Are you calling "lnetctl import" from a script? If so, I think you should be checking if the command succeeds or fails. If it fails you should assume that the node is not configured properly.

            Can you verify if my theory is correct?

            ashehata Amir Shehata (Inactive) added a comment - so the output is a little strange. It looks like you're using the latest master. But the configuration you're feeding in seems to have been generated from 2.10. numa should be fed in under the global, as shown in the global show output you pasted above. Now thinking about it, this is a backwards compatibility issue since 2.10 is already out. I'll have to make lnetctl handle the older configuration as well for master. Do you see "call back for 'numa' not found" error when you configure with the numa block? I think I know what the problem is. When the parser encounters a problem in the YAML file, it'll quit and simply stop configuring the rest of the items. So it could be that when it hits this error it doesn't finish the configuration leading to the problem you're seeing. Are you calling "lnetctl import" from a script? If so, I think you should be checking if the command succeeds or fails. If it fails you should assume that the node is not configured properly. Can you verify if my theory is correct?
            simmonsja James A Simmons added a comment - - edited

            [root@ninja34 ~]# lnetctl global show
            global:
            numa_range: 0
            max_intf: 200
            discovery: 1

            Does the numa_range have to be "under" global: in the YAML config file?

            simmonsja James A Simmons added a comment - - edited [root@ninja34 ~] # lnetctl global show global: numa_range: 0 max_intf: 200 discovery: 1 Does the numa_range have to be "under" global: in the YAML config file?

            I believe the numa range defaults to 0. When you remove it and you do "lnetctl numa show" do you see a different value for the range?

            ashehata Amir Shehata (Inactive) added a comment - I believe the numa range defaults to 0. When you remove it and you do "lnetctl numa show" do you see a different value for the range?

            Sorry I was having a hard time reproducing this problem. Its not the route configuration that breaks lnet but the numa node setting in my lnet.conf that did. I ended up removing the numa stuff from my config file. If you add

            numa:
            range: 0

            to your lnet yaml config file you will see this breakage.

            simmonsja James A Simmons added a comment - Sorry I was having a hard time reproducing this problem. Its not the route configuration that breaks lnet but the numa node setting in my lnet.conf that did. I ended up removing the numa stuff from my config file. If you add numa: range: 0 to your lnet yaml config file you will see this breakage.

            People

              ashehata Amir Shehata (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: