Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9904

LNetError: 16885:0:(peer.c:1786:lnet_peer_push_event()) Push Put from unknown 0@<0:0> (source 0@<0:0>)

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.11.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      New error seen during test on el6.9. Not seen before. Seen during lustre_rmmod every time.

      Suspect due to recent landings on master.

      Attachments

        Issue Links

          Activity

            [LU-9904] LNetError: 16885:0:(peer.c:1786:lnet_peer_push_event()) Push Put from unknown 0@<0:0> (source 0@<0:0>)

            I would suggest that IOC_LIBCFS_GET_NI error 22: Invalid argument is not a useful error to print if lctl list_nids failed, so that should be fixed up somehow.

            Secondly, I don't see where lctl list_nids is being called, or I'd suggest to avoid calling it if that doesn't make sense. The only places I see it are in sk_nodemap_setup(), which is only called if $SK_S2S is set, and in host_nids_address() which is only sanity.sh::test_217 and not during mount that I can see.

            adilger Andreas Dilger added a comment - I would suggest that IOC_LIBCFS_GET_NI error 22: Invalid argument is not a useful error to print if lctl list_nids failed, so that should be fixed up somehow. Secondly, I don't see where lctl list_nids is being called, or I'd suggest to avoid calling it if that doesn't make sense. The only places I see it are in sk_nodemap_setup() , which is only called if $SK_S2S is set, and in host_nids_address() which is only sanity.sh::test_217 and not during mount that I can see.

            Andreas, the reason for
            IOC_LIBCFS_GET_NI error 22: Invalid argument
            is because an lctl list_nids is being called before the modules are loaded

            testvm.centos7: executing check_logdir /tmp/test_logs/1505518717
            Logging to shared log directory: /tmp/test_logs/1505518717
            testvm.centos7: executing yml_node
            IOC_LIBCFS_GET_NI error 22: Invalid argument <----- call to lctl list_nids
            Client: Lustre version: 2.10.52_98_g8e75219_dirty
            MDS: Lustre version: 2.10.52_98_g8e75219_dirty
            OSS: Lustre version: 2.10.52_98_g8e75219_dirty
            Stopping clients: testvm.centos7 /mnt/lustre (opts:)
            Stopping clients: testvm.centos7 /mnt/lustre2 (opts:)
            Loading modules from /usr/lib64/lustre <--- modules (including lnet) are loaded
            
            ashehata Amir Shehata (Inactive) added a comment - Andreas, the reason for IOC_LIBCFS_GET_NI error 22: Invalid argument is because an lctl list_nids is being called before the modules are loaded testvm.centos7: executing check_logdir /tmp/test_logs/1505518717 Logging to shared log directory: /tmp/test_logs/1505518717 testvm.centos7: executing yml_node IOC_LIBCFS_GET_NI error 22: Invalid argument <----- call to lctl list_nids Client: Lustre version: 2.10.52_98_g8e75219_dirty MDS: Lustre version: 2.10.52_98_g8e75219_dirty OSS: Lustre version: 2.10.52_98_g8e75219_dirty Stopping clients: testvm.centos7 /mnt/lustre (opts:) Stopping clients: testvm.centos7 /mnt/lustre2 (opts:) Loading modules from /usr/lib64/lustre <--- modules (including lnet) are loaded

            Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/29026
            Subject: LU-9904 lnet: reduce logging severity
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 0732d936aa07e8ee129829e738b82f10189cd7ce

            gerrit Gerrit Updater added a comment - Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/29026 Subject: LU-9904 lnet: reduce logging severity Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0732d936aa07e8ee129829e738b82f10189cd7ce

            I've started hitting this after a recent update as well. We definitely shouldn't be printing "Error" on the console for something that happens during normal operation. I'm also seeing a new error during mount:

            # sh sanity.sh
            client: executing check_logdir /tmp/test_logs/1505452045
            client: ../libcfs/libcfs/libcfs options: 'libcfs_panic_on_lbug=0'
            Logging to shared log directory: /tmp/test_logs/1505452045
            client: executing yml_node
            IOC_LIBCFS_GET_NI error 22: Invalid argument
            

            The mount itself continues fine, but as with the other error, we shouldn't be printing errors to the console/terminal during normal operations, as that tends to confuse users.

            adilger Andreas Dilger added a comment - I've started hitting this after a recent update as well. We definitely shouldn't be printing "Error" on the console for something that happens during normal operation. I'm also seeing a new error during mount: # sh sanity.sh client: executing check_logdir /tmp/test_logs/1505452045 client: ../libcfs/libcfs/libcfs options: 'libcfs_panic_on_lbug=0' Logging to shared log directory: /tmp/test_logs/1505452045 client: executing yml_node IOC_LIBCFS_GET_NI error 22: Invalid argument The mount itself continues fine, but as with the other error, we shouldn't be printing errors to the console/terminal during normal operations, as that tends to confuse users.

            found a set of steps to reproduce without shutdown:

            peer 2:
            modprobe lnet
            lnetctl lnet configure
            lnetctl net add --net tcp --if eth0,eth1
            
            peer1
            modprobe lnet
            lnetctl lnet configure
            lnetctl net add --net tcp --if eth0,eth1
            lnetctl discover 192.168.122.30@tcp # discover peer2
            # in /var/log/messages
            # Aug 24 17:49:54 MRtest01 kernel: LNetError: 3447:0:(peer.c:1786:lnet_peer_push_event()) Push Put from unknown 0@<0:0> (source 0@<0:0>)
            
            ashehata Amir Shehata (Inactive) added a comment - found a set of steps to reproduce without shutdown: peer 2: modprobe lnet lnetctl lnet configure lnetctl net add --net tcp -- if eth0,eth1 peer1 modprobe lnet lnetctl lnet configure lnetctl net add --net tcp -- if eth0,eth1 lnetctl discover 192.168.122.30@tcp # discover peer2 # in / var /log/messages # Aug 24 17:49:54 MRtest01 kernel: LNetError: 3447:0:(peer.c:1786:lnet_peer_push_event()) Push Put from unknown 0@<0:0> (source 0@<0:0>)

            sure. I'll push in a patch.

            ashehata Amir Shehata (Inactive) added a comment - sure. I'll push in a patch.

            Your explanation sounds quite plausible. I think it would be a very good idea to suppress such errors during shutdown. As it is it looks quite alarming and happens at least once on every shutdown, at least for me.

            bogl Bob Glossman (Inactive) added a comment - Your explanation sounds quite plausible. I think it would be a very good idea to suppress such errors during shutdown. As it is it looks quite alarming and happens at least once on every shutdown, at least for me.

            This could occur when there is a shutdown. Basically, there is a new mechanism now where events are pushed to the peers whenever there is a local network interface change. During shutdown, everything is being removed. So that's detected as a change. and some pushes might make it out. When an event is received for that push message the peer is looked up but because of the shutdown it would've been removed, and that error is printed out. We can probably suppress this error when LNet is shutting down. There are no other negative consequences indicated by this message.

            ashehata Amir Shehata (Inactive) added a comment - This could occur when there is a shutdown. Basically, there is a new mechanism now where events are pushed to the peers whenever there is a local network interface change. During shutdown, everything is being removed. So that's detected as a change. and some pushes might make it out. When an event is received for that push message the peer is looked up but because of the shutdown it would've been removed, and that error is printed out. We can probably suppress this error when LNet is shutting down. There are no other negative consequences indicated by this message.

            I'll investigate.

            ashehata Amir Shehata (Inactive) added a comment - I'll investigate.

            People

              ashehata Amir Shehata (Inactive)
              bogl Bob Glossman (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: