[LU-9904] LNetError: 16885:0:(peer.c:1786:lnet_peer_push_event()) Push Put from unknown 0@<0:0> (source 0@<0:0>) Created: 22/Aug/17  Updated: 30/Sep/17  Resolved: 30/Sep/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Minor
Reporter: Bob Glossman (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9480 LNet Dynamic Discovery Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

New error seen during test on el6.9. Not seen before. Seen during lustre_rmmod every time.

Suspect due to recent landings on master.



 Comments   
Comment by Amir Shehata (Inactive) [ 22/Aug/17 ]

I'll investigate.

Comment by Amir Shehata (Inactive) [ 24/Aug/17 ]

This could occur when there is a shutdown. Basically, there is a new mechanism now where events are pushed to the peers whenever there is a local network interface change. During shutdown, everything is being removed. So that's detected as a change. and some pushes might make it out. When an event is received for that push message the peer is looked up but because of the shutdown it would've been removed, and that error is printed out. We can probably suppress this error when LNet is shutting down. There are no other negative consequences indicated by this message.

Comment by Bob Glossman (Inactive) [ 24/Aug/17 ]

Your explanation sounds quite plausible. I think it would be a very good idea to suppress such errors during shutdown. As it is it looks quite alarming and happens at least once on every shutdown, at least for me.

Comment by Amir Shehata (Inactive) [ 24/Aug/17 ]

sure. I'll push in a patch.

Comment by Amir Shehata (Inactive) [ 25/Aug/17 ]

found a set of steps to reproduce without shutdown:

peer 2:
modprobe lnet
lnetctl lnet configure
lnetctl net add --net tcp --if eth0,eth1

peer1
modprobe lnet
lnetctl lnet configure
lnetctl net add --net tcp --if eth0,eth1
lnetctl discover 192.168.122.30@tcp # discover peer2
# in /var/log/messages
# Aug 24 17:49:54 MRtest01 kernel: LNetError: 3447:0:(peer.c:1786:lnet_peer_push_event()) Push Put from unknown 0@<0:0> (source 0@<0:0>)
Comment by Andreas Dilger [ 15/Sep/17 ]

I've started hitting this after a recent update as well. We definitely shouldn't be printing "Error" on the console for something that happens during normal operation. I'm also seeing a new error during mount:

# sh sanity.sh
client: executing check_logdir /tmp/test_logs/1505452045
client: ../libcfs/libcfs/libcfs options: 'libcfs_panic_on_lbug=0'
Logging to shared log directory: /tmp/test_logs/1505452045
client: executing yml_node
IOC_LIBCFS_GET_NI error 22: Invalid argument

The mount itself continues fine, but as with the other error, we shouldn't be printing errors to the console/terminal during normal operations, as that tends to confuse users.

Comment by Gerrit Updater [ 15/Sep/17 ]

Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/29026
Subject: LU-9904 lnet: reduce logging severity
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0732d936aa07e8ee129829e738b82f10189cd7ce

Comment by Amir Shehata (Inactive) [ 15/Sep/17 ]

Andreas, the reason for
IOC_LIBCFS_GET_NI error 22: Invalid argument
is because an lctl list_nids is being called before the modules are loaded

testvm.centos7: executing check_logdir /tmp/test_logs/1505518717
Logging to shared log directory: /tmp/test_logs/1505518717
testvm.centos7: executing yml_node
IOC_LIBCFS_GET_NI error 22: Invalid argument <----- call to lctl list_nids
Client: Lustre version: 2.10.52_98_g8e75219_dirty
MDS: Lustre version: 2.10.52_98_g8e75219_dirty
OSS: Lustre version: 2.10.52_98_g8e75219_dirty
Stopping clients: testvm.centos7 /mnt/lustre (opts:)
Stopping clients: testvm.centos7 /mnt/lustre2 (opts:)
Loading modules from /usr/lib64/lustre <--- modules (including lnet) are loaded
Comment by Andreas Dilger [ 16/Sep/17 ]

I would suggest that IOC_LIBCFS_GET_NI error 22: Invalid argument is not a useful error to print if lctl list_nids failed, so that should be fixed up somehow.

Secondly, I don't see where lctl list_nids is being called, or I'd suggest to avoid calling it if that doesn't make sense. The only places I see it are in sk_nodemap_setup(), which is only called if $SK_S2S is set, and in host_nids_address() which is only sanity.sh::test_217 and not during mount that I can see.

Comment by Amir Shehata (Inactive) [ 18/Sep/17 ]

yaml.sh:yml_node_info() gets called before the modules are loaded. in this function

for nw in $(lctl list_nids | grep -v @lo | cut -f 2 -d '@' | uniq); do

This doesn't look like it's something new. I'm not sure exactly how the test framework works to make a change there.

I would rather not remove the error when list_nids fails, because it's useful for debugging. prints out why it failed. I think it makes more sense not to call lctl list_nids from the test unless the modules are loaded first.

Comment by Andreas Dilger [ 21/Sep/17 ]

I'd be happy if you fixed the test script to not call list_nids before the modules are loaded.

Even so, the error message itself doesn't explain anything to the user who runs "lctl list_nids". I'm not suggesting the error should be removed, but rather that it should be improved to be actually useful to the user:

  • include lctl list_nids: at the start, so the reader knows which command is failing
  • expand the message for EINVAL to suggest that the lnet module is not loaded
Comment by Gerrit Updater [ 30/Sep/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29026/
Subject: LU-9904 lnet: reduce logging severity
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7fc8037d61b29fa2ac93ab5fb30fccd9b1c0066c

Comment by Peter Jones [ 30/Sep/17 ]

Landed for 2.11.

Generated at Sat Feb 10 02:30:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.