[LU-9904] LNetError: 16885:0:(peer.c:1786:lnet_peer_push_event()) Push Put from unknown 0@<0:0> (source 0@<0:0>) Created: 22/Aug/17 Updated: 30/Sep/17 Resolved: 30/Sep/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Bob Glossman (Inactive) | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
New error seen during test on el6.9. Not seen before. Seen during lustre_rmmod every time. Suspect due to recent landings on master. |
| Comments |
| Comment by Amir Shehata (Inactive) [ 22/Aug/17 ] |
|
I'll investigate. |
| Comment by Amir Shehata (Inactive) [ 24/Aug/17 ] |
|
This could occur when there is a shutdown. Basically, there is a new mechanism now where events are pushed to the peers whenever there is a local network interface change. During shutdown, everything is being removed. So that's detected as a change. and some pushes might make it out. When an event is received for that push message the peer is looked up but because of the shutdown it would've been removed, and that error is printed out. We can probably suppress this error when LNet is shutting down. There are no other negative consequences indicated by this message. |
| Comment by Bob Glossman (Inactive) [ 24/Aug/17 ] |
|
Your explanation sounds quite plausible. I think it would be a very good idea to suppress such errors during shutdown. As it is it looks quite alarming and happens at least once on every shutdown, at least for me. |
| Comment by Amir Shehata (Inactive) [ 24/Aug/17 ] |
|
sure. I'll push in a patch. |
| Comment by Amir Shehata (Inactive) [ 25/Aug/17 ] |
|
found a set of steps to reproduce without shutdown: peer 2: modprobe lnet lnetctl lnet configure lnetctl net add --net tcp --if eth0,eth1 peer1 modprobe lnet lnetctl lnet configure lnetctl net add --net tcp --if eth0,eth1 lnetctl discover 192.168.122.30@tcp # discover peer2 # in /var/log/messages # Aug 24 17:49:54 MRtest01 kernel: LNetError: 3447:0:(peer.c:1786:lnet_peer_push_event()) Push Put from unknown 0@<0:0> (source 0@<0:0>) |
| Comment by Andreas Dilger [ 15/Sep/17 ] |
|
I've started hitting this after a recent update as well. We definitely shouldn't be printing "Error" on the console for something that happens during normal operation. I'm also seeing a new error during mount: # sh sanity.sh client: executing check_logdir /tmp/test_logs/1505452045 client: ../libcfs/libcfs/libcfs options: 'libcfs_panic_on_lbug=0' Logging to shared log directory: /tmp/test_logs/1505452045 client: executing yml_node IOC_LIBCFS_GET_NI error 22: Invalid argument The mount itself continues fine, but as with the other error, we shouldn't be printing errors to the console/terminal during normal operations, as that tends to confuse users. |
| Comment by Gerrit Updater [ 15/Sep/17 ] |
|
Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/29026 |
| Comment by Amir Shehata (Inactive) [ 15/Sep/17 ] |
|
Andreas, the reason for testvm.centos7: executing check_logdir /tmp/test_logs/1505518717 Logging to shared log directory: /tmp/test_logs/1505518717 testvm.centos7: executing yml_node IOC_LIBCFS_GET_NI error 22: Invalid argument <----- call to lctl list_nids Client: Lustre version: 2.10.52_98_g8e75219_dirty MDS: Lustre version: 2.10.52_98_g8e75219_dirty OSS: Lustre version: 2.10.52_98_g8e75219_dirty Stopping clients: testvm.centos7 /mnt/lustre (opts:) Stopping clients: testvm.centos7 /mnt/lustre2 (opts:) Loading modules from /usr/lib64/lustre <--- modules (including lnet) are loaded |
| Comment by Andreas Dilger [ 16/Sep/17 ] |
|
I would suggest that IOC_LIBCFS_GET_NI error 22: Invalid argument is not a useful error to print if lctl list_nids failed, so that should be fixed up somehow. Secondly, I don't see where lctl list_nids is being called, or I'd suggest to avoid calling it if that doesn't make sense. The only places I see it are in sk_nodemap_setup(), which is only called if $SK_S2S is set, and in host_nids_address() which is only sanity.sh::test_217 and not during mount that I can see. |
| Comment by Amir Shehata (Inactive) [ 18/Sep/17 ] |
|
yaml.sh:yml_node_info() gets called before the modules are loaded. in this function for nw in $(lctl list_nids | grep -v @lo | cut -f 2 -d '@' | uniq); do This doesn't look like it's something new. I'm not sure exactly how the test framework works to make a change there. I would rather not remove the error when list_nids fails, because it's useful for debugging. prints out why it failed. I think it makes more sense not to call lctl list_nids from the test unless the modules are loaded first. |
| Comment by Andreas Dilger [ 21/Sep/17 ] |
|
I'd be happy if you fixed the test script to not call list_nids before the modules are loaded. Even so, the error message itself doesn't explain anything to the user who runs "lctl list_nids". I'm not suggesting the error should be removed, but rather that it should be improved to be actually useful to the user:
|
| Comment by Gerrit Updater [ 30/Sep/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29026/ |
| Comment by Peter Jones [ 30/Sep/17 ] |
|
Landed for 2.11. |