[LU-8223] De-Noise LNet neterr logs so they can be ON by default Created: 31/May/16  Updated: 20/Jun/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Doug Oucharek (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 0
Labels: lnet

Issue Links:
Duplicate
is duplicated by LU-2418 Add Way to Detect Dropped Packets on ... Open
Related
is related to LU-8980 Add tracepoint support to Lustre Reopened
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

LNet's neterr logs are turned off by default. I have been told this is due to the fact that they are very noisy. Logically, if logs are happening very frequently, they are not really errors then but normal operations. If they were errors, we should be fixing them.

The big problem here is that when a networking field issue happens, we have little to nothing in the logs to go on. Debugging requires that the problem be easy to reproduce with neterr turned on (not usually the case for production errors), or becomes a discipline of the mind (i.e. guesswork).

This ticket is for cleaning up the neterr logs to be true errors so we can have neterr logs on be default.



 Comments   
Comment by Doug Oucharek (Inactive) [ 01/Jun/16 ]

I actually consider this a bug and not an improvement. Neterrors should never have gotten into this unusable state in the first place. Outside of development, they are useless because they are off.

Comment by Gerrit Updater [ 13/Jun/16 ]

Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/20769
Subject: LU-8223 lnet: Fix use of NETERR logging
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 155086c0edeb1b938994f1aba7c243ad4d57133d

Comment by James A Simmons [ 05/Jan/17 ]

Do you need to backport this to earlier lustre version or is this a lustre 2.10 thing. The reason I ask is that the lustre debugging code is being migrated to tracepoint. For tracepoint this can be addressed but it wouldn't be back portable.

Comment by Doug Oucharek (Inactive) [ 05/Jan/17 ]

This is really just a 2.10 thing.  I don't expect anyone will want this backported.

When will the migration to tracepoint be taking place?

Comment by James A Simmons [ 05/Jan/17 ]

I already started the tracepoint work. See LU-8980.

Comment by Doug Oucharek (Inactive) [ 06/Jan/17 ]

Ok, I can delay this patch until LU-8980 lands and then update it with any additional changes required.

Generated at Sat Feb 10 02:15:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.