[LU-12838] ptlrpc watchdog ratelimiting is broken Created: 08/Oct/19 Updated: 02/Apr/21 Resolved: 18/Oct/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | Lustre 2.13.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Andreas Dilger | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
The ptlrpc thread ratelimiting added in patch https://review.whamcloud.com/33018 "LU-9859 libcfs: add watchdog for ptlrpc service threads" is broken. The kernel always prints: [29352.393371] Lustre: mdt00_009: service thread pid 18935 was inactive for 72.167 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. even though there hasn't been any stack trace printed before. This is visible in e.g. sanityn test_104 timeouts on the MDS when testing It looks like the __ratelimit() return value is backward is backward from what one would expect from normal English grammar, namely that "if (__ratelimit())" is true then the action should NOT be ratelimited, and vice versa. Trivial patch to follow. This should be included in 2.13.0 as it was broken in commit v2_12_50-83-gfc9de67 and would make debugging problems reported from the field significantly more complex than necessary. |
| Comments |
| Comment by Gerrit Updater [ 08/Oct/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36409 |
| Comment by Gerrit Updater [ 18/Oct/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36409/ |
| Comment by Peter Jones [ 18/Oct/19 ] |
|
Landed for 2.13 |