[LU-17353] cfs_cpu_dead() Lustre: can't support CPU plug-out well now Created: 10/Dec/23 Updated: 10/Dec/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Occasionally seen in the logs that CPU cores are being deactivated on the system: [ 200.214057] LNet: 5394:0:(libcfs_cpu.c:1133:cfs_cpu_dead()) Lustre: can't support CPU plug-out well now, performance and stability could be impacted [CPU 32] [ 200.231635] LNet: 5394:0:(libcfs_cpu.c:1133:cfs_cpu_dead()) Lustre: can't support CPU plug-out well now, performance and stability could be impacted [CPU 33] [ 200.249606] LNet: 5394:0:(libcfs_cpu.c:1133:cfs_cpu_dead()) Lustre: can't support CPU plug-out well now, performance and stability could be impacted [CPU 34] I suspect this is mainly a client issue, but could eventually be hit on servers in a cloud environment. It would be good to handle this situation better than just printing an error message. In particular, stop ptlrpcd threads running on those cores if the entire CPT is removed, so they don't continue to burn cycles. Also, recompute the CPT count. |
| Comments |
| Comment by Andreas Dilger [ 10/Dec/23 ] |
|
The important question is at which point the Lustre filesystem is mounted relative to when the cores are removed, and whether cores are removed that were initially available at mount time. There is no message printed when a new core is added, since that does not affect Lustre at all, and removing those cores that were previously added is not complex (the console error message could just be skipped for the case where there are no Lustre threads running there). However, removing cores that were previously in use by Lustre/LNet is much more difficult. Also, it would be much more complex if the cores are not added/removed in numerical order (i.e. core 31 is added, but core 0 is removed). |