Change 531ef4d on b2_5_fe, the patch from
LU-6060, would appear to have introduced a bug that caused a major system outage during our upgrade to 2.5.4 today. The patch was based on 749dc54 from master.
The code makes this, to me, puzzling assumption:
First of all, lr_hops == 1 is the default, and lnet has never required hops to be set even on multi-hop systems. You are explicitly introducing a constraint that never existed in the past, and you have failed to communicate that new constraint to your customers.
Adding this new constraint, while having avoid_asym_router_failure enabled by default, and hops defaulting to 1, is a bug plain and simple.
Hops and priority were invented for fine grain routing. They are not, and never have been, required to be set in multi-hop lnet routing situations. The Lustre manual even shows them as optional.
And speaking of the manual, the documented grammar for setting hops and priority are almost certainly wrong.
Since both hop and priority are numbers, I kind of doubt that there is any way to specify either one without the other. I imagine the grammar is really either this:
Finally, what happened with the review process? I see no reviews listed on the 531ef4d patch, and no reference to the patch on master from which this was backported. And there were additional changes in the backported patch not found in the original, so a review really should have been required.