[LU-2936] nrs_svcpt2nrs()) ASSERTION( (!(hp) || (nrs_svcpt_has_hp(svcpt))) ) failed Created: 09/Mar/13  Updated: 13/Mar/13  Resolved: 13/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: HB

Issue Links:
Related
is related to LU-398 NRS (Network Request Scheduler ) Resolved
Severity: 3
Rank (Obsolete): 7054

 Description   

ORNL did a full scale system test today and one of their OSSes crashed with the above assertion.

It seems we never saw it before because we don't seriously test /proc/fs/lustre/health_check functionality in our testing, but it's actually heavily used by a lot of sites.

I was able to reproduce the issue with racer while running this line in parallel:

while :; do cat /proc/fs/lustre/health_check ; done
[305098.783912] LustreError: 1075:0:(ptlrpc_internal.h:165:nrs_svcpt2nrs()) ASSERTION( (!(hp) || (nrs_svcpt_has_hp(svcpt))) ) failed: 
[305098.784415] LustreError: 1075:0:(ptlrpc_internal.h:165:nrs_svcpt2nrs()) LBUG
[305098.784682] Pid: 1075, comm: cat
[305098.784881] 
[305098.784881] Call Trace:
[305098.785248]  [<ffffffffa07b7915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[305098.785524]  [<ffffffffa07b7f17>] lbug_with_loc+0x47/0xb0 [libcfs]
[305098.785810]  [<ffffffffa11b0e22>] ptlrpc_nrs_req_poll_nolock+0xc2/0x1c0 [ptlrpc]
[305098.786239]  [<ffffffffa1170696>] ptlrpc_svcpt_health_check+0x56/0x180 [ptlrpc]
[305098.786664]  [<ffffffffa1170812>] ptlrpc_service_health_check+0x52/0x70 [ptlrpc]
[305098.787079]  [<ffffffffa05d61fd>] ost_health_check+0x4d/0x90 [ost]
[305098.787345]  [<ffffffffa0e4c8e7>] obd_proc_read_health+0x2a7/0x3b0 [obdclass]
[305098.792327]  [<ffffffffa0e6f36c>] lprocfs_fops_read+0xec/0x1f0 [obdclass]
[305098.793699]  [<ffffffffa0e6f280>] ? lprocfs_fops_read+0x0/0x1f0 [obdclass]
[305098.793964]  [<ffffffff811e1cc5>] proc_reg_read+0x85/0xc0
[305098.794200]  [<ffffffff8117b9e5>] vfs_read+0xb5/0x1a0
[305098.794429]  [<ffffffff8117bb21>] sys_read+0x51/0x90
[305098.794657]  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
[305098.794907] 
[305098.992803] Kernel panic - not syncing: LBUG

Crashdump and modules are in /exports/crashdumps/192.168.10.210-2013-03-09-00\:44\:57

The problem was seemingly added along with NRS code drop.



 Comments   
Comment by Oleg Drokin [ 09/Mar/13 ]

Ok, I now checked the dump and the situation is clear.
The ptlrpc_svcpt_health_check tries to call ptlrpc_nrs_req_poll_nolock twice for every service partition that has anything pending.
Once time with hp set to true and once to false.
Then if we happen to be called for a service that does not have hp ops registered (like ost_create), the assertion trips as the underlying nrs code seems to assuem caller must be all smart about service types and request for a hp request for a service with no possible hp requests is a no-no (which is a bit strange, considering that it's perfectly ok to check if a service has any hp requests pendign even for non-hp services.)

As such possible fixes are:
1. remove the assertion and restrictions on caller knowledge on underlying service when trying to fetch requests.
2. Check that a hp request is actually available before trying to fetch it from ptlrpc_svcpt_health_check

Comment by Oleg Drokin [ 09/Mar/13 ]

patch in http://review.whamcloud.com/5665

Comment by Nikitas Angelinas [ 10/Mar/13 ]

As mentioned in Gerrit, this bug is addressed in the NRS follow-up patch as well, but '2' from the comment above that you have used is a better solution. Maybe '1' could be used to improve things on a future patch.

Comment by Jodi Levi (Inactive) [ 13/Mar/13 ]

Patch landed to master.

Generated at Sat Feb 10 01:29:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.