Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.4.0
-
3
-
7054
Description
ORNL did a full scale system test today and one of their OSSes crashed with the above assertion.
It seems we never saw it before because we don't seriously test /proc/fs/lustre/health_check functionality in our testing, but it's actually heavily used by a lot of sites.
I was able to reproduce the issue with racer while running this line in parallel:
while :; do cat /proc/fs/lustre/health_check ; done
[305098.783912] LustreError: 1075:0:(ptlrpc_internal.h:165:nrs_svcpt2nrs()) ASSERTION( (!(hp) || (nrs_svcpt_has_hp(svcpt))) ) failed: [305098.784415] LustreError: 1075:0:(ptlrpc_internal.h:165:nrs_svcpt2nrs()) LBUG [305098.784682] Pid: 1075, comm: cat [305098.784881] [305098.784881] Call Trace: [305098.785248] [<ffffffffa07b7915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [305098.785524] [<ffffffffa07b7f17>] lbug_with_loc+0x47/0xb0 [libcfs] [305098.785810] [<ffffffffa11b0e22>] ptlrpc_nrs_req_poll_nolock+0xc2/0x1c0 [ptlrpc] [305098.786239] [<ffffffffa1170696>] ptlrpc_svcpt_health_check+0x56/0x180 [ptlrpc] [305098.786664] [<ffffffffa1170812>] ptlrpc_service_health_check+0x52/0x70 [ptlrpc] [305098.787079] [<ffffffffa05d61fd>] ost_health_check+0x4d/0x90 [ost] [305098.787345] [<ffffffffa0e4c8e7>] obd_proc_read_health+0x2a7/0x3b0 [obdclass] [305098.792327] [<ffffffffa0e6f36c>] lprocfs_fops_read+0xec/0x1f0 [obdclass] [305098.793699] [<ffffffffa0e6f280>] ? lprocfs_fops_read+0x0/0x1f0 [obdclass] [305098.793964] [<ffffffff811e1cc5>] proc_reg_read+0x85/0xc0 [305098.794200] [<ffffffff8117b9e5>] vfs_read+0xb5/0x1a0 [305098.794429] [<ffffffff8117bb21>] sys_read+0x51/0x90 [305098.794657] [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b [305098.794907] [305098.992803] Kernel panic - not syncing: LBUG
Crashdump and modules are in /exports/crashdumps/192.168.10.210-2013-03-09-00\:44\:57
The problem was seemingly added along with NRS code drop.
Attachments
Issue Links
- is related to
-
LU-398 NRS (Network Request Scheduler )
- Resolved