[LU-2936] nrs_svcpt2nrs()) ASSERTION( (!(hp) || (nrs_svcpt_has_hp(svcpt))) ) failed Created: 09/Mar/13 Updated: 13/Mar/13 Resolved: 13/Mar/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Oleg Drokin | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 7054 | ||||||||
| Description |
|
ORNL did a full scale system test today and one of their OSSes crashed with the above assertion. It seems we never saw it before because we don't seriously test /proc/fs/lustre/health_check functionality in our testing, but it's actually heavily used by a lot of sites. I was able to reproduce the issue with racer while running this line in parallel: while :; do cat /proc/fs/lustre/health_check ; done [305098.783912] LustreError: 1075:0:(ptlrpc_internal.h:165:nrs_svcpt2nrs()) ASSERTION( (!(hp) || (nrs_svcpt_has_hp(svcpt))) ) failed: [305098.784415] LustreError: 1075:0:(ptlrpc_internal.h:165:nrs_svcpt2nrs()) LBUG [305098.784682] Pid: 1075, comm: cat [305098.784881] [305098.784881] Call Trace: [305098.785248] [<ffffffffa07b7915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [305098.785524] [<ffffffffa07b7f17>] lbug_with_loc+0x47/0xb0 [libcfs] [305098.785810] [<ffffffffa11b0e22>] ptlrpc_nrs_req_poll_nolock+0xc2/0x1c0 [ptlrpc] [305098.786239] [<ffffffffa1170696>] ptlrpc_svcpt_health_check+0x56/0x180 [ptlrpc] [305098.786664] [<ffffffffa1170812>] ptlrpc_service_health_check+0x52/0x70 [ptlrpc] [305098.787079] [<ffffffffa05d61fd>] ost_health_check+0x4d/0x90 [ost] [305098.787345] [<ffffffffa0e4c8e7>] obd_proc_read_health+0x2a7/0x3b0 [obdclass] [305098.792327] [<ffffffffa0e6f36c>] lprocfs_fops_read+0xec/0x1f0 [obdclass] [305098.793699] [<ffffffffa0e6f280>] ? lprocfs_fops_read+0x0/0x1f0 [obdclass] [305098.793964] [<ffffffff811e1cc5>] proc_reg_read+0x85/0xc0 [305098.794200] [<ffffffff8117b9e5>] vfs_read+0xb5/0x1a0 [305098.794429] [<ffffffff8117bb21>] sys_read+0x51/0x90 [305098.794657] [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b [305098.794907] [305098.992803] Kernel panic - not syncing: LBUG Crashdump and modules are in /exports/crashdumps/192.168.10.210-2013-03-09-00\:44\:57 The problem was seemingly added along with NRS code drop. |
| Comments |
| Comment by Oleg Drokin [ 09/Mar/13 ] |
|
Ok, I now checked the dump and the situation is clear. As such possible fixes are: |
| Comment by Oleg Drokin [ 09/Mar/13 ] |
|
patch in http://review.whamcloud.com/5665 |
| Comment by Nikitas Angelinas [ 10/Mar/13 ] |
|
As mentioned in Gerrit, this bug is addressed in the NRS follow-up patch as well, but '2' from the comment above that you have used is a better solution. Maybe '1' could be used to improve things on a future patch. |
| Comment by Jodi Levi (Inactive) [ 13/Mar/13 ] |
|
Patch landed to master. |