[LU-1044] ASSERTION((req->rq_req_swab_mask & (1 << index)) == 0) failed Created: 26/Jan/12 Updated: 05/Mar/12 Resolved: 05/Mar/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 1.8.x (1.8.0 - 1.8.5) |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Prakash Surya (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
toss 2.0-beta12.ch5 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 6476 |
| Description |
|
2012-01-26 10:35:09 LustreError: 15531:0:(lustre_net.h:761:lustre_set_req_swabbed()) ASSERTION((req->rq_req_swab_mask & (1 << index)) == 0) failed |
| Comments |
| Comment by Prakash Surya (Inactive) [ 26/Jan/12 ] |
|
Initially, it appears as though the swap bit in question is being modified in between access in this code path: if (ptlrpc_buf_need_swab(req, inout, offset)) {
lustre_swab_ptlrpc_body(pb);
ptlrpc_buf_set_swabbed(req, inout, offset);
}
`ptlrpc_buf_need_swab` will only return true if `req->rq_req_swab_mask & (1 << index)` is false. Thus, if it's calling `ptlrpc_buf_set_swabbed`, that must mean that `req->rq_req_swab_mask & (1 << index)` is false when it's executed within `ptlrpc_buf_need_swab`. But then when that same check it executed in `ptlrpc_buf_set_swabbed` its true (hence the LBUG). Where is the lock taken on `req` in this code path? The only way I can see this happening is if another thread swabs `req` in between the two above accesses. And that would imply improper lock handling on `req`. |
| Comment by Prakash Surya (Inactive) [ 26/Jan/12 ] |
|
I was going down the wrong path earlier, the real call to `ptlrpc_buf_set_swabbed` I believe happened here: if (ptlrpc_buf_need_swab(pill->rc_req, inout, offset) && swabber != NULL && value != NULL) do_swab = 1; else do_swab = 0; if (!(field->rmf_flags & RMF_F_STRUCT_ARRAY)) { if (dump && field->rmf_dumper) { CDEBUG(D_RPCTRACE, "Dump of %sfield %s follows\n", do_swab ? "unswabbed " : "", field->rmf_name); field->rmf_dumper(value); } if (!do_swab) return; swabber(value); ptlrpc_buf_set_swabbed(pill->rc_req, inout, offset); if (dump) { CDEBUG(D_RPCTRACE, "Dump of swabbed field %s " "follows\n", field->rmf_name); field->rmf_dumper(value); } return; } Although, I think the same logic regarding `req->rq_req_swab_mask` changing in between calls to `ptlrpc_buf_need_swab` and `ptlrpc_buf_set_swabbed` still stands. |
| Comment by Oleg Drokin [ 26/Jan/12 ] |
|
I did a quick search for ldlm_cancel_hpreq_check and I don't see it anywhere in our code neither 2.x nor 1.8.x. Also did this crash happen on a client? What version was the client running? |
| Comment by Andreas Dilger [ 26/Jan/12 ] |
|
Also, is the client a PPC BlueGene system? |
| Comment by Prakash Surya (Inactive) [ 27/Jan/12 ] |
|
Oleg: The ldlm_cancel_hpreq_check was introduced by one of the Andreas: I can't say for sure, but after talking with Chris and Ned, I think the consensus is that we do still indeed have a PPC BG machine out in our environment mounting this filesystem. I don't know of any other reason why scrubbing would be necessary. Although, I'll check with one of the admins to make sure of this. |
| Comment by Peter Jones [ 27/Jan/12 ] |
|
jinshan Could you please look into this possible problem with one of the LU874 patches? Thanks Peter |
| Comment by Prakash Surya (Inactive) [ 27/Jan/12 ] |
|
And just for completeness, here's the Change-id: If14eff6361f55d2b2eeb2db7146789dda4c32060 |
| Comment by D. Marc Stearman (Inactive) [ 27/Jan/12 ] |
|
I can confirm, we do have PPC bluegene clients mounting this file system. Both IO nodes, and login nodes. |
| Comment by Oleg Drokin [ 27/Jan/12 ] |
|
So, is the crash happening on OST? |
| Comment by Prakash Surya (Inactive) [ 27/Jan/12 ] |
|
Oleg: Yeah, the OSS. |
| Comment by Christopher Morrone [ 27/Jan/12 ] |
|
Oleg: Our branch is available at github.com/chaos/lustre. |
| Comment by Jinshan Xiong (Inactive) [ 27/Jan/12 ] |
|
Can you please use this patch(in attachment) as a workaround fix? Because |
| Comment by Jinshan Xiong (Inactive) [ 27/Jan/12 ] |
|
Thanks oleg for helping me find this out issue. There is a similar issue( |
| Comment by Prakash Surya (Inactive) [ 27/Jan/12 ] |
|
Jinshan, since this will be running in a production environment, we are reluctant to apply the patch with it not submitted through Gerrit. Would you be able to merge it into that patch and push a new revision? |
| Comment by Jinshan Xiong (Inactive) [ 27/Jan/12 ] |
|
Hi Prakash, please take a look at patch set 10 of http://review.whamcloud.com/#change,1918. I'll urge guys to inspect it. |
| Comment by Prakash Surya (Inactive) [ 27/Jan/12 ] |
|
Thanks, Jinshan. |
| Comment by Christopher Morrone [ 02/Mar/12 ] |
|
No additional instances of this bug. I am ready to believe it introduced by the earlier revision |
| Comment by Peter Jones [ 05/Mar/12 ] |
|
ok then let's close for now and reopen if it reappears |