[LU-3067] ASSERTION(!(aa->aa_oa->o_valid & OBD_MD_FLHANDLE)) Created: 29/Mar/13 Updated: 02/Jul/16 Resolved: 02/Jul/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.9 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Christopher J. Walker (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | mn8 | ||
| Environment: |
Scientific Linux [walker@fe02 ~]$ uname -r Patchless client: lustre-client-modules-1.8.9-wc1_2.6.18_348.3.1.el5 Servers are all running: |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 7466 | ||||||||
| Description |
|
One of our OSSs had problems writing to disk (due to a raid card problem). Several clients have an LBUG and haven't recovered after OSS reboot. Mar 29 06:20:10 cn492 kernel: LustreError: 3004:0:(osc_request.c:2357:brw_interpret()) ASSERTION(!(aa->aa_oa->o_valid & OBD_MD_FLHANDLE)) failed I attach the associated log file, and reproduce some lines of context in /var/log/messages Mar 29 05:57:03 cn492 kernel: Lustre: lustre_0-OST0027-osc-ffff81021c041800: Connection restored to service lustre_0-OST0027 using nid 10.1.4.12 |
| Comments |
| Comment by Girish Shilamkar (Inactive) [ 03/Apr/13 ] |
|
We have seen this problem with our customer and However this fix added the LASSERT, list_for_each_entry_safe(oap, tmp, &aa->aa_oaps, oap_rpc_item) { aa->aa_oa->o_flags &= ~OBD_FL_HAVE_LOCK; Shouldn't the OBD_MD_FLHANDLE be set as it indicates presence of valid lock handle ? |
| Comment by Kit Westneat (Inactive) [ 05/Apr/13 ] |
|
FYI the customer is Sanger. |
| Comment by Christopher J. Walker (Inactive) [ 05/Apr/13 ] |
|
The problem also occurs when the the network connection to an OSS fails (eg today when a colleague managed to take out both of the resilient core switches). |
| Comment by Peter Jones [ 05/Apr/13 ] |
|
Hongchao Could you please comment? Thanks Peter |
| Comment by Hongchao Zhang [ 08/Apr/13 ] |
|
Hi Girish, the patch is tracked at http://review.whamcloud.com/#change,5971 |
| Comment by Kit Westneat (Inactive) [ 01/May/13 ] |
|
Hi I was wondering what the status of this issue was. We are running into this bug regularly at NOAA as well. Thanks. |
| Comment by Alex Kulyavtsev [ 09/Aug/13 ] |
|
We ran into the same issue on 1.8.9 at FNAL on client. Trace dump is the same. It happened after client communication error with OSS (the router had issues). Is it going to be fixed in 1.8.10 (or 1.8.9.1) ? Thanks, Alex. |
| Comment by Craig Prescott [ 09/Jan/14 ] |
|
We occasionally run into this issue. I see the patch set has been rebased at http://review.whamcloud.com/#change,5971 - is it only waiting approval at this point? |
| Comment by Erich Focht [ 09/Jan/14 ] |
|
We see this issue also at a customer where we deployed servers with Lustre 2.5.0 and his most stable setup seems to be with 1.8.9 clients. Up to this bug! |
| Comment by Frederik Ferner (Inactive) [ 03/Mar/14 ] |
|
Looks like we ran into this bug as well. As far as I can tell, we hit this on a (1.8) client after it had some network issues of unknown type... Is the patch ok to use as it is? |
| Comment by Christopher J. Walker (Inactive) [ 03/Mar/14 ] |
|
We are using it in production and haven't noticed problems. |
| Comment by Peter Jones [ 02/Jul/16 ] |
|
No plans for further 1.8.x releases |