[LU-3067] ASSERTION(!(aa->aa_oa->o_valid & OBD_MD_FLHANDLE)) Created: 29/Mar/13  Updated: 02/Jul/16  Resolved: 02/Jul/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.9
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Christopher J. Walker (Inactive) Assignee: Hongchao Zhang
Resolution: Won't Fix Votes: 0
Labels: mn8
Environment:

Scientific Linux [walker@fe02 ~]$ uname -r
2.6.18-348.3.1.el5

Patchless client:

lustre-client-modules-1.8.9-wc1_2.6.18_348.3.1.el5
lustre-client-1.8.9-wc1_2.6.18_348.3.1.el5

Servers are all running:
[root@sn20 ~]# rpm -qa | grep ^lustre
lustre-modules-1.8.9-wc1_2.6.18_348.1.1.el5_lustre
lustre-1.8.9-wc1_2.6.18_348.1.1.el5_lustre
lustre-ldiskfs-3.1.53-wc1_2.6.18_348.1.1.el5_lustre


Attachments: File lustre-log.1364538010.3004.gz    
Issue Links:
Duplicate
is duplicated by LU-4452 Lustre 1.8.8 client causes kernel panic Resolved
Severity: 3
Rank (Obsolete): 7466

 Description   

One of our OSSs had problems writing to disk (due to a raid card problem).

Several clients have an LBUG and haven't recovered after OSS reboot.
The error is:

Mar 29 06:20:10 cn492 kernel: LustreError: 3004:0:(osc_request.c:2357:brw_interpret()) ASSERTION(!(aa->aa_oa->o_valid & OBD_MD_FLHANDLE)) failed
Mar 29 06:20:10 cn492 kernel: LustreError: 3004:0:(osc_request.c:2357:brw_interpret()) LBUG

I attach the associated log file, and reproduce some lines of context in /var/log/messages

Mar 29 05:57:03 cn492 kernel: Lustre: lustre_0-OST0027-osc-ffff81021c041800: Connection restored to service lustre_0-OST0027 using nid 10.1.4.12
0@tcp.
Mar 29 05:57:03 cn492 kernel: Lustre: Skipped 1 previous similar message
Mar 29 06:09:39 cn492 kernel: Lustre: 3004:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1430341259304767 sent from lustre_0-OST002
7-osc-ffff81021c041800 to NID 10.1.4.120@tcp 756s ago has timed out (756s prior to deadline).
Mar 29 06:09:39 cn492 kernel: req@ffff8101145e6800 x1430341259304767/t0 o3->lustre_0-OST0027_UUID@10.1.4.120@tcp:6/4 lens 448/592 e 1 to 1 dl
1364537379 ref 2 fl Rpc:/2/0 rc 0/0
Mar 29 06:09:39 cn492 kernel: Lustre: 3004:0:(client.c:1529:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Mar 29 06:09:39 cn492 kernel: Lustre: lustre_0-OST0027-osc-ffff81021c041800: Connection to service lustre_0-OST0027 via nid 10.1.4.120@tcp was l
ost; in progress operations using this service will wait for recovery to complete.
Mar 29 06:09:39 cn492 kernel: Lustre: Skipped 1 previous similar message
Mar 29 06:09:39 cn492 kernel: Lustre: lustre_0-OST0027-osc-ffff81021c041800: Connection restored to service lustre_0-OST0027 using nid 10.1.4.12
0@tcp.
Mar 29 06:09:39 cn492 kernel: Lustre: Skipped 1 previous similar message
Mar 29 06:20:10 cn492 kernel: LustreError: 3004:0:(osc_request.c:2357:brw_interpret()) ASSERTION(!(aa->aa_oa->o_valid & OBD_MD_FLHANDLE)) failed
Mar 29 06:20:10 cn492 kernel: LustreError: 3004:0:(osc_request.c:2357:brw_interpret()) LBUG
Mar 29 06:20:10 cn492 kernel: Pid: 3004, comm: ptlrpcd
Mar 29 06:20:10 cn492 kernel:
Mar 29 06:20:10 cn492 kernel: Call Trace:
Mar 29 06:20:10 cn492 kernel: [<ffffffff885786a1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
Mar 29 06:20:10 cn492 kernel: [<ffffffff88578bda>] lbug_with_loc+0x7a/0xd0 [libcfs]
Mar 29 06:20:10 cn492 kernel: [<ffffffff88580fc0>] tracefile_init+0x0/0x110 [libcfs]
Mar 29 06:20:10 cn492 kernel: [<ffffffff8879c7e8>] brw_interpret+0x8e8/0xdb0 [osc]
Mar 29 06:20:10 cn492 kernel: [<ffffffff886d36ac>] after_reply+0xcac/0xe30 [ptlrpc]
Mar 29 06:20:10 cn492 kernel: [<ffffffff886d4b0b>] ptlrpc_check_set+0x12db/0x15a0 [ptlrpc]
Mar 29 06:20:10 cn492 kernel: [<ffffffff8004b396>] try_to_del_timer_sync+0x7f/0x88
Mar 29 06:20:10 cn492 kernel: [<ffffffff887095ad>] ptlrpcd_check+0xdd/0x1f0 [ptlrpc]
Mar 29 06:20:10 cn492 kernel: [<ffffffff8009a98c>] process_timeout+0x0/0x5
Mar 29 06:20:10 cn492 kernel: [<ffffffff88709ef1>] ptlrpcd+0x1b1/0x259 [ptlrpc]
Mar 29 06:20:10 cn492 kernel: [<ffffffff8008f3ad>] default_wake_function+0x0/0xe
Mar 29 06:20:10 cn492 kernel: [<ffffffff8005dfc1>] child_rip+0xa/0x11
Mar 29 06:20:10 cn492 kernel: [<ffffffff88709d40>] ptlrpcd+0x0/0x259 [ptlrpc]
Mar 29 06:20:10 cn492 kernel: [<ffffffff8005dfb7>] child_rip+0x0/0x11
Mar 29 06:20:10 cn492 kernel:
Mar 29 06:20:10 cn492 kernel: LustreError: dumping log to /tmp/lustre-log.1364538010.3004



 Comments   
Comment by Girish Shilamkar (Inactive) [ 03/Apr/13 ]

We have seen this problem with our customer and
while investigating it I stumbled upon the recent changes made to this code.
http://jira.whamcloud.com/browse/LU-2703
This soft lockup issue was fixed in 1.8.9 by removing obd_cancel() (which cancels a ldlm lock held for brw) out of *_ap_completion. This was done as obd_cancel() is heavy operation and should not be called while holding a spinlock (cl_loi_list_lock).

However this fix added the LASSERT,

list_for_each_entry_safe(oap, tmp, &aa->aa_oaps, oap_rpc_item) {
list_del_init(&oap->oap_rpc_item);

aa->aa_oa->o_flags &= ~OBD_FL_HAVE_LOCK;
osc_ap_completion(cli, aa->aa_oa, oap, 1, rc);
if (aa->aa_oa->o_flags & OBD_FL_HAVE_LOCK)

{ LASSERT(!(aa->aa_oa->o_valid & OBD_MD_FLHANDLE)); LASSERT(index < aa->aa_handle_count); aa->aa_handle[index++] = aa->aa_oa->o_handle; }

Shouldn't the OBD_MD_FLHANDLE be set as it indicates presence of valid lock handle ?

Comment by Kit Westneat (Inactive) [ 05/Apr/13 ]

FYI the customer is Sanger.

Comment by Christopher J. Walker (Inactive) [ 05/Apr/13 ]

The problem also occurs when the the network connection to an OSS fails (eg today when a colleague managed to take out both of the resilient core switches).

Comment by Peter Jones [ 05/Apr/13 ]

Hongchao

Could you please comment?

Thanks

Peter

Comment by Hongchao Zhang [ 08/Apr/13 ]

Hi Girish,
Thanks for your investigation! the OBD_MD_FLHANDLE could be used in client side (OSC)!

the patch is tracked at http://review.whamcloud.com/#change,5971

Comment by Kit Westneat (Inactive) [ 01/May/13 ]

Hi I was wondering what the status of this issue was. We are running into this bug regularly at NOAA as well. Thanks.

Comment by Alex Kulyavtsev [ 09/Aug/13 ]

We ran into the same issue on 1.8.9 at FNAL on client. Trace dump is the same. It happened after client communication error with OSS (the router had issues). Is it going to be fixed in 1.8.10 (or 1.8.9.1) ? Thanks, Alex.

Comment by Craig Prescott [ 09/Jan/14 ]

We occasionally run into this issue. I see the patch set has been rebased at http://review.whamcloud.com/#change,5971 - is it only waiting approval at this point?

Comment by Erich Focht [ 09/Jan/14 ]

We see this issue also at a customer where we deployed servers with Lustre 2.5.0 and his most stable setup seems to be with 1.8.9 clients. Up to this bug!

Comment by Frederik Ferner (Inactive) [ 03/Mar/14 ]

Looks like we ran into this bug as well. As far as I can tell, we hit this on a (1.8) client after it had some network issues of unknown type... Is the patch ok to use as it is?

Comment by Christopher J. Walker (Inactive) [ 03/Mar/14 ]

We are using it in production and haven't noticed problems.

Comment by Peter Jones [ 02/Jul/16 ]

No plans for further 1.8.x releases

Generated at Sat Feb 10 01:30:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.