Details
-
Bug
-
Resolution: Won't Fix
-
Major
-
None
-
Lustre 1.8.9
-
Scientific Linux [walker@fe02 ~]$ uname -r
2.6.18-348.3.1.el5
Patchless client:
lustre-client-modules-1.8.9-wc1_2.6.18_348.3.1.el5
lustre-client-1.8.9-wc1_2.6.18_348.3.1.el5
Servers are all running:
[root@sn20 ~]# rpm -qa | grep ^lustre
lustre-modules-1.8.9-wc1_2.6.18_348.1.1.el5_lustre
lustre-1.8.9-wc1_2.6.18_348.1.1.el5_lustre
lustre-ldiskfs-3.1.53-wc1_2.6.18_348.1.1.el5_lustre
Scientific Linux [ walker@fe02 ~]$ uname -r 2.6.18-348.3.1.el5 Patchless client: lustre-client-modules-1.8.9-wc1_2.6.18_348.3.1.el5 lustre-client-1.8.9-wc1_2.6.18_348.3.1.el5 Servers are all running: [ root@sn20 ~]# rpm -qa | grep ^lustre lustre-modules-1.8.9-wc1_2.6.18_348.1.1.el5_lustre lustre-1.8.9-wc1_2.6.18_348.1.1.el5_lustre lustre-ldiskfs-3.1.53-wc1_2.6.18_348.1.1.el5_lustre
-
3
-
7466
Description
One of our OSSs had problems writing to disk (due to a raid card problem).
Several clients have an LBUG and haven't recovered after OSS reboot.
The error is:
Mar 29 06:20:10 cn492 kernel: LustreError: 3004:0:(osc_request.c:2357:brw_interpret()) ASSERTION(!(aa->aa_oa->o_valid & OBD_MD_FLHANDLE)) failed
Mar 29 06:20:10 cn492 kernel: LustreError: 3004:0:(osc_request.c:2357:brw_interpret()) LBUG
I attach the associated log file, and reproduce some lines of context in /var/log/messages
Mar 29 05:57:03 cn492 kernel: Lustre: lustre_0-OST0027-osc-ffff81021c041800: Connection restored to service lustre_0-OST0027 using nid 10.1.4.12
0@tcp.
Mar 29 05:57:03 cn492 kernel: Lustre: Skipped 1 previous similar message
Mar 29 06:09:39 cn492 kernel: Lustre: 3004:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1430341259304767 sent from lustre_0-OST002
7-osc-ffff81021c041800 to NID 10.1.4.120@tcp 756s ago has timed out (756s prior to deadline).
Mar 29 06:09:39 cn492 kernel: req@ffff8101145e6800 x1430341259304767/t0 o3->lustre_0-OST0027_UUID@10.1.4.120@tcp:6/4 lens 448/592 e 1 to 1 dl
1364537379 ref 2 fl Rpc:/2/0 rc 0/0
Mar 29 06:09:39 cn492 kernel: Lustre: 3004:0:(client.c:1529:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Mar 29 06:09:39 cn492 kernel: Lustre: lustre_0-OST0027-osc-ffff81021c041800: Connection to service lustre_0-OST0027 via nid 10.1.4.120@tcp was l
ost; in progress operations using this service will wait for recovery to complete.
Mar 29 06:09:39 cn492 kernel: Lustre: Skipped 1 previous similar message
Mar 29 06:09:39 cn492 kernel: Lustre: lustre_0-OST0027-osc-ffff81021c041800: Connection restored to service lustre_0-OST0027 using nid 10.1.4.12
0@tcp.
Mar 29 06:09:39 cn492 kernel: Lustre: Skipped 1 previous similar message
Mar 29 06:20:10 cn492 kernel: LustreError: 3004:0:(osc_request.c:2357:brw_interpret()) ASSERTION(!(aa->aa_oa->o_valid & OBD_MD_FLHANDLE)) failed
Mar 29 06:20:10 cn492 kernel: LustreError: 3004:0:(osc_request.c:2357:brw_interpret()) LBUG
Mar 29 06:20:10 cn492 kernel: Pid: 3004, comm: ptlrpcd
Mar 29 06:20:10 cn492 kernel:
Mar 29 06:20:10 cn492 kernel: Call Trace:
Mar 29 06:20:10 cn492 kernel: [<ffffffff885786a1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
Mar 29 06:20:10 cn492 kernel: [<ffffffff88578bda>] lbug_with_loc+0x7a/0xd0 [libcfs]
Mar 29 06:20:10 cn492 kernel: [<ffffffff88580fc0>] tracefile_init+0x0/0x110 [libcfs]
Mar 29 06:20:10 cn492 kernel: [<ffffffff8879c7e8>] brw_interpret+0x8e8/0xdb0 [osc]
Mar 29 06:20:10 cn492 kernel: [<ffffffff886d36ac>] after_reply+0xcac/0xe30 [ptlrpc]
Mar 29 06:20:10 cn492 kernel: [<ffffffff886d4b0b>] ptlrpc_check_set+0x12db/0x15a0 [ptlrpc]
Mar 29 06:20:10 cn492 kernel: [<ffffffff8004b396>] try_to_del_timer_sync+0x7f/0x88
Mar 29 06:20:10 cn492 kernel: [<ffffffff887095ad>] ptlrpcd_check+0xdd/0x1f0 [ptlrpc]
Mar 29 06:20:10 cn492 kernel: [<ffffffff8009a98c>] process_timeout+0x0/0x5
Mar 29 06:20:10 cn492 kernel: [<ffffffff88709ef1>] ptlrpcd+0x1b1/0x259 [ptlrpc]
Mar 29 06:20:10 cn492 kernel: [<ffffffff8008f3ad>] default_wake_function+0x0/0xe
Mar 29 06:20:10 cn492 kernel: [<ffffffff8005dfc1>] child_rip+0xa/0x11
Mar 29 06:20:10 cn492 kernel: [<ffffffff88709d40>] ptlrpcd+0x0/0x259 [ptlrpc]
Mar 29 06:20:10 cn492 kernel: [<ffffffff8005dfb7>] child_rip+0x0/0x11
Mar 29 06:20:10 cn492 kernel:
Mar 29 06:20:10 cn492 kernel: LustreError: dumping log to /tmp/lustre-log.1364538010.3004
Attachments
Issue Links
- is duplicated by
-
LU-4452 Lustre 1.8.8 client causes kernel panic
-
- Resolved
-
We ran into the same issue on 1.8.9 at FNAL on client. Trace dump is the same. It happened after client communication error with OSS (the router had issues). Is it going to be fixed in 1.8.10 (or 1.8.9.1) ? Thanks, Alex.