[LU-5169] Lustre client panic during MDS failover Created: 10/Jun/14  Updated: 29/Jan/16  Resolved: 29/Jan/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Swapnil Pimpale (Inactive) Assignee: Hongchao Zhang
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Lustre servers: 2.4.3
Lustre clients: 2.5.1


Issue Links:
Related
is related to LU-5507 sanity-quota test_18: Oops: IP: lustr... Resolved
Severity: 2
Rank (Obsolete): 14248

 Description   

The setup is as follows:

There are two filesystems: pfs2dat2 and pfs2wor2

Clients:
uc1n996
uc1n997

For pfs2dat2:
MDS: pfs2n12/13
OSS: pfs2n14/15

For pfs2wor2:
MDS: pfs2n16/17
OSS: pfs2n18/19/20/21

The two MDSes involved in failover were pfs2n12 and pds2n13. The client uc1n996 panicked with the following stack trace:
last sysfs file:
/sys/devices/system/cpu/online
CPU 5
Modules linked in: iptable_filter ip_tables
nfs lockd fscache auth_rpcgss nfs_acl sunrpc lmv(U) fld(U) mgc(U) lustre(U)
lov(U) osc(U) mdc(U) fid(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U)
sha512_generic sha256_generic crc32c_intel libcfs(U) ib_ipoib rdma_ucm ib_ucm
ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_multipath vhost_net
macvtap macvlan tun kvm_intel kvm uinput microcode iTCO_wdt
iTCO_vendor_support acpi_pad power_meter dcdbas sg mlx4_ib ib_sa ib_mad
ib_core mlx4_en mlx4_core sb_edac edac_core lpc_ich mfd_core shpchp igb
i2c_algo_bit i2c_core ixgbe dca ptp pps_core mdio xfs exportfs sd_mod
crc_t10dif wmi ahci megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last
unloaded: speedstep_lib]

Pid: 2895, comm: ptlrpcd_rcv Not tainted
2.6.32-431.11.2.el6.x86_64 #1 Dell Inc. PowerEdge R620/0PXXHP
RIP: 0010:[<ffffffffa0708bde>]
[<ffffffffa0708bde>] lustre_msg_get_opc+0xe/0x110 [ptlrpc]
RSP: 0018:ffff88082b5ddc80 EFLAGS: 00010282
RAX: ffff8800a585e208 RBX: 0000000000000000
RCX: ffff8801a22893a0
RDX: 0000000000000002 RSI: 0000000000000000
RDI: 3237323033093932
RBP: ffff88082b5ddc90 R08: 0000000000000000
R09: 00000000fffffffc
R10: 0000000000000002 R11: 0000000000000004
R12: ffff8809421d7000
R13: ffff8800a585e208 R14: 00000032a434f11a
R15: ffff8801a22890c8
FS: 0000000000000000(0000)
GS:ffff88085c440000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0:
000000008005003b
CR2: 000000346b2727d0 CR3: 000000102a8e5000
CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0
DR7: 0000000000000400
Process ptlrpcd_rcv (pid: 2895, threadinfo
ffff88082b5dc000, task ffff8808314deaa0)
Stack:
ffff88082b5ddc90 0000000000000000
ffff88082b5ddcd0 ffffffffa08b6c2d
<d> ffff880563411000 ffff8801a2289000
ffff8801a2289000 ffff88102d915800
<d> ffff8801a22892e0 00000032a434f11a
ffff88082b5ddd00 ffffffffa06fd312
Call Trace:
[<ffffffffa08b6c2d>]
mdc_replay_open+0xad/0x420 [mdc]
[<ffffffffa06fd312>]
ptlrpc_replay_interpret+0x142/0x740 [ptlrpc]
[<ffffffffa06fe994>]
ptlrpc_check_set+0x2c4/0x1b40 [ptlrpc]
[<ffffffffa0729ebb>] ptlrpcd_check+0x53b/0x560
[ptlrpc]
[<ffffffffa072a3db>] ptlrpcd+0x20b/0x370
[ptlrpc]
[<ffffffff81065df0>] ?
default_wake_function+0x0/0x20
[<ffffffffa072a1d0>] ? ptlrpcd+0x0/0x370
[ptlrpc]
[<ffffffff8109aee6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffff8109ae50>] ? kthread+0x0/0xa0
[<ffffffff8100c200>] ? child_rip+0x0/0x20
Code: 24 48 48 83 c4 68 4c 89 e0 5b 41 5c 41
5d 41 5e 41 5f c9 c3 45 31 e4 e9 26 ff ff ff 90 55 48 89 e5 53 48 83 ec 08 0f
1f 44 00 00 <81> 7f 08 d3 0b d0 0b 48 89 fb 74 76 c7 05 fc 7e 0a 00 00 01 00
RIP [<ffffffffa0708bde>]
lustre_msg_get_opc+0xe/0x110 [ptlrpc]
RSP <ffff88082b5ddc80>
--[ end trace ee65cdcf6a61aa8a ]--



 Comments   
Comment by Swapnil Pimpale (Inactive) [ 10/Jun/14 ]

I have uploaded the following logs to ftp.whamcloud.com (/uploads/LU-5169)

Client logs: 2014-06-05-ddn_lustre_showall_clients_case_kernel_panic_20140605.tar
Server logs: 2014-06-05-SR31415_es_lustre_showall_2014-06-05_091605.tar.bz2

Comment by Peter Jones [ 11/Jun/14 ]

Hongchao

Could you please assist with this issue?

Thanks

Peter

Comment by Oleg Drokin [ 11/Jun/14 ]

This looks like LU-3333 to me

Comment by Hongchao Zhang [ 12/Jun/14 ]

Is the logs around the panic available? and I can't find it in the uploaded logs, Thanks.
btw, could you please print the code lines at "lustre_msg_get_opc + 0xe" for the module(ptlrpc) could be different, Thanks.

Comment by Rajeshwaran Ganesan [ 24/Jun/14 ]

there is no logs collected during the panic. Is it possible to get the patch (LU-3333) for 2.5.1

Comment by Peter Jones [ 24/Jun/14 ]

Perhaps it would make more sense to upgrade to 2.5.2 (Due out imminently) in order to get this fix?

Comment by Rajeshwaran Ganesan [ 26/Jun/14 ]

Hello,

Cu. is fine to upgrade to 2.5.2 Could you please give me link to download the very latest master build of 2.5.2. So that the following patch will be covered.

1. the do_statahead_enter() LBUG ( http://review.whamcloud.com/10363 )
patch reports included in v2_5_2_RC2.
2. the lovsub_lock_state() LBUG patch (http://review.whamcloud.com/9881)
reports included in v2_5_60_0 and branch master

Thanks,
Rajesh

Comment by Peter Jones [ 26/Jun/14 ]

2.5.2 also includes the LU-4558 fix - http://git.whamcloud.com/fs/lustre-release.git/commit/deb1e8aa6836ad073d53bf3e4dd29a2cb5696f2e

The release can be accessed at http://downloads.whamcloud.com/public/lustre/latest-maintenance-release/

Comment by Ian Costello [ 30/Apr/15 ]

Is this a duplicate of LU-5507? As it looks like the patch http://review.whamcloud.com/#/c/12667/ will resolve the issue above? As I have seen the above problem at the ANU/NCI site...

Comment by Hongchao Zhang [ 30/Apr/15 ]

Yes, it seems to be the same issue! thank!

Comment by John Fuchs-Chesney (Inactive) [ 29/Jan/16 ]

We are marking as resolved/duplicate.

Many thanks,
~ jfc.

Generated at Sat Feb 10 01:49:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.