[LU-6596] GPF: RIP [<ffffffffa076924b>] ptlrpc_replay_next+0xdb/0x380 [ptlrpc] Created: 13/May/15  Updated: 02/Nov/15  Resolved: 02/Nov/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Bruno Travouillon (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

kernel 2.6.32-504.12.2.el6
lustre-2.5.3.90 w/ some bullpatches on clients and servers


Issue Links:
Duplicate
duplicates LU-6802 sanity test_208 fail: “lease not brok... Resolved
Related
is related to LU-6022 replay-single test 73c hung: RIP: ptl... Reopened
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Since its update from 2.5.3 to 2.5.3.90, one of our customer is hitting the following GPF on client nodes while the MDT is in recovery.

<4>general protection fault: 0000 [0000001] SMP
<4>last sysfs file: /sys/module/ipv6/initstate
<4>CPU 21
<4>Modules linked in: iptable_mangle iptable_filter lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic crc32c_intel libcfs(U) nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_sa(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) mic(U) uinput ipmi_si ipmi_msghandler sg compat(U) lpc_ich mfd_core ioatdma myri10ge igb dca i2c_algo_bit i2c_core ptp pps_core ext4 jbd2 mbcache ahci sd_mod crc_t10dif dm_mirror dm_region_hash dm_log dm_mod megaraid_sas [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 10457, comm: ptlrpcd_rcv Not tainted 2.6.32-504.12.2.el6.Bull.72.x86_64 0000001 BULL bullx super-node
<4>RIP: 0010:[<ffffffffa076924b>] [<ffffffffa076924b>] ptlrpc_replay_next+0xdb/0x380 [ptlrpc]
<4>RSP: 0018:ffff88086d375bb0 EFLAGS: 00010296
<4>RAX: 5a5a5a5a5a5a5a5a RBX: ffff88107ad2d800 RCX: ffff8806e2f65d10
<4>RDX: ffff88107ad2d8b0 RSI: ffff88086d375c1c RDI: ffff88107ad2d800
<4>RBP: ffff88086d375be0 R08: 0000000000000000 R09: 0000000000000000
<4>R10: ffff88087ca2fe50 R11: 0000000000000000 R12: 0000000000000000
<4>R13: ffff88107ad2da90 R14: ffff88086d375c1c R15: ffff881e7f308000
<4>FS: 0000000000000000(0000) GS:ffff88089c540000(0000) knlGS:0000000000000000
<4>CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>CR2: 00002ba018b2c000 CR3: 0000000001a85000 CR4: 00000000000007e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process ptlrpcd_rcv (pid: 10457, threadinfo ffff88086d374000, task ffff88087b2b6040)
<4>Stack:
<4> 0000000000000000 ffff88107ad2d800 ffff880917d81800 ffff88107ad2da90
<4><d> 0000000000000000 ffff880d22697cc0 ffff88086d375c40 ffffffffa078e570
<4><d> 0000000000000000 ffff88107b316078 0000000000000000 ffff880d22697cc0
<4>Call Trace:
<4> [<ffffffffa078e570>] ptlrpc_import_recovery_state_machine+0x360/0xc30 [ptlrpc]
<4> [<ffffffffa078fc69>] ptlrpc_connect_interpret+0x779/0x21d0 [ptlrpc]
<4> [<ffffffffa0784d6b>] ? ptlrpc_pinger_commit_expected+0x1b/0x90 [ptlrpc]
<4> [<ffffffffa076605d>] ptlrpc_check_set+0x31d/0x1c20 [ptlrpc]
<4> [<ffffffff81087fdb>] ? try_to_del_timer_sync+0x7b/0xe0
<4> [<ffffffffa0792613>] ptlrpcd_check+0x533/0x550 [ptlrpc]
<4> [<ffffffffa0792b2b>] ptlrpcd+0x20b/0x370 [ptlrpc]
<4> [<ffffffff81064b90>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0792920>] ? ptlrpcd+0x0/0x370 [ptlrpc]
<4> [<ffffffff8109e66e>] kthread+0x9e/0xc0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>Code: c0 00 00 00 48 8b 00 48 39 c2 48 89 83 c0 00 00 00 75 18 eb 23 0f 1f 00 48 8b 00 48 39 c2 48 89 83 c0 00 00 00 0f 84 8c 00 00 00 <4c> 3b 60 f0 4c 8d b8 f0 fe ff ff 73 e0 4d 85 ff 74 7a f6 83 95
<1>RIP [<ffffffffa076924b>] ptlrpc_replay_next+0xdb/0x380 [ptlrpc]
<4> RSP <ffff88086d375bb0>

Our MDS is an active/passive HA cluster. This GPF can occur during failover or failback of the MDT.

Occurred on 05/04, 05/07 and 05/12. During the last occurrence, we lost 200 compute nodes and 2 login nodes.

The stack looks like LU-6022.

Could you help us on this one?



 Comments   
Comment by Bruno Travouillon (Inactive) [ 13/May/15 ]

FTR, our build include LU-6084 ptlrpc: prevent request timeout grow due to recovery

Comment by Peter Jones [ 13/May/15 ]

Bruno F

Could you please advise on this one?

thanks

Peter

Comment by Bruno Faccini (Inactive) [ 18/May/15 ]

Hello Bruno,
Is it possible to get access to one of these Clients crash-dump ??
Also, you have indicated that this could also occur during MDS fail-back, thus you may be able to enable full debug on a set (all?) of Clients nodes in order to catch max infos/traces at the time of an expected new occurrence?
Thanks to highlight that patch for LU-6084 has been already integrated in your distro, but can you also detail the full list of patches that are on top of 2.5.3.90 for both Servers/Clients sides ??

Comment by Sebastien Buisson (Inactive) [ 20/May/15 ]

Hi Bruno,

As for the patches that we apply on top of 2.5.3.90, here is the list you requested:

Cheers,
Sebastien.

Comment by Bruno Travouillon (Inactive) [ 20/May/15 ]

Hello Bruno,

The customer is a blacksite. We can schedule a meeting to work on the crash-dump onsite if you want to.
You are right, I will give instructions to the local support to enable full debug in case of failback. We will set debug=ALL and slightly increase the debug_size.

Comment by Oleg Drokin [ 24/Jul/15 ]

I just hit this (or so I think) during my regular testing of master. I have a crashdump, so that should give us quite a bit of extra info.

<1>[26215.496409] BUG: unable to handle kernel paging request at ffffffffffffffe8
<1>[26215.496730] IP: [<ffffffffa16cf30b>] ptlrpc_replay_next+0xdb/0x370 [ptlrpc]
<4>[26215.497079] PGD 1a27067 PUD 1a28067 PMD 0 
<4>[26215.497358] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
<4>[26215.497650] last sysfs file: /sys/devices/system/cpu/possible
<4>[26215.497954] CPU 7 
<4>[26215.497996] Modules linked in: lustre ofd osp lod ost mdt mdd mgs osd_ldiskfs ldiskfs lquota lfsck obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass ksocklnd lnet libcfs zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl zlib_deflate exportfs jbd sha512_generic sha256_generic ext4 jbd2 mbcache virtio_console virtio_balloon i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs]
<4>[26215.500111] 
<4>[26215.500111] Pid: 14755, comm: ptlrpcd_rcv Tainted: P           ---------------    2.6.32-rhe6.6-debug #1 Red Hat KVM
<4>[26215.500111] RIP: 0010:[<ffffffffa16cf30b>]  [<ffffffffa16cf30b>] ptlrpc_replay_next+0xdb/0x370 [ptlrpc]
<4>[26215.500111] RSP: 0018:ffff880098b55b10  EFLAGS: 00010286
<4>[26215.500111] RAX: 0000000000000000 RBX: ffff8800789617f0 RCX: 0000000000000000
<4>[26215.500111] RDX: ffff8800789618b8 RSI: 0000000000000000 RDI: ffff8800b051ef70
<4>[26215.500111] RBP: ffff880098b55b40 R08: 00000000fffffffb R09: 00000000fffffffe
<4>[26215.500111] R10: 0000000000000000 R11: 000000000000005a R12: 0000000000000000
<4>[26215.500111] R13: ffff880078961ad0 R14: ffff880098b55b7c R15: ffff880075689ce8
<4>[26215.500111] FS:  0000000000000000(0000) GS:ffff8800063c0000(0000) knlGS:0000000000000000
<4>[26215.500111] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>[26215.500111] CR2: ffffffffffffffe8 CR3: 000000008a94b000 CR4: 00000000000006e0
<4>[26215.500111] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[26215.500111] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[26215.500111] Process ptlrpcd_rcv (pid: 14755, threadinfo ffff880098b54000, task ffff8800b43d02c0)
<4>[26215.500111] Stack:
<4>[26215.500111]  0000000000000000 ffff8800789617f0 0000000000000000 ffffffffa177a633
<4>[26215.500111] <d> ffff8800b03b9ce8 ffff880078961ad0 ffff880098b55ba0 ffffffffa16f3830
<4>[26215.500111] <d> 0000000000000010 ffff880098b55bb0 ffff880098b55b70 ffff8800789617f0
<4>[26215.500111] Call Trace:
<4>[26215.500111]  [<ffffffffa16f3830>] ptlrpc_import_recovery_state_machine+0x360/0xc20 [ptlrpc]
<4>[26215.500111]  [<ffffffffa16f6bac>] ptlrpc_connect_interpret+0xc4c/0x2570 [ptlrpc]
<4>[26215.500111]  [<ffffffffa16ec59a>] ? ptlrpc_update_next_ping+0x4a/0xd0 [ptlrpc]
<4>[26215.500111]  [<ffffffffa16cc2c3>] ptlrpc_check_set+0x613/0x1bf0 [ptlrpc]
<4>[26215.500111]  [<ffffffff81522574>] ? _spin_lock_irqsave+0x24/0x30
<4>[26215.500111]  [<ffffffffa16fa2d3>] ptlrpcd_check+0x3e3/0x630 [ptlrpc]
<4>[26215.500111]  [<ffffffffa16fa83b>] ptlrpcd+0x31b/0x500 [ptlrpc]
<4>[26215.500111]  [<ffffffff81061630>] ? default_wake_function+0x0/0x20
<4>[26215.500111]  [<ffffffffa16fa520>] ? ptlrpcd+0x0/0x500 [ptlrpc]
<4>[26215.500111]  [<ffffffff8109ce4e>] kthread+0x9e/0xc0
<4>[26215.500111]  [<ffffffff8100c24a>] child_rip+0xa/0x20
<4>[26215.500111]  [<ffffffff8109cdb0>] ? kthread+0x0/0xc0
<4>[26215.500111]  [<ffffffff8100c240>] ? child_rip+0x0/0x20
<4>[26215.500111] Code: 48 8b 00 48 39 c2 48 89 83 d8 00 00 00 75 1c eb 24 0f 1f 80 00 00 00 00 48 8b 00 48 39 c2 48 89 83 d8 00 00 00 0f 84 7c 00 00 00 <4c> 3b 60 e8 4c 8d 78 80 73 e3 4d 85 ff 74 6d f6 83 f9 02 00 00 
<1>[26215.500111] RIP  [<ffffffffa16cf30b>] ptlrpc_replay_next+0xdb/0x370 [ptlrpc]
<4>[26215.500111]  RSP <ffff880098b55b10>
<4>[26215.500111] CR2: ffffffffffffffe8
(gdb) l *(ptlrpc_replay_next+0xdb)
0x4a33b is in ptlrpc_replay_next (/home/green/git/lustre-release/lustre/ptlrpc/recover.c:129).
124				while (imp->imp_replay_cursor !=
125				       &imp->imp_committed_list) {
126					req = list_entry(imp->imp_replay_cursor,
127							     struct ptlrpc_request,
128							     rq_replay_list);
129					if (req->rq_transno > last_transno)
130						break;
131	
132					req = NULL;
133					imp->imp_replay_cursor =
(gdb) p (int)0xffffffffffffffe8
$1 = -24
(gdb) p &((struct ptlrpc_request *) 0x0)->rq_transno
$1 = (__u64 *) 0x68
(gdb) p (int)0xffffffffffffffe8 - 0x68
$2 = -128
(gdb) quit
[green@intelbox lustre-release]$ grep 128 ~/bk/linux-2.6.32-504.3.3.el6-debug/include/asm-generic/errno.h 
#define	EKEYREVOKED	128	/* Key has been revoked */

of course this -128 might be total garbage from somewhere and also there are some differences in the code accumulated, so this might not be totally related or it well might be related too.

Crashdump on my box in /exports/crashdumps/192.168.10.221-2015-07-23-18\:24\:10 tag in my tree master-20150723

Comment by Bruno Faccini (Inactive) [ 23/Aug/15 ]

Humm, according to crash-dump infos and related source/assembly code, seems that problem occurs during following piece of code in ptlrpc_replay_next( ) :

119                         /* Since the imp_committed_list is immutable before
120                          * all of it's requests being replayed, it's safe to
121                          * use a cursor to accelerate the search */
122                         imp->imp_replay_cursor = imp->imp_replay_cursor->next;
123 
124                         while (imp->imp_replay_cursor !=
125                                &imp->imp_committed_list) {
126                                 req = list_entry(imp->imp_replay_cursor,
127                                                      struct ptlrpc_request,
128                                                      rq_replay_list);
129                                 if (req->rq_transno > last_transno)
130                                         break;
131 
132                                 req = NULL;
133                                 imp->imp_replay_cursor =
134                                         imp->imp_replay_cursor->next;
135                         }

because imp->imp_replay_cursor has become corrupted, likely due to a race. We may need to add some mutual exclusion protection to allw this code to become safe.
Just to confirm, Bruno, can you check imp_replay_cursor field in obd_import struct with address 0xffff88107ad2d800 from your crash-dump ? If you could also provide the full obd_import struct, that would be very helpful too.

Comment by Bruno Faccini (Inactive) [ 24/Aug/15 ]

To be complete, GPF occurs at line :

129                                 if (req->rq_transno > last_transno)

when dereferencing req, which directly comes from the fact that imp->imp_replay_cursor has been corrupted at either loop initialization line :

122                         imp->imp_replay_cursor = imp->imp_replay_cursor->next;

or inside while loop, at lines :

133                                 imp->imp_replay_cursor =
134                                         imp->imp_replay_cursor->next;

with LP_POISON (Bull's case) value when not already re-used or not overwritten, or either 0s (Oleg's case) when already re-allocated ...

Previous imp->imp_replay_cursor pointer value can not be retrieved from the crash-dumps, according to routine assembly code, so we can presently only think about the possible and different corruption scenarios.
And since a race does not seem to have occured, at least after I have analyzed the Lustre debug log extracted from crash-dump for Oleg's case, I have thought about an other possibility which could be that imp->imp_replay_cursor original value, when starting the current replay/recovery round, could have been out-dated and pointing to an old req ...
If this is the case, recent patches for LU-6802 may avoid this problem, because they change recovery mechanism to all times reset imp->imp_replay_cursor to start from imp->imp_committed_list list beginning.

Bruno, I have other questions/requests ... Do you still frequently experience this problem during MDT fail-over/back ? If yes, could it be possible for you to give a try to integrate 2x patches for LU-6802 ? Also, can provide the full dmesg (from "log" crash sub-command output) from your crash-dump ?

Comment by Bruno Faccini (Inactive) [ 28/Aug/15 ]

I have spent sometimes on-site to have a look to a bunch of Lustre Client crash dumps taken during recovery after MDS/MDT failure/restart and here are more facts/details that can be listed :
_ problem can occur during MDS fail-over or fail-back operations but also during MDS reboot in non-HA configuration or MDT target re-start/mount.
_ during a single occurence, one or multiple (sometimes >200!) Lustre Clients can be affected/crash but there have been occurrences without any Client crash.
_ Clients crashes are very often GPFs of the same kind than described in this ticket, this with multiple different and invalid values of imp->imp_replay_cursor, but there are also a few cases of LBUGs or Kernel BUG()s that I have analyzed on-site and they only come from the fact that imp->imp_replay_cursor has been corrupted by a wrong (eg, not pointing as expected to a ptlrpc_request struct) but valid (on a Kernel virtual memory addressing point of view).
_ the configured Clients Lustre debug mask is only WARN+EMERG+ERROR+CONSOLE, so the Lustre debug log extracted from each crash-dump does not help in order to identify if the imp->imp_replay_cursor corruption comes from a race during imp_committed_list access or out-dated value usage. There are instructions to ensure that upon a new MDS/MDT re-start occurrence, full/-1 debug mask will be enabled in order to catch the full recovery process.

According to all of this I still feel pretty confident that, as I already indicated, the 2x patches for LU-6802 are likely to be the fix for this ticket.

Comment by Andreas Dilger [ 13/Oct/15 ]

Should this be closed as a duplicate of LU-6802?

Comment by Jinshan Xiong (Inactive) [ 02/Nov/15 ]

LU-6802

Generated at Sat Feb 10 02:01:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.