[LU-12853] general protection fault: 0000 RIP: keys_fill Created: 14/Oct/19  Updated: 30/Jan/20  Resolved: 06/Dec/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3
Fix Version/s: Lustre 2.14.0, Lustre 2.12.4

Type: Bug Priority: Major
Reporter: Alexander Boyko Assignee: Alexander Boyko
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

reproducer is replay-dual/25 with low rate.

[  324.426593] general protection fault: 0000 [#1] SMP 
[  324.431986] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) ...
[  324.477340] CPU: 0 PID: 11422 Comm: tgt_recover_0 Tainted: G           OE  ------------   3.10.0-693.21.1.x3.2.152.x86_64 #1
[  324.482771] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  324.486161] task: ffff88009e422f70 ti: ffff88009e428000 task.ti: ffff88009e428000
[  324.490071] RIP: 0010:[<ffffffffc08b77cc>]  [<ffffffffc08b77cc>] keys_fill+0x5c/0x180 [obdclass]
[  324.494522] RSP: 0018:ffff88009e42bad0  EFLAGS: 00010246
[  324.497767] RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000000000 RCX: ffff88009e42bfd8
[  324.501529] RDX: ffff88009e42baf8 RSI: 0000000000000002 RDI: ffffffffc091d100
[  324.505262] RBP: ffff88009e42baf0 R08: 000000000001b940 R09: ffff88013b001b00
[  324.508969] R10: ffffffffc0ff8a17 R11: ffff88013586f400 R12: ffffffffc091d1c0
[  324.512676] R13: ffff88009e4bb250 R14: 0000000000000014 R15: ffff8800bbbcf4c8
[  324.517164] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
[  324.524156] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  324.529233] CR2: 00007fd17b3053e4 CR3: 00000000bb84c000 CR4: 00000000000006f0
[  324.532950] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  324.537805] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  324.543260] Call Trace:
[  324.546087]  [<ffffffffc08bc351>] lu_context_refill+0x41/0x50 [obdclass]
[  324.551161]  [<ffffffffc08bc3e4>] lu_env_refill+0x24/0x30 [obdclass]
[  324.554887]  [<ffffffffc0ff8ab1>] ofd_lvbo_init+0x2b1/0x8ad [ofd]
[  324.559344]  [<ffffffffc0b1e6a0>] ldlm_server_completion_ast+0x600/0x990 [ptlrpc]
[  324.565333]  [<ffffffffc0b1e0a0>] ? ldlm_server_blocking_ast+0xa40/0xa40 [ptlrpc]
[  324.569117]  [<ffffffffc0aefd08>] ldlm_work_cp_ast_lock+0xa8/0x1d0 [ptlrpc]
[  324.572619]  [<ffffffffc0b38452>] ptlrpc_set_wait+0x72/0x790 [ptlrpc]
[  324.576416]  [<ffffffff811e4d1d>] ? kmem_cache_alloc_node_trace+0x11d/0x210
[  324.580625]  [<ffffffffc089a389>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[  324.584984]  [<ffffffffc0aefc60>] ? ldlm_work_gl_ast_lock+0x3a0/0x3a0 [ptlrpc]
[  324.589275]  [<ffffffffc0b2ec92>] ? ptlrpc_prep_set+0xd2/0x280 [ptlrpc]
[  324.593275]  [<ffffffffc0af52a5>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
[  324.597178]  [<ffffffffc0af682e>] __ldlm_reprocess_all+0x10e/0x350 [ptlrpc]
[  324.600564]  [<ffffffffc0af6dd6>] ldlm_reprocess_res+0x26/0x30 [ptlrpc]
[  324.603862]  [<ffffffffc07423f0>] cfs_hash_for_each_relax+0x250/0x450 [libcfs]
[  324.607537]  [<ffffffffc0af6db0>] ? ldlm_lock_mode_downgrade+0x320/0x320 [ptlrpc]
[  324.612088]  [<ffffffffc0af6db0>] ? ldlm_lock_mode_downgrade+0x320/0x320 [ptlrpc]
[  324.617688]  [<ffffffffc0745785>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs]
[  324.622078]  [<ffffffffc0af6e1c>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc]
[  324.626371]  [<ffffffffc0b0a3f1>] target_recovery_thread+0xd21/0x11d0 [ptlrpc]
[  324.630650]  [<ffffffffc0b096d0>] ? replay_request_or_update.isra.23+0x8c0/0x8c0 [ptlrpc]
[  324.635262]  [<ffffffff810b4031>] kthread+0xd1/0xe0
[  324.638630]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[  324.642747]  [<ffffffff816c4577>] ret_from_fork+0x77/0xb0
[  324.646459]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[  324.650524] Code: 5b 59 06 00 0f 1f 00 31 db eb 15 0f 1f 40 00 48 83 c
 

I think the situation was
handle_recovery_req use the environment in next way

       req->rq_session.lc_thread = thread;
       req->rq_svc_thread = thread;
       req->rq_svc_thread->t_env->le_ses = &req->rq_session;
       /* thread context */
       lu_context_enter(&thread->t_env->le_ctx);
       (void)handler(req);
       lu_context_exit(&thread->t_env->le_ctx); (edited) 

after that request could be freed, and env->le_ses would point to freed memory
later ofd_lvbo_init took env from percpu chache, doing refill, found le_ses is not zero and tried to do keys_fill. But this part of memory has gone / invalid.

 ofd_lvbo_init()-> env = lu_env_find(); info = ofd_info(env);-> lu_env_refill((void *)env);->lu_context_refill(env->le_ses)->keys_fill(ctxt)


 Comments   
Comment by Gerrit Updater [ 14/Oct/19 ]

Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/36443
Subject: LU-12853 ptlrpc: zero session enviroment
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5bff99700e3510841f4d7cae6dafe51b383299ce

Comment by Sarah Liu [ 16/Oct/19 ]

soak hit similar problem on 1 OSS after running for 6 days (2.12.3 #64)

[ 7493.833095] Lustre: soaked-OST0000: Connection restored to  (at 192.168.1.111@o2ib)
[ 7493.841688] Lustre: Skipped 9 previous similar messages
[ 7497.798043] Lustre: soaked-OST000c: Will be in recovery for at least 2:30, or until 8 clients reconnect
[ 7499.877838] Lustre: soaked-OST0004: Recovery over after 0:13, of 8 clients 8 recovered and 0 were evicted.
[ 7499.878758] Lustre: soaked-OST0004: deleting orphan objects from 0x400000401:4535723 to 0x400000401:4535860
[ 7499.880166] Lustre: soaked-OST0004: deleting orphan objects from 0x400000400:8181547 to 0x400000400:8183436
[ 7499.911433] general protection fault: 0000 [#1] SMP
[ 7499.917001] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE
)[ 7499.932009] Lustre: soaked-OST0004: deleting orphan objects from 0x400000402:4595225 to 0x400000402:4597207
 libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver[ 7499.947802] Lustre: soaked-OST0004: deleting orphan objects from 0x0:6331191 to 0x0:6331193
 nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) iTCO_
wdt iTCO_vendor_support dm_round_robin sb_edac zfs(POE) intel_powerclamp coretemp zunicode(POE) zavl(POE) intel_rapl iosf_mbi icp(POE) kvm irqbypass crc32_pclm
ul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd zcommon(POE) pcspkr znvpair(POE) spl(OE) ses enclosure lpc_ich joydev ipmi_ssif 
sg mei_me i2c_i801 mei ioatdma ipmi_si wmi ipmi_devintf ipmi_msghandler pcc_cpufreq dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod c
rc_t10dif crct10dif_generic mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm isci igb ahci drm mlx4_core(OE) libsas libahci ptp mpt2sas pps_core devlink crct10dif_pclmul crct10dif_common raid_class dca crc32c_intel libata mlx_compat(OE) scsi_transport_sas drm_panel_orientation_quirks i2c_algo_bit
[ 7500.055381] CPU: 12 PID: 65100 Comm: tgt_recover_4 Kdump: loaded Tainted: P           OE  ------------   3.10.0-1062.1.1.el7_lustre.x86_64 #1
[ 7500.069552] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
[ 7500.082077] task: ffff8e04b9a6e2a0 ti: ffff8e042d19c000 task.ti: ffff8e042d19c000
[ 7500.090430] RIP: 0010:[<ffffffffc1035f4c>]  [<ffffffffc1035f4c>] keys_fill+0x5c/0x180 [obdclass]
[ 7500.100283] RSP: 0018:ffff8e042d19fac0  EFLAGS: 00010246
[ 7500.106212] RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000000000 RCX: ffff8e042d19ffd8
[ 7500.114176] RDX: ffff8e042d19fae8 RSI: 0000000000000002 RDI: ffffffffc109b180
[ 7500.122141] RBP: ffff8e042d19fae0 R08: 000000000001f0a0 R09: ffff8e023fc07b00
[ 7500.130116] R10: ffffffffc16e5d17 R11: ffff8e0865767400 R12: ffffffffc109b240
[ 7500.138081] R13: ffff8e04923c6120 R14: 0000000000000013 R15: ffff8e08263ca2c8
[ 7500.146047] FS:  0000000000000000(0000) GS:ffff8e08ee300000(0000) knlGS:0000000000000000
[ 7500.155079] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7500.161493] CR2: 00000000006dee20 CR3: 000000072e010000 CR4: 00000000000607e0
[ 7500.169474] Call Trace:
[ 7500.172241]  [<ffffffffc103b021>] lu_context_refill+0x41/0x50 [obdclass]
[ 7500.179766]  [<ffffffffc103b0b4>] lu_env_refill+0x24/0x30 [obdclass]
[ 7500.186871]  [<ffffffffc16e5db1>] ofd_lvbo_init+0x2a1/0x7f0 [ofd]
[ 7500.193748]  [<ffffffffc1339ecf>] ? req_capsule_shrink+0xff/0x240 [ptlrpc]
[ 7500.201467]  [<ffffffffc12ec58d>] ldlm_server_completion_ast+0x5fd/0x980 [ptlrpc]
[ 7500.209864]  [<ffffffffc12ebf90>] ? ldlm_server_blocking_ast+0xa40/0xa40 [ptlrpc]
[ 7500.218257]  [<ffffffffc12bec58>] ldlm_work_cp_ast_lock+0xa8/0x1d0 [ptlrpc]
[ 7500.226091]  [<ffffffffc13069b2>] ptlrpc_set_wait+0x72/0x790 [ptlrpc]
[ 7500.233277]  [<ffffffff85c2426d>] ? kmem_cache_alloc_node_trace+0x11d/0x210
[ 7500.241086]  [<ffffffffc1018a69>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[ 7500.249092]  [<ffffffffc12bebb0>] ? ldlm_work_gl_ast_lock+0x3a0/0x3a0 [ptlrpc]
[ 7500.257199]  [<ffffffffc12fd232>] ? ptlrpc_prep_set+0xd2/0x280 [ptlrpc]
[ 7500.264634]  [<ffffffffc12c4065>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
[ 7500.272064]  [<ffffffffc12c553f>] __ldlm_reprocess_all+0x11f/0x360 [ptlrpc]
[ 7500.279875]  [<ffffffffc12c5ae8>] ldlm_reprocess_res+0x28/0x30 [ptlrpc]
[ 7500.287288]  [<ffffffffc0eddfb0>] cfs_hash_for_each_relax+0x250/0x450 [libcfs]
[ 7500.295387]  [<ffffffffc12c5ac0>] ? ldlm_lock_mode_downgrade+0x320/0x320 [ptlrpc]
[ 7500.303790]  [<ffffffffc12c5ac0>] ? ldlm_lock_mode_downgrade+0x320/0x320 [ptlrpc]
[ 7500.312157]  [<ffffffffc0ee1345>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs]
[ 7500.320258]  [<ffffffffc12c5b2c>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc]
[ 7500.328758]  [<ffffffffc12d85b1>] target_recovery_thread+0xcd1/0x1160 [ptlrpc]
[ 7500.336871]  [<ffffffffc12d78e0>] ? replay_request_or_update.isra.24+0x8c0/0x8c0 [ptlrpc]
[ 7500.346026]  [<ffffffff85ac50d1>] kthread+0xd1/0xe0
[ 7500.351473]  [<ffffffff85ac5000>] ? insert_kthread_work+0x40/0x40
[ 7500.358279]  [<ffffffff8618cd37>] ret_from_fork_nospec_begin+0x21/0x21
[ 7500.365567]  [<ffffffff85ac5000>] ? insert_kthread_work+0x40/0x40
[ 7500.372368] Code: 5b 52 06 00 0f 1f 00 31 db eb 15 0f 1f 40 00 48 83 c3 08 48 81 fb 40 01 00 00 0f 84 9f 00 00 00 49 8b 45 10 4c 8b a3 20 31 16 c1 <48> 83 3c 18 00 75 dd 4d 85 e4 74 d8 41 8b 04 24 41 8b 55 00 85 
[ 7500.394254] RIP  [<ffffffffc1035f4c>] keys_fill+0x5c/0x180 [obdclass]
[ 7500.401496]  RSP <ffff8e042d19fac0>
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
Comment by Gerrit Updater [ 06/Dec/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36443/
Subject: LU-12853 ptlrpc: zero session enviroment
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2a620f07e23b3b044f429f049bcc5ffa96f6d844

Comment by Peter Jones [ 06/Dec/19 ]

Landed for 2.14

Comment by Gerrit Updater [ 22/Jan/20 ]

Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37305
Subject: LU-12853 ptlrpc: zero session enviroment
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 31f04e772c1a62bf210bc65ef1acf4ea6db823ee

Comment by Gerrit Updater [ 27/Jan/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37305/
Subject: LU-12853 ptlrpc: zero session enviroment
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: f8e915bc0946772ab9c92a8e3c37211716cbd395

Generated at Sat Feb 10 02:56:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.