[LU-12853] general protection fault: 0000 RIP: keys_fill Created: 14/Oct/19 Updated: 30/Jan/20 Resolved: 06/Dec/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.3 |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexander Boyko | Assignee: | Alexander Boyko |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
reproducer is replay-dual/25 with low rate. [ 324.426593] general protection fault: 0000 [#1] SMP [ 324.431986] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) ... [ 324.477340] CPU: 0 PID: 11422 Comm: tgt_recover_0 Tainted: G OE ------------ 3.10.0-693.21.1.x3.2.152.x86_64 #1 [ 324.482771] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 324.486161] task: ffff88009e422f70 ti: ffff88009e428000 task.ti: ffff88009e428000 [ 324.490071] RIP: 0010:[<ffffffffc08b77cc>] [<ffffffffc08b77cc>] keys_fill+0x5c/0x180 [obdclass] [ 324.494522] RSP: 0018:ffff88009e42bad0 EFLAGS: 00010246 [ 324.497767] RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000000000 RCX: ffff88009e42bfd8 [ 324.501529] RDX: ffff88009e42baf8 RSI: 0000000000000002 RDI: ffffffffc091d100 [ 324.505262] RBP: ffff88009e42baf0 R08: 000000000001b940 R09: ffff88013b001b00 [ 324.508969] R10: ffffffffc0ff8a17 R11: ffff88013586f400 R12: ffffffffc091d1c0 [ 324.512676] R13: ffff88009e4bb250 R14: 0000000000000014 R15: ffff8800bbbcf4c8 [ 324.517164] FS: 0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000 [ 324.524156] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 324.529233] CR2: 00007fd17b3053e4 CR3: 00000000bb84c000 CR4: 00000000000006f0 [ 324.532950] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 324.537805] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 324.543260] Call Trace: [ 324.546087] [<ffffffffc08bc351>] lu_context_refill+0x41/0x50 [obdclass] [ 324.551161] [<ffffffffc08bc3e4>] lu_env_refill+0x24/0x30 [obdclass] [ 324.554887] [<ffffffffc0ff8ab1>] ofd_lvbo_init+0x2b1/0x8ad [ofd] [ 324.559344] [<ffffffffc0b1e6a0>] ldlm_server_completion_ast+0x600/0x990 [ptlrpc] [ 324.565333] [<ffffffffc0b1e0a0>] ? ldlm_server_blocking_ast+0xa40/0xa40 [ptlrpc] [ 324.569117] [<ffffffffc0aefd08>] ldlm_work_cp_ast_lock+0xa8/0x1d0 [ptlrpc] [ 324.572619] [<ffffffffc0b38452>] ptlrpc_set_wait+0x72/0x790 [ptlrpc] [ 324.576416] [<ffffffff811e4d1d>] ? kmem_cache_alloc_node_trace+0x11d/0x210 [ 324.580625] [<ffffffffc089a389>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [ 324.584984] [<ffffffffc0aefc60>] ? ldlm_work_gl_ast_lock+0x3a0/0x3a0 [ptlrpc] [ 324.589275] [<ffffffffc0b2ec92>] ? ptlrpc_prep_set+0xd2/0x280 [ptlrpc] [ 324.593275] [<ffffffffc0af52a5>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] [ 324.597178] [<ffffffffc0af682e>] __ldlm_reprocess_all+0x10e/0x350 [ptlrpc] [ 324.600564] [<ffffffffc0af6dd6>] ldlm_reprocess_res+0x26/0x30 [ptlrpc] [ 324.603862] [<ffffffffc07423f0>] cfs_hash_for_each_relax+0x250/0x450 [libcfs] [ 324.607537] [<ffffffffc0af6db0>] ? ldlm_lock_mode_downgrade+0x320/0x320 [ptlrpc] [ 324.612088] [<ffffffffc0af6db0>] ? ldlm_lock_mode_downgrade+0x320/0x320 [ptlrpc] [ 324.617688] [<ffffffffc0745785>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs] [ 324.622078] [<ffffffffc0af6e1c>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc] [ 324.626371] [<ffffffffc0b0a3f1>] target_recovery_thread+0xd21/0x11d0 [ptlrpc] [ 324.630650] [<ffffffffc0b096d0>] ? replay_request_or_update.isra.23+0x8c0/0x8c0 [ptlrpc] [ 324.635262] [<ffffffff810b4031>] kthread+0xd1/0xe0 [ 324.638630] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 [ 324.642747] [<ffffffff816c4577>] ret_from_fork+0x77/0xb0 [ 324.646459] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 [ 324.650524] Code: 5b 59 06 00 0f 1f 00 31 db eb 15 0f 1f 40 00 48 83 c I think the situation was req->rq_session.lc_thread = thread;
req->rq_svc_thread = thread;
req->rq_svc_thread->t_env->le_ses = &req->rq_session;
/* thread context */
lu_context_enter(&thread->t_env->le_ctx);
(void)handler(req);
lu_context_exit(&thread->t_env->le_ctx); (edited)
after that request could be freed, and env->le_ses would point to freed memory ofd_lvbo_init()-> env = lu_env_find(); info = ofd_info(env);-> lu_env_refill((void *)env);->lu_context_refill(env->le_ses)->keys_fill(ctxt) |
| Comments |
| Comment by Gerrit Updater [ 14/Oct/19 ] |
|
Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/36443 |
| Comment by Sarah Liu [ 16/Oct/19 ] |
|
soak hit similar problem on 1 OSS after running for 6 days (2.12.3 #64) [ 7493.833095] Lustre: soaked-OST0000: Connection restored to (at 192.168.1.111@o2ib) [ 7493.841688] Lustre: Skipped 9 previous similar messages [ 7497.798043] Lustre: soaked-OST000c: Will be in recovery for at least 2:30, or until 8 clients reconnect [ 7499.877838] Lustre: soaked-OST0004: Recovery over after 0:13, of 8 clients 8 recovered and 0 were evicted. [ 7499.878758] Lustre: soaked-OST0004: deleting orphan objects from 0x400000401:4535723 to 0x400000401:4535860 [ 7499.880166] Lustre: soaked-OST0004: deleting orphan objects from 0x400000400:8181547 to 0x400000400:8183436 [ 7499.911433] general protection fault: 0000 [#1] SMP [ 7499.917001] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE )[ 7499.932009] Lustre: soaked-OST0004: deleting orphan objects from 0x400000402:4595225 to 0x400000402:4597207 libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver[ 7499.947802] Lustre: soaked-OST0004: deleting orphan objects from 0x0:6331191 to 0x0:6331193 nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) iTCO_ wdt iTCO_vendor_support dm_round_robin sb_edac zfs(POE) intel_powerclamp coretemp zunicode(POE) zavl(POE) intel_rapl iosf_mbi icp(POE) kvm irqbypass crc32_pclm ul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd zcommon(POE) pcspkr znvpair(POE) spl(OE) ses enclosure lpc_ich joydev ipmi_ssif sg mei_me i2c_i801 mei ioatdma ipmi_si wmi ipmi_devintf ipmi_msghandler pcc_cpufreq dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod c rc_t10dif crct10dif_generic mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm isci igb ahci drm mlx4_core(OE) libsas libahci ptp mpt2sas pps_core devlink crct10dif_pclmul crct10dif_common raid_class dca crc32c_intel libata mlx_compat(OE) scsi_transport_sas drm_panel_orientation_quirks i2c_algo_bit [ 7500.055381] CPU: 12 PID: 65100 Comm: tgt_recover_4 Kdump: loaded Tainted: P OE ------------ 3.10.0-1062.1.1.el7_lustre.x86_64 #1 [ 7500.069552] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [ 7500.082077] task: ffff8e04b9a6e2a0 ti: ffff8e042d19c000 task.ti: ffff8e042d19c000 [ 7500.090430] RIP: 0010:[<ffffffffc1035f4c>] [<ffffffffc1035f4c>] keys_fill+0x5c/0x180 [obdclass] [ 7500.100283] RSP: 0018:ffff8e042d19fac0 EFLAGS: 00010246 [ 7500.106212] RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000000000 RCX: ffff8e042d19ffd8 [ 7500.114176] RDX: ffff8e042d19fae8 RSI: 0000000000000002 RDI: ffffffffc109b180 [ 7500.122141] RBP: ffff8e042d19fae0 R08: 000000000001f0a0 R09: ffff8e023fc07b00 [ 7500.130116] R10: ffffffffc16e5d17 R11: ffff8e0865767400 R12: ffffffffc109b240 [ 7500.138081] R13: ffff8e04923c6120 R14: 0000000000000013 R15: ffff8e08263ca2c8 [ 7500.146047] FS: 0000000000000000(0000) GS:ffff8e08ee300000(0000) knlGS:0000000000000000 [ 7500.155079] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7500.161493] CR2: 00000000006dee20 CR3: 000000072e010000 CR4: 00000000000607e0 [ 7500.169474] Call Trace: [ 7500.172241] [<ffffffffc103b021>] lu_context_refill+0x41/0x50 [obdclass] [ 7500.179766] [<ffffffffc103b0b4>] lu_env_refill+0x24/0x30 [obdclass] [ 7500.186871] [<ffffffffc16e5db1>] ofd_lvbo_init+0x2a1/0x7f0 [ofd] [ 7500.193748] [<ffffffffc1339ecf>] ? req_capsule_shrink+0xff/0x240 [ptlrpc] [ 7500.201467] [<ffffffffc12ec58d>] ldlm_server_completion_ast+0x5fd/0x980 [ptlrpc] [ 7500.209864] [<ffffffffc12ebf90>] ? ldlm_server_blocking_ast+0xa40/0xa40 [ptlrpc] [ 7500.218257] [<ffffffffc12bec58>] ldlm_work_cp_ast_lock+0xa8/0x1d0 [ptlrpc] [ 7500.226091] [<ffffffffc13069b2>] ptlrpc_set_wait+0x72/0x790 [ptlrpc] [ 7500.233277] [<ffffffff85c2426d>] ? kmem_cache_alloc_node_trace+0x11d/0x210 [ 7500.241086] [<ffffffffc1018a69>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [ 7500.249092] [<ffffffffc12bebb0>] ? ldlm_work_gl_ast_lock+0x3a0/0x3a0 [ptlrpc] [ 7500.257199] [<ffffffffc12fd232>] ? ptlrpc_prep_set+0xd2/0x280 [ptlrpc] [ 7500.264634] [<ffffffffc12c4065>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] [ 7500.272064] [<ffffffffc12c553f>] __ldlm_reprocess_all+0x11f/0x360 [ptlrpc] [ 7500.279875] [<ffffffffc12c5ae8>] ldlm_reprocess_res+0x28/0x30 [ptlrpc] [ 7500.287288] [<ffffffffc0eddfb0>] cfs_hash_for_each_relax+0x250/0x450 [libcfs] [ 7500.295387] [<ffffffffc12c5ac0>] ? ldlm_lock_mode_downgrade+0x320/0x320 [ptlrpc] [ 7500.303790] [<ffffffffc12c5ac0>] ? ldlm_lock_mode_downgrade+0x320/0x320 [ptlrpc] [ 7500.312157] [<ffffffffc0ee1345>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs] [ 7500.320258] [<ffffffffc12c5b2c>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc] [ 7500.328758] [<ffffffffc12d85b1>] target_recovery_thread+0xcd1/0x1160 [ptlrpc] [ 7500.336871] [<ffffffffc12d78e0>] ? replay_request_or_update.isra.24+0x8c0/0x8c0 [ptlrpc] [ 7500.346026] [<ffffffff85ac50d1>] kthread+0xd1/0xe0 [ 7500.351473] [<ffffffff85ac5000>] ? insert_kthread_work+0x40/0x40 [ 7500.358279] [<ffffffff8618cd37>] ret_from_fork_nospec_begin+0x21/0x21 [ 7500.365567] [<ffffffff85ac5000>] ? insert_kthread_work+0x40/0x40 [ 7500.372368] Code: 5b 52 06 00 0f 1f 00 31 db eb 15 0f 1f 40 00 48 83 c3 08 48 81 fb 40 01 00 00 0f 84 9f 00 00 00 49 8b 45 10 4c 8b a3 20 31 16 c1 <48> 83 3c 18 00 75 dd 4d 85 e4 74 d8 41 8b 04 24 41 8b 55 00 85 [ 7500.394254] RIP [<ffffffffc1035f4c>] keys_fill+0x5c/0x180 [obdclass] [ 7500.401496] RSP <ffff8e042d19fac0> [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct |
| Comment by Gerrit Updater [ 06/Dec/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36443/ |
| Comment by Peter Jones [ 06/Dec/19 ] |
|
Landed for 2.14 |
| Comment by Gerrit Updater [ 22/Jan/20 ] |
|
Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37305 |
| Comment by Gerrit Updater [ 27/Jan/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37305/ |