[LU-8316] BUG: unable to handle kernel NULL pointer dereference at tgt_free_reply_data+0x97/0x330 Created: 22/Jun/16  Updated: 11/May/20  Resolved: 11/May/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Yang Sheng Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-8199 NULL pointer dereference in tgt_free_... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

System crashed while testing under memory pressure:

[432534.561808] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
[432534.563083] Lustre: Skipped 3 previous similar messages
[432534.593088] BUG: unable to handle kernel NULL pointer dereference at           (null)
[432534.594035] IP: [<ffffffffa07d31f7>] tgt_free_reply_data+0x97/0x330 [ptlrpc]
[432534.594035] PGD 3c7cb067 PUD 3836e067 PMD 0 
[432534.594035] Oops: 0002 [#1] SMP 
[432534.594035] Modules linked in: lustre(OF) ofd(OF) osp(OF) lod(OF) ost(OF) mdt(OF) mdd(OF) mgs(OF) osd_ldiskfs(OF) ldiskfs(OF) lquota(OF) lfsck(OF) obdecho(OF) mgc(OF) lov(OF) osc(OF) mdc(OF) lmv(OF) fid(OF) fld(OF) ptlrpc(OF) obdclass(OF) ksocklnd(OF) lnet(OF) libcfs(OF) loop mbcache jbd2 sha512_generic netconsole sg dm_mirror dm_region_hash dm_log crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd serio_raw virtio_balloon virtio_console dm_mod intel_agp i2c_piix4 intel_gtt nfsd auth_rpcgss nfs_acl lockd sunrpc ip_tables xfs ata_generic libcrc32c virtio_net cirrus syscopyarea sysfillrect sysimgblt virtio_scsi drm_kms_helper virtio_blk ttm drm virtio_pci agpgart ata_piix virtio_ring libata virtio i2c_core [last unloaded: libcfs]
[432534.594035] CPU: 1 PID: 5669 Comm: mdt01_003 Tainted: GF          O--------------   3.10.0-229.7.2.x86_64 #7
[432534.594035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014
[432534.594035] task: ffff880025081580 ti: ffff88002da2c000 task.ti: ffff88002da2c000
[432534.594035] RIP: 0010:[<ffffffffa07d31f7>]  [<ffffffffa07d31f7>] tgt_free_reply_data+0x97/0x330 [ptlrpc]
[432534.594035] RSP: 0018:ffff88002da2fb90  EFLAGS: 00010293
[432534.594035] RAX: 0000000000000001 RBX: ffff8800133fb8d8 RCX: 0000000000000000
[432534.594035] RDX: 0000000000000000 RSI: ffff88001289f300 RDI: ffff8800133fb8d8
[432534.594035] RBP: ffff88002da2fbd8 R08: ffff8800133fb8d8 R09: 0000000000000000
[432534.594035] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[432534.594035] R13: ffff880000e65718 R14: ffff88001289f3f8 R15: ffff88001289f300
[432534.594035] FS:  0000000000000000(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000
[432534.594035] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[432534.594035] CR2: 0000000000bfc001 CR3: 000000003c289000 CR4: 00000000001406e0
[432534.594035] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[432534.594035] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[432534.594035] Stack:
[432534.594035]  0000000000000000 ffff88001289f3f8 ffffffff810c486d ffff88002da2fc30
[432534.594035]  ffff8800133fb8d8 ffff88001289f300 ffff88001289ef60 ffff88001289f3f8
[432534.594035]  ffff880000e65718 ffff88002da2fc30 ffffffffa07d34ee 0000000000000246
[432534.594035] Call Trace:
[432534.594035]  [<ffffffff810c486d>] ? trace_hardirqs_on+0xd/0x10
[432534.594035]  [<ffffffffa07d34ee>] tgt_release_reply_data+0x5e/0x180 [ptlrpc]
[432534.594035]  [<ffffffffa07dc128>] tgt_handle_received_xid+0x98/0xe0 [ptlrpc]
[432534.594035]  [<ffffffffa07e1d38>] tgt_request_handle+0xb88/0x1330 [ptlrpc]
[432534.594035]  [<ffffffffa078d591>] ptlrpc_server_handle_request+0x231/0xac0 [ptlrpc]
[432534.594035]  [<ffffffffa078be15>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[432534.594035]  [<ffffffffa0791790>] ptlrpc_main+0xab0/0x1e10 [ptlrpc]
[432534.594035]  [<ffffffff810c486d>] ? trace_hardirqs_on+0xd/0x10
[432534.594035]  [<ffffffff8109b842>] ? finish_task_switch+0x42/0x150
[432534.594035]  [<ffffffffa0790ce0>] ? ptlrpc_register_service+0xe50/0xe50 [ptlrpc]
[432534.594035]  [<ffffffff8109008a>] kthread+0xea/0xf0
[432534.594035]  [<ffffffff8108ffa0>] ? kthread_create_on_node+0x140/0x140
[432534.594035]  [<ffffffff81571258>] ret_from_fork+0x58/0x90
[432534.594035]  [<ffffffff8108ffa0>] ? kthread_create_on_node+0x140/0x140
[432534.594035] Code: c1 fa 1f c1 ea 0c c1 f9 14 41 8d 04 14 25 ff ff 0f 00 29 d0 83 f9 0f 0f 8f 72 02 00 00 49 8b 95 28 04 00 00 48 63 c9 48 8b 14 ca <f0> 0f b3 02 19 c0 85 c0 0f 84 8b 01 00 00 48 85 db 0f 84 1b 02 
[432534.594035] RIP  [<ffffffffa07d31f7>] tgt_free_reply_data+0x97/0x330 [ptlrpc]
[432534.594035]  RSP <ffff88002da2fb90>
[432534.594035] CR2: 0000000000000000
[432534.712915] ---[ end trace 26ac593d02d07dd0 ]---
[432534.714120] Kernel panic - not syncing: Fatal exception

This issue is caused by error return value in :

        /* reply_data is supported by MDT targets only for now */
        if (strncmp(obd->obd_type->typ_name, LUSTRE_MDT_NAME, 3) != 0)
                RETURN(0);

        OBD_ALLOC(lut->lut_reply_bitmap,
                  LUT_REPLY_SLOTS_MAX_CHUNKS * sizeof(unsigned long *));
        if (lut->lut_reply_bitmap == NULL)
                GOTO(out, rc);
-----------------------------^^^

        memset(&attr, 0, sizeof(attr));
        attr.la_valid = LA_MODE;

I'll push a patch for it.



 Comments   
Comment by Gerrit Updater [ 22/Jun/16 ]

Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/20918
Subject: LU-8316 tgt: return -ENOMEM while kmalloc failed
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bfff1305da9b03766b253631a729a0007fbd6782

Comment by Oleg Drokin [ 23/Jun/16 ]

Must be a dup of LU-8199 that also have a patch?

Comment by Yang Sheng [ 29/Jun/16 ]

Hi, Oleg,

Yes, I think it is almost dup of LU-8199. But 8199 patch is a improvement patch. This is a bug fixed patch. They are not conflict.

Comment by Gerrit Updater [ 05/Jul/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20918/
Subject: LU-8316 tgt: return -ENOMEM while kmalloc failed
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 36451f3be0bbc7858d848b5b6ee6e9133de9115d

Comment by Yang Sheng [ 06/Jul/16 ]

Patch landed. Close ticket.

Comment by Oleg Drokin [ 24/Apr/18 ]

I just had this hit again on current master-next.

[178184.101361] Lustre: DEBUG MARKER: == replay-single test 39: test recovery from unlink llog (test llog_gen_rec) ========================= 02:06:10 (1524377170)
[178187.227063] Turning device loop0 (0x700000) read-only
[178187.304890] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
[178187.327340] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000
[178189.095109] LustreError: 25219:0:(client.c:1147:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8802930a6c00 x1598423905600144/t0(0) o6->lustre-OST0000-osc-MDT0000@0@lo:28/4 lens 664/432 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
[178189.160680] BUG: unable to handle kernel NULL pointer dereference at           (null)
[178189.162062] IP: [<ffffffffa06557d3>] tgt_free_reply_data+0x93/0x370 [ptlrpc]
[178189.163275] PGD 0 
[178189.163867] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[178189.164595] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) loop zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) zlib_deflate mbcache jbd2 syscopyarea sysfillrect sysimgblt ata_generic ttm pata_acpi drm_kms_helper drm ata_piix i2c_piix4 libata pcspkr serio_raw virtio_balloon virtio_blk virtio_console i2c_core floppy nfsd ip_tables rpcsec_gss_krb5 [last unloaded: libcfs]
[178189.172341] CPU: 3 PID: 143 Comm: kworker/3:1 Tainted: P           OE  ------------   3.10.0-debug #2
[178189.176342] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[178189.177281] Workqueue: obd_zombid obd_zombie_exp_cull [obdclass]
[178189.177975] task: ffff88032107e4c0 ti: ffff880321084000 task.ti: ffff880321084000
[178189.182094] RIP: 0010:[<ffffffffa06557d3>]  [<ffffffffa06557d3>] tgt_free_reply_data+0x93/0x370 [ptlrpc]
[178189.183637] RSP: 0018:ffff880321087c68  EFLAGS: 00010293
[178189.184412] RAX: 0000000000000000 RBX: ffff88025d2b5500 RCX: 0000000000000000
[178189.185641] RDX: 0000000000000000 RSI: ffff8800a9644be0 RDI: ffff88025d2b5500
[178189.187161] RBP: ffff880321087cb0 R08: ffff88025d2b5500 R09: 0000000000000000
[178189.188393] R10: 0000000000000000 R11: ffff88028dce37e0 R12: 0000000000000000
[178189.189623] R13: ffff88029383c0b0 R14: ffff88029383c0b0 R15: ffff8800a9644be0
[178189.190921] FS:  0000000000000000(0000) GS:ffff88033e460000(0000) knlGS:0000000000000000
[178189.192189] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[178189.192849] CR2: 0000000000000000 CR3: 0000000001c0e000 CR4: 00000000000006e0
[178189.194304] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[178189.195545] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[178189.196776] Stack:
[178189.197399]  ffff88032107e4c0 ffff8800a9644be0 0000000000000000 ffff880321087fd8
[178189.199036]  ffff88027eaa9400 ffff8800a9644be0 ffff8800a9644be0 ffff88029383c0b0
[178189.201390]  ffff88029383c0b0 ffff880321087d08 ffffffffa0655b38 ffff880321087d10
[178189.207062] Call Trace:
[178189.207804]  [<ffffffffa0655b38>] tgt_release_reply_data+0x88/0x180 [ptlrpc]
[178189.208621]  [<ffffffffa02183d8>] ? cfs_hash_putref+0x2e8/0x500 [libcfs]
[178189.209388]  [<ffffffffa06562e1>] tgt_client_free+0x81/0x360 [ptlrpc]
[178189.210344]  [<ffffffffa0cda13a>] mdt_destroy_export+0x5a/0x200 [mdt]
[178189.211100]  [<ffffffffa0395815>] class_export_destroy+0xe5/0x490 [obdclass]
[178189.211914]  [<ffffffffa0395bd5>] obd_zombie_exp_cull+0x15/0x20 [obdclass]
[178189.212897]  [<ffffffff8109adb6>] process_one_work+0x206/0x5b0
[178189.213660]  [<ffffffff8109ad4b>] ? process_one_work+0x19b/0x5b0
[178189.214358]  [<ffffffff8109b27b>] worker_thread+0x11b/0x3a0
[178189.215037]  [<ffffffff8109b160>] ? process_one_work+0x5b0/0x5b0
[178189.215718]  [<ffffffff810a2eba>] kthread+0xea/0xf0
[178189.216382]  [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140
[178189.217103]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
[178189.217805]  [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140
[178189.251576] Code: 41 0f 49 cc c1 fa 1f c1 ea 0c c1 f9 14 41 8d 04 14 25 ff ff 0f 00 29 d0 83 f9 0f 0f 8f b1 02 00 00 49 8b 95 58 04 00 00 48 63 c9 <48> 8b 14 ca 48 85 d2 0f 84 cf 01 00 00 f0 0f b3 02 19 c0 85 c0 
Comment by Oleg Drokin [ 24/Apr/18 ]

seems to be still present

Comment by Oleg Drokin [ 11/May/20 ]

that did not reoccur since Apr 23, 2018 in my testing it seems

Generated at Sat Feb 10 02:16:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.