[LU-4698] obdfilter-survey test_3a crash on OST: tgt_request_handle Created: 03/Mar/14  Updated: 22/Dec/15  Resolved: 07/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: Nathaniel Clark Assignee: Nathaniel Clark
Resolution: Fixed Votes: 0
Labels: None
Environment:

patch on master, single MDS/MGS, single OSS, two OSTs


Issue Links:
Duplicate
is duplicated by LU-5677 unable to handle kernel NULL pointer ... Closed
Rank (Obsolete): 12920

 Description   

During "normal" run of obdfilter-survey test 3a:
OST Console log:

Mar  3 13:19:14 oss1 kernel: Lustre: Echo OBD driver; http://www.lustre.org/
Mar  3 13:19:17 oss1 kernel: LustreError: 5569:0:(sec_config.c:93:sptlrpc_target_sec_part()) unknown target ffff88001568e038(obdecho)
Mar  3 13:19:17 oss1 kernel: BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
Mar  3 13:19:17 oss1 kernel: IP: [<ffffffffa15db6a5>] tgt_request_handle+0x125/0xac0 [ptlrpc]
Mar  3 13:19:17 oss1 kernel: PGD 0 
Mar  3 13:19:17 oss1 kernel: Oops: 0000 [#1] SMP 
Mar  3 13:19:17 oss1 kernel: last sysfs file: /sys/module/ptlrpc/initstate
Mar  3 13:19:17 oss1 kernel: CPU 0 
Mar  3 13:19:17 oss1 kernel: Modules linked in: lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) mdd(U) mgs(U) nodemap(U) osd_ldiskfs(U) ldiskfs(U) lquota(U) lfsck(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) ksocklnd(U) lnet(U) libcfs(U) exportfs jbd sha512_generic sha256_generic crc32c_intel nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipv6 ppdev parport_pc parport zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate btusb bluetooth rfkill snd_ens1371 snd_rawmidi snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000 vmware_balloon sg i2c_piix4 i2c_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif sr_mod cdrom mptspi mptscsih mptbase scsi_transport_spi pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
Mar  3 13:19:17 oss1 kernel: 
Mar  3 13:19:17 oss1 kernel: Pid: 5569, comm: ll_ost00_002 Tainted: P           ---------------    2.6.32-431.5.1.el6_lustre.gb1c0d36.x86_64 #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
Mar  3 13:19:17 oss1 kernel: RIP: 0010:[<ffffffffa15db6a5>]  [<ffffffffa15db6a5>] tgt_request_handle+0x125/0xac0 [ptlrpc]
Mar  3 13:19:17 oss1 kernel: RSP: 0018:ffff8800287f7d50  EFLAGS: 00010246
Mar  3 13:19:17 oss1 kernel: RAX: ffff88002bf41400 RBX: ffff88002bdaec00 RCX: 0000000000000001
Mar  3 13:19:17 oss1 kernel: RDX: ffff88002bf41540 RSI: 0000000000000000 RDI: ffff88002bf41540
Mar  3 13:19:17 oss1 kernel: RBP: ffff8800287f7da0 R08: 0000000000000000 R09: ffffffff81645de0
Mar  3 13:19:17 oss1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800287c7240
Mar  3 13:19:17 oss1 kernel: R13: 0000000000000008 R14: ffff88002bdaef70 R15: 0000000000000000
Mar  3 13:19:17 oss1 kernel: FS:  0000000000000000(0000) GS:ffff880003200000(0000) knlGS:0000000000000000
Mar  3 13:19:17 oss1 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Mar  3 13:19:17 oss1 kernel: CR2: 000000000000001c CR3: 0000000027843000 CR4: 00000000000007f0
Mar  3 13:19:17 oss1 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar  3 13:19:17 oss1 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Mar  3 13:19:17 oss1 kernel: Process ll_ost00_002 (pid: 5569, threadinfo ffff8800287f6000, task ffff88002921a080)
Mar  3 13:19:17 oss1 kernel: Stack:
Mar  3 13:19:17 oss1 kernel: 0000000000000010 ffff8800287f7d01 ffff8800287f7d70 ffff880013a7c000
Mar  3 13:19:17 oss1 kernel: <d> ffff880013a7c000 ffff88002b747540 ffff880015218400 ffff88002bdaec00
Mar  3 13:19:17 oss1 kernel: <d> 0000000000000001 ffff88002b6bb200 ffff8800287f7ee0 ffffffffa158a99a
Mar  3 13:19:17 oss1 kernel: Call Trace:
Mar  3 13:19:17 oss1 kernel: [<ffffffffa158a99a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
Mar  3 13:19:17 oss1 kernel: [<ffffffff81528090>] ? thread_return+0x4e/0x76e
Mar  3 13:19:17 oss1 kernel: [<ffffffffa1589c80>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
Mar  3 13:19:17 oss1 kernel: [<ffffffff8109aee6>] kthread+0x96/0xa0
Mar  3 13:19:17 oss1 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Mar  3 13:19:17 oss1 kernel: [<ffffffff8109ae50>] ? kthread+0x0/0xa0
Mar  3 13:19:17 oss1 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Mar  3 13:19:17 oss1 kernel: Code: 8b 83 48 02 00 00 49 89 44 24 10 48 8b 83 48 02 00 00 f6 80 4d 01 00 00 02 0f 85 4f 03 00 00 49 c7 84 24 88 00 00 00 00 00 00 00 <41> 8b 47 1c 48 89 df 89 45 c4 41 8b 47 18 41 89 84 24 80 00 00 
Mar  3 13:19:17 oss1 kernel: RIP  [<ffffffffa15db6a5>] tgt_request_handle+0x125/0xac0 [ptlrpc]
Mar  3 13:19:17 oss1 kernel: RSP <ffff8800287f7d50>
Mar  3 13:19:17 oss1 kernel: CR2: 000000000000001c
Mar  3 13:19:17 oss1 kernel: ---[ end trace be0e3de8467532c8 ]---


 Comments   
Comment by Andreas Dilger [ 13/Mar/14 ]

On my local system the faulting address decodes to:

(gdb) list *(tgt_request_handle+0x125)
0xb16c5 is in tgt_request_handle (/usr/src/lustre-head/lustre/ptlrpc/../../lustre/target/tgt_handler.c:629).
624             if (exp_connect_flags(req->rq_export) & OBD_CONNECT_JOBSTATS)
625                     tsi->tsi_jobid = lustre_msg_get_jobid(req->rq_reqmsg);
626             else
627                     tsi->tsi_jobid = NULL;
628
629             request_fail_id = tgt->lut_request_fail_id;
630             tsi->tsi_reply_fail_id = tgt->lut_reply_fail_id;
631
632             h = tgt_handler_find_check(req);
633             if (IS_ERR(h)) {

but it should be decoded on the actual system, since it might be different. What patch was being tested? The oops in the above case would be the first dereference of the "tgt" pointer (exp->exp_obd->u.obt.obt_lut). Maybe that relates to the patch?

Comment by Nathaniel Clark [ 11/Apr/14 ]

I decoded in when I hit this again on tip of master http://review.whamcloud.com/#/c/9350/4 and it was the same as Andreas's. It is definitely the deref of tgt.

Comment by Nathaniel Clark [ 11/Apr/14 ]

http://review.whamcloud.com/9936

Comment by Peter Jones [ 27/Apr/15 ]

Is this patch something we should try and resurrect for 2.8?

Comment by Gerrit Updater [ 07/May/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/9936/
Subject: LU-4698 target: check for NULL tgt before deref
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4b3cea8249337dd6e5037b5fb7e4557efa92de5a

Generated at Sat Feb 10 01:45:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.