[LU-6558] replay-single: test_61c, test_90 timeout: nrs_orr_res_get() accessed NULL pointer Created: 04/May/15 Updated: 16/May/16 Resolved: 11/May/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for nasf <fan.yong@intel.com> Please provide additional information about the failure here. This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/f8be94f2-f1f7-11e4-98d4-5254006e85c2. 14:46:00:BUG: unable to handle kernel NULL pointer dereference at 00000000000000c0 14:46:00:IP: [<ffffffffa0ace822>] nrs_orr_res_get+0x102/0xc20 [ptlrpc] 14:46:00:PGD 7c6e5067 PUD 7c6e6067 PMD 0 14:46:00:Oops: 0000 [#1] SMP 14:46:00:last sysfs file: /sys/devices/system/cpu/online 14:46:00:CPU 1 14:46:00:Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) microcode virtio_balloon i2c_piix4 i2c_core 8139too 8139cp mii ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] 14:46:00: 14:46:00:Pid: 9827, comm: ll_ost_io00_009 Tainted: P --------------- 2.6.32-504.16.2.el6_lustre.gd805a88.x86_64 #1 Red Hat KVM 14:46:00:RIP: 0010:[<ffffffffa0ace822>] [<ffffffffa0ace822>] nrs_orr_res_get+0x102/0xc20 [ptlrpc] 14:46:00:RSP: 0018:ffff880073a4fc20 EFLAGS: 00010246 14:46:00:RAX: 0000000000000000 RBX: ffff88006ff68bc0 RCX: 0000000000000000 14:46:00:RDX: ffff880072ff16c0 RSI: ffff88007cc2bc41 RDI: ffffffffa0b17fb0 14:46:00:RBP: ffff880073a4fca0 R08: 0000000000000000 R09: 0000000000000000 14:46:00:R10: ffff88003d6c0840 R11: 00000000000000c0 R12: ffff880071baa9c0 14:46:00:R13: ffff880073a4fcc8 R14: ffff880072ff17b8 R15: 0000000000000004 14:46:00:FS: 0000000000000000(0000) GS:ffff880002300000(0000) knlGS:0000000000000000 14:46:00:CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b 14:46:00:CR2: 00000000000000c0 CR3: 000000007c6e4000 CR4: 00000000000006e0 14:46:00:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 14:46:00:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 14:46:00:Process ll_ost_io00_009 (pid: 9827, threadinfo ffff880073a4e000, task ffff8800703ccab0) 14:46:00:Stack: 14:46:00: 0000005600000000 ffffffffa0b2563f ffff880072ff16c0 0010000000000100 14:46:00:<d> 554634a300000001 00000000000a1fd5 0000266300000000 0000000000000286 14:46:00:<d> 0000000000000000 0000000000000000 ffff88007e4e4000 ffff88006ff68bc0 14:46:00:Call Trace: 14:46:00: [<ffffffffa0ac4fb6>] nrs_resource_get+0x56/0x110 [ptlrpc] 14:46:00: [<ffffffffa0ac597b>] nrs_resource_get_safe+0x8b/0x100 [ptlrpc] 14:46:00: [<ffffffffa0ac7fbb>] ptlrpc_nrs_req_initialize+0x3b/0x90 [ptlrpc] 14:46:00: [<ffffffffa0a8e297>] ptlrpc_server_handle_req_in+0x8c7/0xca0 [ptlrpc] 14:46:00: [<ffffffffa0a957a3>] ptlrpc_main+0x9f3/0x1970 [ptlrpc] 14:46:00: [<ffffffffa0a94db0>] ? ptlrpc_main+0x0/0x1970 [ptlrpc] 14:46:00: [<ffffffff8109e71e>] kthread+0x9e/0xc0 14:46:00: [<ffffffff8100c20a>] child_rip+0xa/0x20 14:46:00: [<ffffffff8109e680>] ? kthread+0x0/0xc0 14:46:00: [<ffffffff8100c200>] ? child_rip+0x0/0x20 14:46:00:Cod |
| Comments |
| Comment by Andreas Dilger [ 04/May/15 ] |
|
Nikitas, can you please take a look at this problem? |
| Comment by Bob Glossman (Inactive) [ 06/May/15 ] |
|
another in master: |
| Comment by John Hammond [ 06/May/15 ] |
|
I saw a crash dump for this. The request has a NULL export in nrs_orr_key_fill(): [jlhammon@shadow-1 ~]$ xddr2line ./usr/lib/debug/lib/modules/2.6.32-504.16.2.el6_lustre.g
d805a88.x86_64/extra/kernel/fs/lustre/ptlrpc.ko.debug nrs_orr_res_get+258
class_server_data
/usr/src/debug/lustre-2.7.52/lustre/include/obd_class.h:313
nrs_orr_key_fill
/usr/src/debug/lustre-2.7.52/lustre/ptlrpc/nrs_orr.c:160
nrs_orr_res_get
/usr/src/debug/lustre-2.7.52/lustre/ptlrpc/nrs_orr.c:855
nrs_orr_res_get()
/**
* Fill in the key for the request; OST FID for ORR policy instances,
* and OST index for TRR policy instances.
*/
855: rc = nrs_orr_key_fill(orrd, nrq, opc, policy->pol_desc->pd_name, &key);
if (rc < 0)
RETURN(rc);
nrs_orr_key_fill()
160: ost_idx = class_server_data(req->rq_export->exp_obd)->lsd_osd_index;
class_server_data()
static inline struct lr_server_data *class_server_data(struct obd_device *obd)
{
313: LASSERT(obd->u.obt.obt_lut);
return &obd->u.obt.obt_lut->lut_lsd;
}
|
| Comment by Gerrit Updater [ 06/May/15 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/14699 |
| Comment by Andreas Dilger [ 07/May/15 ] |
|
This is causing quite a few test failures. |
| Comment by Nikitas Angelinas [ 07/May/15 ] |
|
I'll do my best to look at this today; just swamped with other things as well atm; sorry. |
| Comment by Andreas Dilger [ 07/May/15 ] |
|
It seems this failure is fallout from landing http://review.whamcloud.com/9286 " The simple path forward would be to disable NRS at the end of each test_77 sub-test, but I'd rather avoid that if keeping NRS running for later tests is giving us better test coverage, as seen by this ticket. |
| Comment by John Hammond [ 08/May/15 ] |
llmount.sh lctl set_param ost.OSS.ost_io.nrs_policies=trr lctl set_param ost.OSS.*.nrs_trr_supported=reads_and_writes cd /mnt/lustre lfs setstripe -c1 -i0 f0 echo XXX > f0 umount /mnt/ost1 sync |
| Comment by Gerrit Updater [ 08/May/15 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/14737 |
| Comment by Nikitas Angelinas [ 09/May/15 ] |
|
I can't figure out how to log in to Gerrit using my stackexchange OpenID, so I am posting here instead. John, could you please consider to return -ve from nrs_orr_key_fill()? 0 signifies successful completion for this function, but if we exit early due to an invalid obd_export the nrs_orr_key (OST FID for the ORR policy, and OST idx for the TRR policy) will not be filled in, so continuing with handling the RPC using the ORR/TRR policies would likely not help much with RPC scheduling; returning -ve should hand over the RPC to the default FIFO policy. The test for rq_export == NULL could also be made above if (nrq->nr_u.orr.or_orr_set || nrq->nr_u.orr.or_trr_set), as a very insignificant optimization. |
| Comment by Andreas Dilger [ 09/May/15 ] |
|
Nikitas, since this is only happening during shutdown, it shouldn't matter what is being returned. |
| Comment by John Hammond [ 10/May/15 ] |
|
Nikitas: Done. BTW, I tried to add you as a reviewer on gerrit but it fails and says "Nikitas Angelinas <nikitas.angelinas@seagate.com> does not identify a registered user or group." |
| Comment by Andreas Dilger [ 10/May/15 ] |
|
It seems there are two different accounts for Nikitas with the same email address. I was able to add him to a different patch by using only the email address, which is a bit tricky because it always autocompletes to include the full name, which produces an error. John, could you give that a try? I'm currently not able to login myself. |
| Comment by Gerrit Updater [ 11/May/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14737/ |
| Comment by Peter Jones [ 11/May/15 ] |
|
Landed for 2.8 |