[LU-6558] replay-single: test_61c, test_90 timeout: nrs_orr_res_get() accessed NULL pointer Created: 04/May/15  Updated: 16/May/16  Resolved: 11/May/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-6571 replay-single test_61c: test failed t... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for nasf <fan.yong@intel.com>

Please provide additional information about the failure here.

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/f8be94f2-f1f7-11e4-98d4-5254006e85c2.

14:46:00:BUG: unable to handle kernel NULL pointer dereference at 00000000000000c0
14:46:00:IP: [<ffffffffa0ace822>] nrs_orr_res_get+0x102/0xc20 [ptlrpc]
14:46:00:PGD 7c6e5067 PUD 7c6e6067 PMD 0 
14:46:00:Oops: 0000 [#1] SMP 
14:46:00:last sysfs file: /sys/devices/system/cpu/online
14:46:00:CPU 1 
14:46:00:Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) microcode virtio_balloon i2c_piix4 i2c_core 8139too 8139cp mii ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
14:46:00:
14:46:00:Pid: 9827, comm: ll_ost_io00_009 Tainted: P           ---------------    2.6.32-504.16.2.el6_lustre.gd805a88.x86_64 #1 Red Hat KVM
14:46:00:RIP: 0010:[<ffffffffa0ace822>]  [<ffffffffa0ace822>] nrs_orr_res_get+0x102/0xc20 [ptlrpc]
14:46:00:RSP: 0018:ffff880073a4fc20  EFLAGS: 00010246
14:46:00:RAX: 0000000000000000 RBX: ffff88006ff68bc0 RCX: 0000000000000000
14:46:00:RDX: ffff880072ff16c0 RSI: ffff88007cc2bc41 RDI: ffffffffa0b17fb0
14:46:00:RBP: ffff880073a4fca0 R08: 0000000000000000 R09: 0000000000000000
14:46:00:R10: ffff88003d6c0840 R11: 00000000000000c0 R12: ffff880071baa9c0
14:46:00:R13: ffff880073a4fcc8 R14: ffff880072ff17b8 R15: 0000000000000004
14:46:00:FS:  0000000000000000(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
14:46:00:CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
14:46:00:CR2: 00000000000000c0 CR3: 000000007c6e4000 CR4: 00000000000006e0
14:46:00:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
14:46:00:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
14:46:00:Process ll_ost_io00_009 (pid: 9827, threadinfo ffff880073a4e000, task ffff8800703ccab0)
14:46:00:Stack:
14:46:00: 0000005600000000 ffffffffa0b2563f ffff880072ff16c0 0010000000000100
14:46:00:<d> 554634a300000001 00000000000a1fd5 0000266300000000 0000000000000286
14:46:00:<d> 0000000000000000 0000000000000000 ffff88007e4e4000 ffff88006ff68bc0
14:46:00:Call Trace:
14:46:00: [<ffffffffa0ac4fb6>] nrs_resource_get+0x56/0x110 [ptlrpc]
14:46:00: [<ffffffffa0ac597b>] nrs_resource_get_safe+0x8b/0x100 [ptlrpc]
14:46:00: [<ffffffffa0ac7fbb>] ptlrpc_nrs_req_initialize+0x3b/0x90 [ptlrpc]
14:46:00: [<ffffffffa0a8e297>] ptlrpc_server_handle_req_in+0x8c7/0xca0 [ptlrpc]
14:46:00: [<ffffffffa0a957a3>] ptlrpc_main+0x9f3/0x1970 [ptlrpc]
14:46:00: [<ffffffffa0a94db0>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
14:46:00: [<ffffffff8109e71e>] kthread+0x9e/0xc0
14:46:00: [<ffffffff8100c20a>] child_rip+0xa/0x20
14:46:00: [<ffffffff8109e680>] ? kthread+0x0/0xc0
14:46:00: [<ffffffff8100c200>] ? child_rip+0x0/0x20
14:46:00:Cod


 Comments   
Comment by Andreas Dilger [ 04/May/15 ]

Nikitas, can you please take a look at this problem?

Comment by Bob Glossman (Inactive) [ 06/May/15 ]

another in master:
https://testing.hpdd.intel.com/test_sets/cc3202ac-f39b-11e4-9186-5254006e85c2

Comment by John Hammond [ 06/May/15 ]

I saw a crash dump for this. The request has a NULL export in nrs_orr_key_fill():

[jlhammon@shadow-1 ~]$ xddr2line ./usr/lib/debug/lib/modules/2.6.32-504.16.2.el6_lustre.g
d805a88.x86_64/extra/kernel/fs/lustre/ptlrpc.ko.debug nrs_orr_res_get+258
class_server_data
/usr/src/debug/lustre-2.7.52/lustre/include/obd_class.h:313
nrs_orr_key_fill
/usr/src/debug/lustre-2.7.52/lustre/ptlrpc/nrs_orr.c:160
nrs_orr_res_get
/usr/src/debug/lustre-2.7.52/lustre/ptlrpc/nrs_orr.c:855

nrs_orr_res_get()
        /**
         * Fill in the key for the request; OST FID for ORR policy instances,
         * and OST index for TRR policy instances.
         */
855:    rc = nrs_orr_key_fill(orrd, nrq, opc, policy->pol_desc->pd_name, &key);
        if (rc < 0)
        	RETURN(rc);


nrs_orr_key_fill()

160:    ost_idx = class_server_data(req->rq_export->exp_obd)->lsd_osd_index;

class_server_data()
static inline struct lr_server_data *class_server_data(struct obd_device *obd)
{
313:    LASSERT(obd->u.obt.obt_lut);
        return &obd->u.obt.obt_lut->lut_lsd;
}
Comment by Gerrit Updater [ 06/May/15 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/14699
Subject: LU-6558 nrs: test patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ff98f2354ef56af9f205e7d120fa0c0caa0e614e

Comment by Andreas Dilger [ 07/May/15 ]

This is causing quite a few test failures.

Comment by Nikitas Angelinas [ 07/May/15 ]

I'll do my best to look at this today; just swamped with other things as well atm; sorry.

Comment by Andreas Dilger [ 07/May/15 ]

It seems this failure is fallout from landing http://review.whamcloud.com/9286 "LU-3266 test: regression tests for nrs policies", which is enabling NRS in sanity.sh test_77* but it seems that it isn't changing the NRS policy back to FIFO afterward, so the following tests are all running with TRR enabled after test_77d. The failures started being hit intermittently on May 2, the day after the NRS tests were landed as new patches were rebased to include that change:
https://testing.hpdd.intel.com/sub_tests/query?utf8=%E2%9C%93&test_set[test_set_script_id]=f6a12204-32c3-11e0-a61c-52540025f9ae&sub_test[sub_test_script_id]=fb61f372-32c3-11e0-a61c-52540025f9ae&sub_test[status]=TIMEOUT&sub_test[query_bugs]=&test_session[test_host]=&test_session[test_group]=&test_session[user_id]=&test_session[query_date]=&test_session[query_recent_period]=2419200&test_node[os_type_id]=&test_node[distribution_type_id]=&test_node[architecture_type_id]=&test_node[file_system_type_id]=&test_node[lustre_branch_id]=24a6947e-04a9-11e1-bb5f-52540025f9af&test_node_network[network_type_id]=&commit=Update+results

The simple path forward would be to disable NRS at the end of each test_77 sub-test, but I'd rather avoid that if keeping NRS running for later tests is giving us better test coverage, as seen by this ticket.

Comment by John Hammond [ 08/May/15 ]
llmount.sh
lctl set_param ost.OSS.ost_io.nrs_policies=trr
lctl set_param ost.OSS.*.nrs_trr_supported=reads_and_writes
cd /mnt/lustre
lfs setstripe -c1 -i0 f0
echo XXX > f0
umount /mnt/ost1
sync
Comment by Gerrit Updater [ 08/May/15 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/14737
Subject: LU-6558 ptlrpc: handle NULL export in nrs_orr_key_fill()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b953627757184e722c78e19390acc59eff25c0bd

Comment by Nikitas Angelinas [ 09/May/15 ]

I can't figure out how to log in to Gerrit using my stackexchange OpenID, so I am posting here instead.

John, could you please consider to return -ve from nrs_orr_key_fill()? 0 signifies successful completion for this function, but if we exit early due to an invalid obd_export the nrs_orr_key (OST FID for the ORR policy, and OST idx for the TRR policy) will not be filled in, so continuing with handling the RPC using the ORR/TRR policies would likely not help much with RPC scheduling; returning -ve should hand over the RPC to the default FIFO policy.

The test for rq_export == NULL could also be made above if (nrq->nr_u.orr.or_orr_set || nrq->nr_u.orr.or_trr_set), as a very insignificant optimization.

Comment by Andreas Dilger [ 09/May/15 ]

Nikitas, since this is only happening during shutdown, it shouldn't matter what is being returned.

Comment by John Hammond [ 10/May/15 ]

Nikitas: Done.

BTW, I tried to add you as a reviewer on gerrit but it fails and says "Nikitas Angelinas <nikitas.angelinas@seagate.com> does not identify a registered user or group."

Comment by Andreas Dilger [ 10/May/15 ]

It seems there are two different accounts for Nikitas with the same email address. I was able to add him to a different patch by using only the email address, which is a bit tricky because it always autocompletes to include the full name, which produces an error. John, could you give that a try? I'm currently not able to login myself.

Comment by Gerrit Updater [ 11/May/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14737/
Subject: LU-6558 ptlrpc: handle NULL export in nrs_orr_key_fill()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 46af0e8dd04d01679865704e06c1037e4f30f1a3

Comment by Peter Jones [ 11/May/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:01:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.