[LU-2981] sanity.sh test_17m test_77i: oops in ptlrpc_server_hpreq_fini Created: 18/Mar/13  Updated: 01/Apr/13  Resolved: 01/Apr/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Bob Glossman (Inactive)
Resolution: Duplicate Votes: 0
Labels: LB

Issue Links:
Related
is related to LU-398 NRS (Network Request Scheduler ) Resolved
Severity: 3
Rank (Obsolete): 7264

 Description   

This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/1000ccce-8f3b-11e2-aa82-52540035b04c.

The sub-test test_77i failed with the following error:

22:42:38:BUG: unable to handle kernel NULL pointer dereference at 0000000000000228
22:42:39:IP: [<ffffffffa075dc77>] ptlrpc_server_hpreq_fini+0x27/0x160 [ptlrpc]
22:42:41:Pid: 13574, comm: obd_zombid Not tainted 2.6.32-279.19.1.el6.x86_64 #1 Red Hat KVM
22:42:44:Call Trace:
22:42:44: [<ffffffffa0760dc9>] ptlrpc_unregister_service+0x4a9/0x10b0 [ptlrpc]
22:42:44: [<ffffffffa04402e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
22:42:44: [<ffffffffa07367b2>] ldlm_cleanup+0x262/0x4f0 [ptlrpc]
22:42:44: [<ffffffffa0736b65>] ldlm_put_ref+0x125/0x1a0 [ptlrpc]
22:42:45: [<ffffffffa072a67d>] client_obd_cleanup+0x4d/0x120 [ptlrpc]
22:42:45: [<ffffffffa0aa7133>] mgc_cleanup+0x53/0x130 [mgc]
22:42:45: [<ffffffffa05ce012>] class_decref+0x212/0x580 [obdclass]
22:42:45: [<ffffffffa04402e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
22:42:47: [<ffffffffa05abff9>] obd_zombie_impexp_cull+0x309/0x5d0 [obdclass]
22:42:47: [<ffffffffa05ac385>] obd_zombie_impexp_thread+0xc5/0x1c0 [obdclass]
22:42:47: [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
22:42:47: [<ffffffffa05ac2c0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
22:42:48: [<ffffffff8100c0ca>] child_rip+0xa/0x20
22:42:48: [<ffffffffa05ac2c0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
22:42:48: [<ffffffffa05ac2c0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
22:42:48: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Info required for matching: sanity 77i



 Comments   
Comment by Andreas Dilger [ 18/Mar/13 ]

Another hit in in ptlrpc_unregister_service() though this time from an unmount:

https://maloo.whamcloud.com/test_sets/018d9cec-8d58-11e2-bb99-52540035b04c

19:19:04:RIP: 0010:[<ffffffffa07edc27>]  [<ffffffffa07edc27>] ptlrpc_server_hpreq_fini+0x27/0x160 [ptlrpc]
19:19:05:Process umount (pid: 11123, threadinfo ffff880069806000, task ffff88005
19:19:05:Call Trace:
19:19:05: [<ffffffffa07f0d79>] ptlrpc_unregister_service+0x4a9/0x10b0 [ptlrpc]
19:19:05: [<ffffffff81052223>] ? __wake_up+0x53/0x70
19:19:05: [<ffffffffa0de49fe>] mgs_device_fini+0xee/0x5a0 [mgs]
19:19:06: [<ffffffffa06489c7>] class_cleanup+0x577/0xda0 [obdclass]
19:19:06: [<ffffffffa061dd36>] ? class_name2dev+0x56/0xe0 [obdclass]
19:19:06: [<ffffffffa064a2ac>] class_process_config+0x10bc/0x1c80 [obdclass]
19:19:06: [<ffffffffa0643ad3>] ? lustre_cfg_new+0x353/0x7e0 [obdclass]
19:19:06: [<ffffffffa064afe9>] class_manual_cleanup+0x179/0x6f0 [obdclass]
19:19:06: [<ffffffffa061dd36>] ? class_name2dev+0x56/0xe0 [obdclass]
19:19:06: [<ffffffffa0657a3d>] server_put_super+0x46d/0xf00 [obdclass]
19:19:06: [<ffffffff811785ab>] generic_shutdown_super+0x5b/0xe0
19:19:06: [<ffffffff81178696>] kill_anon_super+0x16/0x60
19:19:07: [<ffffffffa064ce46>] lustre_kill_super+0x36/0x60 [obdclass]
19:19:07: [<ffffffff81179670>] deactivate_super+0x70/0x90
19:19:07: [<ffffffff811955cf>] mntput_no_expire+0xbf/0x110
19:19:07: [<ffffffff81195f2b>] sys_umount+0x7b/0x3a0
19:19:07: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Comment by Andreas Dilger [ 18/Mar/13 ]

sanity.sh test_77i has failed 10 times and test_17m once in the past 4 weeks, but only starting 2013-03-12.

Comment by Andreas Dilger [ 18/Mar/13 ]

Could this relate to NRS? The patch http://review.whamcloud.com/5665 was landed on the 12th.

Comment by Nikitas Angelinas [ 18/Mar/13 ]

I think this must be due to the NRS framework follow-up patch itself, as the version that fired those bugs had some important parts missing. I have just updated that patch and this new version should address this ticket.

Comment by Peter Jones [ 22/Mar/13 ]

Thanks Nikitas! As it seems that you are not around to do so (I appreciate it is late on a Friday in the UK) Bob is going to rebase this patch to avoid the LU-2910 failure in the hope that we can get a clean test run over the weekend

Comment by Nikitas Angelinas [ 22/Mar/13 ]

Hi Peter, please proceed to rebase the patch if you want to get a clean test run, though I was planning to refresh the patch at end of day today (so in 5-7 hours from now) after I included some additional changes.

Comment by Peter Jones [ 22/Mar/13 ]

Nikitas

Ah great - that timeframe works fine to get testing completed over the weekend. Obviously we want the version with the latest changes. I guess that I underestimated the end of your work day

Peter

Comment by Alexander Boyko [ 29/Mar/13 ]

Xyratex has patch for this issue, probably, I will submit it to master in a few days.

Comment by Peter Jones [ 01/Apr/13 ]

ok so the extra tidy up from LU-298 that Nikitas believes should resolve this issue has now landed so I am closing this ticket. We can reopen it if there are still problems with this change in place.

Generated at Sat Feb 10 01:29:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.