[LU-7782] sanity-scrub test_2: NULL pointer dereference at 0x10 in lu_context_key_get() on mds2 Created: 17/Feb/16 Updated: 10/Sep/16 Resolved: 10/Sep/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/f0270640-d4f4-11e5-9e3f-5254006e85c2. This is testing patch http://review.whamcloud.com/18442 which is changing sanity-scrub.sh scrub_prep() to use test_mkdir -i instead of mkdir and lfs mkdir explicitly for testing, so that it works with the upstream kernel (which doesn't have DNE support). The sub-test test_2 failed with the following error on MDS2: 08:17:18:LustreError: 29698:0:(client.c:1133:ptlrpc_import_delay_req()) Skipped 40 previous similar messages 08:17:18:BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 08:17:18:IP: [<ffffffffa057bb57>] lu_context_key_get+0x1 08:17:18:CPU 0 08:17:18:Pid: 29699, comm: osp_up2-1 Not tainted 2.6.32-573.12.1.el6_lustre.g93f956d.x86_64 #1 Red Hat KVM 08:17:18:Call Trace: 08:17:18: [<ffffffffa09ee83f>] fld_local_lookup+0x4f/0x290 [fld] 08:17:18: [<ffffffffa09eec83>] fld_server_lookup+0x53/0x330 [fld] 08:17:18: [<ffffffffa0e6e38f>] lod_fld_lookup+0x34f/0x520 [lod] 08:17:18: [<ffffffffa0e84243>] lod_object_init+0x103/0x3c0 [lod] 08:17:18: [<ffffffffa057f198>] lu_object_alloc+0xd8/0x320 [obdclass] 08:17:18: [<ffffffffa0580581>] lu_object_find_try+0x151/0x260 [obdclass] 08:17:18: [<ffffffffa0580741>] lu_object_find_at+0xb1/0xe0 [obdclass] 08:17:18: [<ffffffffa05807af>] lu_object_find_slice+0x1f/0x80 [obdclass] 08:17:18: [<ffffffffa0f79a4e>] osp_trans_stop_cb+0x1be/0x2d0 [osp] 08:17:18: [<ffffffffa0f7b2be>] osp_update_interpret+0x21e/0x4a0 [osp] 08:17:18: [<ffffffffa07900b5>] ptlrpc_check_set+0x615/0x1da0 [ptlrpc] 08:17:18: [<ffffffffa0791b9a>] ptlrpc_set_wait+0x35a/0x960 [ptlrpc] 08:17:18: [<ffffffffa0792221>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc] 08:17:18: [<ffffffffa0f7b9c6>] osp_send_update_req+0x256/0x850 [osp] 08:17:18: [<ffffffffa0f7c63f>] osp_send_update_thread+0x20f/0x7ac [osp] 08:17:18: [<ffffffff810a0fce>] kthread+0x9e/0xc0 Please provide additional information about the failure here. It shouldn't be possible to cause the node to crash, no matter how the test directories are being created. Info required for matching: sanity-scrub 2 |
| Comments |
| Comment by Andreas Dilger [ 17/Feb/16 ] |
|
This failed in sanity-scrub test_1a also: https://testing.hpdd.intel.com/test_sets/07f70048-d547-11e5-9cc2-5254006e85c2 Info required for matching: sanity-scrub 1a |
| Comment by Andreas Dilger [ 17/Feb/16 ] |
|
Wang Di, Fan Yong, could you please take a look at this? The test shouldn't cause the MDS to crash just because I slightly changed the way the directories are being created (apparently "test_mkdir" is creating a 2-stripe directory rather than a regular directory created by "mkdir"). Running sanity-scrub has failed 3x on the http://review.whamcloud.com/18442 patch so it should be reproducible. |
| Comment by nasf (Inactive) [ 17/Feb/16 ] |
|
I will investigate it. |
| Comment by nasf (Inactive) [ 17/Feb/16 ] |
|
I cannot reproduce the issue in my local environment, but the test logs were clear, it shows that the crash happened when umount the MDTx after creating some striped directory. At that time, there was neither file-level backup/restore nor OI scrub running. File-level backup will happen after the umount, and the OI scrub will happen after the file-level backup/restore. It seems that when umount the MDTx, the OUT RPC for creating slave MDT-object of striped directory on remote MDT was NOT completed yet, its callback triggered object_init during MDT stack cleanup. If my guess correctly, this issue can be reproduced by repeatedly call scrub_prep() only. Di, would you please to check DNE async update logic? Thanks! |
| Comment by Di Wang [ 17/Feb/16 ] |
|
Interesting, env from the request interrupt is not initialized at all, which should be the reason for this panic. See * At least one request is in flight, so no
* interrupts are allowed. Wait until all
* complete, or an in-flight req times out.
*/
lwi = LWI_TIMEOUT(cfs_time_seconds(timeout? timeout : 1),
ptlrpc_expired_set, set);
rc = l_wait_event(set->set_waitq, ptlrpc_check_set(NULL, set), &lwi);
|
| Comment by Joseph Gmitter (Inactive) [ 17/Feb/16 ] |
|
Reassigning to Di |
| Comment by Gerrit Updater [ 18/Feb/16 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18493 |
| Comment by James Nunez (Inactive) [ 22/Feb/16 ] |
|
Looks like another instance on sanity-scrub test_10a at 2016-02-21 18:51:44 - https://testing.hpdd.intel.com/test_sets/b5b59d62-d8de-11e5-b4e5-5254006e85c2 |
| Comment by Bruno Faccini (Inactive) [ 25/Feb/16 ] |
|
+1 other instance on sanity-scrub test_10a at https://testing.hpdd.intel.com/test_sets/4e014c2e-db87-11e5-b8c9-5254006e85c2 |
| Comment by Andreas Dilger [ 01/Mar/16 ] |
|
Looks like this patch didn't fix sanity-scrub with my small patch to change to test_mkdir: |
| Comment by Di Wang [ 01/Mar/16 ] |
|
It looks different issue, and I did not see OOPs on any MDS node. Fan Yong, could you please check why LFSCK is stuck here? Thanks. |
| Comment by nasf (Inactive) [ 06/Mar/16 ] |
|
The issues found by James and Bruno are the same as the original issue found by Andreas. But the latest failure instance in https://testing.hpdd.intel.com/test_sets/4e014c2e-db87-11e5-b8c9-5254006e85c2 is different. That is because the current OSD cannot know the correct OI mapping for the slave MDT-object of the striped directory before the OI scrub completely rebuilt the OI files under MDT file level backup/restore case. I will make patch to handle it. |
| Comment by Gerrit Updater [ 06/Mar/16 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/18801 |
| Comment by Gerrit Updater [ 14/Mar/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18493/ |
| Comment by Andreas Dilger [ 14/Mar/16 ] |
|
The 18801 patch also needs to land before this can be closed. |
| Comment by Bob Glossman (Inactive) [ 04/Apr/16 ] |
|
seen in b2_8. I think the fix only went into master after the branch was made. https://testing.hpdd.intel.com/test_sets/b2502df8-f950-11e5-812a-5254006e85c2 |
| Comment by Gerrit Updater [ 04/Apr/16 ] |
|
Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/19313 |
| Comment by Gerrit Updater [ 11/Jul/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18801/ |
| Comment by Joseph Gmitter (Inactive) [ 13/Jul/16 ] |
|
Patches have landed to master for 2.9.0 |
| Comment by Gerrit Updater [ 14/Jul/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/21313 |
| Comment by Gerrit Updater [ 20/Jul/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21313/ |
| Comment by Oleg Drokin [ 20/Jul/16 ] |
|
The patch here was reverted because it appears to be causing multiple issues tracked under |
| Comment by Gerrit Updater [ 26/Jul/16 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/21506 |
| Comment by Gerrit Updater [ 10/Sep/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21506/ |
| Comment by Peter Jones [ 10/Sep/16 ] |
|
Landed for 2.9 |