[LU-7782] sanity-scrub test_2: NULL pointer dereference at 0x10 in lu_context_key_get() on mds2 Created: 17/Feb/16  Updated: 10/Sep/16  Resolved: 10/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-7746 skip test of new functionality on ups... Resolved
is related to LU-7935 MDS crash with NULL pointer dereferen... Resolved
is related to LU-8399 MDT hung at lu_object_find_at during ... Resolved
is related to LU-8416 sanity-scrub test_4c: Auto trigger fu... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/f0270640-d4f4-11e5-9e3f-5254006e85c2.

This is testing patch http://review.whamcloud.com/18442 which is changing sanity-scrub.sh scrub_prep() to use test_mkdir -i instead of mkdir and lfs mkdir explicitly for testing, so that it works with the upstream kernel (which doesn't have DNE support).

The sub-test test_2 failed with the following error on MDS2:

08:17:18:LustreError: 29698:0:(client.c:1133:ptlrpc_import_delay_req()) Skipped 40 previous similar messages
08:17:18:BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
08:17:18:IP: [<ffffffffa057bb57>] lu_context_key_get+0x1
08:17:18:CPU 0 
08:17:18:Pid: 29699, comm: osp_up2-1 Not tainted 2.6.32-573.12.1.el6_lustre.g93f956d.x86_64 #1 Red Hat KVM
08:17:18:Call Trace:
08:17:18: [<ffffffffa09ee83f>] fld_local_lookup+0x4f/0x290 [fld]
08:17:18: [<ffffffffa09eec83>] fld_server_lookup+0x53/0x330 [fld]
08:17:18: [<ffffffffa0e6e38f>] lod_fld_lookup+0x34f/0x520 [lod]
08:17:18: [<ffffffffa0e84243>] lod_object_init+0x103/0x3c0 [lod]
08:17:18: [<ffffffffa057f198>] lu_object_alloc+0xd8/0x320 [obdclass]
08:17:18: [<ffffffffa0580581>] lu_object_find_try+0x151/0x260 [obdclass]
08:17:18: [<ffffffffa0580741>] lu_object_find_at+0xb1/0xe0 [obdclass]
08:17:18: [<ffffffffa05807af>] lu_object_find_slice+0x1f/0x80 [obdclass]
08:17:18: [<ffffffffa0f79a4e>] osp_trans_stop_cb+0x1be/0x2d0 [osp]
08:17:18: [<ffffffffa0f7b2be>] osp_update_interpret+0x21e/0x4a0 [osp]
08:17:18: [<ffffffffa07900b5>] ptlrpc_check_set+0x615/0x1da0 [ptlrpc]
08:17:18: [<ffffffffa0791b9a>] ptlrpc_set_wait+0x35a/0x960 [ptlrpc]
08:17:18: [<ffffffffa0792221>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]
08:17:18: [<ffffffffa0f7b9c6>] osp_send_update_req+0x256/0x850 [osp]
08:17:18: [<ffffffffa0f7c63f>] osp_send_update_thread+0x20f/0x7ac [osp]
08:17:18: [<ffffffff810a0fce>] kthread+0x9e/0xc0

Please provide additional information about the failure here.

It shouldn't be possible to cause the node to crash, no matter how the test directories are being created.

Info required for matching: sanity-scrub 2



 Comments   
Comment by Andreas Dilger [ 17/Feb/16 ]

This failed in sanity-scrub test_1a also: https://testing.hpdd.intel.com/test_sets/07f70048-d547-11e5-9cc2-5254006e85c2

Info required for matching: sanity-scrub 1a

Comment by Andreas Dilger [ 17/Feb/16 ]

Wang Di, Fan Yong, could you please take a look at this? The test shouldn't cause the MDS to crash just because I slightly changed the way the directories are being created (apparently "test_mkdir" is creating a 2-stripe directory rather than a regular directory created by "mkdir"). Running sanity-scrub has failed 3x on the http://review.whamcloud.com/18442 patch so it should be reproducible.

Comment by nasf (Inactive) [ 17/Feb/16 ]

I will investigate it.

Comment by nasf (Inactive) [ 17/Feb/16 ]

I cannot reproduce the issue in my local environment, but the test logs were clear, it shows that the crash happened when umount the MDTx after creating some striped directory. At that time, there was neither file-level backup/restore nor OI scrub running. File-level backup will happen after the umount, and the OI scrub will happen after the file-level backup/restore.

It seems that when umount the MDTx, the OUT RPC for creating slave MDT-object of striped directory on remote MDT was NOT completed yet, its callback triggered object_init during MDT stack cleanup. If my guess correctly, this issue can be reproduced by repeatedly call scrub_prep() only.

Di, would you please to check DNE async update logic? Thanks!

Comment by Di Wang [ 17/Feb/16 ]

Interesting, env from the request interrupt is not initialized at all, which should be the reason for this panic. See

              * At least one request is in flight, so no
                         * interrupts are allowed. Wait until all
                         * complete, or an in-flight req times out.
                         */
                        lwi = LWI_TIMEOUT(cfs_time_seconds(timeout? timeout : 1),
                                          ptlrpc_expired_set, set);

                rc = l_wait_event(set->set_waitq, ptlrpc_check_set(NULL, set), &lwi);
Comment by Joseph Gmitter (Inactive) [ 17/Feb/16 ]

Reassigning to Di

Comment by Gerrit Updater [ 18/Feb/16 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18493
Subject: LU-7782 osp: re-initialize environment if necessary
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1ad165b666050280a28befa6d37e9e0a7e09b9e7

Comment by James Nunez (Inactive) [ 22/Feb/16 ]

Looks like another instance on sanity-scrub test_10a at

2016-02-21 18:51:44 - https://testing.hpdd.intel.com/test_sets/b5b59d62-d8de-11e5-b4e5-5254006e85c2

Comment by Bruno Faccini (Inactive) [ 25/Feb/16 ]

+1 other instance on sanity-scrub test_10a at https://testing.hpdd.intel.com/test_sets/4e014c2e-db87-11e5-b8c9-5254006e85c2

Comment by Andreas Dilger [ 01/Mar/16 ]

Looks like this patch didn't fix sanity-scrub with my small patch to change to test_mkdir:
https://testing.hpdd.intel.com/test_sets/dc48373e-d9d1-11e5-8b17-5254006e85c2

Comment by Di Wang [ 01/Mar/16 ]

It looks different issue, and I did not see OOPs on any MDS node. Fan Yong, could you please check why LFSCK is stuck here? Thanks.

Comment by nasf (Inactive) [ 06/Mar/16 ]

The issues found by James and Bruno are the same as the original issue found by Andreas. But the latest failure instance in https://testing.hpdd.intel.com/test_sets/4e014c2e-db87-11e5-b8c9-5254006e85c2 is different. That is because the current OSD cannot know the correct OI mapping for the slave MDT-object of the striped directory before the OI scrub completely rebuilt the OI files under MDT file level backup/restore case. I will make patch to handle it.

Comment by Gerrit Updater [ 06/Mar/16 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/18801
Subject: LU-7782 scrub: handle slave obj of striped directory
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 13c9ae64c78e1036f0464e78b0983f86e2383f1b

Comment by Gerrit Updater [ 14/Mar/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18493/
Subject: LU-7782 osp: save env for update callback
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 97510a3fcb1bc073fe9d45267cb541f3e1406d8d

Comment by Andreas Dilger [ 14/Mar/16 ]

The 18801 patch also needs to land before this can be closed.

Comment by Bob Glossman (Inactive) [ 04/Apr/16 ]

seen in b2_8. I think the fix only went into master after the branch was made.

https://testing.hpdd.intel.com/test_sets/b2502df8-f950-11e5-812a-5254006e85c2

Comment by Gerrit Updater [ 04/Apr/16 ]

Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/19313
Subject: LU-7782 osp: save env for update callback
Project: fs/lustre-release
Branch: b2_8
Current Patch Set: 1
Commit: 079971d9943fa6f218c3b4188f0f6574e97b341d

Comment by Gerrit Updater [ 11/Jul/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18801/
Subject: LU-7782 scrub: handle slave obj of striped directory
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 80fe81c5b14835bbd5d751e878edbd00fe90f797

Comment by Joseph Gmitter (Inactive) [ 13/Jul/16 ]

Patches have landed to master for 2.9.0

Comment by Gerrit Updater [ 14/Jul/16 ]

Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/21313
Subject: Revert "LU-7782 scrub: handle slave obj of striped directory"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 60f7f0815007dffd709de93698bba3bd2380535c

Comment by Gerrit Updater [ 20/Jul/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21313/
Subject: Revert "LU-7782 scrub: handle slave obj of striped directory"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0f37c051158a399f7b00536eeec27f5dbdd54168

Comment by Oleg Drokin [ 20/Jul/16 ]

The patch here was reverted because it appears to be causing multiple issues tracked under LU-8399, LU-8416 and others and just the fix in LU-8399 was not enough to fix it.

Comment by Gerrit Updater [ 26/Jul/16 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/21506
Subject: LU-7782 scrub: handle slave obj of striped directory
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bcde560652d19f66c0ddf650e895d620e87e3537

Comment by Gerrit Updater [ 10/Sep/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21506/
Subject: LU-7782 scrub: handle slave obj of striped directory
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 842bda9c5b41eef9e43dc3e00f05767147611677

Comment by Peter Jones [ 10/Sep/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:11:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.