[LU-4386] conf-sanity test_50h: LBUG: (import.c:285:ptlrpc_invalidate_import()) ASSERTION( imp->imp_invalid ) failed Created: 14/Dec/13 Updated: 20/Feb/14 Resolved: 18/Feb/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1, Lustre 2.5.1 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.1 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jian Yu | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mn4 | ||
| Environment: |
Lustre client: http://build.whamcloud.com/job/lustre-b2_4/67/ |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 12021 | ||||||||
| Description |
|
While running conf-sanity test 50h, the following LBUG failure occurred on MDS node: 09:07:20:Lustre: Permanently deactivating lustre-OST0000 09:07:20:Lustre: setting import lustre-OST0000_UUID INACTIVE by administrator request 09:07:20:LustreError: 8773:0:(import.c:285:ptlrpc_invalidate_import()) ASSERTION( imp->imp_invalid ) failed: 09:07:20:LustreError: 8773:0:(import.c:285:ptlrpc_invalidate_import()) LBUG 09:07:20:Pid: 8773, comm: llog_process_th 09:07:20: 09:07:20:Call Trace: 09:07:20: [<ffffffffa04ea905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 09:07:20: [<ffffffffa04eaf17>] lbug_with_loc+0x47/0xb0 [libcfs] 09:07:21: [<ffffffffa0808421>] ptlrpc_invalidate_import+0x391/0x8d0 [ptlrpc] 09:07:21: [<ffffffffa04fb5f1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 09:07:21: [<ffffffffa0647f59>] ? lprocfs_write_frac_helper+0xa9/0x150 [obdclass] 09:07:21: [<ffffffffa07ddca5>] ptlrpc_set_import_active+0x405/0x6e0 [ptlrpc] 09:07:21: [<ffffffffa09992ae>] osc_wr_active+0xfe/0x120 [osc] 09:07:21: [<ffffffffa065e05e>] class_process_proc_param+0x35e/0x780 [obdclass] 09:07:21: [<ffffffffa098688a>] osc_process_config_base+0x4a/0x60 [osc] 09:07:21: [<ffffffffa099df1f>] osc_cl_process_config+0xaf/0xe0 [osc] 09:07:21: [<ffffffffa065fb53>] class_process_config+0x16d3/0x1c30 [obdclass] 09:07:21: [<ffffffffa0658ef1>] ? lustre_cfg_new+0x391/0x7e0 [obdclass] 09:07:21: [<ffffffffa066115b>] class_config_llog_handler+0x9bb/0x1610 [obdclass] 09:07:22: [<ffffffffa04fb5f1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 09:07:22: [<ffffffffa062a1f8>] llog_process_thread+0x888/0xd00 [obdclass] 09:07:22: [<ffffffffa0629970>] ? llog_process_thread+0x0/0xd00 [obdclass] 09:07:22: [<ffffffff8100c14a>] child_rip+0xa/0x20 09:07:22: [<ffffffffa0629970>] ? llog_process_thread+0x0/0xd00 [obdclass] 09:07:22: [<ffffffffa0629970>] ? llog_process_thread+0x0/0xd00 [obdclass] 09:07:22: [<ffffffff8100c140>] ? child_rip+0x0/0x20 09:07:23: 09:07:23:Kernel panic - not syncing: LBUG Maloo report: https://maloo.whamcloud.com/test_sets/838153a6-628c-11e3-bf4a-52540035b04c |
| Comments |
| Comment by Jian Yu [ 14/Dec/13 ] |
|
This is a regression failure caused by Lustre b2_4 build #67 because the failure did not occur before while running interop testing on previous Lustre b2_4 builds with 2.3.0 server. |
| Comment by Peter Jones [ 15/Dec/13 ] |
|
Hongchao What is your assessment of this issue? Thanks Peter |
| Comment by Jian Yu [ 15/Dec/13 ] |
|
There are two patches included in Lustre b2_4 build #67: FYI, I tried to run the interop testing with Lustre 2.3.0 server and the More testing is ongoing. In the meantime, please look into the failure to find out the real cause. Thanks! |
| Comment by Jian Yu [ 16/Dec/13 ] |
|
I triggered Lustre b2_4 build #67 client + 2.3.0 server full group test session again, and the test results showed that the failure did not occur in this session: |
| Comment by Hongchao Zhang [ 17/Dec/13 ] |
|
there could be a race between ptlrpc_invalidate_import and ptlrpc_activate_import, 1, in llog_process_thread 2, during lod_process_config processed "LCFG_LOV_ADD_INA" (initial inactive), lod_add_device was called to connect the target even "active" parameter was false, |
| Comment by Jian Yu [ 19/Dec/13 ] |
|
FYI, the failure did not occur in the Lustre 2.4.2 RC1 client + 2.3.0 server full group test session: |
| Comment by Hongchao Zhang [ 19/Dec/13 ] |
|
the issue has been reproduced locally by adding some delay between operations and restarting MDT. |
| Comment by Hongchao Zhang [ 20/Dec/13 ] |
|
the patch against b2_4 is tracked at http://review.whamcloud.com/#/c/8634/, and it is also needed in b2_3. |
| Comment by Peter Jones [ 20/Dec/13 ] |
|
Thanks Hongchao. Is this change needed on master too? |
| Comment by Hongchao Zhang [ 23/Dec/13 ] |
|
Yes, the master also need this patch. |
| Comment by Sarah Liu [ 27/Dec/13 ] |
|
also hit this issue in master branch under DNE mode https://maloo.whamcloud.com/test_sets/f722c56e-6ec7-11e3-9dbc-52540035b04c |
| Comment by Sarah Liu [ 31/Dec/13 ] |
|
another instance seen in review-dne https://maloo.whamcloud.com/test_sets/7d013360-711a-11e3-b502-52540035b04c |
| Comment by Bob Glossman (Inactive) [ 06/Jan/14 ] |
|
another instance in master review |
| Comment by Andreas Dilger [ 06/Jan/14 ] |
|
I cherry-picked this patch to master (using the handy new "Cherry Pick" option in Gerrit), and updated the commit comment slightly. |
| Comment by Jian Yu [ 09/Jan/14 ] |
|
While verifying patch http://review.whamcloud.com/8734 on Lustre b2_5 branch, conf-sanity test 50h hit the same failure: Lustre: DEBUG MARKER: == conf-sanity test 50h: LU-642: activate deactivated OST ===== 09:05:36 (1389027936)^M Lustre: DEBUG MARKER: mkdir -p /mnt/mds1^M Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P1^M Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl /dev/lvm-Role_MDS/P1 /mnt/mds1^M LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts: ^M Lustre: MGS: Logs for fs lustre were removed by user request. All servers must be restarted in order to regenerate the logs.^M Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lust^M Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null^M Lustre: MGS: Regenerating lustre-OST0000 log by user request.^M Lustre: Permanently deactivating lustre-OST0000^M Lustre: setting import lustre-OST0000_UUID INACTIVE by administrator request^M LustreError: 14612:0:(import.c:284:ptlrpc_invalidate_import()) ASSERTION( imp->imp_invalid ) failed: ^M LustreError: 14612:0:(import.c:284:ptlrpc_invalidate_import()) LBUG^M Pid: 14612, comm: llog_process_th^M ^M Call Trace:^M [<ffffffffa03ef895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M [<ffffffffa03efe97>] lbug_with_loc+0x47/0xb0 [libcfs]^M [<ffffffffa08ac8c7>] ptlrpc_invalidate_import+0x377/0x880 [ptlrpc]^M [<ffffffffa03ff7b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]^M [<ffffffffa064f549>] ? lprocfs_write_frac_helper+0xa9/0x150 [obdclass]^M [<ffffffffa08897a8>] ptlrpc_set_import_active+0x3d8/0x670 [ptlrpc]^M [<ffffffffa12b3f0e>] osp_wr_active+0xfe/0x120 [osp]^M [<ffffffffa066eb06>] class_process_proc_param+0x356/0x770 [obdclass]^M [<ffffffffa12a2b45>] osp_process_config+0x2a5/0x640 [osp]^M [<ffffffffa0671560>] class_process_config+0x1910/0x1ad0 [obdclass]^M [<ffffffffa066a352>] ? lustre_cfg_new+0x312/0x6e0 [obdclass]^M [<ffffffffa067283b>] class_config_llog_handler+0xa2b/0x1880 [obdclass]^M [<ffffffffa03ff7b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]^M [<ffffffffa0634dab>] llog_process_thread+0x8ab/0xd60 [obdclass]^M [<ffffffffa067d43f>] ? keys_fill+0x6f/0x190 [obdclass]^M [<ffffffffa06352a5>] llog_process_thread_daemonize+0x45/0x70 [obdclass]^M [<ffffffffa0635260>] ? llog_process_thread_daemonize+0x0/0x70 [obdclass]^M [<ffffffff81096a36>] kthread+0x96/0xa0^M [<ffffffff8100c0ca>] child_rip+0xa/0x20^M [<ffffffff810969a0>] ? kthread+0x0/0xa0^M [<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M ^M Kernel panic - not syncing: LBUG^M Maloo report: https://maloo.whamcloud.com/test_sets/4a0358ae-7736-11e3-9ce8-52540035b04c This is a regression on Lustre b2_5 branch introduced by build #6 or #7 or #8. |
| Comment by Hongchao Zhang [ 13/Jan/14 ] |
|
Hi Yujian, does the build #6, #7, #8 contain the patch at http://review.whamcloud.com/#/c/8634/? Thanks |
| Comment by Peter Jones [ 13/Jan/14 ] |
|
No it does not. You can click on the links for each build to see which fixes where included in that build at http://build.whamcloud.com/job/lustre-b2_5/ |
| Comment by nasf (Inactive) [ 05/Feb/14 ] |
|
Hit it on master: https://maloo.whamcloud.com/test_sets/a22c319e-8e7f-11e3-b036-52540035b04c |
| Comment by Andreas Dilger [ 10/Feb/14 ] |
|
Unfortunately, the information about what patches are included in build #6, #7, and #8 is now lost, because Jenkins does not cache build information indefinitely. It would have been useful to paste the list of patches for each build into this bug. |
| Comment by Jian Yu [ 15/Feb/14 ] |
|
Lustre client build: http://build.whamcloud.com/job/lustre-b2_4/70/ (2.4.2) The same failure occurred, console log on MDS: Lustre: DEBUG MARKER: /usr/sbin/lctl mark == conf-sanity test 50h: LU-642: activate deactivated OST ===== 10:06:50 \(1392314810\)^M Lustre: DEBUG MARKER: == conf-sanity test 50h: LU-642: activate deactivated OST ===== 10:06:50 (1392314810)^M Lustre: DEBUG MARKER: mkdir -p /mnt/mds1^M Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P1^M Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre /dev/lvm-Role_MDS/P1 /mnt/mds1^M LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts: ^M Lustre: MGS: Logs for fs lustre were removed by user request. All servers must be restarted in order to regenerate the logs.^M Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lust^M Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null^M Lustre: MGS: Regenerating lustre-OST0000 log by user request.^M Lustre: Permanently deactivating lustre-OST0000^M Lustre: setting import lustre-OST0000_UUID INACTIVE by administrator request^M LustreError: 19638:0:(import.c:284:ptlrpc_invalidate_import()) ASSERTION( imp->imp_invalid ) failed: ^M LustreError: 19638:0:(import.c:284:ptlrpc_invalidate_import()) LBUG^M Pid: 19638, comm: llog_process_th^M ^M Call Trace:^M [<ffffffffa03f7895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M [<ffffffffa03f7e97>] lbug_with_loc+0x47/0xb0 [libcfs]^M [<ffffffffa08bc227>] ptlrpc_invalidate_import+0x377/0x880 [ptlrpc]^M [<ffffffffa04077b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]^M [<ffffffffa0685559>] ? lprocfs_write_frac_helper+0xa9/0x150 [obdclass]^M [<ffffffffa08990f8>] ptlrpc_set_import_active+0x3d8/0x670 [ptlrpc]^M [<ffffffffa1307f0e>] osp_wr_active+0xfe/0x120 [osp]^M [<ffffffffa06a368a>] class_process_proc_param+0x34a/0x770 [obdclass]^M [<ffffffffa12f6b45>] osp_process_config+0x2a5/0x640 [osp]^M [<ffffffffa06a7570>] class_process_config+0x1910/0x1ad0 [obdclass]^M [<ffffffffa06a0362>] ? lustre_cfg_new+0x312/0x6e0 [obdclass]^M [<ffffffffa06a884b>] class_config_llog_handler+0xa2b/0x1880 [obdclass]^M [<ffffffffa04077b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]^M [<ffffffffa066adab>] llog_process_thread+0x8ab/0xd60 [obdclass]^M [<ffffffffa06b344f>] ? keys_fill+0x6f/0x190 [obdclass]^M [<ffffffffa066b2a5>] llog_process_thread_daemonize+0x45/0x70 [obdclass]^M [<ffffffffa066b260>] ? llog_process_thread_daemonize+0x0/0x70 [obdclass]^M [<ffffffff81096a36>] kthread+0x96/0xa0^M [<ffffffff8100c0ca>] child_rip+0xa/0x20^M [<ffffffff810969a0>] ? kthread+0x0/0xa0^M [<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M ^M Kernel panic - not syncing: LBUG^M Maloo report: |
| Comment by Jian Yu [ 15/Feb/14 ] |
|
I just cherry-picked the patch to Lustre b2_5 branch (using the "Cherry Pick To" option in Gerrit): http://review.whamcloud.com/9283 |
| Comment by Peter Jones [ 18/Feb/14 ] |
|
Landed for 2.6 |