[LU-4386] conf-sanity test_50h: LBUG: (import.c:285:ptlrpc_invalidate_import()) ASSERTION( imp->imp_invalid ) failed Created: 14/Dec/13  Updated: 20/Feb/14  Resolved: 18/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1, Lustre 2.5.1
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: mn4
Environment:

Lustre client: http://build.whamcloud.com/job/lustre-b2_4/67/
Lustre server: http://build.whamcloud.com/job/lustre-b2_3/41/ (2.3.0)


Issue Links:
Duplicate
duplicates LU-4636 Test failure on test suite conf-sanit... Closed
Severity: 3
Rank (Obsolete): 12021

 Description   

While running conf-sanity test 50h, the following LBUG failure occurred on MDS node:

09:07:20:Lustre: Permanently deactivating lustre-OST0000
09:07:20:Lustre: setting import lustre-OST0000_UUID INACTIVE by administrator request
09:07:20:LustreError: 8773:0:(import.c:285:ptlrpc_invalidate_import()) ASSERTION( imp->imp_invalid ) failed: 
09:07:20:LustreError: 8773:0:(import.c:285:ptlrpc_invalidate_import()) LBUG
09:07:20:Pid: 8773, comm: llog_process_th
09:07:20:
09:07:20:Call Trace:
09:07:20: [<ffffffffa04ea905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
09:07:20: [<ffffffffa04eaf17>] lbug_with_loc+0x47/0xb0 [libcfs]
09:07:21: [<ffffffffa0808421>] ptlrpc_invalidate_import+0x391/0x8d0 [ptlrpc]
09:07:21: [<ffffffffa04fb5f1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
09:07:21: [<ffffffffa0647f59>] ? lprocfs_write_frac_helper+0xa9/0x150 [obdclass]
09:07:21: [<ffffffffa07ddca5>] ptlrpc_set_import_active+0x405/0x6e0 [ptlrpc]
09:07:21: [<ffffffffa09992ae>] osc_wr_active+0xfe/0x120 [osc]
09:07:21: [<ffffffffa065e05e>] class_process_proc_param+0x35e/0x780 [obdclass]
09:07:21: [<ffffffffa098688a>] osc_process_config_base+0x4a/0x60 [osc]
09:07:21: [<ffffffffa099df1f>] osc_cl_process_config+0xaf/0xe0 [osc]
09:07:21: [<ffffffffa065fb53>] class_process_config+0x16d3/0x1c30 [obdclass]
09:07:21: [<ffffffffa0658ef1>] ? lustre_cfg_new+0x391/0x7e0 [obdclass]
09:07:21: [<ffffffffa066115b>] class_config_llog_handler+0x9bb/0x1610 [obdclass]
09:07:22: [<ffffffffa04fb5f1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
09:07:22: [<ffffffffa062a1f8>] llog_process_thread+0x888/0xd00 [obdclass]
09:07:22: [<ffffffffa0629970>] ? llog_process_thread+0x0/0xd00 [obdclass]
09:07:22: [<ffffffff8100c14a>] child_rip+0xa/0x20
09:07:22: [<ffffffffa0629970>] ? llog_process_thread+0x0/0xd00 [obdclass]
09:07:22: [<ffffffffa0629970>] ? llog_process_thread+0x0/0xd00 [obdclass]
09:07:22: [<ffffffff8100c140>] ? child_rip+0x0/0x20
09:07:23:
09:07:23:Kernel panic - not syncing: LBUG

Maloo report: https://maloo.whamcloud.com/test_sets/838153a6-628c-11e3-bf4a-52540035b04c



 Comments   
Comment by Jian Yu [ 14/Dec/13 ]

This is a regression failure caused by Lustre b2_4 build #67 because the failure did not occur before while running interop testing on previous Lustre b2_4 builds with 2.3.0 server.

Comment by Peter Jones [ 15/Dec/13 ]

Hongchao

What is your assessment of this issue?

Thanks

Peter

Comment by Jian Yu [ 15/Dec/13 ]

There are two patches included in Lustre b2_4 build #67:
LU-4362: http://review.whamcloud.com/8510
LU-3929: http://review.whamcloud.com/8513

FYI, I tried to run the interop testing with Lustre 2.3.0 server and the LU-3929 build as client, conf-sanity passed twice:
https://maloo.whamcloud.com/test_sets/5944e50c-6571-11e3-b9b7-52540035b04c
https://maloo.whamcloud.com/test_sets/025877d6-6553-11e3-9f05-52540035b04c

More testing is ongoing.

In the meantime, please look into the failure to find out the real cause. Thanks!

Comment by Jian Yu [ 16/Dec/13 ]

I triggered Lustre b2_4 build #67 client + 2.3.0 server full group test session again, and the test results showed that the failure did not occur in this session:
https://maloo.whamcloud.com/test_sessions/9ed3c6e4-65c4-11e3-b9b7-52540035b04c

Comment by Hongchao Zhang [ 17/Dec/13 ]

there could be a race between ptlrpc_invalidate_import and ptlrpc_activate_import,

1, in llog_process_thread
osc_process_config_base -> class_process_proc_param -> osc_wr_active -> ptlrpc_set_import_active -> ptlrpc_invalidate_import, which will call
ptlrpc_deactivate_import to deactive the import.

2, during lod_process_config processed "LCFG_LOV_ADD_INA" (initial inactive), lod_add_device was called to connect the target even "active" parameter was false,
and if the connection is successful, ptlrpc_activate_import will be called in ptlrpc_connect_interpret, and if it is just called between ptlrpc_invalidate_import
and the assertion "LASSERT(imp->imp_invalid)", then it will trigger the LBUG.

Comment by Jian Yu [ 19/Dec/13 ]

FYI, the failure did not occur in the Lustre 2.4.2 RC1 client + 2.3.0 server full group test session:
https://maloo.whamcloud.com/test_sessions/2f6f4020-67eb-11e3-846d-52540035b04c

Comment by Hongchao Zhang [ 19/Dec/13 ]

the issue has been reproduced locally by adding some delay between operations and restarting MDT.
the corresponding patch is under creation and test.

Comment by Hongchao Zhang [ 20/Dec/13 ]

the patch against b2_4 is tracked at http://review.whamcloud.com/#/c/8634/, and it is also needed in b2_3.

Comment by Peter Jones [ 20/Dec/13 ]

Thanks Hongchao. Is this change needed on master too?

Comment by Hongchao Zhang [ 23/Dec/13 ]

Yes, the master also need this patch.

Comment by Sarah Liu [ 27/Dec/13 ]

also hit this issue in master branch under DNE mode

https://maloo.whamcloud.com/test_sets/f722c56e-6ec7-11e3-9dbc-52540035b04c

Comment by Sarah Liu [ 31/Dec/13 ]

another instance seen in review-dne

https://maloo.whamcloud.com/test_sets/7d013360-711a-11e3-b502-52540035b04c

Comment by Bob Glossman (Inactive) [ 06/Jan/14 ]

another instance in master review
https://maloo.whamcloud.com/test_sets/fb94c47e-75e5-11e3-b3c0-52540035b04c

Comment by Andreas Dilger [ 06/Jan/14 ]

I cherry-picked this patch to master (using the handy new "Cherry Pick" option in Gerrit), and updated the commit comment slightly.

http://review.whamcloud.com/8747

Comment by Jian Yu [ 09/Jan/14 ]

While verifying patch http://review.whamcloud.com/8734 on Lustre b2_5 branch, conf-sanity test 50h hit the same failure:

Lustre: DEBUG MARKER: == conf-sanity test 50h: LU-642: activate deactivated OST ===== 09:05:36 (1389027936)^M
Lustre: DEBUG MARKER: mkdir -p /mnt/mds1^M
Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P1^M
Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl                                /dev/lvm-Role_MDS/P1 /mnt/mds1^M
LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts: ^M
Lustre: MGS: Logs for fs lustre were removed by user request.  All servers must be restarted in order to regenerate the logs.^M
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lust^M
Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null^M
Lustre: MGS: Regenerating lustre-OST0000 log by user request.^M
Lustre: Permanently deactivating lustre-OST0000^M
Lustre: setting import lustre-OST0000_UUID INACTIVE by administrator request^M
LustreError: 14612:0:(import.c:284:ptlrpc_invalidate_import()) ASSERTION( imp->imp_invalid ) failed: ^M
LustreError: 14612:0:(import.c:284:ptlrpc_invalidate_import()) LBUG^M
Pid: 14612, comm: llog_process_th^M
^M
Call Trace:^M
 [<ffffffffa03ef895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
 [<ffffffffa03efe97>] lbug_with_loc+0x47/0xb0 [libcfs]^M
 [<ffffffffa08ac8c7>] ptlrpc_invalidate_import+0x377/0x880 [ptlrpc]^M
 [<ffffffffa03ff7b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]^M
 [<ffffffffa064f549>] ? lprocfs_write_frac_helper+0xa9/0x150 [obdclass]^M
 [<ffffffffa08897a8>] ptlrpc_set_import_active+0x3d8/0x670 [ptlrpc]^M
 [<ffffffffa12b3f0e>] osp_wr_active+0xfe/0x120 [osp]^M
 [<ffffffffa066eb06>] class_process_proc_param+0x356/0x770 [obdclass]^M
 [<ffffffffa12a2b45>] osp_process_config+0x2a5/0x640 [osp]^M
 [<ffffffffa0671560>] class_process_config+0x1910/0x1ad0 [obdclass]^M
 [<ffffffffa066a352>] ? lustre_cfg_new+0x312/0x6e0 [obdclass]^M
 [<ffffffffa067283b>] class_config_llog_handler+0xa2b/0x1880 [obdclass]^M
 [<ffffffffa03ff7b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]^M
 [<ffffffffa0634dab>] llog_process_thread+0x8ab/0xd60 [obdclass]^M
 [<ffffffffa067d43f>] ? keys_fill+0x6f/0x190 [obdclass]^M
 [<ffffffffa06352a5>] llog_process_thread_daemonize+0x45/0x70 [obdclass]^M
 [<ffffffffa0635260>] ? llog_process_thread_daemonize+0x0/0x70 [obdclass]^M
 [<ffffffff81096a36>] kthread+0x96/0xa0^M
 [<ffffffff8100c0ca>] child_rip+0xa/0x20^M
 [<ffffffff810969a0>] ? kthread+0x0/0xa0^M
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M
^M
Kernel panic - not syncing: LBUG^M

Maloo report: https://maloo.whamcloud.com/test_sets/4a0358ae-7736-11e3-9ce8-52540035b04c

This is a regression on Lustre b2_5 branch introduced by build #6 or #7 or #8.

Comment by Hongchao Zhang [ 13/Jan/14 ]

Hi Yujian, does the build #6, #7, #8 contain the patch at http://review.whamcloud.com/#/c/8634/? Thanks

Comment by Peter Jones [ 13/Jan/14 ]

No it does not. You can click on the links for each build to see which fixes where included in that build at http://build.whamcloud.com/job/lustre-b2_5/

Comment by nasf (Inactive) [ 05/Feb/14 ]

Hit it on master:

https://maloo.whamcloud.com/test_sets/a22c319e-8e7f-11e3-b036-52540035b04c

Comment by Andreas Dilger [ 10/Feb/14 ]

Unfortunately, the information about what patches are included in build #6, #7, and #8 is now lost, because Jenkins does not cache build information indefinitely. It would have been useful to paste the list of patches for each build into this bug.

Comment by Jian Yu [ 15/Feb/14 ]

Lustre client build: http://build.whamcloud.com/job/lustre-b2_4/70/ (2.4.2)
Lustre server build: http://build.whamcloud.com/job/lustre-b2_5/21/
Distro/Arch: RHEL6.4/x86_64

The same failure occurred, console log on MDS:

Lustre: DEBUG MARKER: /usr/sbin/lctl mark == conf-sanity test 50h: LU-642: activate deactivated OST  ===== 10:06:50 \(1392314810\)^M
Lustre: DEBUG MARKER: == conf-sanity test 50h: LU-642: activate deactivated OST ===== 10:06:50 (1392314810)^M
Lustre: DEBUG MARKER: mkdir -p /mnt/mds1^M
Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P1^M
Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre                                  /dev/lvm-Role_MDS/P1 /mnt/mds1^M
LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts: ^M
Lustre: MGS: Logs for fs lustre were removed by user request.  All servers must be restarted in order to regenerate the logs.^M
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lust^M
Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null^M
Lustre: MGS: Regenerating lustre-OST0000 log by user request.^M
Lustre: Permanently deactivating lustre-OST0000^M
Lustre: setting import lustre-OST0000_UUID INACTIVE by administrator request^M
LustreError: 19638:0:(import.c:284:ptlrpc_invalidate_import()) ASSERTION( imp->imp_invalid ) failed: ^M
LustreError: 19638:0:(import.c:284:ptlrpc_invalidate_import()) LBUG^M
Pid: 19638, comm: llog_process_th^M
^M
Call Trace:^M
 [<ffffffffa03f7895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
 [<ffffffffa03f7e97>] lbug_with_loc+0x47/0xb0 [libcfs]^M
 [<ffffffffa08bc227>] ptlrpc_invalidate_import+0x377/0x880 [ptlrpc]^M
 [<ffffffffa04077b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]^M
 [<ffffffffa0685559>] ? lprocfs_write_frac_helper+0xa9/0x150 [obdclass]^M
 [<ffffffffa08990f8>] ptlrpc_set_import_active+0x3d8/0x670 [ptlrpc]^M
 [<ffffffffa1307f0e>] osp_wr_active+0xfe/0x120 [osp]^M
 [<ffffffffa06a368a>] class_process_proc_param+0x34a/0x770 [obdclass]^M
 [<ffffffffa12f6b45>] osp_process_config+0x2a5/0x640 [osp]^M
 [<ffffffffa06a7570>] class_process_config+0x1910/0x1ad0 [obdclass]^M
 [<ffffffffa06a0362>] ? lustre_cfg_new+0x312/0x6e0 [obdclass]^M
 [<ffffffffa06a884b>] class_config_llog_handler+0xa2b/0x1880 [obdclass]^M
 [<ffffffffa04077b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]^M
 [<ffffffffa066adab>] llog_process_thread+0x8ab/0xd60 [obdclass]^M
 [<ffffffffa06b344f>] ? keys_fill+0x6f/0x190 [obdclass]^M
 [<ffffffffa066b2a5>] llog_process_thread_daemonize+0x45/0x70 [obdclass]^M
 [<ffffffffa066b260>] ? llog_process_thread_daemonize+0x0/0x70 [obdclass]^M
 [<ffffffff81096a36>] kthread+0x96/0xa0^M
 [<ffffffff8100c0ca>] child_rip+0xa/0x20^M
 [<ffffffff810969a0>] ? kthread+0x0/0xa0^M
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M
^M
Kernel panic - not syncing: LBUG^M

Maloo report:
https://maloo.whamcloud.com/test_sets/0d342454-95b7-11e3-bb07-52540035b04c

Comment by Jian Yu [ 15/Feb/14 ]

I just cherry-picked the patch to Lustre b2_5 branch (using the "Cherry Pick To" option in Gerrit): http://review.whamcloud.com/9283

Comment by Peter Jones [ 18/Feb/14 ]

Landed for 2.6

Generated at Sat Feb 10 01:42:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.