[LU-5218] Interop 2.5.1<->2.6 failure on test suite lustre-rsync-test test_1: ASSERTION( index >= 0 && index < LLOG_MAX_CTXTS ) failed Created: 18/Jun/14 Updated: 25/Jun/14 Resolved: 25/Jun/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Environment: |
server: lustre-master build # 2901 |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 14552 | ||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/c22bdcb4-f4ba-11e3-ae09-52540035b04c. The sub-test test_1 failed with the following error:
MDS console 19:05:56:Lustre: DEBUG MARKER: == lustre-rsync-test test 1: Simple Replication == 19:05:44 (1402797944) 19:05:56:Lustre: DEBUG MARKER: lctl --device lustre-MDT0000 changelog_register -n 19:05:56:Lustre: lustre-MDD0000: changelog on 19:05:56:Lustre: DEBUG MARKER: lctl get_param -n mdd.lustre-MDT0000.changelog_users 19:05:56:Lustre: DEBUG MARKER: dumpe2fs -h /dev/lvm-Role_MDS/P1 2>&1 | 19:05:56: grep -E -q '(ea_inode|large_xattr)' 19:05:56:Lustre: DEBUG MARKER: dumpe2fs -h /dev/lvm-Role_MDS/P1 2>&1 19:05:56:Lustre: DEBUG MARKER: dumpe2fs -h /dev/lvm-Role_MDS/P1 2>&1 | 19:05:56: grep -E -q '(ea_inode|large_xattr)' 19:05:56:LustreError: 8689:0:(lustre_log.h:440:llog_group_get_ctxt()) ASSERTION( index >= 0 && index < LLOG_MAX_CTXTS ) failed: 19:05:56:LustreError: 8689:0:(lustre_log.h:440:llog_group_get_ctxt()) LBUG 19:05:56:Pid: 8689, comm: mdt00_000 19:05:56: 19:05:56:Call Trace: 19:05:56: [<ffffffffa048e895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 19:05:56: [<ffffffffa048ee97>] lbug_with_loc+0x47/0xb0 [libcfs] 19:05:56: [<ffffffffa0849868>] llog_origin_handle_open+0x668/0x670 [ptlrpc] 19:05:56: [<ffffffffa088db35>] tgt_llog_open+0x35/0xd0 [ptlrpc] 19:05:56: [<ffffffffa08942cc>] tgt_request_handle+0x23c/0xac0 [ptlrpc] 19:05:56: [<ffffffffa0843d3a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc] 19:05:56: [<ffffffffa0843020>] ? ptlrpc_main+0x0/0x1980 [ptlrpc] 19:05:56: [<ffffffff8109ab56>] kthread+0x96/0xa0 19:05:56: [<ffffffff8100c20a>] child_rip+0xa/0x20 19:05:56: [<ffffffff8109aac0>] ? kthread+0x0/0xa0 19:05:56: [<ffffffff8100c200>] ? child_rip+0x0/0x20 19:05:56: 19:05:56:Kernel panic - not syncing: LBUG 19:05:56:Pid: 8689, comm: mdt00_000 Not tainted 2.6.32-431.17.1.el6_lustre.g8d5344f.x86_64 #1 19:05:56:Call Trace: 19:05:56: [<ffffffff8152795f>] ? panic+0xa7/0x16f 19:05:56: [<ffffffffa048eeeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 19:05:56: [<ffffffffa0849868>] ? llog_origin_handle_open+0x668/0x670 [ptlrpc] 19:05:56: [<ffffffffa088db35>] ? tgt_llog_open+0x35/0xd0 [ptlrpc] 19:05:56: [<ffffffffa08942cc>] ? tgt_request_handle+0x23c/0xac0 [ptlrpc] 19:05:56: [<ffffffffa0843d3a>] ? ptlrpc_main+0xd1a/0x1980 [ptlrpc] 19:05:56: [<ffffffffa0843020>] ? ptlrpc_main+0x0/0x1980 [ptlrpc] 19:05:56: [<ffffffff8109ab56>] ? kthread+0x96/0xa0 19:05:56: [<ffffffff8100c20a>] ? child_rip+0xa/0x20 19:05:56: [<ffffffff8109aac0>] ? kthread+0x0/0xa0 19:05:56: [<ffffffff8100c200>] ? child_rip+0x0/0x20 19:05:56:Initializing cgroup subsys cpuset 19:05:56:Initializing cgroup subsys cpu 19:05:56:Linux version 2.6.32-431.17.1.el6_lustre.g8d5344f.x86_64 (jenkins@builder-2- |
| Comments |
| Comment by Oleg Drokin [ 18/Jun/14 ] |
|
With new crashdumps feature now available for maloo runs, please also check that this crashdump is available and also use it to see what index value is I guess. |
| Comment by Mikhail Pershin [ 19/Jun/14 ] |
|
this is related to commit 14d162c5438de959d0ea01fb1b40a7c5dfa764d1 @@ -208,15 +204,9 @@ enum llog_ctxt_id { LLOG_CONFIG_ORIG_CTXT = 0, LLOG_CONFIG_REPL_CTXT, LLOG_MDS_OST_ORIG_CTXT, - LLOG_MDS_OST_REPL_CTXT, LLOG_SIZE_ORIG_CTXT, LLOG_SIZE_REPL_CTXT, - LLOG_RD1_ORIG_CTXT, - LLOG_RD1_REPL_CTXT, LLOG_TEST_ORIG_CTXT, - LLOG_TEST_REPL_CTXT, - LLOG_LOVEA_ORIG_CTXT, - LLOG_LOVEA_REPL_CTXT, LLOG_CHANGELOG_ORIG_CTXT, /**< changelog generation on mdd */ LLOG_CHANGELOG_REPL_CTXT, /**< changelog access on clients */ LLOG_CHANGELOG_USER_ORIG_CTXT, /**< for multiple changelog consumers */ some entries were removed from enum, so all llog context names have new values since that moment and incompatible with all pre-commit Lustre versions. This is critical for change log and HSM agents in old versions. I am not sure what would be better fix - to assign old numbers to the entries or to have two tables and use one or another depending on connected client version. |
| Comment by Mikhail Pershin [ 19/Jun/14 ] |
|
the patch is here - http://review.whamcloud.com/10758 I've just reverted things back, add comments and check for context index from wire. |
| Comment by Andreas Dilger [ 19/Jun/14 ] |
|
What a gigantic mess. It surprises me that these llog_ctxt_id values have been in the network protocol since ancient times, but they are not declared in lustre_idl.h, and are not checked in wirecheck/wiretest. What is the point of calling llog_open(..., name = CHANGELOG_CATALOG, ...) if it is passing body->lgd_ctxt_idx = LLOG_CHANGELOG_REPL_CTXT and using that to open the log? For the 2.6 release I think we don't have any option except to move enum llog_ctxt_idx into lustre_idl.h and assign the specific (old) values to each of the remaining LLOG_.*_CTXT values. They should be added to wirecheck/wiretest on b2_5. It looks like in ancient times on HEAD these values were specifically assigned since commit d2d56f38, but these values were removed by a patch from you & Shadow in commit c9842fdc (some time in 1.6 via https://bugzilla.lustre.org/show_bug.cgi?id=13821) and that removed the LLOG_MD_{ORIG,REPL}_CTXT values and the explicit assignments, so the resulting values changed. Even worse, the c9842fdc patch changed the values after LLOG_MD_ORIG_CTXT to be different on master and b1_8 (where the enum llog_ctxt_idx still has explicit values assigned). Fortunately, b1_8 does not have any HSM support, and we do not use LLOG_TEST_{ORIG,REPL}_CTXT over the network so the fact that the assigned values at the high end are different does not impact anything. I think the correct assignments today should be: enum llog_ctxt_id { LLOG_CONFIG_ORIG_CTXT = 0, LLOG_CONFIG_REPL_CTXT = 1, LLOG_MDS_OST_ORIG_CTXT = 2, LLOG_MDS_OST_REPL_CTXT = 3, LLOG_SIZE_ORIG_CTXT = 4, LLOG_SIZE_REPL_CTXT = 5, LLOG_TEST_ORIG_CTXT = 8, LLOG_TEST_REPL_CTXT = 9, LLOG_CHANGELOG_ORIG_CTXT = 12, /**< changelog generation on mdd */ LLOG_CHANGELOG_REPL_CTXT = 13, /**< changelog access on clients */ LLOG_CHANGELOG_USER_ORIG_CTXT = 14, /**< for multiple changelog consumers */ LLOG_AGENT_ORIG_CTXT = 15, /**< agent requests generation on cdt */ LLOG_MAX_CTXTS } There should be a big comment explaining this and referencing this LU ticket. The LASSERT() should be avoided by llog_origin_handle_open() checking the value of body->lgh_ctxt_idx is valid before calling llog_get_context(), and return -EPROTO if it is bad. |
| Comment by Andreas Dilger [ 20/Jun/14 ] |
|
Moved this to be a blocker because without this patch HSM interoperability between the MDS and archive clients will be broken. |
| Comment by Oleg Drokin [ 22/Jun/14 ] |
|
In addition to the proposed patch - we need another patch to fix the assertion on network data, I think. |
| Comment by Mikhail Pershin [ 24/Jun/14 ] |
|
Oleg, what assertion do you mean? |
| Comment by Oleg Drokin [ 25/Jun/14 ] |
|
The assertion that was reported initially here, about hte index being out of range. |
| Comment by Jodi Levi (Inactive) [ 25/Jun/14 ] |
|
Patch landed to Master. |
| Comment by Andreas Dilger [ 25/Jun/14 ] |
|
The original LASSERT() cannot be hit with the patch that was landed. The code checks that the requested context is valid before calling down into the code. |