[LU-5218] Interop 2.5.1<->2.6 failure on test suite lustre-rsync-test test_1: ASSERTION( index >= 0 && index < LLOG_MAX_CTXTS ) failed Created: 18/Jun/14  Updated: 25/Jun/14  Resolved: 25/Jun/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: HB
Environment:

server: lustre-master build # 2901
client: 2.5.1


Issue Links:
Duplicate
duplicates LU-5257 Rolling upgrade from 2.4 to master fa... Closed
Related
is related to LU-5230 Interop 2.5.1<->2.6 failure on test s... Resolved
Severity: 3
Rank (Obsolete): 14552

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/c22bdcb4-f4ba-11e3-ae09-52540035b04c.

The sub-test test_1 failed with the following error:

test failed to respond and timed out

MDS console

19:05:56:Lustre: DEBUG MARKER: == lustre-rsync-test test 1: Simple Replication == 19:05:44 (1402797944)
19:05:56:Lustre: DEBUG MARKER: lctl --device lustre-MDT0000 changelog_register -n
19:05:56:Lustre: lustre-MDD0000: changelog on
19:05:56:Lustre: DEBUG MARKER: lctl get_param -n mdd.lustre-MDT0000.changelog_users
19:05:56:Lustre: DEBUG MARKER: dumpe2fs -h /dev/lvm-Role_MDS/P1 2>&1 |
19:05:56:		grep -E -q '(ea_inode|large_xattr)'
19:05:56:Lustre: DEBUG MARKER: dumpe2fs -h /dev/lvm-Role_MDS/P1 2>&1
19:05:56:Lustre: DEBUG MARKER: dumpe2fs -h /dev/lvm-Role_MDS/P1 2>&1 |
19:05:56:		grep -E -q '(ea_inode|large_xattr)'
19:05:56:LustreError: 8689:0:(lustre_log.h:440:llog_group_get_ctxt()) ASSERTION( index >= 0 && index < LLOG_MAX_CTXTS ) failed: 
19:05:56:LustreError: 8689:0:(lustre_log.h:440:llog_group_get_ctxt()) LBUG
19:05:56:Pid: 8689, comm: mdt00_000
19:05:56:
19:05:56:Call Trace:
19:05:56: [<ffffffffa048e895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
19:05:56: [<ffffffffa048ee97>] lbug_with_loc+0x47/0xb0 [libcfs]
19:05:56: [<ffffffffa0849868>] llog_origin_handle_open+0x668/0x670 [ptlrpc]
19:05:56: [<ffffffffa088db35>] tgt_llog_open+0x35/0xd0 [ptlrpc]
19:05:56: [<ffffffffa08942cc>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
19:05:56: [<ffffffffa0843d3a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
19:05:56: [<ffffffffa0843020>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
19:05:56: [<ffffffff8109ab56>] kthread+0x96/0xa0
19:05:56: [<ffffffff8100c20a>] child_rip+0xa/0x20
19:05:56: [<ffffffff8109aac0>] ? kthread+0x0/0xa0
19:05:56: [<ffffffff8100c200>] ? child_rip+0x0/0x20
19:05:56:
19:05:56:Kernel panic - not syncing: LBUG
19:05:56:Pid: 8689, comm: mdt00_000 Not tainted 2.6.32-431.17.1.el6_lustre.g8d5344f.x86_64 #1
19:05:56:Call Trace:
19:05:56: [<ffffffff8152795f>] ? panic+0xa7/0x16f
19:05:56: [<ffffffffa048eeeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
19:05:56: [<ffffffffa0849868>] ? llog_origin_handle_open+0x668/0x670 [ptlrpc]
19:05:56: [<ffffffffa088db35>] ? tgt_llog_open+0x35/0xd0 [ptlrpc]
19:05:56: [<ffffffffa08942cc>] ? tgt_request_handle+0x23c/0xac0 [ptlrpc]
19:05:56: [<ffffffffa0843d3a>] ? ptlrpc_main+0xd1a/0x1980 [ptlrpc]
19:05:56: [<ffffffffa0843020>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
19:05:56: [<ffffffff8109ab56>] ? kthread+0x96/0xa0
19:05:56: [<ffffffff8100c20a>] ? child_rip+0xa/0x20
19:05:56: [<ffffffff8109aac0>] ? kthread+0x0/0xa0
19:05:56: [<ffffffff8100c200>] ? child_rip+0x0/0x20
19:05:56:Initializing cgroup subsys cpuset
19:05:56:Initializing cgroup subsys cpu
19:05:56:Linux version 2.6.32-431.17.1.el6_lustre.g8d5344f.x86_64 (jenkins@builder-2-


 Comments   
Comment by Oleg Drokin [ 18/Jun/14 ]

With new crashdumps feature now available for maloo runs, please also check that this crashdump is available and also use it to see what index value is I guess.

Comment by Mikhail Pershin [ 19/Jun/14 ]

this is related to commit 14d162c5438de959d0ea01fb1b40a7c5dfa764d1

@@ -208,15 +204,9 @@ enum llog_ctxt_id {
        LLOG_CONFIG_ORIG_CTXT  =  0,
        LLOG_CONFIG_REPL_CTXT,
        LLOG_MDS_OST_ORIG_CTXT,
-       LLOG_MDS_OST_REPL_CTXT,
        LLOG_SIZE_ORIG_CTXT,
        LLOG_SIZE_REPL_CTXT,
-       LLOG_RD1_ORIG_CTXT,
-       LLOG_RD1_REPL_CTXT,
        LLOG_TEST_ORIG_CTXT,
-       LLOG_TEST_REPL_CTXT,
-       LLOG_LOVEA_ORIG_CTXT,
-       LLOG_LOVEA_REPL_CTXT,
        LLOG_CHANGELOG_ORIG_CTXT,       /**< changelog generation on mdd */
        LLOG_CHANGELOG_REPL_CTXT,       /**< changelog access on clients */
        LLOG_CHANGELOG_USER_ORIG_CTXT,  /**< for multiple changelog consumers */

some entries were removed from enum, so all llog context names have new values since that moment and incompatible with all pre-commit Lustre versions. This is critical for change log and HSM agents in old versions. I am not sure what would be better fix - to assign old numbers to the entries or to have two tables and use one or another depending on connected client version.

Comment by Mikhail Pershin [ 19/Jun/14 ]

the patch is here - http://review.whamcloud.com/10758

I've just reverted things back, add comments and check for context index from wire.

Comment by Andreas Dilger [ 19/Jun/14 ]

What a gigantic mess. It surprises me that these llog_ctxt_id values have been in the network protocol since ancient times, but they are not declared in lustre_idl.h, and are not checked in wirecheck/wiretest. What is the point of calling llog_open(..., name = CHANGELOG_CATALOG, ...) if it is passing body->lgd_ctxt_idx = LLOG_CHANGELOG_REPL_CTXT and using that to open the log?

For the 2.6 release I think we don't have any option except to move enum llog_ctxt_idx into lustre_idl.h and assign the specific (old) values to each of the remaining LLOG_.*_CTXT values. They should be added to wirecheck/wiretest on b2_5. It looks like in ancient times on HEAD these values were specifically assigned since commit d2d56f38, but these values were removed by a patch from you & Shadow in commit c9842fdc (some time in 1.6 via https://bugzilla.lustre.org/show_bug.cgi?id=13821) and that removed the LLOG_MD_{ORIG,REPL}_CTXT values and the explicit assignments, so the resulting values changed. Even worse, the c9842fdc patch changed the values after LLOG_MD_ORIG_CTXT to be different on master and b1_8 (where the enum llog_ctxt_idx still has explicit values assigned). Fortunately, b1_8 does not have any HSM support, and we do not use LLOG_TEST_{ORIG,REPL}_CTXT over the network so the fact that the assigned values at the high end are different does not impact anything.

I think the correct assignments today should be:

enum llog_ctxt_id {
        LLOG_CONFIG_ORIG_CTXT  =  0,
        LLOG_CONFIG_REPL_CTXT  =  1,
        LLOG_MDS_OST_ORIG_CTXT =  2,
        LLOG_MDS_OST_REPL_CTXT =  3,
        LLOG_SIZE_ORIG_CTXT    =  4,
        LLOG_SIZE_REPL_CTXT    =  5,
        LLOG_TEST_ORIG_CTXT    = 8,
        LLOG_TEST_REPL_CTXT    = 9,
        LLOG_CHANGELOG_ORIG_CTXT = 12,      /**< changelog generation on mdd */
        LLOG_CHANGELOG_REPL_CTXT = 13,      /**< changelog access on clients */
        LLOG_CHANGELOG_USER_ORIG_CTXT = 14, /**< for multiple changelog consumers */
        LLOG_AGENT_ORIG_CTXT = 15,           /**< agent requests generation on cdt */
        LLOG_MAX_CTXTS
}

There should be a big comment explaining this and referencing this LU ticket. The LASSERT() should be avoided by llog_origin_handle_open() checking the value of body->lgh_ctxt_idx is valid before calling llog_get_context(), and return -EPROTO if it is bad.

Comment by Andreas Dilger [ 20/Jun/14 ]

Moved this to be a blocker because without this patch HSM interoperability between the MDS and archive clients will be broken.

Comment by Oleg Drokin [ 22/Jun/14 ]

In addition to the proposed patch - we need another patch to fix the assertion on network data, I think.

Comment by Mikhail Pershin [ 24/Jun/14 ]

Oleg, what assertion do you mean?

Comment by Oleg Drokin [ 25/Jun/14 ]

The assertion that was reported initially here, about hte index being out of range.

Comment by Jodi Levi (Inactive) [ 25/Jun/14 ]

Patch landed to Master.

Comment by Andreas Dilger [ 25/Jun/14 ]

The original LASSERT() cannot be hit with the patch that was landed. The code checks that the requested context is valid before calling down into the code.

Generated at Sat Feb 10 01:49:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.