[LU-1644] lustre b2_2<->master failure on lustre-initialization-1: ASSERTION( entry->mne_length <= ((1UL) << 12) ) Created: 18/Jul/12  Updated: 19/Dec/18  Resolved: 13/Sep/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: Lustre 2.3.0, Lustre 2.4.0, Lustre 2.12.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

server: lustre-b2_2
client: master


Issue Links:
Duplicate
is duplicated by LU-1701 CLONE - mgc_apply_recover_logs() ASSE... Resolved
Related
is related to LU-5329 Remove obsollete nidtbl swabbing code Resolved
Severity: 3
Rank (Obsolete): 4451

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/e20033fc-d075-11e1-9002-52540035b04c.

The sub-test lustre-initialization_1 failed with the following error:

Test system failed to start single suite, so abandoning all hope and giving up

This is the console log from client-1

16:26:00:LustreError: 4164:0:(mgc_request.c:1297:mgc_apply_recover_logs()) ASSERTION( entry->mne_length <= ((1UL) << 12) ) failed: 
16:26:00:LustreError: 4164:0:(mgc_request.c:1297:mgc_apply_recover_logs()) LBUG
16:26:00:Pid: 4164, comm: mount.lustre
16:26:00:
16:26:00:Call Trace:
16:26:00: [<ffffffffa0428905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
16:26:00: [<ffffffffa0428f17>] lbug_with_loc+0x47/0xb0 [libcfs]
16:26:00: [<ffffffffa053e438>] mgc_apply_recover_logs+0x13e8/0x17e0 [mgc]
16:26:00: [<ffffffffa0780426>] ? __req_capsule_get+0x176/0x750 [ptlrpc]
16:26:00: [<ffffffffa0429bae>] ? cfs_free+0xe/0x10 [libcfs]
16:26:00: [<ffffffffa0757cc0>] ? lustre_swab_mgs_config_res+0x0/0x20 [ptlrpc]
16:26:01: [<ffffffffa05413b4>] mgc_process_log+0xe54/0x12f0 [mgc]
16:26:01: [<ffffffffa053a980>] ? mgc_blocking_ast+0x0/0x680 [mgc]
16:26:01: [<ffffffffa0732380>] ? ldlm_completion_ast+0x0/0x730 [ptlrpc]
16:26:01: [<ffffffffa0542d76>] mgc_process_config+0x5c6/0xee0 [mgc]
16:26:01: [<ffffffffa05e80ec>] lustre_process_log+0x25c/0xad0 [obdclass]
16:26:02: [<ffffffff8127f332>] ? __percpu_counter_init+0x62/0x70
16:26:02: [<ffffffffa0a897e0>] ll_fill_super+0xa70/0x1490 [lustre]
16:26:02: [<ffffffffa05f342d>] lustre_fill_super+0x11d/0xfd0 [obdclass]
16:26:02: [<ffffffffa05f3310>] ? lustre_fill_super+0x0/0xfd0 [obdclass]
16:26:02: [<ffffffff8117989f>] get_sb_nodev+0x5f/0xa0
16:26:02: [<ffffffffa05e2cf5>] lustre_get_sb+0x25/0x30 [obdclass]
16:26:02: [<ffffffff811794fb>] vfs_kern_mount+0x7b/0x1b0
16:26:02: [<ffffffff811796a2>] do_kern_mount+0x52/0x130
16:26:02: [<ffffffff81197ce2>] do_mount+0x2d2/0x8d0
16:26:02: [<ffffffff81198370>] sys_mount+0x90/0xe0
16:26:02: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
16:26:02:
16:26:02:Kernel panic - not syncing: LBUG
16:26:02:Pid: 4164, comm: mount.lustre Not tainted 2.6.32-220.17.1.el6.x86_64 #1


 Comments   
Comment by Sarah Liu [ 18/Jul/12 ]

Both master vs b2_2 and b2_2 vs master hit this issue:

https://maloo.whamcloud.com/test_sessions/611d3f18-d058-11e1-9002-52540035b04c
https://maloo.whamcloud.com/test_sessions/dc6a13ea-d075-11e1-9002-52540035b04c

Comment by Jodi Levi (Inactive) [ 19/Jul/12 ]

Jinshan,
Would you be able to take a look at this one and assign to yourself if it's appropriate for you to fix?
Thank you!

Comment by Peter Jones [ 20/Jul/12 ]

Jinshan will look into this one

Comment by Jinshan Xiong (Inactive) [ 20/Jul/12 ]

We need to land commit 35a8ed2b2007d89c1f125f01f155232e7f511e98 to 2.2 to fix this problem.

Comment by James A Simmons [ 20/Jul/12 ]

What do you know. I have the patch for LU-1252 for b2_2 already available at http://review.whamcloud.com/#change,3008

Comment by Jinshan Xiong (Inactive) [ 20/Jul/12 ]

Cool, thanks James.

Comment by Peter Jones [ 20/Jul/12 ]

James

Thanks!

Peter

Comment by Jodi Levi (Inactive) [ 26/Jul/12 ]

Jinshan,
Since we do have 2.2 servers, the 2.3 client should not crash. So we should create a patch on master to avoid the crash. Can you please look into this?
Thank you!

Comment by Jinshan Xiong (Inactive) [ 27/Jul/12 ]

Hi Jodi,

master has already had this patch. The problem was 2.2 was released during the time when we're working on this patch. After we commit this patch to 2.2, then we're done.

Jinshan

Comment by Andreas Dilger [ 06/Aug/12 ]

Apparently this patch being will crash the 2.2 client, so we cannot retroactively fix that release, and the chance of making a 2.2.1 release is small. We need to have some mechanism to detect if the client is handling this swabbing correctly. The best method is to use an OBD_CONNECT flag being sent from the 2.3+ clients and checked by the 2.3+ servers to decide how the swabbing needs to be done.

I'm reluctant to use a separate flag for just fixing this rare bug. Oleg's suggestion is to re-use an existing OBD_CONNECT flag that is not currently being used for the MGS, which can be deprecated easily in the future. I would suggest OBD_CONNECT_GRANT, which is a flag that we can also soon deprecate for 2.x clients as well. Something like:

/* overload OBD_CONNECT_GRANT to fix rare 2.2/2.3 problem with mixed-endian
 * interop swabbing for IR mne_length field.   This can be removed in the
 * future when we don't expect 2.2 clients running with 2.3+ servers.
 * See LU-1644 for details */
#define OBD_CONNECT_MNE_SWAB OBD_CONNECT_GRANT
Comment by Jinshan Xiong (Inactive) [ 06/Aug/12 ]

A patch is pushed to: http://review.whamcloud.com/3548

to fix the comparability problem between 2.2 client and 2.3+ servers.

Comment by Peter Jones [ 21/Aug/12 ]

Landed for 2.3

Comment by Sarah Liu [ 04/Sep/12 ]

Stillhit this error on 2.2 client<->2.3-tag2.2.94 interop testing
https://maloo.whamcloud.com/test_sessions/1f9fde02-f42e-11e1-b3b2-52540035b04c

Comment by Jinshan Xiong (Inactive) [ 05/Sep/12 ]

I can't even mount 2.2 clients to master servers, with this error:

[root@client-17 ~]# uname -r
2.6.32-220.4.2.el6_lustre.g45b2fe8.x86_64
[root@client-17 ~]# rpm -qa |grep lustre
lustre-2.2.0-2.6.32_220.4.2.el6_lustre.g45b2fe8.x86_64_g25a1427.x86_64
kernel-2.6.32-220.4.2.el6_lustre.g45b2fe8.x86_64
lustre-modules-2.2.0-2.6.32_220.4.2.el6_lustre.g45b2fe8.x86_64_g25a1427.x86_64
lustre-ldiskfs-3.3.0-2.6.32_220.4.2.el6_lustre.g45b2fe8.x86_64_g25a1427.x86_64
[root@client-17 ~]# mount -t lustre client-18@tcp:/lustre /mnt/lustre
mount.lustre: mount client-18@tcp:/lustre at /mnt/lustre failed: Invalid argument
This may have multiple causes.
Is 'lustre' the correct filesystem name?
Are the mount options correct?
Check the syslog for more info.
[root@client-17 ~]# dmesg
Lustre: MGC10.10.4.18@tcp: Reactivating import
Lustre: 6833:0:(obd_config.c:1002:class_process_config()) Ignoring unknown param jobid_var=procname_uid
LustreError: 6833:0:(obd_config.c:1362:class_config_llog_handler()) Err -22 on cfg command:
Lustre:    cmd=cf00f 0:(null)  1:sys.jobid_var=procname_uid  2:procname_uid  
LustreError: 15b-f: MGC10.10.4.18@tcp: The configuration from log 'lustre-client'failed from the MGS (-22).  Make sure this client and the MGS are running compatible versions of Lustre.
LustreError: 15c-8: MGC10.10.4.18@tcp: The configuration from log 'lustre-client' failed (-22). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
LustreError: 6821:0:(llite_lib.c:978:ll_fill_super()) Unable to process log: -22
LustreError: 6736:0:(lov_obd.c:928:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, lovrc=1
LustreError: 6736:0:(lov_obd.c:928:lov_cleanup()) Skipped 3 previous similar messages
LustreError: 6821:0:(ldlm_request.c:1170:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
LustreError: 6821:0:(ldlm_request.c:1796:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Lustre: client ffff880329563000 umount complete
LustreError: 6821:0:(obd_mount.c:2349:lustre_fill_super()) Unable to mount  (-22)

any secret to do that successfully?

Comment by Jinshan Xiong (Inactive) [ 06/Sep/12 ]

Patch is at: http://review.whamcloud.com/3897

Comment by Peter Jones [ 13/Sep/12 ]

Landed for 2.3 and 2.4

Comment by Gerrit Updater [ 19/Apr/18 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/32087
Subject: LU-1644 mgc: remove obsolete IR swabbing workaround
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ee1e23c0b535f5726008b0de015cea813815d6de

Comment by Gerrit Updater [ 19/Apr/18 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/32088
Subject: LU-1644 ptlrpc: fix return type of boolean functions
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 764161ac692e9ff99ba48d39032a1adb0ff5351e

Comment by Gerrit Updater [ 06/May/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/32087/
Subject: LU-1644 mgc: remove obsolete IR swabbing workaround
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a0c644fde3405bba6752885481f0fdfe05da1bcd

Comment by Gerrit Updater [ 12/May/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/32088/
Subject: LU-1644 ptlrpc: fix return type of boolean functions
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e2cac9fb9baf43f48eb58334fb5044cece5395c0

Generated at Sat Feb 10 01:18:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.