[LU-1644] lustre b2_2<->master failure on lustre-initialization-1: ASSERTION( entry->mne_length <= ((1UL) << 12) ) Created: 18/Jul/12 Updated: 19/Dec/18 Resolved: 13/Sep/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.4.0, Lustre 2.12.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
server: lustre-b2_2 |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 4451 | ||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/e20033fc-d075-11e1-9002-52540035b04c. The sub-test lustre-initialization_1 failed with the following error:
This is the console log from client-1 16:26:00:LustreError: 4164:0:(mgc_request.c:1297:mgc_apply_recover_logs()) ASSERTION( entry->mne_length <= ((1UL) << 12) ) failed: 16:26:00:LustreError: 4164:0:(mgc_request.c:1297:mgc_apply_recover_logs()) LBUG 16:26:00:Pid: 4164, comm: mount.lustre 16:26:00: 16:26:00:Call Trace: 16:26:00: [<ffffffffa0428905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 16:26:00: [<ffffffffa0428f17>] lbug_with_loc+0x47/0xb0 [libcfs] 16:26:00: [<ffffffffa053e438>] mgc_apply_recover_logs+0x13e8/0x17e0 [mgc] 16:26:00: [<ffffffffa0780426>] ? __req_capsule_get+0x176/0x750 [ptlrpc] 16:26:00: [<ffffffffa0429bae>] ? cfs_free+0xe/0x10 [libcfs] 16:26:00: [<ffffffffa0757cc0>] ? lustre_swab_mgs_config_res+0x0/0x20 [ptlrpc] 16:26:01: [<ffffffffa05413b4>] mgc_process_log+0xe54/0x12f0 [mgc] 16:26:01: [<ffffffffa053a980>] ? mgc_blocking_ast+0x0/0x680 [mgc] 16:26:01: [<ffffffffa0732380>] ? ldlm_completion_ast+0x0/0x730 [ptlrpc] 16:26:01: [<ffffffffa0542d76>] mgc_process_config+0x5c6/0xee0 [mgc] 16:26:01: [<ffffffffa05e80ec>] lustre_process_log+0x25c/0xad0 [obdclass] 16:26:02: [<ffffffff8127f332>] ? __percpu_counter_init+0x62/0x70 16:26:02: [<ffffffffa0a897e0>] ll_fill_super+0xa70/0x1490 [lustre] 16:26:02: [<ffffffffa05f342d>] lustre_fill_super+0x11d/0xfd0 [obdclass] 16:26:02: [<ffffffffa05f3310>] ? lustre_fill_super+0x0/0xfd0 [obdclass] 16:26:02: [<ffffffff8117989f>] get_sb_nodev+0x5f/0xa0 16:26:02: [<ffffffffa05e2cf5>] lustre_get_sb+0x25/0x30 [obdclass] 16:26:02: [<ffffffff811794fb>] vfs_kern_mount+0x7b/0x1b0 16:26:02: [<ffffffff811796a2>] do_kern_mount+0x52/0x130 16:26:02: [<ffffffff81197ce2>] do_mount+0x2d2/0x8d0 16:26:02: [<ffffffff81198370>] sys_mount+0x90/0xe0 16:26:02: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 16:26:02: 16:26:02:Kernel panic - not syncing: LBUG 16:26:02:Pid: 4164, comm: mount.lustre Not tainted 2.6.32-220.17.1.el6.x86_64 #1 |
| Comments |
| Comment by Sarah Liu [ 18/Jul/12 ] |
|
Both master vs b2_2 and b2_2 vs master hit this issue: https://maloo.whamcloud.com/test_sessions/611d3f18-d058-11e1-9002-52540035b04c |
| Comment by Jodi Levi (Inactive) [ 19/Jul/12 ] |
|
Jinshan, |
| Comment by Peter Jones [ 20/Jul/12 ] |
|
Jinshan will look into this one |
| Comment by Jinshan Xiong (Inactive) [ 20/Jul/12 ] |
|
We need to land commit 35a8ed2b2007d89c1f125f01f155232e7f511e98 to 2.2 to fix this problem. |
| Comment by James A Simmons [ 20/Jul/12 ] |
|
What do you know. I have the patch for |
| Comment by Jinshan Xiong (Inactive) [ 20/Jul/12 ] |
|
Cool, thanks James. |
| Comment by Peter Jones [ 20/Jul/12 ] |
|
James Thanks! Peter |
| Comment by Jodi Levi (Inactive) [ 26/Jul/12 ] |
|
Jinshan, |
| Comment by Jinshan Xiong (Inactive) [ 27/Jul/12 ] |
|
Hi Jodi, master has already had this patch. The problem was 2.2 was released during the time when we're working on this patch. After we commit this patch to 2.2, then we're done. Jinshan |
| Comment by Andreas Dilger [ 06/Aug/12 ] |
|
Apparently this patch being will crash the 2.2 client, so we cannot retroactively fix that release, and the chance of making a 2.2.1 release is small. We need to have some mechanism to detect if the client is handling this swabbing correctly. The best method is to use an OBD_CONNECT flag being sent from the 2.3+ clients and checked by the 2.3+ servers to decide how the swabbing needs to be done. I'm reluctant to use a separate flag for just fixing this rare bug. Oleg's suggestion is to re-use an existing OBD_CONNECT flag that is not currently being used for the MGS, which can be deprecated easily in the future. I would suggest OBD_CONNECT_GRANT, which is a flag that we can also soon deprecate for 2.x clients as well. Something like: /* overload OBD_CONNECT_GRANT to fix rare 2.2/2.3 problem with mixed-endian * interop swabbing for IR mne_length field. This can be removed in the * future when we don't expect 2.2 clients running with 2.3+ servers. * See LU-1644 for details */ #define OBD_CONNECT_MNE_SWAB OBD_CONNECT_GRANT |
| Comment by Jinshan Xiong (Inactive) [ 06/Aug/12 ] |
|
A patch is pushed to: http://review.whamcloud.com/3548 to fix the comparability problem between 2.2 client and 2.3+ servers. |
| Comment by Peter Jones [ 21/Aug/12 ] |
|
Landed for 2.3 |
| Comment by Sarah Liu [ 04/Sep/12 ] |
|
Stillhit this error on 2.2 client<->2.3-tag2.2.94 interop testing |
| Comment by Jinshan Xiong (Inactive) [ 05/Sep/12 ] |
|
I can't even mount 2.2 clients to master servers, with this error: [root@client-17 ~]# uname -r 2.6.32-220.4.2.el6_lustre.g45b2fe8.x86_64 [root@client-17 ~]# rpm -qa |grep lustre lustre-2.2.0-2.6.32_220.4.2.el6_lustre.g45b2fe8.x86_64_g25a1427.x86_64 kernel-2.6.32-220.4.2.el6_lustre.g45b2fe8.x86_64 lustre-modules-2.2.0-2.6.32_220.4.2.el6_lustre.g45b2fe8.x86_64_g25a1427.x86_64 lustre-ldiskfs-3.3.0-2.6.32_220.4.2.el6_lustre.g45b2fe8.x86_64_g25a1427.x86_64 [root@client-17 ~]# mount -t lustre client-18@tcp:/lustre /mnt/lustre mount.lustre: mount client-18@tcp:/lustre at /mnt/lustre failed: Invalid argument This may have multiple causes. Is 'lustre' the correct filesystem name? Are the mount options correct? Check the syslog for more info. [root@client-17 ~]# dmesg Lustre: MGC10.10.4.18@tcp: Reactivating import Lustre: 6833:0:(obd_config.c:1002:class_process_config()) Ignoring unknown param jobid_var=procname_uid LustreError: 6833:0:(obd_config.c:1362:class_config_llog_handler()) Err -22 on cfg command: Lustre: cmd=cf00f 0:(null) 1:sys.jobid_var=procname_uid 2:procname_uid LustreError: 15b-f: MGC10.10.4.18@tcp: The configuration from log 'lustre-client'failed from the MGS (-22). Make sure this client and the MGS are running compatible versions of Lustre. LustreError: 15c-8: MGC10.10.4.18@tcp: The configuration from log 'lustre-client' failed (-22). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. LustreError: 6821:0:(llite_lib.c:978:ll_fill_super()) Unable to process log: -22 LustreError: 6736:0:(lov_obd.c:928:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, lovrc=1 LustreError: 6736:0:(lov_obd.c:928:lov_cleanup()) Skipped 3 previous similar messages LustreError: 6821:0:(ldlm_request.c:1170:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway LustreError: 6821:0:(ldlm_request.c:1796:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Lustre: client ffff880329563000 umount complete LustreError: 6821:0:(obd_mount.c:2349:lustre_fill_super()) Unable to mount (-22) any secret to do that successfully? |
| Comment by Jinshan Xiong (Inactive) [ 06/Sep/12 ] |
|
Patch is at: http://review.whamcloud.com/3897 |
| Comment by Peter Jones [ 13/Sep/12 ] |
|
Landed for 2.3 and 2.4 |
| Comment by Gerrit Updater [ 19/Apr/18 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/32087 |
| Comment by Gerrit Updater [ 19/Apr/18 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/32088 |
| Comment by Gerrit Updater [ 06/May/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/32087/ |
| Comment by Gerrit Updater [ 12/May/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/32088/ |