[LU-1384] MDS Kernel Panic when trying to mount the lustre file system Created: 08/May/12  Updated: 01/Jun/12  Resolved: 01/Jun/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0, Lustre 2.3.0
Fix Version/s: Lustre 2.3.0

Type: Bug Priority: Critical
Reporter: Fabio Verzelloni Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

Linux 2.6.32-220.7.1.el6_lustre.g9c8f747.x86_64 #1 SMP Tue Apr 24 14:27:35 PDT 2012 x86_64 x86_64 x86_64 GNU/Linux
14 Servers Total
1 MDS + 1 Fail Over ( 300 GB ) AMD Opteron(tm) Processor 6128
12 OSS ( with fail over per each couple ) Sandy Bridge
6 OST per OSS ( 7 TB )


Attachments: PNG File error5.png     PNG File kernel_panic.png     PNG File kernel_panic2.png     PNG File kernel_panic3.png     File lfs_check_servers.log     File messages1     File weisshorn_mkfs.sh    
Severity: 3
Rank (Obsolete): 4605

 Description   

After the mkfs of all the FS I was able to mount it, and do a simple 'dd' to create few files. Once that I mount it on 12 client with lustre 1.8.4 and trying to make IOR benchmark, using 2 nodes for a total of 12 cores the file system immediately hang and the MDS01 had a kernel panic, as follow:
Message from syslogd@mds01 at May 8 12:00:59 ...
kernel:LustreError: 3523:0:(mdd_object.c:635:mdd_big_lmm_get()) ASSERTION( ma->ma_lmm_size > 0 ) failed:

Message from syslogd@mds01 at May 8 12:00:59 ...
kernel:LustreError: 3523:0:(mdd_object.c:635:mdd_big_lmm_get()) LBUG
Write failed: Broken pipe

The heartbeat tried to takeover but immediately had kernel panic too:

Message from syslogd@mds02 at May 8 12:04:05 ...
kernel:LustreError: 3657:0:(mdd_object.c:635:mdd_big_lmm_get()) ASSERTION( ma->ma_lmm_size > 0 ) failed:

Message from syslogd@mds02 at May 8 12:04:05 ...
kernel:LustreError: 3657:0:(mdd_object.c:635:mdd_big_lmm_get()) LBUG
Write failed: Broken pipe

To make the file system I did as the attached file weisshorn_mkfs.sh

The SSD Lun is built on a LSI SSD controller with RAID10.

Any suggestions or input that I can try to fix the problem?
Attached also the /var/log/messages with the kernel messages.



 Comments   
Comment by Fabio Verzelloni [ 08/May/12 ]

That's the moment of the kernel panic as soon as I mounted the lustre FS on the client with 1.8.4

Comment by Fabio Verzelloni [ 08/May/12 ]

The version of lustre on the client side which are killing the MDS are:

lustre-modules-1.8.4-2.6.32.36_0.5_default_201202291115
lustre-client-source-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default
lustre-1.8.4-2.6.32.36_0.5_default_201202291115

cray-liblustreconfig0-1.0-1.0400.30000.6.18.gem
cray-lustre-utils-2.3-1.0400.29861.8.1.gem
cray-lustre-cray_gem_s-1.8.4_2.6.32.45_0.3.2_1.0400.6221.1.1-1.0400.30252.1.29
cray-lustre-cray_gem_s-1.8.4_2.6.32.45_0.3.2_1.0400.6221.1.1-1.0400.31443.0.0

Comment by Peter Jones [ 10/May/12 ]

Lai

Could you please look into this one?

Thanks

Peter

Comment by Andreas Dilger [ 10/May/12 ]

As a starting point, the client should never be able to crash the MDS. The MDS code needs to be updated to validate the incoming data and return an error if it is wrong.

A separate case is that the 1.8.4 client will not work correctly with a 2.x server without several patches being applied.

Comment by Peter Jones [ 24/May/12 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Zhenyu Xu [ 25/May/12 ]

patch tracking at http://review.whamcloud.com/2905

LU-1384 mdd: validate incoming param to avoid crashing

MDS get crashed when it is connected by unsupported 1.8.x client,
the crash point is

kernel:LustreError: 3657:0:(mdd_object.c:635:mdd_big_lmm_get())
ASSERTION( ma->ma_lmm_size > 0 ) failed

We need validate the incoming @ma lest old client crash the MDS.

Comment by Peter Jones [ 01/Jun/12 ]

Landed for 2.3

Generated at Sat Feb 10 01:16:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.