[LU-7798] ll_prep_inode()) ASSERTION( fid_is_sane(&md.body->mbo_fid1) ) failed: Created: 19/Feb/16  Updated: 07/Oct/16  Resolved: 07/Oct/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Frank Heckes (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: soak
Environment:

lola
build: 2.8.50-6-gf9ca359 ;commit f9ca359284357d145819beb08b316e932f7a3060


Attachments: File console-lola-26.log.bz2     File console-lola-29.log.bz2     File lola-26-vmcore-dmesg.txt.bz2     File lola-29-vmcore-dmesg.txt.bz2     File messages-lola-26.log.bz2     File messages-lola-29.log.bz2    
Issue Links:
Duplicate
duplicates LU-7422 incorrect ENOENT handling in mdt_inte... Resolved
Related
is related to LU-7797 Can't mount zpools after OSS restart Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Error happened during soak testing of build '20160218' (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160218). DNE is enabled.
MDT's have been formated using ldiskfs, OSTs using zfs.

Event history:

  • 2016-02-18 16:24:27,115:fsmgmt.fsmgmt:INFO reseting MDS node lola-10
  • 2016-02-18-16:34:04] Lustre client lola-26 crash with LBUG
  • 2016-02-18-16:34:06 Lustre client lola-29 crash with LBUG

Error message and stack trace is same for both events:

<6>Lustre: soaked-MDT0004-mdc-ffff880852d1a000: Connection restored to 192.168.1.110@o2ib10 (at 192.168.1.110@o2ib10)
<0>LustreError: 23705:0:(llite_lib.c:2295:ll_prep_inode()) ASSERTION( fid_is_sane(&md.body->mbo_fid1) ) failed:
<0>LustreError: 23705:0:(llite_lib.c:2295:ll_prep_inode()) LBUG
<4>Pid: 23705, comm: pct
<4>
<4>Call Trace:
<4> [<ffffffffa050b875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa050be77>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0aa0192>] ll_prep_inode+0x752/0xc40 [lustre]
<4> [<ffffffffa07f6d60>] ? lustre_swab_mdt_body+0x0/0x130 [ptlrpc]
<4> [<ffffffffa0ab42b2>] ll_new_node+0x682/0x7f0 [lustre]
<4> [<ffffffffa0ab6b04>] ll_mkdir+0x104/0x220 [lustre]
<4> [<ffffffff8122ec0f>] ? security_inode_permission+0x1f/0x30
<4> [<ffffffff8119d759>] vfs_mkdir+0xd9/0x140
<4> [<ffffffff811a04e7>] sys_mkdirat+0xc7/0x1b0
<4> [<ffffffff8100c6f5>] ? math_state_restore+0x45/0x60
<4> [<ffffffff811a05e8>] sys_mkdir+0x18/0x20
<4> [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 23705, comm: pct Not tainted 2.6.32-504.30.3.el6.x86_64 #1
<4>Call Trace:
<4> [<ffffffff815293fc>] ? panic+0xa7/0x16f
<4> [<ffffffffa050becb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa0aa0192>] ? ll_prep_inode+0x752/0xc40 [lustre]
<4> [<ffffffffa07f6d60>] ? lustre_swab_mdt_body+0x0/0x130 [ptlrpc]
<4> [<ffffffffa0ab42b2>] ? ll_new_node+0x682/0x7f0 [lustre]
<4> [<ffffffffa0ab6b04>] ? ll_mkdir+0x104/0x220 [lustre]
<4> [<ffffffff8122ec0f>] ? security_inode_permission+0x1f/0x30
<4> [<ffffffff8119d759>] ? vfs_mkdir+0xd9/0x140
<4> [<ffffffff811a04e7>] ? sys_mkdirat+0xc7/0x1b0
<4> [<ffffffff8100c6f5>] ? math_state_restore+0x45/0x60
<4> [<ffffffff811a05e8>] ? sys_mkdir+0x18/0x20
<4> [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b
  • No errors on Lustre server nodes ca be be correlated to the events.

Attached console, messages and vmcore-dmesg.txt from both clients.
Crash files are available



 Comments   
Comment by Frank Heckes (Inactive) [ 19/Feb/16 ]

Crash files have been saved to lhn.lola.hpdd.intel.com:/scratch/crashdumps/lu-7798/

Comment by Frank Heckes (Inactive) [ 19/Feb/16 ]

I forgot to mention that the OSS nodes have been extended to operate in active-active failover configuration
for disk resources by Feb, 17th 2016. So the failover partner node lola-4 can see of node lola-5 all disks and has its ZFS pools imported also.
There's no start-up (boot) wrapper script that prevents the (primary) zpools of the other node from being imported.

Comment by Andreas Dilger [ 19/Feb/16 ]

The client should not crash when accessing bad data from the network. It is very likely that this is a result of bad data being sent from the server, but that is a separate issue. This bug should focus on adding proper error handling to ll_prep_inode() for this case instead of crashing.

Comment by Cliff White (Inactive) [ 10/Mar/16 ]

This bug has re-occured in 2.8.0-RC5 testing

Lustre: 3790:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 6 previous similar messages^M
Lustre: 3779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1457583412/real 1457583412]  req@ffff880275d579c0 x1528367925364328/t0(0) o38->soaked-MDT0007-mdc-ffff88082f2f4800@192.168.1.111@o2ib10:12/10 lens 520/544 e 0 to 1 dl 1457583433 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1^M
Lustre: 3779:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 6 previous similar messages^M
LustreError: 11-0: soaked-MDT0006-mdc-ffff88082f2f4800: operation mds_reint to node 192.168.1.110@o2ib10 failed: rc = -19^M
Lustre: soaked-MDT0007-mdc-ffff88082f2f4800: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)^M
LustreError: 3779:0:(client.c:2874:ptlrpc_replay_interpret()) @@@ request replay timed out.^M
  req@ffff88046cbfb6c0 x1528367924440432/t412317941034(412317941034) o36->soaked-MDT0006-mdc-ffff88082f2f4800@192.168.1.111@o2ib10:12/10 lens 616/192 e 1 to 1 dl 1457583622 ref 2 fl Interpret:EX/4/ffffffff rc -110/-1^M
Lustre: soaked-MDT0006-mdc-ffff88082f2f4800: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)^M
LustreError: 85659:0:(llite_lib.c:2295:ll_prep_inode()) ASSERTION( fid_is_sane(&md.body->mbo_fid1) ) failed: ^M
LustreError: 85659:0:(llite_lib.c:2295:ll_prep_inode()) LBUG^M
Pid: 85659, comm: pct^M
^M
Call Trace:^M
 [<ffffffffa045c875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
 [<ffffffffa045ce77>] lbug_with_loc+0x47/0xb0 [libcfs]^M
 [<ffffffffa09ff192>] ll_prep_inode+0x752/0xc40 [lustre]^M
 [<ffffffffa0747d60>] ? lustre_swab_mdt_body+0x0/0x130 [ptlrpc]^M
 [<ffffffffa0a132b2>] ll_new_node+0x682/0x7f0 [lustre]^M
 [<ffffffffa0a15b04>] ll_mkdir+0x104/0x220 [lustre]^M
 [<ffffffff8122ec0f>] ? security_inode_permission+0x1f/0x30^M
 [<ffffffff8119d759>] vfs_mkdir+0xd9/0x140^M
 [<ffffffff811a04e7>] sys_mkdirat+0xc7/0x1b0^M
 [<ffffffff8100c6f5>] ? math_state_restore+0x45/0x60^M
 [<ffffffff811a05e8>] sys_mkdir+0x18/0x20^M
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b^M
^M
Comment by Cliff White (Inactive) [ 10/Mar/16 ]

Rest of kernel panic

Kernel panic - not syncing: LBUG^M
Pid: 85659, comm: pct Not tainted 2.6.32-504.30.3.el6.x86_64 #1^M
Call Trace:^M
 [<ffffffff815293fc>] ? panic+0xa7/0x16f^M
 [<ffffffffa045cecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]^M
 [<ffffffffa09ff192>] ? ll_prep_inode+0x752/0xc40 [lustre]^M
 [<ffffffffa0747d60>] ? lustre_swab_mdt_body+0x0/0x130 [ptlrpc]^M
 [<ffffffffa0a132b2>] ? ll_new_node+0x682/0x7f0 [lustre]^M
 [<ffffffffa0a15b04>] ? ll_mkdir+0x104/0x220 [lustre]^M
 [<ffffffff8122ec0f>] ? security_inode_permission+0x1f/0x30^M
 [<ffffffff8119d759>] ? vfs_mkdir+0xd9/0x140^M
 [<ffffffff811a04e7>] ? sys_mkdirat+0xc7/0x1b0^M
 [<ffffffff8100c6f5>] ? math_state_restore+0x45/0x60^M
 [<ffffffff811a05e8>] ? sys_mkdir+0x18/0x20^M
 [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b^M
Initializing cgroup subsys cpuset^M
Initializing cgroup subsys cpu^M

crash dump has been copied to lola:/scratch/crashdumps/lu-7798/

Comment by Cliff White (Inactive) [ 06/Oct/16 ]

We appear to have hit this again, while doing ZFS metadata testing:

Oct  6 22:18:08 spirit-17 kernel: LustreError: 36451:0:(llite_lib.c:2465:ll_prep_inode()) ASSERTION( fid_is_sane(&md.body->mbo_fid1) ) failed:
Oct  6 22:18:08 spirit-17 kernel: LustreError: 36451:0:(llite_lib.c:2465:ll_prep_inode()) LBUG
Oct  6 22:18:08 spirit-17 kernel: Pid: 36451, comm: mdtest
Oct  6 22:18:08 spirit-17 kernel: #012Call Trace:
Oct  6 22:18:08 spirit-17 kernel: [<ffffffffa08ab853>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
Oct  6 22:18:08 spirit-17 kernel: [<ffffffffa08abdf5>] lbug_with_loc+0x45/0xc0 [libcfs]
Oct  6 22:18:08 spirit-17 kernel: [<ffffffffa0de7002>] ll_prep_inode+0xb12/0xe60 [lustre]
Oct  6 22:18:08 spirit-17 kernel: [<ffffffffa0df6c3f>] ? ll_lookup_it+0x64f/0xe20 [lustre]
Oct  6 22:18:08 spirit-17 kernel: [<ffffffffa0df7c2c>] ll_atomic_open+0x81c/0x12c0 [lustre]
Oct  6 22:18:08 spirit-17 kernel: [<ffffffff811ec8a1>] do_last+0xa11/0x1270 
Oct  6 22:18:08 spirit-17 kernel: [<ffffffff811eede2>] path_openat+0xc2/0x490
Oct  6 22:18:08 spirit-17 kernel: [<ffffffff811f05ab>] do_filp_open+0x4b/0xb0
Oct  6 22:18:08 spirit-17 kernel: [<ffffffff811fd147>] ? __alloc_fd+0xa7/0x130
Oct  6 22:18:08 spirit-17 kernel: [<ffffffff811ddf53>] do_sys_open+0xf3/0x1f0
Oct  6 22:18:08 spirit-17 kernel: [<ffffffff811de06e>] SyS_open+0x1e/0x20
Oct  6 22:18:08 spirit-17 kernel: [<ffffffff811de0a6>] SyS_creat+0x16/0x20
Oct  6 22:18:08 spirit-17 kernel: [<ffffffff81646a09>] system_call_fastpath+0x16/0x1b
Oct  6 22:18:08 spirit-17 kernel:
Comment by Peter Jones [ 07/Oct/16 ]

seems to be a duplicate of LU-7422

Generated at Sat Feb 10 02:12:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.