In case of DISP_OPEN_CREATE client waits for valid fid value in reply when it_status == 0.
When reint_open returns ENOENT fid is not set and client gets fid filled by 0. This may cause following panic:
We faced the issue on DNE setup. For unknown reason(possibly failover) FLDB on master mdt didn't include OST seq ranges.
We faced above panic every time after trying to create regular file in directory located on mdt1.
Original error happens again for two client nodes at almost the same time during failback of MDS resources to primary node while soak testing the build specified above.
Crash dump files have been written for both nodes (lola-26,29) and have been saved to lola-1:/scratch/crashdumps/lu-7422/lola-26-127.0.0.1-2016-01-05-19:02:53 , lola-29-127.0.0.1-2016-01-05-19:02:56. Log files can be provided on demand.
Frank Heckes (Inactive)
added a comment - - edited Used build '20160104' from branch master (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20160104 ).
DNE is enabled and MDSes are in active-active HA configuration. MDTs have been formatted using ldiskfs , OSTs using zfs .
Original error happens again for two client nodes at almost the same time during failback of MDS resources to primary node while soak testing the build specified above.
<0>LustreError: 75738:0:(llite_lib.c:2295:ll_prep_inode()) ASSERTION( fid_is_sane(&md.body->mbo_fid1) ) failed:
<0>LustreError: 75738:0:(llite_lib.c:2295:ll_prep_inode()) LBUG
<4>Pid: 75738, comm: mdtest
...
<0>Kernel panic - not syncing: LBUG
<4>Pid: 75738, comm: mdtest Not tainted 2.6.32-504.30.3.el6.x86_64 #1
Crash dump files have been written for both nodes ( lola-26,29 ) and have been saved to lola-1:/scratch/crashdumps/lu-7422/lola-26-127.0.0.1-2016-01-05-19:02:53 , lola-29-127.0.0.1-2016-01-05-19:02:56 . Log files can be provided on demand.
Maloo set -1 because there are 2 test failures:
1. sanity 230f - it is marked as known but LU-7549
2. conf-sanity 51. I don't see how it can be connected with my patch:
Here the error occurred after the remount of the MDTs on MDS (lola-10) completed successful (2015-11-26 00:27:36).
Pasted the stack trace ones more, as context seems to be different than for the one above.
Nov 26 00:29:48 lola-29 kernel: LustreError: 65535:0:(llite_lib.c:2295:ll_prep_inode()) ASSERTION( fid_is_sane(&md.body->mbo_fid1) ) failed:
Nov 26 00:29:48 lola-29 kernel: LustreError: 65535:0:(llite_lib.c:2295:ll_prep_inode()) LBUG
Nov 26 00:29:48 lola-29 kernel: Pid: 65535, comm: pct
Nov 26 00:29:48 lola-29 kernel:
Nov 26 00:29:48 lola-29 kernel: Call Trace:
Nov 26 00:29:48 lola-29 kernel: [<ffffffffa050b875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Nov 26 00:29:48 lola-29 kernel: [<ffffffffa050be77>] lbug_with_loc+0x47/0xb0 [libcfs]
Nov 26 00:29:48 lola-29 kernel: [<ffffffffa0abdb62>] ll_prep_inode+0x752/0xc40 [lustre]
Nov 26 00:29:48 lola-29 kernel: [<ffffffffa0802c10>] ? lustre_swab_mdt_body+0x0/0x130 [ptlrpc]
Nov 26 00:29:48 lola-29 kernel: [<ffffffffa0ad29d2>] ll_new_node+0x682/0x7f0 [lustre]
Nov 26 00:29:48 lola-29 kernel: [<ffffffffa0ad5224>] ll_mkdir+0x104/0x220 [lustre]
Nov 26 00:29:48 lola-29 kernel: [<ffffffff8122ec0f>] ? security_inode_permission+0x1f/0x30
Nov 26 00:29:48 lola-29 kernel: [<ffffffff8119d759>] vfs_mkdir+0xd9/0x140
Nov 26 00:29:48 lola-29 kernel: [<ffffffff811a04e7>] sys_mkdirat+0xc7/0x1b0
Nov 26 00:29:48 lola-29 kernel: [<ffffffff8100c6f5>] ? math_state_restore+0x45/0x60
Nov 26 00:29:48 lola-29 kernel: [<ffffffff811a05e8>] sys_mkdir+0x18/0x20
Nov 26 00:29:48 lola-29 kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Nov 26 00:29:48 lola-29 kernel:
Nov 26 00:29:48 lola-29 kernel: LustreError: dumping log to /tmp/lustre-log.1448526588.65535
Chronologically this event can be correlated to the following error on lola-10:
Sergey Cheremencev
added a comment - The problem faced on lustre 2.5.1.
The fid_is_sane() check should be skipped if -ENOENT is returned:
Yes. But the problem here is that 0 returned instead of -ENOENT.
static int mdt_intent_reint(enum mdt_it_code opcode,
struct mdt_thread_info *info,
struct ldlm_lock **lockp,
__u64 flags)
...
if (rep->lock_policy_res2 == -ENOENT &&
mdt_get_disposition(rep, DISP_LOOKUP_NEG))
rep->lock_policy_res2 = 0;
What version of Lustre are you testing? The fid_is_sane() check should be skipped if -ENOENT is returned:
int ll_prep_inode(struct inode **inode, struct ptlrpc_request *req,
struct super_block *sb, struct lookup_intent *it)
{
:
:
rc = md_get_lustre_md(sbi->ll_md_exp, req, sbi->ll_dt_exp,
sbi->ll_md_exp, &md);
if (rc != 0)
GOTO(cleanup, rc);
:
:
/*
* At this point server returns to client's same fid as client
* generated for creating. So using ->fid1 is okay here.
*/
LASSERT(fid_is_sane(&md.body->mbo_fid1));
Andreas Dilger
added a comment - What version of Lustre are you testing? The fid_is_sane() check should be skipped if -ENOENT is returned:
int ll_prep_inode(struct inode **inode, struct ptlrpc_request *req,
struct super_block *sb, struct lookup_intent *it)
{
:
:
rc = md_get_lustre_md(sbi->ll_md_exp, req, sbi->ll_dt_exp,
sbi->ll_md_exp, &md);
if (rc != 0)
GOTO(cleanup, rc);
:
:
/*
* At this point server returns to client's same fid as client
* generated for creating. So using ->fid1 is okay here.
*/
LASSERT(fid_is_sane(&md.body->mbo_fid1));
Used build '20160104' from branch master (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20160104).
DNE is enabled and MDSes are in active-active HA configuration. MDTs have been formatted using ldiskfs, OSTs using zfs.
Original error happens again for two client nodes at almost the same time during failback of MDS resources to primary node while soak testing the build specified above.
Crash dump files have been written for both nodes (lola-26,29) and have been saved to lola-1:/scratch/crashdumps/lu-7422/lola-26-127.0.0.1-2016-01-05-19:02:53 , lola-29-127.0.0.1-2016-01-05-19:02:56. Log files can be provided on demand.