[LU-4653] Hit LBUG ASSERTION( fid_seq(fid1) == fid_seq(fid2) ) failed after upgrade OST from 2.5.0 to 2.6 Created: 20/Feb/14  Updated: 28/May/14  Resolved: 24/Apr/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.5.1, Lustre 2.4.3
Fix Version/s: Lustre 2.6.0, Lustre 2.5.2

Type: Bug Priority: Blocker
Reporter: Sarah Liu Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: mn4
Environment:

before upgrade:
server and client are running 2.5.0

upgrade OSS to lustre-master build # 1876
MDS and clients are still running 2.5.0


Issue Links:
Related
is related to LU-4957 :osp_precreate_send()) ASSERTION( osp... Resolved
Severity: 3
Rank (Obsolete): 12727

 Description   

Hit following LBUG when running rolling upgrade testing. Test steps are
1. setup the system with 2.5.0
2. keep the system on and upgrade OSS to lustre-master build #1876(teg-2.5.55)
3. when mounting OST, MDS reboot

Lustre: lustre-MDT0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted.
LustreError: 2152:0:(lustre_fid.h:725:lu_fid_diff()) ASSERTION( fid_seq(fid1) == fid_seq(fid2) ) failed: fid1:[0x100000000:0x21:0x0], fid2:[0x100000001:0x0:0x0]
LustreError: 2152:0:(lustre_fid.h:725:lu_fid_diff()) LBUG
Pid: 2152, comm: osp-pre-0

Call Trace:
 [<ffffffffa01ff895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa01ffe97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0af896d>] osp_precreate_cleanup_orphans+0x10fd/0x1130 [osp]
 [<ffffffffa0498161>] ? import_at_get_index+0xb1/0xf0 [ptlrpc]
 [<ffffffff81063410>] ? default_wake_function+0x0/0x20
 [<ffffffffa0afac6f>] osp_precreate_thread+0x20f/0x1b00 [osp]
 [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
 [<ffffffff81063410>] ? default_wake_function+0x0/0x20
 [<ffffffffa0afaa60>] ? osp_precreate_thread+0x0/0x1b00 [osp]
 [<ffffffff81096a36>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff810969a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Kernel panic - not syncing: LBUG
Pid: 2152, comm: osp-pre-0 Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1
Call Trace:
 [<ffffffff8150de58>] ? panic+0xa7/0x16f
 [<ffffffffa01ffeeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [<ffffffffa0af896d>] ? osp_precreate_cleanup_orphans+0x10fd/0x1130 [osp]
 [<ffffffffa0498161>] ? import_at_get_index+0xb1/0xf0 [ptlrpc]
 [<ffffffff81063410>] ? default_wake_function+0x0/0x20
 [<ffffffffa0afac6f>] ? osp_precreate_thread+0x20f/0x1b00 [osp]
 [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
 [<ffffffff81063410>] ? default_wake_function+0x0/0x20
 [<ffffffffa0afaa60>] ? osp_precreate_thread+0x0/0x1b00 [osp]
 [<ffffffff81096a36>] ? kthread+0x96/0xa0
 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
 [<ffffffff810969a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Linux version 2.6.32-358.18.1.el6_lustre.x86_64 (jenkins@builder-1-sde1-el6-x8664.lab.whamcloud.com) (gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) ) #1 SMP Fri Oct 11 16:41:53 PDT 2013
Command line: ro root=UUID=dec021fd-a287-4254-8c2b-0a004dfdde46 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD console=ttyS0,115200 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off  memmap=exactmap memmap=538K@64K memmap=132562K@49690K elfcorehdr=182252K memmap=64K$0K memmap=38K$602K memmap=104K$920K memmap=8K$3668600K memmap=72K#3668608K memmap=184K#3668680K memmap=263296K$3668864K memmap=2048K$4192256K
KERNEL supported cpus:
  Intel GenuineIntel
  AMD AuthenticAMD
  Centaur CentaurHauls
BIOS-provided physical RAM map:


 Comments   
Comment by Jodi Levi (Inactive) [ 20/Feb/14 ]

Di,
Can you please have a look at this one and comment?
Thank you!

Comment by Di Wang [ 20/Feb/14 ]

Hmm, probably caused by this commit

commit da12d3ba35bbb86c8e5860a5ed161a55f01b69d5
Author: Fan Yong <fan.yong@intel.com>
Date: Sun Nov 24 16:59:00 2013 +0800

LU-3569 ofd: packing ost_idx in IDIF

For a normal FID, we can know on which target the related object
is allocated via querying FLDB; but it is not true for an IDIF.

To locate the OST via the given IDIF, when the IDIF is generated,
we pack the OST index in it. Then for any given FID, in spite of
it is a normal FID or not, we has the method to know which target
it belongs to. That is useful for LFSCK.

For old IDIF, the OST index is not part of the IDIF, means that
different OSTs may have the same IDIFs, that may cause the IFID
in LMA does not match the read FID. Under such case, we need to
make some compatible check to avoid to trigger unexpected.

tgt_validate_obdo() converts the ostid contained in the RPC body
to fid and changes the "struct ost_id" union, then the users can
access ost_id::oi_fid directly without call ostid_to_fid() again.

It also contains some other fixing and cleanup.

Test-Parameters: testlist=sanity-scrub
Signed-off-by: wang di <di.wang@intel.com>
Signed-off-by: Fan Yong <fan.yong@intel.com>
Change-Id: I228f2f6cd9310193a1724046cee15e3b2103c8e2
Reviewed-on: http://review.whamcloud.com/7053
Tested-by: Jenkins
Tested-by: Maloo <hpdd-maloo@intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>

Comment by Di Wang [ 08/Mar/14 ]

http://review.whamcloud.com/9560

Comment by Sarah Liu [ 23/Apr/14 ]

Hit this LBUG and MDS reboot

fat-amd-1.lab.whamcloud.com login: Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
Lustre: lustre-MDT0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted.
LustreError: 2165:0:(lustre_fid.h:725:lu_fid_diff()) ASSERTION( fid_seq(fid1) == fid_seq(fid2) ) failed: fid1:[0x100000000:0x21:0x0], fid2:[0x100000001:0x0:0x0]
LustreError: 2165:0:(lustre_fid.h:725:lu_fid_diff()) LBUG
Pid: 2165, comm: osp-pre-0

Call Trace:
 [<ffffffffa0216895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0216e97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0b1da3d>] osp_precreate_cleanup_orphans+0x10fd/0x1130 [osp]
 [<ffffffffa04b0d81>] ? import_at_get_index+0xb1/0xf0 [ptlrpc]
 [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
 [<ffffffffa0b1fd4f>] osp_precreate_thread+0x20f/0x1b00 [osp]
 [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
 [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
 [<ffffffffa0b1fb40>] ? osp_precreate_thread+0x0/0x1b00 [osp]
 [<ffffffff8109aee6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ae50>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

Kernel panic - not syncing: LBUG
Pid: 2165, comm: osp-pre-0 Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
Call Trace:
 [<ffffffff81527983>] ? panic+0xa7/0x16f
 [<ffffffffa0216eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [<ffffffffa0b1da3d>] ? osp_precreate_cleanup_orphans+0x10fd/0x1130 [osp]
 [<ffffffffa04b0d81>] ? import_at_get_index+0xb1/0xf0 [ptlrpc]
 [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
 [<ffffffffa0b1fd4f>] ? osp_precreate_thread+0x20f/0x1b00 [osp]
 [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
 [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
 [<ffffffffa0b1fb40>] ? osp_precreate_thread+0x0/0x1b00 [osp]
 [<ffffffff8109aee6>] ? kthread+0x96/0xa0
 [<ffffffff8100c20a>] ? child_rip+0xa/0x20
 [<ffffffff8109ae50>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Comment by Sarah Liu [ 23/Apr/14 ]

Ah, it is rolling upgrade. MDS still run with 2.5 version. so we need port http://review.whamcloud.com/9560 to 2.5.

Comment by Di Wang [ 23/Apr/14 ]

http://review.whamcloud.com/#/c/10058 b2_5
http://review.whamcloud.com/10059 b2_4

Comment by Sarah Liu [ 24/Apr/14 ]

I tried the patch for b2_5, got following error in the recovery stage after OSS upgrade to 2.6, MDS reboot again

fat-amd-1.lab.whamcloud.com login: root
Password: 
Lustre: lustre-MDT0000: Recovery over after 0:10, of 2 clients 2 recovered and 0 were evicted.
LustreError: 2166:0:(osp_precreate.c:476:osp_precreate_send()) ASSERTION( osp_fid_diff(fid, &d->opd_pre_used_fid) > 0 ) failed: reply fid [0x100000001:0x0:0x0] pre used fid [0x100000000:0x1e8:0x0]
LustreError: 2166:0:(osp_precreate.c:476:osp_precreate_send()) LBUG
Pid: 2166, comm: osp-pre-0

Call Trace:
 [<ffffffffa0209895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0209e97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0b119d7>] osp_precreate_send+0x1a47/0x1b00 [osp]
 [<ffffffffa0491304>] ? lustre_msg_set_timeout+0x74/0xc0 [ptlrpc]
 [<ffffffffa0b11f79>] osp_precreate_thread+0x4e9/0xc50 [osp]
 [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
 [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
 [<ffffffffa0b11a90>] ? osp_precreate_thread+0x0/0xc50 [osp]
Last login: Wed  [<ffffffff8109aee6>] kthread+0x96/0xa0
Apr 23 17:31:38  [<ffffffff8100c20a>] child_rip+0xa/0x20
on ttyS0
 [<ffffffff8109ae50>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

Kernel panic - not syncing: LBUG
Pid: 2166, comm: osp-pre-0 Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
Call Trace:
 [<ffffffff81527983>] ? panic+0xa7/0x16f
 [<ffffffffa0209eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [<ffffffffa0b119d7>] ? osp_precreate_send+0x1a47/0x1b00 [osp]
 [<ffffffffa0491304>] ? lustre_msg_set_timeout+0x74/0xc0 [ptlrpc]
 [<ffffffffa0b11f79>] ? osp_precreate_thread+0x4e9/0xc50 [osp]
 [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
 [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
 [<ffffffffa0b11a90>] ? osp_precreate_thread+0x0/0xc50 [osp]
 [<ffffffff8109aee6>] ? kthread+0x96/0xa0
 [<ffffffff8100c20a>] ? child_rip+0xa/0x20
 [<ffffffff8109ae50>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Comment by Jodi Levi (Inactive) [ 24/Apr/14 ]

Patch landed to Master. Patches for other branches will be tracked outside of this ticket.

Comment by Di Wang [ 25/Apr/14 ]
Lustre: lustre-MDT0000: Recovery over after 0:10, of 2 clients 2 recovered and 0 were evicted.
LustreError: 2166:0:(osp_precreate.c:476:osp_precreate_send()) ASSERTION( osp_fid_diff(fid, &d->opd_pre_used_fid) > 0 ) failed: reply fid [0x100000001:0x0:0x0] pre used fid [0x100000000:0x1e8:0x0]
LustreError: 2166:0:(osp_precreate.c:476:osp_precreate_send()) LBUG
Pid: 2166, comm: osp-pre-0

Hmm, this is different issue here, it seems MDT get a lower FID( < its last precreate used FID) from OST during recovery, which is wrong. Though this ASSERT might be improper here. But anyway probably a new ticket?

Comment by Di Wang [ 25/Apr/14 ]

Create a new ticket in LU-4957

Comment by Alex Zhuravlev [ 06/May/14 ]

Di, look at the details:

[0x100000001:0x0:0x0] pre used fid [0x100000000:0x1e8:0x0] LustreError: 2166:0:(osp_precreate.c:476:osp_precreate_send()) LBUG

so, it got a new sequence?

Comment by Di Wang [ 06/May/14 ]

Alex: 0x100000001 is the same sequence as 0x100000000, and we just embedded ost_index here. or you mean sth else? btw: probably discussed this in 4957?

Comment by Alex Zhuravlev [ 12/May/14 ]

Sara, could you try to reproduce this and grab a full debug log please? you can attach that to LU-4957

Comment by Sarah Liu [ 13/May/14 ]

Alex, ok, will do it now

Comment by Andreas Dilger [ 28/May/14 ]

Patch landed for 2.5.2.

Generated at Sat Feb 10 01:44:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.