[LU-4653] Hit LBUG ASSERTION( fid_seq(fid1) == fid_seq(fid2) ) failed after upgrade OST from 2.5.0 to 2.6 Created: 20/Feb/14 Updated: 28/May/14 Resolved: 24/Apr/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.5.1, Lustre 2.4.3 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.2 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Sarah Liu | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mn4 | ||
| Environment: |
before upgrade: upgrade OSS to lustre-master build # 1876 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 12727 | ||||||||
| Description |
|
Hit following LBUG when running rolling upgrade testing. Test steps are Lustre: lustre-MDT0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted. LustreError: 2152:0:(lustre_fid.h:725:lu_fid_diff()) ASSERTION( fid_seq(fid1) == fid_seq(fid2) ) failed: fid1:[0x100000000:0x21:0x0], fid2:[0x100000001:0x0:0x0] LustreError: 2152:0:(lustre_fid.h:725:lu_fid_diff()) LBUG Pid: 2152, comm: osp-pre-0 Call Trace: [<ffffffffa01ff895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa01ffe97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0af896d>] osp_precreate_cleanup_orphans+0x10fd/0x1130 [osp] [<ffffffffa0498161>] ? import_at_get_index+0xb1/0xf0 [ptlrpc] [<ffffffff81063410>] ? default_wake_function+0x0/0x20 [<ffffffffa0afac6f>] osp_precreate_thread+0x20f/0x1b00 [osp] [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 [<ffffffff81063410>] ? default_wake_function+0x0/0x20 [<ffffffffa0afaa60>] ? osp_precreate_thread+0x0/0x1b00 [osp] [<ffffffff81096a36>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Kernel panic - not syncing: LBUG Pid: 2152, comm: osp-pre-0 Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1 Call Trace: [<ffffffff8150de58>] ? panic+0xa7/0x16f [<ffffffffa01ffeeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa0af896d>] ? osp_precreate_cleanup_orphans+0x10fd/0x1130 [osp] [<ffffffffa0498161>] ? import_at_get_index+0xb1/0xf0 [ptlrpc] [<ffffffff81063410>] ? default_wake_function+0x0/0x20 [<ffffffffa0afac6f>] ? osp_precreate_thread+0x20f/0x1b00 [osp] [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 [<ffffffff81063410>] ? default_wake_function+0x0/0x20 [<ffffffffa0afaa60>] ? osp_precreate_thread+0x0/0x1b00 [osp] [<ffffffff81096a36>] ? kthread+0x96/0xa0 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Initializing cgroup subsys cpuset Initializing cgroup subsys cpu Linux version 2.6.32-358.18.1.el6_lustre.x86_64 (jenkins@builder-1-sde1-el6-x8664.lab.whamcloud.com) (gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) ) #1 SMP Fri Oct 11 16:41:53 PDT 2013 Command line: ro root=UUID=dec021fd-a287-4254-8c2b-0a004dfdde46 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD console=ttyS0,115200 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off memmap=exactmap memmap=538K@64K memmap=132562K@49690K elfcorehdr=182252K memmap=64K$0K memmap=38K$602K memmap=104K$920K memmap=8K$3668600K memmap=72K#3668608K memmap=184K#3668680K memmap=263296K$3668864K memmap=2048K$4192256K KERNEL supported cpus: Intel GenuineIntel AMD AuthenticAMD Centaur CentaurHauls BIOS-provided physical RAM map: |
| Comments |
| Comment by Jodi Levi (Inactive) [ 20/Feb/14 ] |
|
Di, |
| Comment by Di Wang [ 20/Feb/14 ] |
|
Hmm, probably caused by this commit commit da12d3ba35bbb86c8e5860a5ed161a55f01b69d5 For a normal FID, we can know on which target the related object To locate the OST via the given IDIF, when the IDIF is generated, For old IDIF, the OST index is not part of the IDIF, means that tgt_validate_obdo() converts the ostid contained in the RPC body It also contains some other fixing and cleanup. Test-Parameters: testlist=sanity-scrub |
| Comment by Di Wang [ 08/Mar/14 ] |
| Comment by Sarah Liu [ 23/Apr/14 ] |
|
Hit this LBUG and MDS reboot fat-amd-1.lab.whamcloud.com login: Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect Lustre: lustre-MDT0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted. LustreError: 2165:0:(lustre_fid.h:725:lu_fid_diff()) ASSERTION( fid_seq(fid1) == fid_seq(fid2) ) failed: fid1:[0x100000000:0x21:0x0], fid2:[0x100000001:0x0:0x0] LustreError: 2165:0:(lustre_fid.h:725:lu_fid_diff()) LBUG Pid: 2165, comm: osp-pre-0 Call Trace: [<ffffffffa0216895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa0216e97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0b1da3d>] osp_precreate_cleanup_orphans+0x10fd/0x1130 [osp] [<ffffffffa04b0d81>] ? import_at_get_index+0xb1/0xf0 [ptlrpc] [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 [<ffffffffa0b1fd4f>] osp_precreate_thread+0x20f/0x1b00 [osp] [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 [<ffffffffa0b1fb40>] ? osp_precreate_thread+0x0/0x1b00 [osp] [<ffffffff8109aee6>] kthread+0x96/0xa0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109ae50>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 Kernel panic - not syncing: LBUG Pid: 2165, comm: osp-pre-0 Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1 Call Trace: [<ffffffff81527983>] ? panic+0xa7/0x16f [<ffffffffa0216eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa0b1da3d>] ? osp_precreate_cleanup_orphans+0x10fd/0x1130 [osp] [<ffffffffa04b0d81>] ? import_at_get_index+0xb1/0xf0 [ptlrpc] [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 [<ffffffffa0b1fd4f>] ? osp_precreate_thread+0x20f/0x1b00 [osp] [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 [<ffffffffa0b1fb40>] ? osp_precreate_thread+0x0/0x1b00 [osp] [<ffffffff8109aee6>] ? kthread+0x96/0xa0 [<ffffffff8100c20a>] ? child_rip+0xa/0x20 [<ffffffff8109ae50>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 Initializing cgroup subsys cpuset Initializing cgroup subsys cpu |
| Comment by Sarah Liu [ 23/Apr/14 ] |
|
Ah, it is rolling upgrade. MDS still run with 2.5 version. so we need port http://review.whamcloud.com/9560 to 2.5. |
| Comment by Di Wang [ 23/Apr/14 ] |
|
http://review.whamcloud.com/#/c/10058 b2_5 |
| Comment by Sarah Liu [ 24/Apr/14 ] |
|
I tried the patch for b2_5, got following error in the recovery stage after OSS upgrade to 2.6, MDS reboot again fat-amd-1.lab.whamcloud.com login: root Password: Lustre: lustre-MDT0000: Recovery over after 0:10, of 2 clients 2 recovered and 0 were evicted. LustreError: 2166:0:(osp_precreate.c:476:osp_precreate_send()) ASSERTION( osp_fid_diff(fid, &d->opd_pre_used_fid) > 0 ) failed: reply fid [0x100000001:0x0:0x0] pre used fid [0x100000000:0x1e8:0x0] LustreError: 2166:0:(osp_precreate.c:476:osp_precreate_send()) LBUG Pid: 2166, comm: osp-pre-0 Call Trace: [<ffffffffa0209895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa0209e97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0b119d7>] osp_precreate_send+0x1a47/0x1b00 [osp] [<ffffffffa0491304>] ? lustre_msg_set_timeout+0x74/0xc0 [ptlrpc] [<ffffffffa0b11f79>] osp_precreate_thread+0x4e9/0xc50 [osp] [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 [<ffffffffa0b11a90>] ? osp_precreate_thread+0x0/0xc50 [osp] Last login: Wed [<ffffffff8109aee6>] kthread+0x96/0xa0 Apr 23 17:31:38 [<ffffffff8100c20a>] child_rip+0xa/0x20 on ttyS0 [<ffffffff8109ae50>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 Kernel panic - not syncing: LBUG Pid: 2166, comm: osp-pre-0 Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1 Call Trace: [<ffffffff81527983>] ? panic+0xa7/0x16f [<ffffffffa0209eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa0b119d7>] ? osp_precreate_send+0x1a47/0x1b00 [osp] [<ffffffffa0491304>] ? lustre_msg_set_timeout+0x74/0xc0 [ptlrpc] [<ffffffffa0b11f79>] ? osp_precreate_thread+0x4e9/0xc50 [osp] [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 [<ffffffffa0b11a90>] ? osp_precreate_thread+0x0/0xc50 [osp] [<ffffffff8109aee6>] ? kthread+0x96/0xa0 [<ffffffff8100c20a>] ? child_rip+0xa/0x20 [<ffffffff8109ae50>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 Initializing cgroup subsys cpuset Initializing cgroup subsys cpu |
| Comment by Jodi Levi (Inactive) [ 24/Apr/14 ] |
|
Patch landed to Master. Patches for other branches will be tracked outside of this ticket. |
| Comment by Di Wang [ 25/Apr/14 ] |
Lustre: lustre-MDT0000: Recovery over after 0:10, of 2 clients 2 recovered and 0 were evicted. LustreError: 2166:0:(osp_precreate.c:476:osp_precreate_send()) ASSERTION( osp_fid_diff(fid, &d->opd_pre_used_fid) > 0 ) failed: reply fid [0x100000001:0x0:0x0] pre used fid [0x100000000:0x1e8:0x0] LustreError: 2166:0:(osp_precreate.c:476:osp_precreate_send()) LBUG Pid: 2166, comm: osp-pre-0 Hmm, this is different issue here, it seems MDT get a lower FID( < its last precreate used FID) from OST during recovery, which is wrong. Though this ASSERT might be improper here. But anyway probably a new ticket? |
| Comment by Di Wang [ 25/Apr/14 ] |
|
Create a new ticket in |
| Comment by Alex Zhuravlev [ 06/May/14 ] |
|
Di, look at the details: [0x100000001:0x0:0x0] pre used fid [0x100000000:0x1e8:0x0] LustreError: 2166:0:(osp_precreate.c:476:osp_precreate_send()) LBUG so, it got a new sequence? |
| Comment by Di Wang [ 06/May/14 ] |
|
Alex: 0x100000001 is the same sequence as 0x100000000, and we just embedded ost_index here. or you mean sth else? btw: probably discussed this in 4957? |
| Comment by Alex Zhuravlev [ 12/May/14 ] |
|
Sara, could you try to reproduce this and grab a full debug log please? you can attach that to |
| Comment by Sarah Liu [ 13/May/14 ] |
|
Alex, ok, will do it now |
| Comment by Andreas Dilger [ 28/May/14 ] |
|
Patch landed for 2.5.2. |