[LU-4878] fld_server_lookup() ASSERTION( fld->lsf_control_exp ) failed Created: 10/Apr/14  Updated: 15/May/14  Resolved: 15/May/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Gregoire Pichon Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: mn4

Issue Links:
Related
is related to LU-3126 conf-sanity test_41b: fld_server_look... Resolved
is related to LU-3915 After upgrade from 2.4.0 to 2.5, can ... Resolved
Severity: 3
Rank (Obsolete): 13490

 Description   

The following LBUG appeared at customer site, during the mount process on all OSS in lustre 2.4.3 version.

LustreError: 12838:0:(fld_handler.c:172:fld_server_lookup()) ASSERTION(fld->lsf_control_exp ) failed:
LustreError: 12838:0:(fld_handler.c:172:fld_server_lookup()) LBUG

Pid: 12838, comm: mount.lustre

Call Trace:
 libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 lbug_with_loc+0x47/0xb0 [libcfs]
 fld_server_lookup+0x2f7/0x3d0 [fld]
 osd_fld_lookup+0x71/0x1d0 [osd_ldiskfs]
 osd_remote_fid+0x9a/0x280 [osd_ldiskfs]
 osd_index_ea_lookup+0521/0x850 [osd_ldiskfs]
 dt_lookup_dir+0x6f/0x130 [obdclass]
 llog_osd_open+0x485/0xc00 [obdclass]
 llog_open+0xba/0x2c0 [obdclass]
 mgc_process_log [mgc]
 mgc_process_config [mgc]
 lustre_process_log [obdclass]
 server_start_targets [obdclass]
 server_fill_super [obdclass]
 lustre_fill_super[obdclass]
 get_sb_nodev
 lustre_get_sb
 vfs_kern_mount
 do_kern_mount
 do_mount
 sys_mount
 system_call_fastpath

This issue seems the same as LU-3126 for which a patch has been landed in lustre 2.5. Unfortunately no patch has been provided for lustre 2.4 release.



 Comments   
Comment by Bruno Faccini (Inactive) [ 10/Apr/14 ]

Hello Gregoire,
Did you mean that this same LBUG occured for all OSSs of the same FS at OSTs mount time ??
And yes you are right LU-3126 patch has not been back-ported to b2_4, but it is mainly because the issue was unlikely to happen ...

Comment by Bob Glossman (Inactive) [ 10/Apr/14 ]

LU-3126 backport to b2_4:
http://review.whamcloud.com/9929

Comment by Antoine Percher [ 10/Apr/14 ]

Yes We meet the same LBUG on all OSS, with or without abort-recov
when mounting all ost at the same time or when we try to mount
manuely just one ost.
and it s just with 2.4.3 with some additional patchs and works
fine on 2.4.2 with some other additionnal patchs

Comment by Gregoire Pichon [ 14/Apr/14 ]

I have tested the patch #9929 posted in gerrit. Unfortunately the OSS still crashes when mounting OST.

Here is the stack

<3>LustreError: 6577:0:(fld_handler.c:174:fld_server_lookup()) srv-fs_pv-OST0000: lookup 0x7d, but not connects to MDT0yet: rc = -5.
<3>LustreError: 6577:0:(osd_handler.c:2135:osd_fld_lookup()) fs_pv-OST0000-osd: cannot find FLD range for 0x7d: rc = -5
<3>LustreError: 6577:0:(osd_handler.c:3344:osd_mdt_seq_exists()) fs_pv-OST0000-osd: Can not lookup fld for 0x7d
<0>LustreError: 6577:0:(osd_handler.c:2651:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: 
<0>LustreError: 6577:0:(osd_handler.c:2651:osd_object_ref_del()) LBUG
<4>Pid: 6577, comm: mount.lustre
<4>
<4>Call Trace:
<4> [<ffffffffa0d57895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa0d57e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa04197e7>] osd_object_ref_del+0x1e7/0x220 [osd_ldiskfs]
<4> [<ffffffffa0ec1fee>] llog_osd_destroy+0x48e/0xb20 [obdclass]
<4> [<ffffffffa0e91d61>] llog_destroy+0x51/0x170 [obdclass]
<4> [<ffffffffa0e96b34>] llog_erase+0x1c4/0x1e0 [obdclass]
<4> [<ffffffffa0e97401>] llog_backup+0x231/0x500 [obdclass]
<4> [<ffffffffa049ad66>] mgc_process_log+0x1636/0x18f0 [mgc]
<4> [<ffffffffa049c514>] mgc_process_config+0x594/0xed0 [mgc]
<4> [<ffffffffa0ede64c>] lustre_process_log+0x25c/0xaa0 [obdclass]
<4> [<ffffffffa0f126d3>] server_start_targets+0x1833/0x19c0 [obdclass]
<4> [<ffffffffa0f1340c>] server_fill_super+0xbac/0x1660 [obdclass]
<4> [<ffffffffa0ee3d68>] lustre_fill_super+0x1d8/0x530 [obdclass]
<4> [<ffffffff8118c7df>] get_sb_nodev+0x5f/0xa0
<4> [<ffffffffa0edb3b5>] lustre_get_sb+0x25/0x30 [obdclass]
<4> [<ffffffff8118be3b>] vfs_kern_mount+0x7b/0x1b0
<4> [<ffffffff8118bfe2>] do_kern_mount+0x52/0x130
<4> [<ffffffff811acfeb>] do_mount+0x2fb/0x930
<4> [<ffffffff811ad6b0>] sys_mount+0x90/0xe0
<4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
<4>
Comment by Bruno Faccini (Inactive) [ 15/Apr/14 ]

Hello Gregoire,
Thanks for the update.
But concerning the new and different stack/traceback you reported, in my opinion it looks more like a different problem which has already been reported in LU-3915.
So you may now give a try to my b2_4 back-port, of master patch #7673 from LU-3915, I just pushed at http://review.whamcloud.com/9958.

Comment by Gregoire Pichon [ 15/Apr/14 ]

The LBUG I hit when testing path #9929 has the same stack trace than LU-3915, which reports an OSS crash when mounting OSTs after an upgrade from 2.4.0 to 2.5.

In my case, I was upgrading from 2.4.2 to 2.4.3 with a few additional patches including #5049 "LU-2059 llog: MGC to use OSD API for backup logs".

Patch #7673 "LU-3915 osd-ldiskfs: don't assert on possible upgrade" seems to fix upgrade issue caused by #5049. So I probably need to add #7673 in the list of patches on top of 2.4.3, do you agree ?

Comment by Bruno Faccini (Inactive) [ 15/Apr/14 ]

You may have missed my previous update that already confirmed what you finally found!
So yes, I agree that you can add #7673 or its back-port on top of your 2.4.3 version that also include #5049 ...

Comment by Gregoire Pichon [ 15/Apr/14 ]

Thanks for the backport Bruno. Our comments interleaved !

I have tested a lustre version 2.4.3 with both additional patches

  • #9929 LU-3126 osd: remove fld lookup during configuration
  • #9958 LU-4878 osd-ldiskfs: don't assert on possible upgrade (backport of LU-3915)

The OSS is able to start without any problem. Filesystem is operational.

I am now waiting for these patches to be fully approved and Maloo tested so they can be delivered to the customer.

Comment by Bruno Faccini (Inactive) [ 16/Apr/14 ]

Hello Gregoire,
I am not sure that my patch #9958 will finally be fully accepted+landed to b2_4 ... The main reason of this is that #5049 is itself still not in b2_4 and may be won't, so #9958 is not necessary then as Mike commented in patch with reason !!
This points to some limit in the process where people use to decide to add more patches on top of releases we tested vs regressions and interoperability...

Comment by Gregoire Pichon [ 16/Apr/14 ]

Hello Bruno,

Actually the patch #5049 "LU-2059 llog: MGC to use OSD API for backup logs" has been integrated by Bull on top of release 2.4.x because the customer hit the LBUG ASSERTION(cli->cl_mgc_configs_dir) described in LU-2959. As mentionned in that ticket, the LBUG is fixed by patch #5049.

These problems occured in lustre 2.4.x release and need to be addressed.

Comment by Bruno Faccini (Inactive) [ 16/Apr/14 ]

Gregoire, don't misunderstand me, I did not mean that you added patches without good reasons to do so, but only that doing so you fall back out from our regression/interop testing process.
Concerning the fact you added #5049 due to LU-2959, that may help for #5049 and #9958 to finally land ...

Comment by Bruno Faccini (Inactive) [ 05/May/14 ]

Hello Gregoire,
Since patch #9958 is planned for 2.4 integration, do you agree if we close/resolve this issue as fixed ?

Comment by Gregoire Pichon [ 06/May/14 ]

Hi Bruno,

Yes this ticket can be closed since our tests have shown the issue is fixed with patches #9929 and #9958.
I hope both patches are planned for integration in 2.4 if a new version is released.

Comment by Peter Jones [ 15/May/14 ]

Yes this would be under consideration for 2.4.4.

Generated at Sat Feb 10 01:46:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.