[LU-2842] osd_handler.c:3183:osd_remote_fid()) lustre-MDT0000-osd: Can not lookup fld for [0x1100000002000004:0x2000010:0x3b000000] Created: 20/Feb/13  Updated: 09/Jan/20  Resolved: 09/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 6880

 Description   

Hit this crash during racer:

[49497.030405] LustreError: 7303:0:(fld_handler.c:170:fld_server_lookup()) srv-lustre-MDT0000: Can not found the seq 0x1100000002000004
[49497.030968] LustreError: 7303:0:(osd_handler.c:2008:osd_fld_lookup()) lustre-MDT0000-osd can not find [0x1100000002000004:0x2000010:0x3b000000]: rc = -5
[49497.031828] LustreError: 7303:0:(osd_handler.c:3183:osd_remote_fid()) lustre-MDT0000-osd: Can not lookup fld for [0x1100000002000004:0x2000010:0x3b000000]
[49497.352963] LustreError: 25008:0:(mdt_open.c:1639:mdt_reint_open()) LBUG
[49497.353281] Pid: 25008, comm: mdt00_008
[49497.353514] 
[49497.353514] Call Trace:
[49497.353924]  [<ffffffffa0e7e915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[49497.354241]  [<ffffffffa0e7ef17>] lbug_with_loc+0x47/0xb0 [libcfs]
[49497.354525]  [<ffffffffa0b98a59>] mdt_reint_open+0x17f9/0x1ac0 [mdt]
[49497.354877]  [<ffffffffa06a9efc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
[49497.355205]  [<ffffffffa0584ec0>] ? lu_ucred+0x20/0x30 [obdclass]
[49497.355497]  [<ffffffffa0584ec0>] ? lu_ucred+0x20/0x30 [obdclass]
[49497.355800]  [<ffffffffa0b83031>] mdt_reint_rec+0x41/0xe0 [mdt]
[49497.356821]  [<ffffffffa0b7c693>] mdt_reint_internal+0x4e3/0x7d0 [mdt]
[49497.357116]  [<ffffffffa0b7cc4d>] mdt_intent_reint+0x1ed/0x4f0 [mdt]
[49497.357401]  [<ffffffffa0b7833e>] mdt_intent_policy+0x3ae/0x750 [mdt]
[49497.357704]  [<ffffffffa066270a>] ldlm_lock_enqueue+0x2ea/0x870 [ptlrpc]
[49497.358023]  [<ffffffffa0688e87>] ldlm_handle_enqueue0+0x4f7/0x1090 [ptlrpc]
[49497.358342]  [<ffffffffa0b78816>] mdt_enqueue+0x46/0x110 [mdt]
[49497.358617]  [<ffffffffa0b6e068>] mdt_handle_common+0x628/0x1620 [mdt]
[49497.358941]  [<ffffffffa0ba4c15>] mds_regular_handle+0x15/0x20 [mdt]
[49497.359250]  [<ffffffffa06b8e23>] ptlrpc_server_handle_request+0x453/0xe50 [ptlrpc]
[49497.359737]  [<ffffffffa0e7f65e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[49497.360071]  [<ffffffffa06b1eb1>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
[49497.360358]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
[49497.360657]  [<ffffffffa06bb8ed>] ptlrpc_main+0xafd/0x17f0 [ptlrpc]
[49497.361035]  [<ffffffffa01979c0>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
[49497.361348]  [<ffffffffa06badf0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
[49497.361625]  [<ffffffff8100c14a>] child_rip+0xa/0x20
[49497.361907]  [<ffffffffa06badf0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
[49497.362259]  [<ffffffffa06badf0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
[49497.362732]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[49497.363124] 
[49497.418002] Kernel panic - not syncing: LBUG

Crashdump and modules are in /exports/crashdumps/192.168.10.218-2013-02-20-13\:18\:58



 Comments   
Comment by Di Wang [ 21/Feb/13 ]

It seems to me there are two problem here.
1. strange FID
80000000:00020000:2.0:1361384333.629030:0:7303:0:(fld_handler.c:170:fld_server_lookup()) srv-lustre-MDT0000: Can not found the seq 0x110000000200000400000004:00020000:2.0:1361384333.629594:0:7303:0:(osd_handler.c:2008:osd_fld_lookup()) lustre-MDT0000-osd can not find [0x1100000002000004:0x2000010:0x3b000000]: rc = -5
00000004:00020000:2.0:1361384333.630454:0:7303:0:(osd_handler.c:3183:osd_remote_fid()) lustre-MDT0000-osd: Can not lookup fld for [0x1100000002000004:0x2000010:0x3b000000]
00000004:00200000:2.0:1361384333.631097:0:7303:0:(mdt_handler.c:411:mdt_pack_attr2body()) [0x200000403:0xd50:0x0]: returning size 2
00000004:00080000:2.0:1361384333.631101:0:7303:0:(mdt_handler.c:2676:mdt_save_lock()) request = ffff88001afc7840 reply state = ffff880031426bf0 transno = 4298345256

According to the debug log, this is probably the reint create process. It is interesting, osd get this strange fid during create process, but the fid get back to normal during attr pack. I checked the code path, did not find anything wrong yet. Will dig more.

2. The LBUG seems mean the filesystem get corrupted during racer, i.e. the name entry exist, but object has been destroyed from the disk.
Need further inspection either.

Oleg, is it easy to reproduce on your node?

Comment by Brian Behlendorf [ 11/Sep/13 ]

After updating one of our production Lustre server clusters from 2.1 to 2.4 our MDS is repeatedly crashing every few minutes due to this same LBUG. The clients are all still running Lustre 2.1. Can you suggest a work around for this?

2013-09-11 08:01:12 Sep 11 08:01:12 sumom-mds1 kernel: LustreError: 5258:0:(mdt_openCall Trace:
2013-09-11 08:01:12 .c:1685:mdt_reint_open()) LBUG
2013-09-11 08:01:12  [<ffffffffa05108f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2013-09-11 08:01:12  [<ffffffffa0510ef7>] lbug_with_loc+0x47/0xb0 [libcfs]
2013-09-11 08:01:12  [<ffffffffa10b43f9>] mdt_reint_open+0x19d9/0x2110 [mdt]
2013-09-11 08:01:12  [<ffffffffa052d82e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
2013-09-11 08:01:12  [<ffffffffa09e716c>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
2013-09-11 08:01:12  [<ffffffffa087e0c0>] ? lu_ucred+0x20/0x30 [obdclass]
2013-09-11 08:01:12  [<ffffffffa109e881>] mdt_reint_rec+0x41/0xe0 [mdt]
2013-09-11 08:01:13  [<ffffffffa1083ae3>] mdt_reint_internal+0x4c3/0x780 [mdt]
2013-09-11 08:01:13  [<ffffffffa108406d>] mdt_intent_reint+0x1ed/0x520 [mdt]
2013-09-11 08:01:13  [<ffffffffa1081f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
2013-09-11 08:01:13  [<ffffffffa099e7e1>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
2013-09-11 08:01:13  [<ffffffffa09c52bf>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
2013-09-11 08:01:13  [<ffffffffa10823a6>] mdt_enqueue+0x46/0xe0 [mdt]
2013-09-11 08:01:13  [<ffffffffa1088a98>] mdt_handle_common+0x648/0x1660 [mdt]
2013-09-11 08:01:13  [<ffffffffa10c2145>] mds_regular_handle+0x15/0x20 [mdt]
2013-09-11 08:01:13  [<ffffffffa09f7768>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
2013-09-11 08:01:13  [<ffffffffa051163e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
2013-09-11 08:01:13  [<ffffffffa0522e0f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
2013-09-11 08:01:13  [<ffffffffa09eeac9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
2013-09-11 08:01:13  [<ffffffff81055be3>] ? __wake_up+0x53/0x70
2013-09-11 08:01:13  [<ffffffffa09f8afe>] ptlrpc_main+0xace/0x1700 [ptlrpc]
2013-09-11 08:01:13  [<ffffffffa09f8030>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
2013-09-11 08:01:13  [<ffffffff8100c10a>] child_rip+0xa/0x20
2013-09-11 08:01:13  [<ffffffffa09f8030>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
2013-09-11 08:01:13  [<ffffffffa09f8030>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
2013-09-11 08:01:13  [<ffffffff8100c100>] ? child_rip+0x0/0x2
Comment by Oleg Drokin [ 11/Sep/13 ]

Do you get the preceeding messages about not being able to find the seq and the like? What's before the lbug in the dmesg?

Comment by Prakash Surya (Inactive) [ 11/Sep/13 ]

Here's what I see in one of the crash dumps:

SPL: Loaded module v0.6.2-1.2
zunicode: module license 'CDDL' taints kernel.
Disabling lock debugging due to kernel taint
ZFS: Loaded module v0.6.2-1.2, ZFS pool version 5000, ZFS filesystem version 5
Lustre: Lustre: Build Version: 2.4.0-16chaos-16chaos--PRISTINE-2.6.32-358.14.1.2chaos.ch5.1.1.x86_64
LDISKFS-fs (sdb): mounted filesystem with ordered data mode. quota=off. Opts: 
Lustre: MGS: Not available for connect from 192.168.117.247@o2ib5 (not set up)
Lustre: lsc-MDT0000: Not available for connect from 192.168.112.24@o2ib10 (not set up)
Lustre: 4113:0:(mdt_handler.c:4947:mdt_process_config()) For interoperability, skip this mdd.quota_type. It is obsolete.
LNet: No route to 12345-192.168.120.6@o2ib11 via 172.19.1.100@o2ib100 (all routers down)
LustreError: 11-0: lsc-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
Lustre: lsc-MDT0000: Will be in recovery for at least 5:00, or until 3830 clients reconnect.
Lustre: MDS lsc-MDT0000 disabled MDS fid capability.
Lustre: MDS lsc-MDT0000 disabled OSS fid capability.
LustreError: 4232:0:(mdt_open.c:1497:mdt_reint_open()) @@@ [0x24d18001:0x3db440f0:0x0]/islam5->[0x24d98604:0x2a32454:0x0] cr_flags=0104200200001 mode=0200100000 msg_flag=0x4 not found in open replay.  req@ffff880226dd5000 x1443453865661288/t0(463856618502) o101->f45d6fab-2c9c-6b39-0090-4935fbe03e32@192.168.115.87@o2ib10:0/0 lens 568/1176 e 0 to 0 dl 1378921403 ref 1 fl Interpret:/4/0 rc 0/0
LustreError: 4232:0:(mdt_open.c:1497:mdt_reint_open()) @@@ [0x24d18001:0x3db440f0:0x0]/larson->[0x28300425:0xd1780cb4:0x0] cr_flags=0104200200001 mode=0200100000 msg_flag=0x4 not found in open replay.  req@ffff88022728b000 x1443453867322343/t0(472447519997) o101->f45d6fab-2c9c-6b39-0090-4935fbe03e32@192.168.115.87@o2ib10:0/0 lens 568/1176 e 0 to 0 dl 1378921403 ref 1 fl Interpret:/4/0 rc 0/0
LustreError: 4079:0:(mdt_open.c:1497:mdt_reint_open()) @@@ [0x28300407:0xd1780cae:0x0]/adduct->[0x285092ac:0x81a52545:0x0] cr_flags=04200000001 mode=0200100000 msg_flag=0x4 not found in open replay.  req@ffff8808254acc00 x1443268507795602/t0(463906685275) o101->bb3cec34-8f59-54c4-a0f2-539b8244c70c@172.19.1.82@o2ib100:0/0 lens 568/1176 e 0 to 0 dl 1378921411 ref 1 fl Interpret:/4/0 rc 0/0
LustreError: 5205:0:(mdt_open.c:1497:mdt_reint_open()) @@@ [0x24d18001:0x3db440f0:0x0]/brendar->[0x250d8ddb:0x69987943:0x0] cr_flags=0104200200001 mode=0200100000 msg_flag=0x4 not found in open replay.  req@ffff88041caab800 x1443362147837609/t0(463881208837) o101->2dcf901c-5e5a-63be-828f-0314fb0d6174@192.168.112.7@o2ib6:0/0 lens 568/1176 e 0 to 0 dl 1378921418 ref 1 fl Interpret:/4/0 rc 0/0
LustreError: 5205:0:(mdt_open.c:1497:mdt_reint_open()) Skipped 4 previous similar messages
LustreError: 5200:0:(mdt_open.c:1497:mdt_reint_open()) @@@ [0x24d18001:0x3db440f0:0x0]/richmond->[0x24d1800f:0xc0099bc5:0x0] cr_flags=0100000200001 mode=0200100000 msg_flag=0x4 not found in open replay.  req@ffff880226474000 x1444113199520858/t0(472448344517) o101->5a896d53-cd0e-fcb9-cf2c-1c7184a04634@172.16.65.8@tcp:0/0 lens 512/1112 e 0 to 0 dl 1378921420 ref 1 fl Interpret:/4/0 rc 0/0
LustreError: 5200:0:(mdt_open.c:1497:mdt_reint_open()) Skipped 1 previous similar message
LustreError: 4076:0:(mdt_open.c:1497:mdt_reint_open()) @@@ [0x24d18001:0x3db440f0:0x0]/czgeorge->[0x24d411fd:0x6c7ac8b4:0x0] cr_flags=0100000200001 mode=0200100000 msg_flag=0x4 not found in open replay.  req@ffff8806279e9000 x1443268514063617/t0(472448533844) o101->2fe10220-1774-604e-b177-43f288d6b1b7@172.19.1.83@o2ib100:0/0 lens 512/1112 e 0 to 0 dl 1378921426 ref 1 fl Interpret:/4/0 rc 0/0
LustreError: 4076:0:(mdt_open.c:1497:mdt_reint_open()) Skipped 4 previous similar messages
LustreError: 4077:0:(mdt_open.c:1497:mdt_reint_open()) @@@ [0x24d18001:0x3db440f0:0x0]/parish3->[0x28300647:0xd1780d3c:0x0] cr_flags=0104200200001 mode=0200100000 msg_flag=0x4 not found in open replay.  req@ffff88041be9f400 x1443297457705490/t0(463924688500) o101->e1f90f17-8641-6fe9-ede3-4ac44578f37c@192.168.115.229@o2ib5:0/0 lens 568/1176 e 0 to 0 dl 1378921437 ref 1 fl Interpret:/4/0 rc 0/0
LustreError: 4077:0:(mdt_open.c:1497:mdt_reint_open()) Skipped 5 previous similar messages
Lustre: lsc-MDT0000: Denying connection for new client b1b6e953-5b49-5831-6cb7-0fa5fbf2e1df (at 172.16.66.54@tcp), waiting for all 3830 known clients (2944 recovered, 130 in progress, and 0 evicted) to recover in 4:16
LustreError: 5208:0:(mdt_open.c:1497:mdt_reint_open()) @@@ [0x24d18001:0x3db440f0:0x0]/parish3->[0x28300647:0xd1780d3c:0x0] cr_flags=0104200200001 mode=0200100000 msg_flag=0x4 not found in open replay.  req@ffff880626410400 x1443298032940907/t0(463873525150) o101->36f7a362-df87-73f0-4238-4b3a4ebc100f@192.168.113.81@o2ib5:0/0 lens 568/1176 e 0 to 0 dl 1378921453 ref 1 fl Interpret:/4/0 rc 0/0
LustreError: 5208:0:(mdt_open.c:1497:mdt_reint_open()) Skipped 7 previous similar messages
Lustre: lsc-MDT0000: Denying connection for new client b1b6e953-5b49-5831-6cb7-0fa5fbf2e1df (at 172.16.66.54@tcp), waiting for all 3830 known clients (3577 recovered, 228 in progress, and 0 evicted) to recover in 3:51
Lustre: lsc-MDT0000: Recovery over after 1:14, of 3830 clients 3830 recovered and 0 were evicted.
LustreError: 5351:0:(mdt_open.c:1685:mdt_reint_open()) LBUG
Pid: 5351, comm: mdt01_023

Call Trace:
 [<ffffffffa05088f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0508ef7>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa10ac3f9>] mdt_reint_open+0x19d9/0x2110 [mdt]
 [<ffffffffa052582e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
 [<ffffffffa09df16c>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
 [<ffffffffa08760c0>] ? lu_ucred+0x20/0x30 [obdclass]
 [<ffffffffa1096881>] mdt_reint_rec+0x41/0xe0 [mdt]
 [<ffffffffa107bae3>] mdt_reint_internal+0x4c3/0x780 [mdt]
 [<ffffffffa107c06d>] mdt_intent_reint+0x1ed/0x520 [mdt]
 [<ffffffffa1079f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
 [<ffffffffa09967e1>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
 [<ffffffffa09bd2bf>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
 [<ffffffffa107a3a6>] mdt_enqueue+0x46/0xe0 [mdt]
 [<ffffffffa1080a98>] mdt_handle_common+0x648/0x1660 [mdt]
 [<ffffffffa10ba145>] mds_regular_handle+0x15/0x20 [mdt]
 [<ffffffffa09ef768>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
 [<ffffffffa050963e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
 [<ffffffffa051ae0f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
 [<ffffffffa09e6ac9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
 [<ffffffff81055be3>] ? __wake_up+0x53/0x70
 [<ffffffffa09f0afe>] ptlrpc_main+0xace/0x1700 [ptlrpc]
 [<ffffffffa09f0030>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff8100c10a>] child_rip+0xa/0x20
 [<ffffffffa09f0030>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffffa09f0030>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff8100c100>] ? child_rip+0x0/0x20
Comment by Oleg Drokin [ 11/Sep/13 ]

Ok, this patch http://review.whamcloud.com/7617 (will also apply to b2_4) should fix the crash.
The underlying error condition of a name with invalidfid will remain, but at least we'll know the name affected and you might even be able to try out lfsck to fix it.

Comment by Di Wang [ 11/Sep/13 ]

Hmm, even it is invalid FID or stale FID, mdt_object_find should return error, instead of returning non-exist object and LBUG in the following, because it is not suppose to create a object in this case. So I would guess sth wrong in osd_fid_lookup. Fang yong please have a look.

Comment by Christopher Morrone [ 11/Sep/13 ]

The patch did indeed stop the crashing. However, we have over a dozen top-level directories that have have entries in lustre root directory, but are inaccesible. So we have addressed the symptom, but not the root of the problem.

To the best of my knowledge, these directories were all there under 2.1, and we only began having trouble with 2.4.

Comment by Di Wang [ 11/Sep/13 ]

Yeah, there are probably other problems. Could you please post the error message(after the patch) here? So we can know what name/FID are involved in this.

Comment by Christopher Morrone [ 11/Sep/13 ]

I decided to start a new ticket after all for the missing directories problem. See LU-3934.

Comment by Peter Jones [ 12/Sep/13 ]

Bobbie

Could you please keep an eye on this patch - Oleg is traveling next week

Thanks

Peter

Comment by Bobbie Lind (Inactive) [ 17/Sep/13 ]

This ticket and patch are recommended to be closed as being taken care of in LU-3934 and patch http://review.whamcloud.com/#/c/7625/

Comment by Peter Jones [ 21/Sep/13 ]

I'm not sure whether it is appropriate to close the ticket as the original reported issue on Oleg's system does not seem likely to have been the same root cause as the upgrade situation LLNL encountered that was covered under LU-3934. However, as the patch provided under this ticket is not suitable for landing and the original issue was a one-off occurrence, I think that we can drop the priority of this ticket and move it back to the unassigned pool of tickets.

Generated at Sat Feb 10 01:28:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.