[LU-5366] BUG 6063: lock collide during recovery Created: 18/Jul/14  Updated: 11/Dec/14  Resolved: 11/Dec/14

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Philip B Curtis Assignee: Zhenyu Xu
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

RHEL 6.5


Attachments: HTML File lustre_logs_atlas-oss3b5     Text File lustrekernel.txt     File mds_errors_after-recovery.out.rtf    
Severity: 2
Rank (Obsolete): 14965

 Description   

At 16:34 today, one of our mds nodes hit an LBUG that appears to be LU-5294.

Jul 17 16:34:09 atlas-mds3.ccs.ornl.gov kernel: [794253.417021] LustreError: 15235:0:(lu_object.h:867:lu_object_attr()) ASSERTION( ((o)>lo_header>loh_attr & LOHA_EXISTS) != 0 ) failed:
Jul 17 16:34:09 atlas-mds3.ccs.ornl.gov kernel: [794253.430991] LustreError: 15235:0:(lu_object.h:867:lu_object_attr()) LBUG

We performed a crash dump and the mds rebooted. We entered recovery at 17:54 and at 19:24 the time remaining reached 0 but it was still in Recovering status. We have been getting these messages from the mds.

[ 8689.325886] LustreError: 19309:0:(ldlm_lockd.c:878:ldlm_server_blocking_ast()) ### BUG 6063: lock collide during recovery ns: mdt-atlas2-MDT0000_UUID lock: ffff881d06e30900/0xf35a0587ba982321 lrc: 3/0,0 m0
[ 8689.364338] LustreError: 19309:0:(ldlm_lockd.c:878:ldlm_server_blocking_ast()) Skipped 2 previous similar messages
[ 8739.604058] Lustre: atlas2-MDT0000: Denying connection for new client 7c9cecb7-6c21-5cab-c1b6-5ab153ff6158 (at 8173@gni100), waiting for all 20156 known clients (19103 recovered, 1032 in progress, and 21 6
[ 8739.627327] Lustre: Skipped 18 previous similar messages
[ 8973.664327] Lustre: atlas2-MDT0000: Client 1d186399-bf63-fee6-2b63-8fddd9e7fba3 (at 83@gni2) reconnecting, waiting for 20156 clients in recovery for 0:32
[ 8973.679915] Lustre: Skipped 374 previous similar messages
[ 8973.685115] Lustre: atlas2-MDT0000: Client f27002de-ab6c-80a4-91a7-0d2824607322 (at 80@gni2) refused reconnection, still busy with 1 active RPCs
[ 8973.685117] Lustre: Skipped 374 previous similar messages
[ 9065.583029] Lustre: atlas2-MDT0000: recovery is timed out, evict stale exports
[ 9413.230945] Lustre: atlas2-MDT0000: Denying connection for new client 1cbe0fe6-3a30-b20e-2504-2e3665d3e188 (at 11386@gni100), waiting for all 20156 known clients (19126 recovered, 1008 in progress, and 220
[ 9413.254413] Lustre: Skipped 24 previous similar messages
[ 9441.943231] LustreError: 0:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 376s: evicting client at 11681@gni100 ns: mdt-atlas2-MDT0000_UUID lock: ffff881d07c7f240/0xf0
[ 9441.985270] LustreError: 0:0:(ldlm_lockd.c:402:waiting_locks_callback()) Skipped 2 previous similar messages
[ 9442.001611] Lustre: atlas2-MDT0000: recovery is timed out, evict stale exports
[ 9442.024632] LustreError: 19309:0:(ldlm_lockd.c:878:ldlm_server_blocking_ast()) ### BUG 6063: lock collide during recovery ns: mdt-atlas2-MDT0000_UUID lock: ffff883fc7b21240/0xf35a0587ba988598 lrc: 3/0,0 m0
[ 9442.062793] LustreError: 19309:0:(ldlm_lockd.c:878:ldlm_server_blocking_ast()) Skipped 1 previous similar message
[ 9574.246608] Lustre: atlas2-MDT0000: Client f27002de-ab6c-80a4-91a7-0d2824607322 (at 80@gni2) reconnecting, waiting for 20156 clients in recovery for 3:04
[ 9574.262195] Lustre: Skipped 368 previous similar messages
[ 9574.268355] Lustre: atlas2-MDT0000: Client f27002de-ab6c-80a4-91a7-0d2824607322 (at 80@gni2) refused reconnection, still busy with 1 active RPCs
[ 9574.283062] Lustre: Skipped 368 previous similar messages
[ 9818.311480] Lustre: atlas2-MDT0000: recovery is timed out, evict stale exports



 Comments   
Comment by Philip B Curtis [ 18/Jul/14 ]

Sorry, the first LU-5294 should have been LU-5285.

Comment by Peter Jones [ 18/Jul/14 ]

Bobijam is looking into this issue

Comment by Philip B Curtis [ 18/Jul/14 ]

The mds just reported that it finished recovery at 21:58. We will wait to hear back to make sure it is safe to release to users. We wouldn't want it to hit that LBUG again in 10 minutes.

Comment by Zhenyu Xu [ 18/Jul/14 ]

So your MDS does not contain LU-5285 patch, is it?

Comment by Philip B Curtis [ 18/Jul/14 ]

That is correct. We do not have the LU-5285 patch.

Comment by Blake Caldwell [ 18/Jul/14 ]

We are on 2.4.3 with patches and are stuck there for a base in the near term future. Is a backport to 2.4.3 possible?

Comment by Zhenyu Xu [ 18/Jul/14 ]

yes, I'm doing the back port for b2_4

Comment by Philip B Curtis [ 18/Jul/14 ]

We have been seeing these errors that started shortly after recovery reported that it had finished.

Comment by Zhenyu Xu [ 18/Jul/14 ]

this error messages reporting that the MDS is doing the object pre-create reservation on OSTs

Comment by Philip B Curtis [ 18/Jul/14 ]

I should also note that the fs is unresponsive at this time due to the lock timeouts, but it has not LBUG'd yet.

Comment by Zhenyu Xu [ 18/Jul/14 ]

I suspect it's MDS's osp hasn't successfully connected to OSTs yet. Can you check the MDS connection to OSTs messages?

LU-5285 back port for b2_4 http://review.whamcloud.com/11136 , for b2_5 http://review.whamcloud.com/11137

Comment by Philip B Curtis [ 18/Jul/14 ]

I'll include the lustrekernel log of what has been happening since the LBUG initially happened in case that is helpful.

Comment by Blake Caldwell [ 18/Jul/14 ]

It looks like osps are good All 1008 are showing as UP! The syslog messages are attached too.
1007 UP osp atlas2-OST03e8-osc-MDT0000 atlas2-MDT0000-mdtlov_UUID 5
1008 UP osp atlas2-OST03e9-osc-MDT0000 atlas2-MDT0000-mdtlov_UUID 5
1009 UP osp atlas2-OST03ea-osc-MDT0000 atlas2-MDT0000-mdtlov_UUID 5
1010 UP osp atlas2-OST03eb-osc-MDT0000 atlas2-MDT0000-mdtlov_UUID 5
1011 UP osp atlas2-OST03ec-osc-MDT0000 atlas2-MDT0000-mdtlov_UUID 5
1012 UP osp atlas2-OST03ed-osc-MDT0000 atlas2-MDT0000-mdtlov_UUID 5
1013 UP osp atlas2-OST03ee-osc-MDT0000 atlas2-MDT0000-mdtlov_UUID 5
1014 UP osp atlas2-OST03ef-osc-MDT0000 atlas2-MDT0000-mdtlov_UUID 5

Comment by Zhenyu Xu [ 18/Jul/14 ]

in the log, I saw a deadlock stack trace like LU-2173

Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.381954] INFO: task mdt00_000:15203 blocked for more than 120 seconds.
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.389665] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.398633] mdt00_000     D 0000000000000002     0 15203      2 0x00000000
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.406483]  ffff881e85227560 0000000000000046 ffff880000001000 00000000e028a510
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.415032]  ffff884051f26440 04c88840d1000000 ffff881e6a06a980 10000000000004c8
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.423606]  ffff88205015fab8 ffff881e85227fd8 000000000000fb88 ffff88205015fab8
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.432161] Call Trace:
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.435008]  [<ffffffff81277817>] ? kobject_put+0x27/0x60
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.441161]  [<ffffffff8150eb75>] rwsem_down_failed_common+0x95/0x1d0
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.448474]  [<ffffffff811b5b2e>] ? bh_lru_install+0x16e/0x1a0
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.455112]  [<ffffffff8150ecd3>] rwsem_down_write_failed+0x23/0x30
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.462252]  [<ffffffff81281d13>] call_rwsem_down_write_failed+0x13/0x20
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.469886]  [<ffffffff8150e1d2>] ? down_write+0x32/0x40
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.475981]  [<ffffffffa0e0bca5>] lod_alloc_qos.clone.0+0x175/0x1180 [lod]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.483819]  [<ffffffffa0bc10af>] ? qsd_op_begin+0x5f/0xb40 [lquota]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.491046]  [<ffffffffa0e0e70a>] lod_qos_prep_create+0x74a/0x1b14 [lod]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.498675]  [<ffffffffa0882ad2>] ? fld_server_lookup+0x72/0x3d0 [fld]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.506117]  [<ffffffffa0e090db>] lod_declare_striped_object+0x14b/0x880 [lod]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.514461]  [<ffffffffa0c9cccb>] ? osd_xattr_get+0x21b/0x2d0 [osd_ldiskfs]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.522374]  [<ffffffffa0e09d21>] lod_declare_object_create+0x511/0x7a0 [lod]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.530511]  [<ffffffffa0b1a8cf>] mdd_declare_object_create_internal+0xbf/0x1f0 [mdd]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.539511]  [<ffffffffa0b29ffe>] mdd_declare_create+0x4e/0x870 [mdd]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.546842]  [<ffffffffa0b287ff>] ? mdd_linkea_prepare+0x23f/0x430 [mdd]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.554487]  [<ffffffffa0b2afe5>] mdd_create+0x7c5/0x1790 [mdd]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.561269]  [<ffffffffa0c9cb47>] ? osd_xattr_get+0x97/0x2d0 [osd_ldiskfs]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.569122]  [<ffffffffa0d6863e>] mdt_reint_open+0x13ae/0x21c0 [mdt]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.576366]  [<ffffffffa03d583e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.584811]  [<ffffffffa06e8f6c>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.592927]  [<ffffffffa0d52cd1>] mdt_reint_rec+0x41/0xe0 [mdt]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.599699]  [<ffffffffa0d37af3>] mdt_reint_internal+0x4c3/0x780 [mdt]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.607141]  [<ffffffffa0d38080>] mdt_intent_reint+0x1f0/0x530 [mdt]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.614413]  [<ffffffffa0d35f2e>] mdt_intent_policy+0x39e/0x720 [mdt]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.621774]  [<ffffffffa06a0841>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.629424]  [<ffffffffa06c72cf>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.637467]  [<ffffffffa0d363b6>] mdt_enqueue+0x46/0xe0 [mdt]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.644012]  [<ffffffffa0d3ab57>] mdt_handle_common+0x647/0x16d0 [mdt]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.651447]  [<ffffffffa06e9d4c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.659756]  [<ffffffffa0d76a55>] mds_regular_handle+0x15/0x20 [mdt]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.667020]  [<ffffffffa06f9568>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.675847]  [<ffffffffa03b95de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.682999]  [<ffffffffa03cad9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.690743]  [<ffffffffa06f08c9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.698478]  [<ffffffff81063b80>] ? default_wake_function+0x0/0x20
Jul 17 22:01:57 atlas-mds3.ccs.ornl.gov kernel: [15498.705548]  [<ffffffffa06fa8fe>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jul 17 22:01:57 atlas-mds3.ccs.ornl.gov kernel: [15498.712718]  [<ffffffffa06f9e30>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jul 17 22:01:57 atlas-mds3.ccs.ornl.gov kernel: [15498.719882]  [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jul 17 22:01:57 atlas-mds3.ccs.ornl.gov kernel: [15498.725563]  [<ffffffffa06f9e30>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jul 17 22:01:57 atlas-mds3.ccs.ornl.gov kernel: [15498.732711]  [<ffffffffa06f9e30>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jul 17 22:01:57 atlas-mds3.ccs.ornl.gov kernel: [15498.739844]  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Bzzz, can you take a look?

Comment by Blake Caldwell [ 18/Jul/14 ]

We've built a lustre build with 2.4 LU-5285 backport, but are waiting now pending investigation of deadlock causing the MDT unresponsiveness. Thanks.

Comment by John Fuchs-Chesney (Inactive) [ 18/Jul/14 ]

Bobijam,
Can you please say again who you want to take a look at this?
I do not know who you mean by 'Bzzz'
Thank you,
~ jfc.

Comment by Zhenyu Xu [ 18/Jul/14 ]

bzzz refers to Alex Zhuravlev

Comment by Philip B Curtis [ 18/Jul/14 ]

Any updates on this line of investigation?

Comment by Zhenyu Xu [ 18/Jul/14 ]

I haven't found the root cause yet.

Comment by Alex Zhuravlev [ 18/Jul/14 ]

those traces were dumped after the recovery. we aren't allocating new objects via normal precreation during recovery, we trust the clients to provide striping.

Comment by Alex Zhuravlev [ 18/Jul/14 ]

can we have a look at the logs from atlas2-OST024c, please?
those traces might be a sign of very slow precreation process when a single file need many stripes. this could explain missing process with the semaphore held.

Comment by Blake Caldwell [ 18/Jul/14 ]

Uploaded logs from oss with atlas2-OST024c. Start time is before the LBUG until present 7-18 14:22 EST.

Comment by Jason Hill (Inactive) [ 23/Jul/14 ]

With the upload last week has there been any progress towards a root cause?


-Jason

Comment by Oleg Drokin [ 24/Jul/14 ]

So the only way I can think of for the "bug 6063" message to print would be if some of the code somewhere ignored the "replay" flag in the RPC and attempted to return a new lock to the client as part of replay RPC.

Interestingly, I do not see any asserts for this in the code so if this is what really happened, we'd never know unless it manifests itself in this way.

Comment by James Nunez (Inactive) [ 11/Dec/14 ]

Please reopen this ticket if you experience this issue again.

Generated at Sat Feb 10 01:50:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.