[LU-5366] BUG 6063: lock collide during recovery Created: 18/Jul/14 Updated: 11/Dec/14 Resolved: 11/Dec/14 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Philip B Curtis | Assignee: | Zhenyu Xu |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL 6.5 |
||
| Attachments: |
|
| Severity: | 2 |
| Rank (Obsolete): | 14965 |
| Description |
|
At 16:34 today, one of our mds nodes hit an LBUG that appears to be Jul 17 16:34:09 atlas-mds3.ccs.ornl.gov kernel: [794253.417021] LustreError: 15235:0:(lu_object.h:867:lu_object_attr()) ASSERTION( ((o) We performed a crash dump and the mds rebooted. We entered recovery at 17:54 and at 19:24 the time remaining reached 0 but it was still in Recovering status. We have been getting these messages from the mds. [ 8689.325886] LustreError: 19309:0:(ldlm_lockd.c:878:ldlm_server_blocking_ast()) ### BUG 6063: lock collide during recovery ns: mdt-atlas2-MDT0000_UUID lock: ffff881d06e30900/0xf35a0587ba982321 lrc: 3/0,0 m0 |
| Comments |
| Comment by Philip B Curtis [ 18/Jul/14 ] |
| Comment by Peter Jones [ 18/Jul/14 ] |
|
Bobijam is looking into this issue |
| Comment by Philip B Curtis [ 18/Jul/14 ] |
|
The mds just reported that it finished recovery at 21:58. We will wait to hear back to make sure it is safe to release to users. We wouldn't want it to hit that LBUG again in 10 minutes. |
| Comment by Zhenyu Xu [ 18/Jul/14 ] |
|
So your MDS does not contain |
| Comment by Philip B Curtis [ 18/Jul/14 ] |
|
That is correct. We do not have the |
| Comment by Blake Caldwell [ 18/Jul/14 ] |
|
We are on 2.4.3 with patches and are stuck there for a base in the near term future. Is a backport to 2.4.3 possible? |
| Comment by Zhenyu Xu [ 18/Jul/14 ] |
|
yes, I'm doing the back port for b2_4 |
| Comment by Philip B Curtis [ 18/Jul/14 ] |
|
We have been seeing these errors that started shortly after recovery reported that it had finished. |
| Comment by Zhenyu Xu [ 18/Jul/14 ] |
|
this error messages reporting that the MDS is doing the object pre-create reservation on OSTs |
| Comment by Philip B Curtis [ 18/Jul/14 ] |
|
I should also note that the fs is unresponsive at this time due to the lock timeouts, but it has not LBUG'd yet. |
| Comment by Zhenyu Xu [ 18/Jul/14 ] |
|
I suspect it's MDS's osp hasn't successfully connected to OSTs yet. Can you check the MDS connection to OSTs messages?
|
| Comment by Philip B Curtis [ 18/Jul/14 ] |
|
I'll include the lustrekernel log of what has been happening since the LBUG initially happened in case that is helpful. |
| Comment by Blake Caldwell [ 18/Jul/14 ] |
|
It looks like osps are good All 1008 are showing as UP! The syslog messages are attached too. |
| Comment by Zhenyu Xu [ 18/Jul/14 ] |
|
in the log, I saw a deadlock stack trace like Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.381954] INFO: task mdt00_000:15203 blocked for more than 120 seconds. Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.389665] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.398633] mdt00_000 D 0000000000000002 0 15203 2 0x00000000 Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.406483] ffff881e85227560 0000000000000046 ffff880000001000 00000000e028a510 Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.415032] ffff884051f26440 04c88840d1000000 ffff881e6a06a980 10000000000004c8 Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.423606] ffff88205015fab8 ffff881e85227fd8 000000000000fb88 ffff88205015fab8 Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.432161] Call Trace: Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.435008] [<ffffffff81277817>] ? kobject_put+0x27/0x60 Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.441161] [<ffffffff8150eb75>] rwsem_down_failed_common+0x95/0x1d0 Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.448474] [<ffffffff811b5b2e>] ? bh_lru_install+0x16e/0x1a0 Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.455112] [<ffffffff8150ecd3>] rwsem_down_write_failed+0x23/0x30 Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.462252] [<ffffffff81281d13>] call_rwsem_down_write_failed+0x13/0x20 Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.469886] [<ffffffff8150e1d2>] ? down_write+0x32/0x40 Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.475981] [<ffffffffa0e0bca5>] lod_alloc_qos.clone.0+0x175/0x1180 [lod] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.483819] [<ffffffffa0bc10af>] ? qsd_op_begin+0x5f/0xb40 [lquota] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.491046] [<ffffffffa0e0e70a>] lod_qos_prep_create+0x74a/0x1b14 [lod] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.498675] [<ffffffffa0882ad2>] ? fld_server_lookup+0x72/0x3d0 [fld] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.506117] [<ffffffffa0e090db>] lod_declare_striped_object+0x14b/0x880 [lod] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.514461] [<ffffffffa0c9cccb>] ? osd_xattr_get+0x21b/0x2d0 [osd_ldiskfs] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.522374] [<ffffffffa0e09d21>] lod_declare_object_create+0x511/0x7a0 [lod] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.530511] [<ffffffffa0b1a8cf>] mdd_declare_object_create_internal+0xbf/0x1f0 [mdd] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.539511] [<ffffffffa0b29ffe>] mdd_declare_create+0x4e/0x870 [mdd] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.546842] [<ffffffffa0b287ff>] ? mdd_linkea_prepare+0x23f/0x430 [mdd] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.554487] [<ffffffffa0b2afe5>] mdd_create+0x7c5/0x1790 [mdd] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.561269] [<ffffffffa0c9cb47>] ? osd_xattr_get+0x97/0x2d0 [osd_ldiskfs] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.569122] [<ffffffffa0d6863e>] mdt_reint_open+0x13ae/0x21c0 [mdt] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.576366] [<ffffffffa03d583e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.584811] [<ffffffffa06e8f6c>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.592927] [<ffffffffa0d52cd1>] mdt_reint_rec+0x41/0xe0 [mdt] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.599699] [<ffffffffa0d37af3>] mdt_reint_internal+0x4c3/0x780 [mdt] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.607141] [<ffffffffa0d38080>] mdt_intent_reint+0x1f0/0x530 [mdt] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.614413] [<ffffffffa0d35f2e>] mdt_intent_policy+0x39e/0x720 [mdt] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.621774] [<ffffffffa06a0841>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.629424] [<ffffffffa06c72cf>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.637467] [<ffffffffa0d363b6>] mdt_enqueue+0x46/0xe0 [mdt] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.644012] [<ffffffffa0d3ab57>] mdt_handle_common+0x647/0x16d0 [mdt] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.651447] [<ffffffffa06e9d4c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.659756] [<ffffffffa0d76a55>] mds_regular_handle+0x15/0x20 [mdt] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.667020] [<ffffffffa06f9568>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.675847] [<ffffffffa03b95de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.682999] [<ffffffffa03cad9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.690743] [<ffffffffa06f08c9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jul 17 22:01:56 atlas-mds3.ccs.ornl.gov kernel: [15498.698478] [<ffffffff81063b80>] ? default_wake_function+0x0/0x20 Jul 17 22:01:57 atlas-mds3.ccs.ornl.gov kernel: [15498.705548] [<ffffffffa06fa8fe>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jul 17 22:01:57 atlas-mds3.ccs.ornl.gov kernel: [15498.712718] [<ffffffffa06f9e30>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jul 17 22:01:57 atlas-mds3.ccs.ornl.gov kernel: [15498.719882] [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jul 17 22:01:57 atlas-mds3.ccs.ornl.gov kernel: [15498.725563] [<ffffffffa06f9e30>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jul 17 22:01:57 atlas-mds3.ccs.ornl.gov kernel: [15498.732711] [<ffffffffa06f9e30>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jul 17 22:01:57 atlas-mds3.ccs.ornl.gov kernel: [15498.739844] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Bzzz, can you take a look? |
| Comment by Blake Caldwell [ 18/Jul/14 ] |
|
We've built a lustre build with 2.4 |
| Comment by John Fuchs-Chesney (Inactive) [ 18/Jul/14 ] |
|
Bobijam, |
| Comment by Zhenyu Xu [ 18/Jul/14 ] |
|
bzzz refers to Alex Zhuravlev |
| Comment by Philip B Curtis [ 18/Jul/14 ] |
|
Any updates on this line of investigation? |
| Comment by Zhenyu Xu [ 18/Jul/14 ] |
|
I haven't found the root cause yet. |
| Comment by Alex Zhuravlev [ 18/Jul/14 ] |
|
those traces were dumped after the recovery. we aren't allocating new objects via normal precreation during recovery, we trust the clients to provide striping. |
| Comment by Alex Zhuravlev [ 18/Jul/14 ] |
|
can we have a look at the logs from atlas2-OST024c, please? |
| Comment by Blake Caldwell [ 18/Jul/14 ] |
|
Uploaded logs from oss with atlas2-OST024c. Start time is before the LBUG until present 7-18 14:22 EST. |
| Comment by Jason Hill (Inactive) [ 23/Jul/14 ] |
|
With the upload last week has there been any progress towards a root cause? – |
| Comment by Oleg Drokin [ 24/Jul/14 ] |
|
So the only way I can think of for the "bug 6063" message to print would be if some of the code somewhere ignored the "replay" flag in the RPC and attempted to return a new lock to the client as part of replay RPC. Interestingly, I do not see any asserts for this in the code so if this is what really happened, we'd never know unless it manifests itself in this way. |
| Comment by James Nunez (Inactive) [ 11/Dec/14 ] |
|
Please reopen this ticket if you experience this issue again. |