[LU-4183] MDS crash when running fsx with NFS export Created: 29/Oct/13 Updated: 14/Aug/14 Resolved: 14/Aug/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1, Lustre 2.5.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre 2.5.0-RC1, RHEL 6, OpenSFS cluster with one MDS/MGS, two OSSs with two OSTs each, one node with Lustre mounted/NFS server exporting Lustre, three Lustre clients NFS mounting the file system |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 11320 | ||||||||||||
| Description |
|
Running three instances of fsx writing to an NFS mounted Lustre file system with a single NFS server, the MDS crashes and reboots. From the MDS console: Message from syslogd@c03 at Oct 29 13:04:27 ... kernel:LustreError: 16448:0:(osd_internal.h:963:osd_trans_exec_op()) ASSERTION( rb < OSD_OT_MAX ) failed: rb = 12 Message from syslogd@c03 at Oct 29 13:04:27 ... kernel:LustreError: 16448:0:(osd_internal.h:963:osd_trans_exec_op()) LBUG From the crash dump: crash> bt PID: 16448 TASK: ffff880738300040 CPU: 4 COMMAND: "mdt00_005" #0 [ffff8806565c1628] machine_kexec at ffffffff81035d6b #1 [ffff8806565c1688] crash_kexec at ffffffff810c0e22 #2 [ffff8806565c1758] panic at ffffffff8150de5f #3 [ffff8806565c17d8] lbug_with_loc at ffffffffa04f0eeb [libcfs] #4 [ffff8806565c17f8] osd_punch at ffffffffa0d12243 [osd_ldiskfs] #5 [ffff8806565c1848] llog_osd_write_blob at ffffffffa0639050 [obdclass] #6 [ffff8806565c18b8] llog_osd_write_rec at ffffffffa063d04e [obdclass] #7 [ffff8806565c1968] llog_write_rec at ffffffffa0610428 [obdclass] #8 [ffff8806565c19c8] llog_cat_add_rec at ffffffffa0619299 [obdclass] #9 [ffff8806565c1a38] llog_add at ffffffffa0610221 [obdclass] #10 [ffff8806565c1a88] mdd_changelog_store at ffffffffa0f54463 [mdd] #11 [ffff8806565c1af8] mdd_changelog_data_store at ffffffffa0f465c6 [mdd] #12 [ffff8806565c1b48] mdd_attr_set at ffffffffa0f4c9b1 [mdd] #13 [ffff8806565c1bc8] mdt_attr_set at ffffffffa0e1cc28 [mdt] #14 [ffff8806565c1c18] mdt_reint_setattr at ffffffffa0e1d4dd [mdt] #15 [ffff8806565c1c88] mdt_reint_rec at ffffffffa0e16eb1 [mdt] #16 [ffff8806565c1ca8] mdt_reint_internal at ffffffffa0dfec93 [mdt] #17 [ffff8806565c1ce8] mdt_reint at ffffffffa0dfef94 [mdt] #18 [ffff8806565c1d08] mdt_handle_common at ffffffffa0e01a8a [mdt] #19 [ffff8806565c1d58] mds_regular_handle at ffffffffa0e3bc55 [mdt] #20 [ffff8806565c1d68] ptlrpc_server_handle_request at ffffffffa07f2e25 [ptlrpc] #21 [ffff8806565c1e48] ptlrpc_main at ffffffffa07f418d [ptlrpc] #22 [ffff8806565c1ee8] kthread at ffffffff81096a36 #23 [ffff8806565c1f48] kernel_thread at ffffffff8100c0ca On the OSTs, osd-ldiskfs.track_declares_assert=1 From the crash dump, here is the system information: crash> sys
KERNEL: /home/build/kernel/rpmbuild/BUILD/kernel-2.6.32.358.18.1.el6_lustre/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2013-10-29-13:05:16/vmcore [PARTIAL DUMP]
CPUS: 24
DATE: Tue Oct 29 13:04:30 2013
UPTIME: 3 days, 22:31:58
LOAD AVERAGE: 0.17, 0.06, 0.02
TASKS: 362
NODENAME: c03
RELEASE: 2.6.32-358.18.1.el6_lustre.x86_64
VERSION: #1 SMP Fri Oct 11 16:41:53 PDT 2013
MACHINE: x86_64 (2399 Mhz)
MEMORY: 32 GB
PANIC: "Kernel panic - not syncing: LBUG"
Two of the three fsx commands were: |
| Comments |
| Comment by James Nunez (Inactive) [ 29/Oct/13 ] |
|
Looking at the log in the crash dump: crash> log ... LustreError: 16448:0:(osd_io.c:1065:osd_ldiskfs_write_record()) sda3: error reading offset 1249280 (block 305): rc = -28 LustreError: 16448:0:(llog_osd.c:156:llog_osd_write_blob()) scratch-MDT0000-osd: error writing log record: rc = -28 LustreError: 16448:0:(osd_io.c:1065:osd_ldiskfs_write_record()) sda3: error reading offset 1249280 (block 305): rc = -28 LustreError: 16448:0:(llog_osd.c:156:llog_osd_write_blob()) scratch-MDT0000-osd: error writing log record: rc = -28 LustreError: 16448:0:(osd_internal.h:963:osd_trans_exec_op()) ASSERTION( rb < OSD_OT_MAX ) failed: rb = 12 LustreError: 16448:0:(osd_internal.h:963:osd_trans_exec_op()) LBUG Pid: 16448, comm: mdt00_005 Call Trace: [<ffffffffa04f0895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa04f0e97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0d12243>] osd_punch+0x613/0x620 [osd_ldiskfs] [<ffffffffa0639050>] llog_osd_write_blob+0x1e0/0x850 [obdclass] [<ffffffffa063d04e>] llog_osd_write_rec+0xb5e/0x1370 [obdclass] [<ffffffffa0610428>] llog_write_rec+0xc8/0x290 [obdclass] [<ffffffffa0619299>] llog_cat_add_rec+0x2b9/0x480 [obdclass] [<ffffffffa0610221>] llog_add+0x91/0x1d0 [obdclass] [<ffffffffa0f54463>] mdd_changelog_store+0xf3/0x260 [mdd] [<ffffffffa0f465c6>] mdd_changelog_data_store+0x1a6/0x280 [mdd] [<ffffffffa0f4c9b1>] mdd_attr_set+0xeb1/0x1470 [mdd] [<ffffffffa0e1cc28>] mdt_attr_set+0x268/0x560 [mdt] [<ffffffffa0e1d4dd>] mdt_reint_setattr+0x5bd/0xcf0 [mdt] [<ffffffffa07e2f9e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc] [<ffffffffa0e16eb1>] mdt_reint_rec+0x41/0xe0 [mdt] [<ffffffffa0dfec93>] mdt_reint_internal+0x4c3/0x780 [mdt] [<ffffffffa0dfef94>] mdt_reint+0x44/0xe0 [mdt] [<ffffffffa0e01a8a>] mdt_handle_common+0x52a/0x1470 [mdt] [<ffffffffa0e3bc55>] mds_regular_handle+0x15/0x20 [mdt] [<ffffffffa07f2e25>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] [<ffffffffa050227f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] [<ffffffffa07ea4c9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] [<ffffffffa07f418d>] ptlrpc_main+0xaed/0x1740 [ptlrpc] [<ffffffffa07f36a0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc] [<ffffffff81096a36>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Kernel panic - not syncing: LBUG Pid: 16448, comm: mdt00_005 Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1 Call Trace: [<ffffffff8150de58>] ? panic+0xa7/0x16f [<ffffffffa04f0eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa0d12243>] ? osd_punch+0x613/0x620 [osd_ldiskfs] [<ffffffffa0639050>] ? llog_osd_write_blob+0x1e0/0x850 [obdclass] [<ffffffffa063d04e>] ? llog_osd_write_rec+0xb5e/0x1370 [obdclass] [<ffffffffa0610428>] ? llog_write_rec+0xc8/0x290 [obdclass] [<ffffffffa0619299>] ? llog_cat_add_rec+0x2b9/0x480 [obdclass] [<ffffffffa0610221>] ? llog_add+0x91/0x1d0 [obdclass] [<ffffffffa0f54463>] ? mdd_changelog_store+0xf3/0x260 [mdd] [<ffffffffa0f465c6>] ? mdd_changelog_data_store+0x1a6/0x280 [mdd] [<ffffffffa0f4c9b1>] ? mdd_attr_set+0xeb1/0x1470 [mdd] [<ffffffffa0e1cc28>] ? mdt_attr_set+0x268/0x560 [mdt] [<ffffffffa0e1d4dd>] ? mdt_reint_setattr+0x5bd/0xcf0 [mdt] [<ffffffffa07e2f9e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc] [<ffffffffa0e16eb1>] ? mdt_reint_rec+0x41/0xe0 [mdt] [<ffffffffa0dfec93>] ? mdt_reint_internal+0x4c3/0x780 [mdt] [<ffffffffa0dfef94>] ? mdt_reint+0x44/0xe0 [mdt] [<ffffffffa0e01a8a>] ? mdt_handle_common+0x52a/0x1470 [mdt] [<ffffffffa0e3bc55>] ? mds_regular_handle+0x15/0x20 [mdt] [<ffffffffa07f2e25>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] [<ffffffffa050227f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] [<ffffffffa07ea4c9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] [<ffffffffa07f418d>] ? ptlrpc_main+0xaed/0x1740 [ptlrpc] [<ffffffffa07f36a0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc] [<ffffffff81096a36>] ? kthread+0x96/0xa0 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 |
| Comment by Andreas Dilger [ 29/Oct/13 ] |
|
It looks like this is a bug in llog_osd_write_blob() calling dt_punch() when a write fails: rc = dt_record_write(env, o, &lgi->lgi_buf, off, th);
if (rc)
CERROR("%s: error writing log tail: rc = %d\n",
o->do_lu.lo_dev->ld_obd->obd_name, rc);
out_unlock:
dt_write_unlock(env, o);
out:
/* cleanup the content written above */
if (rc) {
dt_punch(env, o, lgi->lgi_attr.la_size, OBD_OBJECT_EOF, th,
BYPASS_CAPA);
dt_attr_set(env, o, &lgi->lgi_attr, th, BYPASS_CAPA);
}
There are two separate classes of problems that I can see here:
|
| Comment by Oleg Drokin [ 25/Jan/14 ] |
|
I hit this too today, Also due to ENOSPC, apparently. nfs was not involved (mainline kernel lustre client). <4>[ 5963.714103] Lustre: DEBUG MARKER: == sanity test 103: acl test ========================================================================= 00:14:02 (1390626842) <0>[ 5964.573079] LustreError: 21998:0:(osd_internal.h:959:osd_trans_exec_op()) ASSERTION( rb < OSD_OT_MAX ) failed: rb = 12 <0>[ 5964.573451] LustreError: 21998:0:(osd_internal.h:959:osd_trans_exec_op()) LBUG <4>[ 5964.573755] Pid: 21998, comm: mdt00_003 <4>[ 5964.573910] <4>[ 5964.573910] Call Trace: <4>[ 5964.574184] [<ffffffffa041a8a5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4>[ 5964.574386] [<ffffffffa041aea7>] lbug_with_loc+0x47/0xb0 [libcfs] <4>[ 5964.574580] [<ffffffffa0978ad3>] osd_punch+0x613/0x620 [osd_ldiskfs] <4>[ 5964.574787] [<ffffffffa0577d80>] llog_osd_write_blob+0x1e0/0x850 [obdclass] <4>[ 5964.574993] [<ffffffffa057bdad>] llog_osd_write_rec+0xadd/0x13a0 [obdclass] <4>[ 5964.575191] [<ffffffffa0365412>] ? do_get_write_access+0x3d2/0x550 [jbd2] <4>[ 5964.575407] [<ffffffffa056a4f8>] llog_write_rec+0xc8/0x290 [obdclass] <4>[ 5964.575602] [<ffffffffa05729e9>] llog_cat_add_rec+0x2b9/0x480 [obdclass] <4>[ 5964.575800] [<ffffffffa056a2f1>] llog_add+0x91/0x1d0 [obdclass] <4>[ 5964.575984] [<ffffffffa0e8c946>] osp_sync_add_rec+0x246/0xa10 [osp] <4>[ 5964.576173] [<ffffffffa0cdb8e8>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs] <4>[ 5964.576482] [<ffffffffa0e8d1bb>] osp_sync_add+0x7b/0x80 [osp] <4>[ 5964.576679] [<ffffffffa0e4fe20>] ? lod_xattr_set+0xc0/0x7e0 [lod] <4>[ 5964.576876] [<ffffffffa0e7d493>] osp_attr_set+0x193/0x1d0 [osp] <4>[ 5964.577057] [<ffffffffa0e4c837>] lod_attr_set+0x247/0x450 [lod] <4>[ 5964.577241] [<ffffffffa0d410d8>] mdd_acl_set+0x668/0x8d0 [mdd] <4>[ 5964.577850] [<ffffffffa0d28372>] mdd_xattr_set+0x192/0xbd0 [mdd] <4>[ 5964.578059] [<ffffffffa07f0304>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc] <4>[ 5964.578372] [<ffffffffa0d88dca>] ? mdt_version_save+0x8a/0x1a0 [mdt] <4>[ 5964.578561] [<ffffffffa0d8ffaf>] mdt_reint_setxattr+0x73f/0x1850 [mdt] <4>[ 5964.578763] [<ffffffffa05dbc20>] ? lu_ucred+0x20/0x30 [obdclass] <4>[ 5964.578947] [<ffffffffa0d6a405>] ? mdt_ucred+0x15/0x20 [mdt] <4>[ 5964.579127] [<ffffffffa0d8457c>] ? mdt_root_squash+0x2c/0x410 [mdt] <4>[ 5964.579317] [<ffffffffa0d883b1>] mdt_reint_rec+0x41/0xe0 [mdt] <4>[ 5964.579499] [<ffffffffa0d6ed33>] mdt_reint_internal+0x4c3/0x780 [mdt] <4>[ 5964.579687] [<ffffffffa0d6f57b>] mdt_reint+0x6b/0x120 [mdt] <4>[ 5964.579883] [<ffffffffa084ebec>] tgt_request_handle+0x23c/0xac0 [ptlrpc] <4>[ 5964.580089] [<ffffffffa07ffb18>] ptlrpc_main+0xcc8/0x1940 [ptlrpc] <4>[ 5964.580288] [<ffffffffa07fee50>] ? ptlrpc_main+0x0/0x1940 [ptlrpc] <4>[ 5964.580476] [<ffffffff81094726>] kthread+0x96/0xa0 <4>[ 5964.580657] [<ffffffff8100c10a>] child_rip+0xa/0x20 <4>[ 5964.580837] [<ffffffff81094690>] ? kthread+0x0/0xa0 <4>[ 5964.581002] [<ffffffff8100c100>] ? child_rip+0x0/0x20 <4>[ 5964.581168] <0>[ 5964.581889] Kernel panic - not syncing: LBUG |
| Comment by Andreas Dilger [ 17/Apr/14 ] |
|
This will likely be fixed with |
| Comment by Andreas Dilger [ 22/Apr/14 ] |
|
The patch is actually being developed under |
| Comment by Jodi Levi (Inactive) [ 03/Jun/14 ] |
|
Reopening to remove fix version as this is a duplicate |