[LU-2856] Kernel panic - not syncing: LBUG - ASSERTION( lh->mlh_pdo_hash != 0 ) Created: 24/Feb/13 Updated: 11/Jul/14 Resolved: 11/Jul/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.1.4 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Julien Paret | Assignee: | Cliff White (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Environment: |
RHEL6 Update 3 |
||
| Attachments: |
|
| Epic/Theme: | Stability |
| Severity: | 3 |
| Rank (Obsolete): | 6918 |
| Description |
|
In a fresh install of RHEL 6 Update 3 (x86_64) 1 Physical server (1 MDT / 8 OST). Local Mount (over eth0) and 1 export NFS I have a Kernel Panic : crash> log LustreError: 4085:0:(mdt_reint.c:916:mdt_pdir_hash_lock()) ASSERTION( lh->mlh_pdo_hash != 0 ) failed: LustreError: 4085:0:(mdt_reint.c:916:mdt_pdir_hash_lock()) LBUG Pid: 4085, comm: mdt_04 Call Trace: [<ffffffffa03347f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa0334e07>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0c65c6b>] mdt_reint_rename+0x1acb/0x1d70 [mdt] [<ffffffffa05a5520>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc] [<ffffffffa05a68c0>] ? ldlm_completion_ast+0x0/0x720 [ptlrpc] [<ffffffffa034b18e>] ? upcall_cache_get_entry+0x28e/0x944 [libcfs] [<ffffffffa0c5c4ec>] ? mdt_root_squash+0x2c/0x3e0 [mdt] [<ffffffffa0c60c51>] mdt_reint_rec+0x41/0xe0 [mdt] [<ffffffffa0c57ed4>] mdt_reint_internal+0x544/0x8e0 [mdt] [<ffffffffa0c582b4>] mdt_reint+0x44/0xe0 [mdt] [<ffffffffa0c4c772>] mdt_handle_common+0x932/0x1750 [mdt] [<ffffffffa0c4d665>] mdt_regular_handle+0x15/0x20 [mdt] [<ffffffffa05ddc5e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc] [<ffffffffa05dd010>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] [<ffffffff8100c14a>] child_rip+0xa/0x20 [<ffffffffa05dd010>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] [<ffffffffa05dd010>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] [<ffffffff8100c140>] ? child_rip+0x0/0x20 Kernel panic - not syncing: LBUG Pid: 4085, comm: mdt_04 Not tainted 2.6.32-279.14.1.el6_lustre.x86_64 #1 Call Trace: [<ffffffff814fdcba>] ? panic+0xa0/0x168 [<ffffffffa0334e5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa0c65c6b>] ? mdt_reint_rename+0x1acb/0x1d70 [mdt] [<ffffffffa05a5520>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc] [<ffffffffa05a68c0>] ? ldlm_completion_ast+0x0/0x720 [ptlrpc] [<ffffffffa034b18e>] ? upcall_cache_get_entry+0x28e/0x944 [libcfs] [<ffffffffa0c5c4ec>] ? mdt_root_squash+0x2c/0x3e0 [mdt] [<ffffffffa0c60c51>] ? mdt_reint_rec+0x41/0xe0 [mdt] [<ffffffffa0c57ed4>] ? mdt_reint_internal+0x544/0x8e0 [mdt] [<ffffffffa0c582b4>] ? mdt_reint+0x44/0xe0 [mdt] [<ffffffffa0c4c772>] ? mdt_handle_common+0x932/0x1750 [mdt] [<ffffffffa0c4d665>] ? mdt_regular_handle+0x15/0x20 [mdt] [<ffffffffa05ddc5e>] ? ptlrpc_main+0xc4e/0x1a40 [ptlrpc] [<ffffffffa05dd010>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] [<ffffffff8100c14a>] ? child_rip+0xa/0x20 [<ffffffffa05dd010>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] [<ffffffffa05dd010>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] [<ffffffff8100c140>] ? child_rip+0x0/0x20 crash> bt -l
PID: 4085 TASK: ffff8806283c0040 CPU: 6 COMMAND: "mdt_04"
#0 [ffff8805ea1d59b8] machine_kexec at ffffffff8103284b
/usr/src/debug/kernel-2.6.32-279.14.1.el6/linux-2.6.32-279.14.1.el6_lustre.x86_64/arch/x86/kernel/machine_kexec_64.c: 336
#1 [ffff8805ea1d5a18] crash_kexec at ffffffff810ba982
/usr/src/debug/kernel-2.6.32-279.14.1.el6/linux-2.6.32-279.14.1.el6_lustre.x86_64/kernel/kexec.c: 1106
#2 [ffff8805ea1d5ae8] panic at ffffffff814fdcc1
/usr/src/debug/kernel-2.6.32-279.14.1.el6/linux-2.6.32-279.14.1.el6_lustre.x86_64/kernel/panic.c: 103
#3 [ffff8805ea1d5b68] lbug_with_loc at ffffffffa0334e5b [libcfs]
#4 [ffff8805ea1d5b88] mdt_reint_rename at ffffffffa0c65c6b [mdt]
#5 [ffff8805ea1d5cd8] mdt_reint_rec at ffffffffa0c60c51 [mdt]
#6 [ffff8805ea1d5cf8] mdt_reint_internal at ffffffffa0c57ed4 [mdt]
#7 [ffff8805ea1d5d48] mdt_reint at ffffffffa0c582b4 [mdt]
#8 [ffff8805ea1d5d68] mdt_handle_common at ffffffffa0c4c772 [mdt]
#9 [ffff8805ea1d5db8] mdt_regular_handle at ffffffffa0c4d665 [mdt]
#10 [ffff8805ea1d5dc8] ptlrpc_main at ffffffffa05ddc5e [ptlrpc]
#11 [ffff8805ea1d5f48] kernel_thread at ffffffff8100c14a
/usr/src/debug///////////////////////////////////////////////////////////////////////////////////////////kernel-2.6.32-279.14.1.el6/linux-2.6.32-279.14.1.el6_lustre.x86_64/arch/x86/kernel/entry_64.S: 1213
|
| Comments |
| Comment by Julien Paret [ 25/Feb/13 ] |
|
Good Morning, Once again, the same PB : I have delete my NFS export for these server. In this case I have only a lustre connected Client over Ethernet / TCP. This client export the lustre volume over NFS. No probleme on the Lustre Client, by the server failed. Here you can find somes informations. KERNEL: /usr/lib/debug/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/vmlinux
DUMPFILE: vmcore.flat [PARTIAL DUMP]
CPUS: 8
DATE: Mon Feb 25 01:30:43 2013
UPTIME: 02:45:04
LOAD AVERAGE: 1.57, 1.36, 1.13
TASKS: 725
NODENAME: X.X.X.X
RELEASE: 2.6.32-279.14.1.el6_lustre.x86_64
VERSION: #1 SMP Fri Dec 14 23:22:17 PST 2012
MACHINE: x86_64 (1795 Mhz)
MEMORY: 48 GB
PANIC: "Kernel panic - not syncing: LBUG"
PID: 4277
COMMAND: "mdt_04"
TASK: ffff880629a98ae0 [THREAD_INFO: ffff8805f9b02000]
CPU: 6
STATE: TASK_RUNNING (PANIC)
crash> bt -l
PID: 4277 TASK: ffff880629a98ae0 CPU: 6 COMMAND: "mdt_04"
#0 [ffff8805f9b039b8] machine_kexec at ffffffff8103284b
/usr/src/debug/kernel-2.6.32-279.14.1.el6/linux-2.6.32-279.14.1.el6_lustre.x86_64/arch/x86/kernel/machine_kexec_64.c: 336
#1 [ffff8805f9b03a18] crash_kexec at ffffffff810ba982
/usr/src/debug/kernel-2.6.32-279.14.1.el6/linux-2.6.32-279.14.1.el6_lustre.x86_64/kernel/kexec.c: 1106
#2 [ffff8805f9b03ae8] panic at ffffffff814fdcc1
/usr/src/debug/kernel-2.6.32-279.14.1.el6/linux-2.6.32-279.14.1.el6_lustre.x86_64/kernel/panic.c: 103
#3 [ffff8805f9b03b68] lbug_with_loc at ffffffffa032be5b [libcfs]
#4 [ffff8805f9b03b88] mdt_reint_rename at ffffffffa0c54c6b [mdt]
#5 [ffff8805f9b03cd8] mdt_reint_rec at ffffffffa0c4fc51 [mdt]
#6 [ffff8805f9b03cf8] mdt_reint_internal at ffffffffa0c46ed4 [mdt]
#7 [ffff8805f9b03d48] mdt_reint at ffffffffa0c472b4 [mdt]
#8 [ffff8805f9b03d68] mdt_handle_common at ffffffffa0c3b772 [mdt]
#9 [ffff8805f9b03db8] mdt_regular_handle at ffffffffa0c3c665 [mdt]
#10 [ffff8805f9b03dc8] ptlrpc_main at ffffffffa05d4c5e [ptlrpc]
#11 [ffff8805f9b03f48] kernel_thread at ffffffff8100c14a
/usr/src/debug///////////////////////////////////////////////////////////////////////////////////////////kernel-2.6.32-279.14.1.el6/linux-2.6.32-279.14.1.el6_lustre.x86_64/arch/x86/kernel/entry_64.S: 1213
crash>
LustreError: 4277:0:(mdt_reint.c:916:mdt_pdir_hash_lock()) ASSERTION( lh->mlh_pdo_hash != 0 ) failed:
LustreError: 4277:0:(mdt_reint.c:916:mdt_pdir_hash_lock()) LBUG
Pid: 4277, comm: mdt_04
Call Trace:
[<ffffffffa032b7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa032be07>] lbug_with_loc+0x47/0xb0 [libcfs]
[<ffffffffa0c54c6b>] mdt_reint_rename+0x1acb/0x1d70 [mdt]
[<ffffffffa059c520>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
[<ffffffffa059d8c0>] ? ldlm_completion_ast+0x0/0x720 [ptlrpc]
[<ffffffffa034218e>] ? upcall_cache_get_entry+0x28e/0x944 [libcfs]
[<ffffffffa0c4b4ec>] ? mdt_root_squash+0x2c/0x3e0 [mdt]
[<ffffffffa0c4fc51>] mdt_reint_rec+0x41/0xe0 [mdt]
[<ffffffffa0c46ed4>] mdt_reint_internal+0x544/0x8e0 [mdt]
[<ffffffffa0c472b4>] mdt_reint+0x44/0xe0 [mdt]
[<ffffffffa0c3b772>] mdt_handle_common+0x932/0x1750 [mdt]
[<ffffffffa0c3c665>] mdt_regular_handle+0x15/0x20 [mdt]
[<ffffffffa05d4c5e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
[<ffffffffa05d4010>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffffa05d4010>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
[<ffffffffa05d4010>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
[<ffffffff8100c140>] ? child_rip+0x0/0x20
Kernel panic - not syncing: LBUG
Pid: 4277, comm: mdt_04 Not tainted 2.6.32-279.14.1.el6_lustre.x86_64 #1
Call Trace:
[<ffffffff814fdcba>] ? panic+0xa0/0x168
[<ffffffffa032be5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
[<ffffffffa0c54c6b>] ? mdt_reint_rename+0x1acb/0x1d70 [mdt]
[<ffffffffa059c520>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
[<ffffffffa059d8c0>] ? ldlm_completion_ast+0x0/0x720 [ptlrpc]
[<ffffffffa034218e>] ? upcall_cache_get_entry+0x28e/0x944 [libcfs]
[<ffffffffa0c4b4ec>] ? mdt_root_squash+0x2c/0x3e0 [mdt]
[<ffffffffa0c4fc51>] ? mdt_reint_rec+0x41/0xe0 [mdt]
[<ffffffffa0c46ed4>] ? mdt_reint_internal+0x544/0x8e0 [mdt]
[<ffffffffa0c472b4>] ? mdt_reint+0x44/0xe0 [mdt]
[<ffffffffa0c3b772>] ? mdt_handle_common+0x932/0x1750 [mdt]
[<ffffffffa0c3c665>] ? mdt_regular_handle+0x15/0x20 [mdt]
[<ffffffffa05d4c5e>] ? ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
[<ffffffffa05d4010>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
[<ffffffff8100c14a>] ? child_rip+0xa/0x20
[<ffffffffa05d4010>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
[<ffffffffa05d4010>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
[<ffffffff8100c140>] ? child_rip+0x0/0x20
crash>
crash> kmem -i
PAGES TOTAL PERCENTAGE
TOTAL MEM 12347405 47.1 GB ----
FREE 66403 259.4 MB 0% of TOTAL MEM
USED 12281002 46.8 GB 99% of TOTAL MEM
SHARED 3829842 14.6 GB 31% of TOTAL MEM
BUFFERS 3918599 14.9 GB 31% of TOTAL MEM
CACHED 3458749 13.2 GB 28% of TOTAL MEM
SLAB 4572891 17.4 GB 37% of TOTAL MEM
TOTAL SWAP 13107198 50 GB ----
SWAP USED 0 0 0% of TOTAL SWAP
SWAP FREE 13107198 50 GB 100% of TOTAL SWAP
How can I help you ? Dou you need more informations ? |
| Comment by Julien Paret [ 28/Feb/13 ] |
|
Hello, Il order two simplify the configuration I have in my server one mdt and 8 ost. I do not use any NFS export. I use a lustre client and RSync in order to write files on my volume. Angain two Kernel Panic on the last days, the bt is the same. Any idea on this problem ? |
| Comment by Julien Paret [ 05/Apr/13 ] |
|
Hello, In order to respect the best practices I'm currently using 3 servers. 1 Lustre Client I'am in RHEL6U3. I use the latest Maintenance release 2.1.5. I have 80 TB of space. After 1.5 or 2 days of data filling on the solution a have this Error on the MDS : ASSERTION( lh->mlh_pdo_hash != 0 ) and Kernel Panic. I have the crash file of the Kernel Panic, here you can find somes informations : crash> log ... LDISKFS-fs (sdb): mounted filesystem with ordered data mode. Opts: Lustre: Lustre: Build Version: RC1--PRISTINE-2.6.32-279.19.1.el6_lustre.x86_64 Lustre: Added LNI 10.67.39.5@tcp [8/256/0/180] Lustre: Accept secure, port 988 Lustre: Lustre OSC module (ffffffffa09c1060). Lustre: Lustre LOV module (ffffffffa0a55280). Lustre: Lustre client module (ffffffffa0b41940). LDISKFS-fs (sdb): mounted filesystem with ordered data mode. Opts: LDISKFS-fs (sdb): mounted filesystem with ordered data mode. Opts: Lustre: MGS MGS started Lustre: 5797:0:(ldlm_lib.c:952:target_handle_connect()) MGS: connection from 4e5913b4-ae4a-057f-76ac-bc04028bf184@0@lo t0 exp (null) cur 1364923454 last 0 Lustre: MGC10.67.39.5@tcp: Reactivating import Lustre: Enabling ACL Lustre: data1-MDT0000: new disk, initializing Lustre: 5797:0:(ldlm_lib.c:952:target_handle_connect()) MGS: connection from b31dcf84-09d3-06bb-f076-1ea0d725ee50@10.67.39.4@tcp t0 exp (null) cur 1364923585 last 0 Lustre: MDS mdd_obd-data1-MDT0000: data1-OST0000_UUID now active, resetting orphans Lustre: MDS mdd_obd-data1-MDT0000: data1-OST0001_UUID now active, resetting orphans Lustre: MDS mdd_obd-data1-MDT0000: data1-OST0002_UUID now active, resetting orphans Lustre: MDS mdd_obd-data1-MDT0000: data1-OST0003_UUID now active, resetting orphans Lustre: MDS mdd_obd-data1-MDT0000: data1-OST0004_UUID now active, resetting orphans Lustre: MDS mdd_obd-data1-MDT0000: data1-OST0005_UUID now active, resetting orphans Lustre: MDS mdd_obd-data1-MDT0000: data1-OST0006_UUID now active, resetting orphans Lustre: MDS mdd_obd-data1-MDT0000: data1-OST0007_UUID now active, resetting orphans Lustre: MDS mdd_obd-data1-MDT0000: data1-OST0008_UUID now active, resetting orphans Lustre: 5797:0:(ldlm_lib.c:952:target_handle_connect()) MGS: connection from ed95164a-a7ef-6d37-ffdd-4409023d7fd5@10.67.39.6@tcp t0 exp (null) cur 1364923987 last 0 Lustre: ctl-data1-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:0 LustreError: 5815:0:(mdt_reint.c:916:mdt_pdir_hash_lock()) ASSERTION( lh->mlh_pdo_hash != 0 ) failed: LustreError: 5815:0:(mdt_reint.c:916:mdt_pdir_hash_lock()) LBUG Pid: 5815, comm: mdt_01 Call Trace: [<ffffffffa043b785>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa043bd97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0cfac6b>] mdt_reint_rename+0x1acb/0x1d70 [mdt] [<ffffffffa06b8520>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc] [<ffffffffa06b98c0>] ? ldlm_completion_ast+0x0/0x720 [ptlrpc] [<ffffffffa045211e>] ? upcall_cache_get_entry+0x28e/0x944 [libcfs] [<ffffffffa0cf14ec>] ? mdt_root_squash+0x2c/0x3e0 [mdt] [<ffffffffa0cf5c51>] mdt_reint_rec+0x41/0xe0 [mdt] [<ffffffffa0ceced4>] mdt_reint_internal+0x544/0x8e0 [mdt] [<ffffffffa0ced2b4>] mdt_reint+0x44/0xe0 [mdt] [<ffffffffa0ce1772>] mdt_handle_common+0x932/0x1750 [mdt] [<ffffffffa0ce2665>] mdt_regular_handle+0x15/0x20 [mdt] [<ffffffffa06f0bae>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc] [<ffffffffa06eff60>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffa06eff60>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] [<ffffffffa06eff60>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Kernel panic - not syncing: LBUG Pid: 5815, comm: mdt_01 Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1 Call Trace: [<ffffffff814e9811>] ? panic+0xa0/0x168 [<ffffffffa043bdeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa0cfac6b>] ? mdt_reint_rename+0x1acb/0x1d70 [mdt] [<ffffffffa06b8520>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc] [<ffffffffa06b98c0>] ? ldlm_completion_ast+0x0/0x720 [ptlrpc] [<ffffffffa045211e>] ? upcall_cache_get_entry+0x28e/0x944 [libcfs] [<ffffffffa0cf14ec>] ? mdt_root_squash+0x2c/0x3e0 [mdt] [<ffffffffa0cf5c51>] ? mdt_reint_rec+0x41/0xe0 [mdt] [<ffffffffa0ceced4>] ? mdt_reint_internal+0x544/0x8e0 [mdt] [<ffffffffa0ced2b4>] ? mdt_reint+0x44/0xe0 [mdt] [<ffffffffa0ce1772>] ? mdt_handle_common+0x932/0x1750 [mdt] [<ffffffffa0ce2665>] ? mdt_regular_handle+0x15/0x20 [mdt] [<ffffffffa06f0bae>] ? ptlrpc_main+0xc4e/0x1a40 [ptlrpc] [<ffffffffa06eff60>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 [<ffffffffa06eff60>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] [<ffffffffa06eff60>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 crash> bt -l
PID: 5815 TASK: ffff8805e64baae0 CPU: 3 COMMAND: "mdt_01"
#0 [ffff8805e64c19b8] machine_kexec at ffffffff81031f7b
/usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre.x86_64/arch/x86/kernel/machine_kexec_64.c: 336
#1 [ffff8805e64c1a18] crash_kexec at ffffffff810b8c22
/usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre.x86_64/kernel/kexec.c: 1106
#2 [ffff8805e64c1ae8] panic at ffffffff814e9818
/usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre.x86_64/kernel/panic.c: 103
#3 [ffff8805e64c1b68] lbug_with_loc at ffffffffa043bdeb [libcfs]
#4 [ffff8805e64c1b88] mdt_reint_rename at ffffffffa0cfac6b [mdt]
#5 [ffff8805e64c1cd8] mdt_reint_rec at ffffffffa0cf5c51 [mdt]
#6 [ffff8805e64c1cf8] mdt_reint_internal at ffffffffa0ceced4 [mdt]
#7 [ffff8805e64c1d48] mdt_reint at ffffffffa0ced2b4 [mdt]
#8 [ffff8805e64c1d68] mdt_handle_common at ffffffffa0ce1772 [mdt]
#9 [ffff8805e64c1db8] mdt_regular_handle at ffffffffa0ce2665 [mdt]
#10 [ffff8805e64c1dc8] ptlrpc_main at ffffffffa06f0bae [ptlrpc]
#11 [ffff8805e64c1f48] kernel_thread at ffffffff8100c0ca
/usr/src/debug////////////////////////////////////////////////////////////////////////////////////////////////kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre.x86_64/arch/x86/kernel/entry_64.S: 121
crash> sys
KERNEL: /usr/lib/debug/lib/modules/2.6.32-279.19.1.el6_lustre.x86_64/vmlinux
DUMPFILE: vmcore.flat [PARTIAL DUMP]
CPUS: 8
DATE: Fri Apr 5 05:36:47 2013
UPTIME: 2 days, 10:16:31
LOAD AVERAGE: 1.62, 1.18, 0.68
TASKS: 319
NODENAME: X.X.X.X
RELEASE: 2.6.32-279.19.1.el6_lustre.x86_64
VERSION: #1 SMP Wed Mar 20 16:37:18 PDT 2013
MACHINE: x86_64 (2266 Mhz)
MEMORY: 24 GB
PANIC: "Kernel panic - not syncing: LBUG"
|
| Comment by Daniel Kobras (Inactive) [ 24/Apr/13 ] |
|
I've has a look into this: The problems are due to a bug in the PDO MDT code that incorrectly assumes full_name_hash() to return only non-zero values. The reporter's filesystem, however, contains a filename that happens to full_name_hash() to 0x00000000. When trying to rsync this file, rsync first creates an updates temporary file that gets rename()d to the problematic filename. The rename() step will always hit the LASSERT() mentioned in the original report. (I cannot disclose the troublesome filename, unfortunately, because it is considered confidental.) lustre/mdt/mdt_handler.c::mdt_lock_pdo_init() uses the (assumed) special value mlh_pdo_hash = 0 to indicate an invalid filename (NULL or ""). This is used in lustre/mdt/mdt_handler.c::mdt_object_lock0() to distinguish whether a lock has to be taken on a specific inode. Also, lustre/ldlm/ldlm_resource.c::ldlm_res_hop_fid_hash() special-cases a zero PDO hash value and uses a different algorithm for its own hash generation. This means that while bogus, the LASSERT() in lustre/mdt/mdt_reint.c::mdt_pdir_hash_lock() is there for a reason and cannot simply be dropped without modifications in the other functions mentioned above, first. What does seem to work fine, though, is to essentially modify the hash function to never return zero for any filename. While full_name_hash() is used in multiple places in the code, they are all disjoint as far as I could tell, and only the MDT/PDO implementation seems to make the non-zero assumption about the hash value. Hence, just modifying the hash generation in lustre/mdt/mdt_handler.c::mdt_lock_pdo_init() should be a safe and sufficient workaround, and has fixed the issue on the reporter's filesystem. I'm attaching a patch for Lustre 2.1, but all later versions also seem to suffer from this bug. |
| Comment by Daniel Kobras (Inactive) [ 24/Apr/13 ] |
|
Possible workaround tested on b2_1 |
| Comment by Daniel Kobras (Inactive) [ 25/Apr/13 ] |
|
Change for b2_1 is at http://review.whamcloud.com/#change,6162 |
| Comment by Daniel Kobras (Inactive) [ 25/Apr/13 ] |
|
Change for master is at http://review.whamcloud.com/#change,6166 |
| Comment by Keith Mannthey (Inactive) [ 30/Apr/13 ] |
|
The master change has landed. B2_1 patch still not committed. |