[LU-1080] mds-survey crash Created: 08/Feb/12 Updated: 21/Feb/12 Resolved: 21/Feb/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.2.0 |
| Fix Version/s: | Lustre 2.2.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Richard Henwood (Inactive) | Assignee: | Alex Zhuravlev |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Master |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 6468 | ||||||||
| Description |
|
Running on a real machine: $ mkfs.lustre --fsname=survey --mdt --index=0 /dev/sda3 $ mount -t lustre /dev/sda3 /mnt $ thrhi=64 file_count=200000 sh mds-survey then crash: Build Version: jenkins-arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel-4610-g614 Lustre: Added LNI 10.45.1.8@tcp [8/256/0/180] Lustre: Accept all, port 988 LDISKFS-fs (sda3): recovery complete LDISKFS-fs (sda3): mounted filesystem with ordered data mode. Opts: LDISKFS-fs (sda3): mounted filesystem with ordered data mode. Opts: Lustre: MGC10.45.1.8@tcp: Reactivating import Lustre: survey-MDT0000: used disk, loading Lustre: Echo OBD driver; http://www.lustre.org/ LustreError: 1821:0:(echo_client.c:1810:echo_md_destroy_internal()) Can not unlink child tests: rc = -39 LustreError: 1823:0:(echo_client.c:1810:echo_md_destroy_internal()) Can not unlink child tests1: rc = -39 LustreError: 1831:0:(osd_handler.c:2294:osd_object_ref_del()) ASSERTION((oh)->ot_declare_ref_del > 0) failed LustreError: 1831:0:(osd_handler.c:2294:osd_object_ref_del()) LBUG Pid: 1831, comm: lctl Call Trace: [<ffffffffa038e855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa038ee95>] lbug_with_loc+0x75/0xe0 [libcfs] [<ffffffffa0399d96>] libcfs_assertion_failed+0x66/0x70 [libcfs] [<ffffffffa0a1781a>] osd_object_ref_del+0x14a/0x180 [osd_ldiskfs] [<ffffffffa096ecbb>] __mdd_ref_del+0x5b/0xa0 [mdd] [<ffffffffa09777a2>] mdd_create+0x1ae2/0x2470 [mdd] [<ffffffffa051190d>] ? htable_lookup+0xed/0x190 [obdclass] [<ffffffffa041b5a9>] ? cfs_hash_bd_add_locked+0x29/0x90 [libcfs] [<ffffffff81275894>] ? vsnprintf+0x484/0x5f0 [<ffffffffa0a6822b>] echo_md_create_internal+0xab/0x4b0 [obdecho] [<ffffffff81275a40>] ? sprintf+0x40/0x50 [<ffffffffa0a6ff40>] echo_md_handler+0x1380/0x1dd0 [obdecho] [<ffffffffa040d87e>] ? cfs_mem_cache_free+0xe/0x10 [libcfs] [<ffffffffa0a75ae6>] echo_client_iocontrol+0x1c86/0x2a30 [obdecho] [<ffffffff81127e77>] ? ____pagevec_lru_add+0x167/0x180 [<ffffffffa040da13>] ? cfs_alloc+0x63/0x90 [libcfs] [<ffffffffa04c0f52>] ? obd_ioctl_getdata+0x172/0x1060 [obdclass] [<ffffffffa04d6264>] class_handle_ioctl+0x14d4/0x2340 [obdclass] [<ffffffff8120d5df>] ? security_inode_permission+0x1f/0x30 [<ffffffffa04c0313>] obd_class_ioctl+0x53/0x240 [obdclass] [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20 [<ffffffff81189342>] vfs_ioctl+0x22/0xa0 [<ffffffff811894c9>] ? do_vfs_ioctl+0x69/0x580 [<ffffffff811894e4>] do_vfs_ioctl+0x84/0x580 [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20 [<ffffffff81189a61>] sys_ioctl+0x81/0xa0 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b Kernel panic - not syncing: LBUG Pid: 7014, comm: lctl Not tainted 2.6.32-220.el6_lustre.x86_64 #1 Call Trace: [<ffffffff814ec701>] ? panic+0x78/0x143 [<ffffffffa040ceeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs] [<ffffffffa0417d96>] ? libcfs_assertion_failed+0x66/0x70 [libcfs] [<ffffffffa0a1781a>] ? osd_object_ref_del+0x14a/0x180 [osd_ldiskfs] [<ffffffffa096ecbb>] ? __mdd_ref_del+0x5b/0xa0 [mdd] [<ffffffffa09777a2>] ? mdd_create+0x1ae2/0x2470 [mdd] [<ffffffffa051190d>] ? htable_lookup+0xed/0x190 [obdclass] [<ffffffffa041b5a9>] ? cfs_hash_bd_add_locked+0x29/0x90 [libcfs] [<ffffffff81275894>] ? vsnprintf+0x484/0x5f0 [<ffffffffa0a6822b>] ? echo_md_create_internal+0xab/0x4b0 [obdecho] [<ffffffff81275a40>] ? sprintf+0x40/0x50 [<ffffffffa0a6ff40>] ? echo_md_handler+0x1380/0x1dd0 [obdecho] [<ffffffffa040d87e>] ? cfs_mem_cache_free+0xe/0x10 [libcfs] [<ffffffffa0a75ae6>] ? echo_client_iocontrol+0x1c86/0x2a30 [obdecho] [<ffffffff81127e77>] ? ____pagevec_lru_add+0x167/0x180 [<ffffffffa040da13>] ? cfs_alloc+0x63/0x90 [libcfs] [<ffffffffa04c0f52>] ? obd_ioctl_getdata+0x172/0x1060 [obdclass] [<ffffffffa04d6264>] ? class_handle_ioctl+0x14d4/0x2340 [obdclass] [<ffffffff8120d5df>] ? security_inode_permission+0x1f/0x30 [<ffffffffa04c0313>] ? obd_class_ioctl+0x53/0x240 [obdclass] [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20 [<ffffffff81189342>] ? vfs_ioctl+0x22/0xa0 [<ffffffff811894c9>] ? do_vfs_ioctl+0x69/0x580 [<ffffffff811894e4>] ? do_vfs_ioctl+0x84/0x580 [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20 [<ffffffff81189a61>] ? sys_ioctl+0x81/0xa0 [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b |
| Comments |
| Comment by Richard Henwood (Inactive) [ 08/Feb/12 ] |
# rpm -qa | grep lustre lustre-ldiskfs-3.3.0-2.6.32_220.el6_lustre.x86_64_g61f62a1.x86_64 kernel-2.6.32-131.17.1.el6_lustre.g60f4e35.x86_64 kernel-2.6.32-220.el6_lustre.x86_64 lustre-modules-2.1.55-2.6.32_220.el6_lustre.x86_64_g61f62a1.x86_64 lustre-tests-2.1.55-2.6.32_220.el6_lustre.x86_64_g61f62a1.x86_64 kernel-firmware-2.6.32-220.el6_lustre.x86_64 lustre-2.1.55-2.6.32_220.el6_lustre.x86_64_g61f62a1.x86_64 mds-survey and libecho are from: |
| Comment by Di Wang [ 09/Feb/12 ] |
|
It seems this is related with recent osd API change, instead of mds-survey bug. Richard, it seems mdd_create is failed somewhere? could you be able to find the debug log of LBUG. |
| Comment by Di Wang [ 09/Feb/12 ] |
|
Assign this to Alex. |
| Comment by Andreas Dilger [ 13/Feb/12 ] |
|
Alex, do you have any ideas on how this might be fixed? |
| Comment by Alex Zhuravlev [ 13/Feb/12 ] |
|
sorry, still thinking how to solve this easily ... |
| Comment by Niu Yawei (Inactive) [ 16/Feb/12 ] |
|
Since we can't declare undo operations, I've removed this LASSERT in |
| Comment by Alex Zhuravlev [ 17/Feb/12 ] |
|
well, we can declare undo ops - that could be the easiest solution, but that results in more credits. btw, any idea why [DTO_INDEX_DELETE] = 16 ? ldiskfs never shrinks dir nor it updates neighbor blocks during entry removal |
| Comment by Niu Yawei (Inactive) [ 17/Feb/12 ] |
I have no idea why it was 16. You are the ext3/4 expert, I believe you are right |
| Comment by Alex Zhuravlev [ 17/Feb/12 ] |
|
I'm beginning to think that if deletion consts 1 credit, then 1..2 more credits to set nlink. |
| Comment by Peter Jones [ 19/Feb/12 ] |
|
Andreas/Johann Could one of you please comment on this 2.2 blocker? Thanks Peter |
| Comment by Richard Henwood (Inactive) [ 20/Feb/12 ] |
|
FYI: I have been running a more recent Lustre and I have not been able to reproduce this issue. # rpm -qa | grep lustre lustre-ldiskfs-3.3.0-2.6.32_220.el6_lustre.gfd1c51d.x86_64_g0204171.x86_64 kernel-2.6.32-220.el6_lustre.gfd1c51d.x86_64 lustre-modules-2.1.55-2.6.32_220.el6_lustre.gfd1c51d.x86_64_g0204171.x86_64 kernel-firmware-2.6.32-220.el6_lustre.gfd1c51d.x86_64 lustre-2.1.55-2.6.32_220.el6_lustre.gfd1c51d.x86_64_g0204171.x86_64 These rpms are from build 480: |
| Comment by Andreas Dilger [ 21/Feb/12 ] |
|
|