[LU-1291] Test failure on test suite replay-single 44c Created: 07/Apr/12 Updated: 21/May/13 Resolved: 21/May/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0, Lustre 2.1.4, Lustre 2.1.5 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.1.6 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Yang Sheng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 4604 | ||||||||
| Description |
|
This issue was created by maloo for Oleg Drokin <green@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/667d0d32-80d8-11e1-997d-525400d2bfa6. MDT crashed: 07:51:48:Lustre: DEBUG MARKER: == replay-single test 44c: race in target handle connect ============================================= 04:51:46 (1333799506) |
| Comments |
| Comment by Oleg Drokin [ 07/Apr/12 ] |
|
Alex advises that this code is in process of being deprecated and some planned changes scheduled for landing remove this particular piece of code anyway. |
| Comment by Andreas Dilger [ 11/Apr/12 ] |
|
Unfortunately, while this code is in the process of being deleted, there is no way that it will be landed for the 2.3 release, so some solution is needed for the short term. This test is failing today (quite frequently, about 30% of full runs, 11% of all runs), and preventing other patches from landing. The first failure is April 1, but this also coincides with the move to Indiana, so it isn't clear that the problem was introduced by a commit or by the environment change. |
| Comment by Peter Jones [ 15/Apr/12 ] |
|
YangSheng Could you please look into this one? Thanks Peter |
| Comment by Andreas Dilger [ 19/Apr/12 ] |
|
https://maloo.whamcloud.com/test_sessions/58a0aa04-884c-11e1-b9ba-525400d2bfa6 |
| Comment by Zhenyu Xu [ 23/Apr/12 ] |
|
https://maloo.whamcloud.com/test_sets/e1f9521e-8b18-11e1-ac4e-525400d2bfa6 |
| Comment by Andreas Dilger [ 07/May/12 ] |
|
https://maloo.whamcloud.com/test_sets/baf56548-9826-11e1-9351-525400d2bfa6 |
| Comment by Andreas Dilger [ 07/May/12 ] |
|
Currently showing about 11% of all runs failing with this bug. |
| Comment by Yang Sheng [ 14/May/12 ] |
|
Patch unload to: http://review.whamcloud.com/2708 |
| Comment by Bob Glossman (Inactive) [ 18/May/12 ] |
|
https://maloo.whamcloud.com/test_sets/8acef8ca-a0be-11e1-b20c-52540035b04c |
| Comment by Xuezhao Liu [ 20/May/12 ] |
|
https://maloo.whamcloud.com/test_sessions/b4d8a1e4-a261-11e1-bba6-52540035b04c |
| Comment by nasf (Inactive) [ 20/May/12 ] |
|
another failure instance: https://maloo.whamcloud.com/sub_tests/3e66a266-a1b4-11e1-abdc-52540035b04c |
| Comment by Hiroya Nozaki [ 21/May/12 ] |
|
I think the patch of |
| Comment by Jinshan Xiong (Inactive) [ 24/May/12 ] |
|
another hit: https://maloo.whamcloud.com/test_sets/ca49e520-a59e-11e1-8432-52540035b04c |
| Comment by Peter Jones [ 31/May/12 ] |
|
Landed for 2.3 |
| Comment by Ian Colle (Inactive) [ 22/Jun/12 ] |
|
Hit it again: https://maloo.whamcloud.com/test_sets/6eb7d33e-bafc-11e1-b98a-52540035b04c |
| Comment by Oleg Drokin [ 25/Jun/12 ] |
|
Ian, this hit you mention is not the real one. We discussed it with Liang and it seems he based his |
| Comment by Yang Sheng [ 20/Mar/13 ] |
|
Hit on 2.1.4 again 20:32:14:Code: 48 8b 42 10 48 83 b8 c0 01 00 00 00 0f 85 a6 00 00 00 f6 05 20:32:14:RIP [<ffffffff8b706d38>] :mds:__mds_lov_synchronize+0x2718/0x2950 20:32:14: RSP <ffff810301a7de10> 20:32:14: <0>Kernel panic - not syncing: Fatal exception 20:32:14: <1>Unable to handle kernel NULL pointer dereference at 0000000000000006 RIP: 20:32:14: [<0000000000000006>] 20:32:14:PGD 0 20:32:14:Oops: 0010 [2] SMP 20:32:14:last sysfs file: /block/sda/queue/max_sectors_kb 20:32:14:CPU 2 20:32:14:Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ldiskfs(U) jbd2 crc16 nfs nfsd exportfs nfs_acl auth_rpcgss autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf be2iscsi iscsi_tcp bnx2i cnic uio cxgb3i libcxgbi iw_cxgb3 cxgb3 libiscsi_tcp ib_iser libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi ib_srp rds_rdma rds_tcp rds ib_sdp ib_ipoib ipoib_helper rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 xfrm_nalgo crypto_api ib_sa dm_mirror dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport mlx4_ib ib_mad ib_core mlx4_en joydev sg shpchp mlx4_core igb i2c_i801 8021q i2c_core dca tpm_tis tpm tpm_bios i7core_edac serio_raw edac_mc pcspkr dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd 20:32:14:Pid: 14126, comm: ll_sync_00 Tainted: G -------------------- 2.6.18-348.3.1.el5_lustre.gec09f95 #1 20:32:16:RIP: 0010:[<0000000000000006>] [<0000000000000006>] 20:32:16:RSP: 0018:ffff81010b78bf90 EFLAGS: 00010006 20:32:16:RAX: ffff810301a7dfd8 RBX: 00000000000000ff RCX: 0000000000000000 20:32:16:RDX: 000000000000372e RSI: 00000000000000ff RDI: 00000000000000ff 20:32:16:RBP: ffff810301a7db40 R08: 0000000000000004 R09: 000000000000003c 20:32:16:R10: ffff810301a7db18 R11: 000000000000027f R12: 0000000000000006 20:32:16:R13: 00000000ffffff03 R14: ffffffff802b81dc R15: ffff81033f896628 20:32:17:FS: 00002ab605e036e0(0000) GS:ffff81010b70b140(0000) knlGS:0000000000000000 20:32:17:CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 20:32:17:CR2: 0000000000000006 CR3: 0000000000201000 CR4: 00000000000006a0 20:32:17:Process ll_sync_00 (pid: 14126, threadinfo ffff810301a7c000, task ffff8102fc8c1080) 20:32:17:Stack: ffffffff80023692 ffffffff802b7f1f 0000000000000000 0000000000000296 20:32:17: ffffffff8005dc32 ffff810301a7db40 <EOI> ffff8102fc8c1080 000000000000027f 20:32:17: ffff810301a7db18 000000000000003c 0000000000000004 00000000000000ff 20:32:17:Call Trace: 20:32:17: <IRQ> [<ffffffff80023692>] smp_call_function_interrupt+0x57/0x75 20:32:17: [<ffffffff8005dc32>] call_function_interrupt+0x66/0x6c 20:32:17: <EOI> [<ffffffff80077a02>] smp_send_stop+0x9e/0xa4 20:32:17: [<ffffffff800779d0>] smp_send_stop+0x6c/0xa4 20:32:17: [<ffffffff80093f0c>] panic+0xbd/0x200 20:32:17: [<ffffffff8003ddf8>] lock_timer_base+0x1b/0x3c 20:32:17: [<ffffffff80064b30>] _spin_unlock_irqrestore+0x8/0x9 20:32:17: [<ffffffff80064fda>] oops_end+0x51/0x53 20:32:17: [<ffffffff8006c73d>] die+0x3a/0x44 20:32:17: [<ffffffff8006557f>] do_general_protection+0xfe/0x107 20:32:17: [<ffffffff8005ddf9>] error_exit+0x0/0x84 20:32:17: [<ffffffff8b706d38>] :mds:__mds_lov_synchronize+0x2718/0x2950 20:32:17: [<ffffffff800957a6>] exit_fs+0x25/0xbb 20:32:17: [<ffffffff8b707dc0>] :mds:mds_lov_synchronize+0x0/0x80 20:32:17: [<ffffffff8b707e28>] :mds:mds_lov_synchronize+0x68/0x80 20:32:17: [<ffffffff8006e18d>] syscall_trace_leave+0x2f/0xd1 20:32:18: [<ffffffff8005dfc1>] child_rip+0xa/0x11 20:32:18: [<ffffffff8b707dc0>] :mds:mds_lov_synchronize+0x0/0x80 20:32:18: [<ffffffff8b707dc0>] :mds:mds_lov_synchronize+0x0/0x80 20:32:18: [<ffffffff8005dfb7>] child_rip+0x0/0x11 20:32:18: 20:32:18: 20:32:18:Code: Bad RIP value. 20:32:18:RIP [<0000000000000006>] 20:32:18: RSP <ffff81010b78bf90> 20:32:18:CR2: 0000000000000006 https://maloo.whamcloud.com/test_logs/966d490e-9139-11e2-8eff-52540035b04c I'll port patch to b2_1. |
| Comment by Yang Sheng [ 21/Mar/13 ] |
|
b2_1 patch: http://review.whamcloud.com/5789 |
| Comment by Jian Yu [ 22/Mar/13 ] |
|
Lustre Client: 1.8.9-wc1 Lustre Server: v2_1_5_RC1 conf-sanity test 58 failed to unmount MDS and hit the following issue: 02:19:25:Lustre: DEBUG MARKER: umount -d -f /mnt/mds 02:19:25:LustreError: 9855:0:(ldlm_lib.c:1987:target_stop_recovery_thread()) lustre-MDT0000: Aborting recovery 02:19:25:Lustre: 9531:0:(ldlm_lib.c:1669:target_recovery_overseer()) recovery is aborted, evict exports in recovery 02:19:25:Lustre: 9531:0:(ldlm_lib.c:1669:target_recovery_overseer()) Skipped 1 previous similar message 02:19:26:Lustre: lustre-MDT0000: sending delayed replies to recovered clients 02:19:26:LustreError: 9531:0:(llog_lvfs.c:616:llog_lvfs_create()) error looking up logfile 0x8000d:0x7515da77: rc -116 02:19:27:LustreError: 9531:0:(llog_obd.c:220:llog_setup_named()) obd lustre-OST0000-osc-MDT0000 ctxt 2 lop_setup=ffffffffa05dfca0 failed -116 02:19:27:LustreError: 9531:0:(osc_request.c:4231:__osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT 02:19:27:LustreError: 9531:0:(osc_request.c:4248:__osc_llog_init()) osc 'lustre-OST0000-osc-MDT0000' tgt 'mdd_obd-lustre-MDT0000' catid ffff880039ef3af0 rc=-116 02:19:27:LustreError: 9531:0:(osc_request.c:4250:__osc_llog_init()) logid 0x8000d:0x7515da77 02:19:27:LustreError: 9531:0:(osc_request.c:4278:osc_llog_init()) rc: -116 02:19:27:LustreError: 9531:0:(lov_log.c:248:lov_llog_init()) error osc_llog_init idx 0 osc 'lustre-OST0000-osc-MDT0000' tgt 'mdd_obd-lustre-MDT0000' (rc=-116) 02:19:27:LustreError: 9856:0:(lov_log.c:160:lov_llog_origin_connect()) error osc_llog_connect tgt 0 (-107) 02:19:27:LustreError: 9856:0:(mds_lov.c:832:__mds_lov_synchronize()) lustre-OST0000_UUID failed at llog_origin_connect: -107 02:19:27:LustreError: 9856:0:(mds_lov.c:861:__mds_lov_synchronize()) sync lustre-OST0000_UUID failed -107 02:19:27:LustreError: 9856:0:(mds_lov.c:865:__mds_lov_synchronize()) deactivating lustre-OST0000_UUID 02:19:27:general protection fault: 0000 [#1] SMP 02:19:27:last sysfs file: /sys/devices/system/cpu/possible 02:19:27:CPU 0 02:19:27:Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ldiskfs(U) jbd2 nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] 02:19:27: 02:19:27:Pid: 9856, comm: ll_sync_00 Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1 Red Hat KVM 02:19:27:RIP: 0010:[<ffffffffa0992504>] [<ffffffffa0992504>] lov_notify+0x94/0x16b0 [lov] 02:19:27:RSP: 0018:ffff880054dc1db0 EFLAGS: 00010286 02:19:27:RAX: 5a5a5a5a5a5a5a5a RBX: ffff88005a1b24f8 RCX: 0000000000000004 02:19:27:RDX: ffff88007c3c40b8 RSI: ffff880037a56040 RDI: ffffffffa09d8e50 02:19:27:RBP: ffff880054dc1e20 R08: ffff880054dc0000 R09: 00000000ffffffff 02:19:28:R10: 000000005a5a5a5a R11: 0000000000000000 R12: ffff88007c3c40b8 02:19:28:R13: 0000000000000003 R14: 0000000000000000 R15: ffff88007c3c40b8 02:19:28:FS: 00007f3c3a2ed700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 02:19:28:CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 02:19:28:CR2: 00007fb455ec7000 CR3: 000000007a314000 CR4: 00000000000006f0 02:19:28:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 02:19:28:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 02:19:28:Process ll_sync_00 (pid: 9856, threadinfo ffff880054dc0000, task ffff880066d8aaa0) 02:19:28:Stack: 02:19:28: ffffffff81c01a80 0000000000000000 0100000000000286 ffff88005a1b2c68 02:19:28:<d> ffff88007c3c40b8 000000037c3e6d40 ffff8800537c84c8 000000007c3c40b8 02:19:28:<d> ffff880054dc1e50 ffff88005a1b24f8 ffff88007c3c40b8 0000000000000003 02:19:28:Call Trace: 02:19:28: [<ffffffffa0a28b25>] obd_notify.clone.0+0xa5/0x370 [mds] 02:19:28: [<ffffffffa0a2da15>] __mds_lov_synchronize+0xd85/0x1780 [mds] 02:19:28: [<ffffffff811a5f13>] ? copy_fs_struct+0x83/0x90 02:19:28: [<ffffffffa0a2ed80>] ? mds_lov_synchronize+0x0/0x102 [mds] 02:19:28: [<ffffffffa0a2eddf>] mds_lov_synchronize+0x5f/0x102 [mds] 02:19:28: [<ffffffff8100c0ca>] child_rip+0xa/0x20 02:19:28: [<ffffffffa0a2ed80>] ? mds_lov_synchronize+0x0/0x102 [mds] 02:19:28: [<ffffffffa0a2ed80>] ? mds_lov_synchronize+0x0/0x102 [mds] 02:19:28: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Maloo report: https://maloo.whamcloud.com/test_sets/699d23e0-92a6-11e2-b7d5-52540035b04c |
| Comment by Peter Jones [ 20/May/13 ] |
|
The b2_1 version of this patch has now landed for 2.1.6. What work is still outstanding? |
| Comment by Yang Sheng [ 21/May/13 ] |
|
It can be closed for now. |