[LU-1291] Test failure on test suite replay-single 44c Created: 07/Apr/12  Updated: 21/May/13  Resolved: 21/May/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.1.4, Lustre 2.1.5
Fix Version/s: Lustre 2.3.0, Lustre 2.1.6

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-1391 lov device is released while some osc... Resolved
Severity: 3
Rank (Obsolete): 4604

 Description   

This issue was created by maloo for Oleg Drokin <green@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/667d0d32-80d8-11e1-997d-525400d2bfa6.

MDT crashed:

07:51:48:Lustre: DEBUG MARKER: == replay-single test 44c: race in target handle connect ============================================= 04:51:46 (1333799506)
07:51:49:Turning device dm-0 (0xfd00000) read-only
07:51:50:Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
07:52:19:Lustre: lustre-OST0003-osc-MDT0000: Connection to service lustre-OST0003 via nid 172.29.3.36@tcp was lost; in progress operations using this service will wait for recovery to complete.
07:52:20:LustreError: 8320:0:(osc_create.c:614:osc_create()) lustre-OST0003-osc-MDT0000: oscc recovery failed: -11
07:52:20:LustreError: 8320:0:(lov_obd.c:1068:lov_clear_orphans()) error in orphan recovery on OST idx 3/7: rc = -11
07:52:20:LustreError: 8320:0:(mds_lov.c:884:__mds_lov_synchronize()) lustre-OST0003_UUID failed at mds_lov_clear_orphans: -11
07:52:20:LustreError: 8320:0:(mds_lov.c:905:__mds_lov_synchronize()) lustre-OST0003_UUID sync failed -11, deactivating
07:52:20:general protection fault: 0000 1 SMP
07:52:20:last sysfs file: /sys/devices/system/cpu/possible
07:52:20:CPU 0
07:52:21:Modules linked in: nfs fscache cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ldiskfs(U) jbd2 nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: obdecho]
07:52:21:
07:52:21:Pid: 8320, comm: ll_sync_03 Not tainted 2.6.32-220.4.2.el6_lustre.gf957caa.x86_64 #1 Red Hat KVM
07:52:21:RIP: 0010:[<ffffffffa08da03a>] [<ffffffffa08da03a>] lov_notify+0x1ba/0x1040 [lov]
07:52:21:RSP: 0018:ffff88007c625dc0 EFLAGS: 00010246
07:52:21:RAX: 5a5a5a5a5a5a5a5a RBX: ffff88007bdf42b8 RCX: 0000000000000000
07:52:21:RDX: 0000000000000000 RSI: ffff88005ad54ac0 RDI: ffff88007bdf4868
07:52:21:RBP: ffff88007c625e20 R08: 000000005a5a5a5a R09: 0000000000000000
07:52:21:R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
07:52:21:R13: 0000000000000003 R14: 0000000000000000 R15: ffff8800716228f8
07:52:21:FS: 00007f0f8a81e700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
07:52:21:CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
07:52:21:CR2: 00007ff18a036000 CR3: 0000000071be4000 CR4: 00000000000006f0
07:52:21:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
07:52:21:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
07:52:22:Process ll_sync_03 (pid: 8320, threadinfo ffff88007c624000, task ffff88007c3fe100)
07:52:22:Stack:
07:52:22: 0000000000000286 01000000fffffff5 ffff8800716223b8 000000036a958c00
07:52:22:<0> ffff8800580664d8 00000000716223b8 ffff88007c625e50 ffff88007bdf42b8
07:52:23:<0> ffff8800716223b8 0000000000000003 ffff8800580664d8 ffff8800716223b8
07:52:23:Call Trace:
07:52:23: [<ffffffffa07ef93c>] obd_notify.clone.0+0xac/0x290 [mds]
07:52:23: [<ffffffffa07f2f74>] __mds_lov_synchronize+0x2c4/0x12e0 [mds]
07:52:23: [<ffffffff811a6433>] ? copy_fs_struct+0x83/0x90
07:52:23: [<ffffffffa07f47a0>] ? mds_lov_synchronize+0x0/0xb0 [mds]
07:52:23: [<ffffffffa07f47ff>] mds_lov_synchronize+0x5f/0xb0 [mds]
07:52:23: [<ffffffff81003330>] ? xen_emergency_restart+0x20/0x30
07:52:23: [<ffffffff8100c14a>] child_rip+0xa/0x20
07:52:23: [<ffffffffa07f47a0>] ? mds_lov_synchronize+0x0/0xb0 [mds]
07:52:23: [<ffffffffa07f47a0>] ? mds_lov_synchronize+0x0/0xb0 [mds]
07:52:23: [<ffffffff8100c140>] ? child_rip+0x0/0x20
07:52:24:Code: 20 05 00 00 49 81 c7 40 05 00 00 45 85 c0 0f 84 ed 05 00 00 45 31 f6 45 31 e4 0f 1f 80 00 00 00 00 48 8b 83 78 05 00 00 49 63 d4 <48> 8b 04 d0 48 85 c0 74 67 48 83 78 40 00 74 60 f6 05 d3 05 b8



 Comments   
Comment by Oleg Drokin [ 07/Apr/12 ]

Alex advises that this code is in process of being deprecated and some planned changes scheduled for landing remove this particular piece of code anyway.

Comment by Andreas Dilger [ 11/Apr/12 ]

Unfortunately, while this code is in the process of being deleted, there is no way that it will be landed for the 2.3 release, so some solution is needed for the short term.

This test is failing today (quite frequently, about 30% of full runs, 11% of all runs), and preventing other patches from landing. The first failure is April 1, but this also coincides with the move to Indiana, so it isn't clear that the problem was introduced by a commit or by the environment change.

Comment by Peter Jones [ 15/Apr/12 ]

YangSheng

Could you please look into this one?

Thanks

Peter

Comment by Andreas Dilger [ 19/Apr/12 ]

https://maloo.whamcloud.com/test_sessions/58a0aa04-884c-11e1-b9ba-525400d2bfa6

Comment by Zhenyu Xu [ 23/Apr/12 ]

https://maloo.whamcloud.com/test_sets/e1f9521e-8b18-11e1-ac4e-525400d2bfa6

Comment by Andreas Dilger [ 07/May/12 ]

https://maloo.whamcloud.com/test_sets/baf56548-9826-11e1-9351-525400d2bfa6

Comment by Andreas Dilger [ 07/May/12 ]

Currently showing about 11% of all runs failing with this bug.

Comment by Yang Sheng [ 14/May/12 ]

Patch unload to: http://review.whamcloud.com/2708

Comment by Bob Glossman (Inactive) [ 18/May/12 ]

https://maloo.whamcloud.com/test_sets/8acef8ca-a0be-11e1-b20c-52540035b04c

Comment by Xuezhao Liu [ 20/May/12 ]

https://maloo.whamcloud.com/test_sessions/b4d8a1e4-a261-11e1-bba6-52540035b04c

Comment by nasf (Inactive) [ 20/May/12 ]

another failure instance:

https://maloo.whamcloud.com/sub_tests/3e66a266-a1b4-11e1-abdc-52540035b04c

Comment by Hiroya Nozaki [ 21/May/12 ]

I think the patch of LU-1391 could help this issue at some extent, thought that's for 1.8 ver.

Comment by Jinshan Xiong (Inactive) [ 24/May/12 ]

another hit: https://maloo.whamcloud.com/test_sets/ca49e520-a59e-11e1-8432-52540035b04c

Comment by Peter Jones [ 31/May/12 ]

Landed for 2.3

Comment by Ian Colle (Inactive) [ 22/Jun/12 ]

Hit it again: https://maloo.whamcloud.com/test_sets/6eb7d33e-bafc-11e1-b98a-52540035b04c

Comment by Oleg Drokin [ 25/Jun/12 ]

Ian, this hit you mention is not the real one.

We discussed it with Liang and it seems he based his LU-56 patches on an older lustre snapshot, and as such some fixes there are missing, including the one for this bug.

Comment by Yang Sheng [ 20/Mar/13 ]

Hit on 2.1.4 again

20:32:14:Code: 48 8b 42 10 48 83 b8 c0 01 00 00 00 0f 85 a6 00 00 00 f6 05 
20:32:14:RIP  [<ffffffff8b706d38>] :mds:__mds_lov_synchronize+0x2718/0x2950
20:32:14: RSP <ffff810301a7de10>
20:32:14: <0>Kernel panic - not syncing: Fatal exception
20:32:14: <1>Unable to handle kernel NULL pointer dereference at 0000000000000006 RIP: 
20:32:14: [<0000000000000006>]
20:32:14:PGD 0 
20:32:14:Oops: 0010 [2] SMP 
20:32:14:last sysfs file: /block/sda/queue/max_sectors_kb
20:32:14:CPU 2 
20:32:14:Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ldiskfs(U) jbd2 crc16 nfs nfsd exportfs nfs_acl auth_rpcgss autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf be2iscsi iscsi_tcp bnx2i cnic uio cxgb3i libcxgbi iw_cxgb3 cxgb3 libiscsi_tcp ib_iser libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi ib_srp rds_rdma rds_tcp rds ib_sdp ib_ipoib ipoib_helper rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 xfrm_nalgo crypto_api ib_sa dm_mirror dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport mlx4_ib ib_mad ib_core mlx4_en joydev sg shpchp mlx4_core igb i2c_i801 8021q i2c_core dca tpm_tis tpm tpm_bios i7core_edac serio_raw edac_mc pcspkr dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
20:32:14:Pid: 14126, comm: ll_sync_00 Tainted: G     --------------------    2.6.18-348.3.1.el5_lustre.gec09f95 #1
20:32:16:RIP: 0010:[<0000000000000006>]  [<0000000000000006>]
20:32:16:RSP: 0018:ffff81010b78bf90  EFLAGS: 00010006
20:32:16:RAX: ffff810301a7dfd8 RBX: 00000000000000ff RCX: 0000000000000000
20:32:16:RDX: 000000000000372e RSI: 00000000000000ff RDI: 00000000000000ff

20:32:16:RBP: ffff810301a7db40 R08: 0000000000000004 R09: 000000000000003c
20:32:16:R10: ffff810301a7db18 R11: 000000000000027f R12: 0000000000000006

20:32:16:R13: 00000000ffffff03 R14: ffffffff802b81dc R15: ffff81033f896628
20:32:17:FS:  00002ab605e036e0(0000) GS:ffff81010b70b140(0000) knlGS:0000000000000000
20:32:17:CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
20:32:17:CR2: 0000000000000006 CR3: 0000000000201000 CR4: 00000000000006a0
20:32:17:Process ll_sync_00 (pid: 14126, threadinfo ffff810301a7c000, task ffff8102fc8c1080)
20:32:17:Stack:  ffffffff80023692 ffffffff802b7f1f 0000000000000000 0000000000000296
20:32:17: ffffffff8005dc32 ffff810301a7db40 <EOI>  ffff8102fc8c1080 000000000000027f
20:32:17: ffff810301a7db18 000000000000003c 0000000000000004 00000000000000ff
20:32:17:Call Trace:
20:32:17: <IRQ>  [<ffffffff80023692>] smp_call_function_interrupt+0x57/0x75
20:32:17: [<ffffffff8005dc32>] call_function_interrupt+0x66/0x6c
20:32:17: <EOI>  [<ffffffff80077a02>] smp_send_stop+0x9e/0xa4
20:32:17: [<ffffffff800779d0>] smp_send_stop+0x6c/0xa4
20:32:17: [<ffffffff80093f0c>] panic+0xbd/0x200
20:32:17: [<ffffffff8003ddf8>] lock_timer_base+0x1b/0x3c
20:32:17: [<ffffffff80064b30>] _spin_unlock_irqrestore+0x8/0x9
20:32:17: [<ffffffff80064fda>] oops_end+0x51/0x53
20:32:17: [<ffffffff8006c73d>] die+0x3a/0x44
20:32:17: [<ffffffff8006557f>] do_general_protection+0xfe/0x107
20:32:17: [<ffffffff8005ddf9>] error_exit+0x0/0x84
20:32:17: [<ffffffff8b706d38>] :mds:__mds_lov_synchronize+0x2718/0x2950
20:32:17: [<ffffffff800957a6>] exit_fs+0x25/0xbb
20:32:17: [<ffffffff8b707dc0>] :mds:mds_lov_synchronize+0x0/0x80
20:32:17: [<ffffffff8b707e28>] :mds:mds_lov_synchronize+0x68/0x80
20:32:17: [<ffffffff8006e18d>] syscall_trace_leave+0x2f/0xd1
20:32:18: [<ffffffff8005dfc1>] child_rip+0xa/0x11
20:32:18: [<ffffffff8b707dc0>] :mds:mds_lov_synchronize+0x0/0x80
20:32:18: [<ffffffff8b707dc0>] :mds:mds_lov_synchronize+0x0/0x80
20:32:18: [<ffffffff8005dfb7>] child_rip+0x0/0x11
20:32:18:
20:32:18:
20:32:18:Code:  Bad RIP value.
20:32:18:RIP  [<0000000000000006>]
20:32:18: RSP <ffff81010b78bf90>
20:32:18:CR2: 0000000000000006

https://maloo.whamcloud.com/test_logs/966d490e-9139-11e2-8eff-52540035b04c

I'll port patch to b2_1.

Comment by Yang Sheng [ 21/Mar/13 ]

b2_1 patch: http://review.whamcloud.com/5789

Comment by Jian Yu [ 22/Mar/13 ]

Lustre Client: 1.8.9-wc1
Lustre Client Build: http://build.whamcloud.com/job/lustre-b1_8/258/

Lustre Server: v2_1_5_RC1
Lustre Server Build: http://build.whamcloud.com/job/lustre-b2_1/191/

conf-sanity test 58 failed to unmount MDS and hit the following issue:

02:19:25:Lustre: DEBUG MARKER: umount -d -f /mnt/mds
02:19:25:LustreError: 9855:0:(ldlm_lib.c:1987:target_stop_recovery_thread()) lustre-MDT0000: Aborting recovery
02:19:25:Lustre: 9531:0:(ldlm_lib.c:1669:target_recovery_overseer()) recovery is aborted, evict exports in recovery
02:19:25:Lustre: 9531:0:(ldlm_lib.c:1669:target_recovery_overseer()) Skipped 1 previous similar message
02:19:26:Lustre: lustre-MDT0000: sending delayed replies to recovered clients
02:19:26:LustreError: 9531:0:(llog_lvfs.c:616:llog_lvfs_create()) error looking up logfile 0x8000d:0x7515da77: rc -116
02:19:27:LustreError: 9531:0:(llog_obd.c:220:llog_setup_named()) obd lustre-OST0000-osc-MDT0000 ctxt 2 lop_setup=ffffffffa05dfca0 failed -116
02:19:27:LustreError: 9531:0:(osc_request.c:4231:__osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT
02:19:27:LustreError: 9531:0:(osc_request.c:4248:__osc_llog_init()) osc 'lustre-OST0000-osc-MDT0000' tgt 'mdd_obd-lustre-MDT0000' catid ffff880039ef3af0 rc=-116
02:19:27:LustreError: 9531:0:(osc_request.c:4250:__osc_llog_init()) logid 0x8000d:0x7515da77
02:19:27:LustreError: 9531:0:(osc_request.c:4278:osc_llog_init()) rc: -116
02:19:27:LustreError: 9531:0:(lov_log.c:248:lov_llog_init()) error osc_llog_init idx 0 osc 'lustre-OST0000-osc-MDT0000' tgt 'mdd_obd-lustre-MDT0000' (rc=-116)
02:19:27:LustreError: 9856:0:(lov_log.c:160:lov_llog_origin_connect()) error osc_llog_connect tgt 0 (-107)
02:19:27:LustreError: 9856:0:(mds_lov.c:832:__mds_lov_synchronize()) lustre-OST0000_UUID failed at llog_origin_connect: -107
02:19:27:LustreError: 9856:0:(mds_lov.c:861:__mds_lov_synchronize()) sync lustre-OST0000_UUID failed -107
02:19:27:LustreError: 9856:0:(mds_lov.c:865:__mds_lov_synchronize()) deactivating lustre-OST0000_UUID
02:19:27:general protection fault: 0000 [#1] SMP 
02:19:27:last sysfs file: /sys/devices/system/cpu/possible
02:19:27:CPU 0 
02:19:27:Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ldiskfs(U) jbd2 nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
02:19:27:
02:19:27:Pid: 9856, comm: ll_sync_00 Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1 Red Hat KVM
02:19:27:RIP: 0010:[<ffffffffa0992504>]  [<ffffffffa0992504>] lov_notify+0x94/0x16b0 [lov]
02:19:27:RSP: 0018:ffff880054dc1db0  EFLAGS: 00010286
02:19:27:RAX: 5a5a5a5a5a5a5a5a RBX: ffff88005a1b24f8 RCX: 0000000000000004
02:19:27:RDX: ffff88007c3c40b8 RSI: ffff880037a56040 RDI: ffffffffa09d8e50
02:19:27:RBP: ffff880054dc1e20 R08: ffff880054dc0000 R09: 00000000ffffffff
02:19:28:R10: 000000005a5a5a5a R11: 0000000000000000 R12: ffff88007c3c40b8
02:19:28:R13: 0000000000000003 R14: 0000000000000000 R15: ffff88007c3c40b8
02:19:28:FS:  00007f3c3a2ed700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
02:19:28:CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
02:19:28:CR2: 00007fb455ec7000 CR3: 000000007a314000 CR4: 00000000000006f0
02:19:28:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
02:19:28:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
02:19:28:Process ll_sync_00 (pid: 9856, threadinfo ffff880054dc0000, task ffff880066d8aaa0)
02:19:28:Stack:
02:19:28: ffffffff81c01a80 0000000000000000 0100000000000286 ffff88005a1b2c68
02:19:28:<d> ffff88007c3c40b8 000000037c3e6d40 ffff8800537c84c8 000000007c3c40b8
02:19:28:<d> ffff880054dc1e50 ffff88005a1b24f8 ffff88007c3c40b8 0000000000000003
02:19:28:Call Trace:
02:19:28: [<ffffffffa0a28b25>] obd_notify.clone.0+0xa5/0x370 [mds]
02:19:28: [<ffffffffa0a2da15>] __mds_lov_synchronize+0xd85/0x1780 [mds]
02:19:28: [<ffffffff811a5f13>] ? copy_fs_struct+0x83/0x90
02:19:28: [<ffffffffa0a2ed80>] ? mds_lov_synchronize+0x0/0x102 [mds]
02:19:28: [<ffffffffa0a2eddf>] mds_lov_synchronize+0x5f/0x102 [mds]
02:19:28: [<ffffffff8100c0ca>] child_rip+0xa/0x20
02:19:28: [<ffffffffa0a2ed80>] ? mds_lov_synchronize+0x0/0x102 [mds]
02:19:28: [<ffffffffa0a2ed80>] ? mds_lov_synchronize+0x0/0x102 [mds]
02:19:28: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Maloo report: https://maloo.whamcloud.com/test_sets/699d23e0-92a6-11e2-b7d5-52540035b04c

Comment by Peter Jones [ 20/May/13 ]

The b2_1 version of this patch has now landed for 2.1.6. What work is still outstanding?

Comment by Yang Sheng [ 21/May/13 ]

It can be closed for now.

Generated at Sat Feb 10 06:04:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.