[LU-6639] sanity-lfsck test_2e: cpu softlock Created: 25/May/15 Updated: 19/Jul/15 Resolved: 19/Jul/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | dne2 | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for wangdi <di.wang@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/2db6ece8-024e-11e5-9144-5254006e85c2. The sub-test test_2e failed with the following error: test failed to respond and timed out Although I only saw this in DNE2 test, but I can not see this is related with the patch. Anyway create the ticket first to see if this is repeatable. 12:47:07:Lustre: DEBUG MARKER: == sanity-lfsck test 2e: namespace LFSCK can verify remote object linkEA == 12:43:12 (1432471392) 12:47:07:Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x1603 12:47:07:Lustre: *** cfs_fail_loc=1603, val=0*** 12:47:07:Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0 12:47:07:Lustre: DEBUG MARKER: /usr/sbin/lctl lfsck_start -M lustre-MDT0000 -t namespace -r -A 12:47:07:Lustre: 25171:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1432471394/real 1432471394] req@ffff88007b587cc0 x1502045982093880/t0(0) o1101->lustre-MDT0001-osp-MDT0000@10.1.5.252@tcp:24/4 lens 320/224 e 4 to 1 dl 1432471501 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 12:47:07:Lustre: 25171:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 24 previous similar messages 12:47:07:Lustre: lustre-MDT0001-osp-MDT0000: Connection to lustre-MDT0001 (at 10.1.5.252@tcp) was lost; in progress operations using this service will wait for recovery to complete 12:47:07:Lustre: Skipped 2 previous similar messages 12:47:07:Lustre: MGS: haven't heard from client a8120cf8-cd69-035f-f0a0-2847eb8fd18a (at 10.1.5.251@tcp) in 53 seconds. I think it's dead, and I am evicting it. exp ffff880059ba1400, cur 1432471543 expire 1432471513 last 1432471490 12:47:07:BUG: soft lockup - CPU#0 stuck for 67s! [ll_evictor:21676] 12:47:07:Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic libcfs(U) ldiskfs(U) jbd2 nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] 12:47:07:CPU 0 12:47:07:Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic libcfs(U) ldiskfs(U) jbd2 nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] 12:47:07: 12:47:07:Pid: 21676, comm: ll_evictor Not tainted 2.6.32-504.16.2.el6_lustre.g2f99b7f.x86_64 #1 Red Hat KVM 12:47:07:RIP: 0010:[<ffffffff8152dc8b>] [<ffffffff8152dc8b>] _spin_lock_bh+0x2b/0x40 12:47:07:RSP: 0018:ffff88006a551b60 EFLAGS: 00000206 12:47:07:RAX: 0000000000000001 RBX: ffff88006a551b70 RCX: 0000000000000000 12:47:07:RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa091bc30 12:47:07:RBP: ffffffff8100bb8e R08: 00000000ffffff0a R09: 00000000fffffffe 12:47:07:R10: 0000000000000000 R11: 000000000000005a R12: ffffffffa0567da0 12:47:07:R13: 0060000000000020 R14: ffffffffa08aae77 R15: 0000000a00000004 12:47:07:FS: 0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 12:47:07:CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b 12:47:07:CR2: 00007fb01ff4e000 CR3: 0000000059bd8000 CR4: 00000000000006f0 12:47:07:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 12:47:07:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 12:47:07:Process ll_evictor (pid: 21676, threadinfo ffff88006a550000, task ffff880058faf520) 12:47:07:Stack: 12:47:07: ffffffffa08c1aa8 ffff8800662c3900 ffff88006a551b90 ffffffffa07e9b39 12:47:07:<d> ffff8800662c3900 ffff88005aafec00 ffff88006a551bb0 ffffffffa07c3f68 12:47:07:<d> ffff8800662c3900 ffff88006a551ce0 ffff88006a551c00 ffffffffa07c57a0 12:47:07:Call Trace: 12:47:07: [<ffffffffa07e9b39>] ? ldlm_del_waiting_lock+0x29/0x220 [ptlrpc] 12:47:07: [<ffffffffa07c3f68>] ? ldlm_lock_cancel+0x178/0x200 [ptlrpc] 12:47:07: [<ffffffffa07c57a0>] ? ldlm_cancel_locks_for_export_cb+0xb0/0x450 [ptlrpc] 12:47:07: [<ffffffffa04a3dcc>] ? cfs_hash_for_each_relax+0x17c/0x350 [libcfs] 12:47:07: [<ffffffffa05d2e15>] ? uuid_export_put_locked+0x15/0x20 [obdclass] 12:47:07: [<ffffffffa07c56f0>] ? ldlm_cancel_locks_for_export_cb+0x0/0x450 [ptlrpc] 12:47:07: [<ffffffffa07c56f0>] ? ldlm_cancel_locks_for_export_cb+0x0/0x450 [ptlrpc] 12:47:07: [<ffffffffa04a55be>] ? cfs_hash_for_each_empty+0xfe/0x1e0 [libcfs] 12:47:07: [<ffffffffa07bf8af>] ? ldlm_cancel_locks_for_export+0x2f/0x40 [ptlrpc] 12:47:07: [<ffffffffa07cfec4>] ? server_disconnect_export+0x64/0x1a0 [ptlrpc] 12:47:07: [<ffffffffa0eb8fa0>] ? mdt_obd_disconnect+0x50/0x500 [mdt] 12:47:07: [<ffffffffa049dc51>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 12:47:07: [<ffffffffa049dc51>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 12:47:07: [<ffffffffa05b835d>] ? class_fail_export+0x23d/0x540 [obdclass] 12:47:07: [<ffffffffa08234b5>] ? ping_evictor_main+0x245/0x650 [ptlrpc] 12:47:07: [<ffffffff81064bc0>] ? default_wake_function+0x0/0x20 12:47:07: [<ffffffffa0823270>] ? ping_evictor_main+0x0/0x650 [ptlrpc] 12:47:07: [<ffffffff8109e71e>] ? kthread+0x9e/0xc0 12:47:07: [<ffffffff8100c20a>] ? child_rip+0xa/0x20 12:47:07: [<ffffffff8109e680>] ? kthread+0x0/0xc0 12:47:07: [<ffffffff8100c200>] ? child_rip+0x0/0x20 12:47:07:Code: 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 89 fb e8 2a f0 b4 ff b8 00 00 01 00 f0 0f c1 03 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> 1f 44 00 00 83 3b 00 75 f4 eb df 48 83 c4 08 5b c9 c3 66 90 12:47:07:Call Trace: 12:47:07: [<ffffffffa07e9b39>] ? ldlm_del_waiting_lock+0x29/0x220 [ptlrpc] 12:47:07: [<ffffffffa07c3f68>] ? ldlm_lock_cancel+0x178/0x200 [ptlrpc] 12:47:07: [<ffffffffa07c57a0>] ? ldlm_cancel_locks_for_export_cb+0xb0/0x450 [ptlrpc] 12:47:07: [<ffffffffa04a3dcc>] ? cfs_hash_for_each_relax+0x17c/0x350 [libcfs] 12:47:07: [<ffffffffa05d2e15>] ? uuid_export_put_locked+0x15/0x20 [obdclass] 12:47:07: [<ffffffffa07c56f0>] ? ldlm_cancel_locks_for_export_cb+0x0/0x450 [ptlrpc] 12:47:07: [<ffffffffa07c56f0>] ? ldlm_cancel_locks_for_export_cb+0x0/0x450 [ptlrpc] 12:47:07: [<ffffffffa04a55be>] ? cfs_hash_for_each_empty+0xfe/0x1e0 [libcfs] 12:47:07: [<ffffffffa07bf8af>] ? ldlm_cancel_locks_for_export+0x2f/0x40 [ptlrpc] 12:47:07: [<ffffffffa07cfec4>] ? server_disconnect_export+0x64/0x1a0 [ptlrpc] 12:47:07: [<ffffffffa0eb8fa0>] ? mdt_obd_disconnect+0x50/0x500 [mdt] 12:47:07: [<ffffffffa049dc51>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 12:47:07: [<ffffffffa049dc51>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 12:47:07: [<ffffffffa05b835d>] ? class_fail_export+0x23d/0x540 [obdclass] 12:47:07: [<ffffffffa08234b5>] ? ping_evictor_main+0x245/0x650 [ptlrpc] 12:47:07: [<ffffffff81064bc0>] ? default_wake_function+0x0/0x20 12:47:07: [<ffffffffa0823270>] ? ping_evictor_main+0x0/0x650 [ptlrpc] 12:47:07: [<ffffffff8109e71e>] ? kthread+0x9e/0xc0 12:47:07: [<ffffffff8100c20a>] ? child_rip+0xa/0x20 12:47:07: [<ffffffff8109e680>] ? kthread+0x0/0xc0 12:47:07: [<ffffffff8100c200>] ? child_rip+0x0/0x20 12:47:07:Kernel panic - not syncing: softlockup: hung tasks 12:47:07:Pid: 21676, comm: ll_evictor Not tainted 2.6.32-504.16.2.el6_lustre.g2f99b7f.x86_64 #1 12:47:07:Call Trace: 12:47:07: <IRQ> [<ffffffff81529fbc>] ? panic+0xa7/0x16f 12:47:07: [<ffffffff810ea5e0>] ? watchdog_timer_fn+0x0/0x1e0 12:47:07: [<ffffffff810ea7aa>] ? watchdog_timer_fn+0x1ca/0x1e0 12:47:07: [<ffffffff810a343e>] ? __run_hrtimer+0x8e/0x1a0 12:47:07: [<ffffffff810aa83f>] ? ktime_get_update_offsets+0x4f/0xd0 12:47:07: [<ffffffff810a37a6>] ? hrtimer_interrupt+0xe6/0x260 12:47:07: [<ffffffff8103422d>] ? local_apic_timer_interrupt+0x3d/0x70 12:47:07: [<ffffffff81534905>] ? smp_apic_timer_interrupt+0x45/0x60 12:47:07: [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20 12:47:07: <EOI> [<ffffffff8152dc8b>] ? _spin_lock_bh+0x2b/0x40 12:47:07: [<ffffffffa07e9b39>] ? ldlm_del_waiting_lock+0x29/0x220 [ptlrpc] 12:47:07: [<ffffffffa07c3f68>] ? ldlm_lock_cancel+0x178/0x200 [ptlrpc] 12:47:07: [<ffffffffa07c57a0>] ? ldlm_cancel_locks_for_export_cb+0xb0/0x450 [ptlrpc] 12:47:07: [<ffffffffa04a3dcc>] ? cfs_hash_for_each_relax+0x17c/0x350 [libcfs] 12:47:07: [<ffffffffa05d2e15>] ? uuid_export_put_locked+0x15/0x20 [obdclass] 12:47:07: [<ffffffffa07c56f0>] ? ldlm_cancel_locks_for_export_cb+0x0/0x450 [ptlrpc] 12:47:07: [<ffffffffa07c56f0>] ? ldlm_cancel_locks_for_export_cb+0x0/0x450 [ptlrpc] 12:47:07: [<ffffffffa04a55be>] ? cfs_hash_for_each_empty+0xfe/0x1e0 [libcfs] 12:47:07: [<ffffffffa07bf8af>] ? ldlm_cancel_locks_for_export+0x2f/0x40 [ptlrpc] 12:47:07: [<ffffffffa07cfec4>] ? server_disconnect_export+0x64/0x1a0 [ptlrpc] 12:47:07: [<ffffffffa0eb8fa0>] ? mdt_obd_disconnect+0x50/0x500 [mdt] 12:47:07: [<ffffffffa049dc51>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 12:47:07: [<ffffffffa049dc51>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 12:47:07: [<ffffffffa05b835d>] ? class_fail_export+0x23d/0x540 [obdclass] 12:47:07: [<ffffffffa08234b5>] ? ping_evictor_main+0x245/0x650 [ptlrpc] 12:47:07: [<ffffffff81064bc0>] ? default_wake_function+0x0/0x20 12:47:07: [<ffffffffa0823270>] ? ping_evictor_main+0x0/0x650 [ptlrpc] 12:47:07: [<ffffffff8109e71e>] ? kthread+0x9e/0xc0 12:47:07: [<ffffffff8100c20a>] ? child_rip+0xa/0x20 12:47:07: [<ffffffff8109e680>] ? kthread+0x0/0xc0 12:47:07: [<ffffffff8100c200>] ? child_rip+0x0/0x20 |
| Comments |
| Comment by nasf (Inactive) [ 27/May/15 ] |
|
Some error message on the MDS2: May 24 12:43:13 shadow-42vm7 kernel: Lustre: DEBUG MARKER: == sanity-lfsck test 2e: namespace LFSCK can verify remote object linkEA == 12:43:12 (1432471392) May 24 12:43:13 shadow-42vm7 xinetd[1535]: EXIT: mshell status=0 pid=31338 duration=0(sec) May 24 12:45:01 shadow-42vm7 kernel: Lustre: lustre-MDT0001: Client lustre-MDT0000-mdtlov_UUID (at 10.1.5.248@tcp) reconnecting May 24 12:45:01 shadow-42vm7 kernel: Lustre: Skipped 1 previous similar message May 24 12:45:02 shadow-42vm7 kernel: LustreError: 166-1: MGC10.1.5.248@tcp: Connection to MGS (at 10.1.5.248@tcp) was lost; in progress operations using this service will fail May 24 12:45:02 shadow-42vm7 kernel: LustreError: Skipped 5 previous similar messages May 24 12:45:08 shadow-42vm7 kernel: Lustre: lustre-MDT0003: Client lustre-MDT0000-mdtlov_UUID (at 10.1.5.248@tcp) reconnecting May 24 12:45:08 shadow-42vm7 kernel: Lustre: Skipped 1 previous similar message May 24 12:45:42 shadow-42vm7 kernel: LNetError: 26731:0:(socklnd.c:1660:ksocknal_destroy_conn()) Completing partial receive from 12345-10.1.5.248@tcp[1], ip 10.1.5.248:7988, with error, wanted: 104, left: 104, last alive is 34 secs ago May 24 12:45:42 shadow-42vm7 kernel: LustreError: 26731:0:(events.c:306:request_in_callback()) event type 2, status -5, service mdt_out May 24 12:45:42 shadow-42vm7 kernel: LustreError: 30449:0:(pack_generic.c:587:__lustre_unpack_msg()) message length 0 too small for magic/version check May 24 12:45:42 shadow-42vm7 kernel: LustreError: 30449:0:(sec.c:2084:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-10.1.5.248@tcp x1502045982094672 May 24 12:45:50 shadow-42vm7 kernel: Lustre: lustre-MDT0002: haven't heard from client lustre-MDT0000-mdtlov_UUID (at 10.1.5.248@tcp) in 49 seconds. I think it's dead, and I am evicting it. exp ffff88006662ac00, cur 1432471550 expire 1432471520 last 1432471501 Means the connection between MDS1 and MDS2 has ever failed, that can be seen on the MDS1 also. Then during/after the reconnection, the MDS2 received bad RPC message (__lustre_unpack_msg). What will cause the short message? maybe some clew for MDS1 side lockup? |
| Comment by Di Wang [ 19/Jul/15 ] |
|
This does not happen for almost 2 months. |