[LU-6639] sanity-lfsck test_2e: cpu softlock Created: 25/May/15  Updated: 19/Jul/15  Resolved: 19/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: dne2

Issue Links:
Related
is related to LU-6831 The ticket for tracking all DNE2 bugs Reopened
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for wangdi <di.wang@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/2db6ece8-024e-11e5-9144-5254006e85c2.

The sub-test test_2e failed with the following error:

test failed to respond and timed out

Although I only saw this in DNE2 test, but I can not see this is related with the patch. Anyway create the ticket first to see if this is repeatable.

12:47:07:Lustre: DEBUG MARKER: == sanity-lfsck test 2e: namespace LFSCK can verify remote object linkEA == 12:43:12 (1432471392)
12:47:07:Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x1603
12:47:07:Lustre: *** cfs_fail_loc=1603, val=0***
12:47:07:Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0
12:47:07:Lustre: DEBUG MARKER: /usr/sbin/lctl lfsck_start -M lustre-MDT0000 -t namespace -r -A
12:47:07:Lustre: 25171:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1432471394/real 1432471394]  req@ffff88007b587cc0 x1502045982093880/t0(0) o1101->lustre-MDT0001-osp-MDT0000@10.1.5.252@tcp:24/4 lens 320/224 e 4 to 1 dl 1432471501 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
12:47:07:Lustre: 25171:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 24 previous similar messages
12:47:07:Lustre: lustre-MDT0001-osp-MDT0000: Connection to lustre-MDT0001 (at 10.1.5.252@tcp) was lost; in progress operations using this service will wait for recovery to complete
12:47:07:Lustre: Skipped 2 previous similar messages
12:47:07:Lustre: MGS: haven't heard from client a8120cf8-cd69-035f-f0a0-2847eb8fd18a (at 10.1.5.251@tcp) in 53 seconds. I think it's dead, and I am evicting it. exp ffff880059ba1400, cur 1432471543 expire 1432471513 last 1432471490
12:47:07:BUG: soft lockup - CPU#0 stuck for 67s! [ll_evictor:21676]
12:47:07:Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic libcfs(U) ldiskfs(U) jbd2 nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
12:47:07:CPU 0 
12:47:07:Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic libcfs(U) ldiskfs(U) jbd2 nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
12:47:07:
12:47:07:Pid: 21676, comm: ll_evictor Not tainted 2.6.32-504.16.2.el6_lustre.g2f99b7f.x86_64 #1 Red Hat KVM
12:47:07:RIP: 0010:[<ffffffff8152dc8b>]  [<ffffffff8152dc8b>] _spin_lock_bh+0x2b/0x40
12:47:07:RSP: 0018:ffff88006a551b60  EFLAGS: 00000206
12:47:07:RAX: 0000000000000001 RBX: ffff88006a551b70 RCX: 0000000000000000
12:47:07:RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa091bc30
12:47:07:RBP: ffffffff8100bb8e R08: 00000000ffffff0a R09: 00000000fffffffe
12:47:07:R10: 0000000000000000 R11: 000000000000005a R12: ffffffffa0567da0
12:47:07:R13: 0060000000000020 R14: ffffffffa08aae77 R15: 0000000a00000004
12:47:07:FS:  0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
12:47:07:CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
12:47:07:CR2: 00007fb01ff4e000 CR3: 0000000059bd8000 CR4: 00000000000006f0
12:47:07:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
12:47:07:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
12:47:07:Process ll_evictor (pid: 21676, threadinfo ffff88006a550000, task ffff880058faf520)
12:47:07:Stack:
12:47:07: ffffffffa08c1aa8 ffff8800662c3900 ffff88006a551b90 ffffffffa07e9b39
12:47:07:<d> ffff8800662c3900 ffff88005aafec00 ffff88006a551bb0 ffffffffa07c3f68
12:47:07:<d> ffff8800662c3900 ffff88006a551ce0 ffff88006a551c00 ffffffffa07c57a0
12:47:07:Call Trace:
12:47:07: [<ffffffffa07e9b39>] ? ldlm_del_waiting_lock+0x29/0x220 [ptlrpc]
12:47:07: [<ffffffffa07c3f68>] ? ldlm_lock_cancel+0x178/0x200 [ptlrpc]
12:47:07: [<ffffffffa07c57a0>] ? ldlm_cancel_locks_for_export_cb+0xb0/0x450 [ptlrpc]
12:47:07: [<ffffffffa04a3dcc>] ? cfs_hash_for_each_relax+0x17c/0x350 [libcfs]
12:47:07: [<ffffffffa05d2e15>] ? uuid_export_put_locked+0x15/0x20 [obdclass]
12:47:07: [<ffffffffa07c56f0>] ? ldlm_cancel_locks_for_export_cb+0x0/0x450 [ptlrpc]
12:47:07: [<ffffffffa07c56f0>] ? ldlm_cancel_locks_for_export_cb+0x0/0x450 [ptlrpc]
12:47:07: [<ffffffffa04a55be>] ? cfs_hash_for_each_empty+0xfe/0x1e0 [libcfs]
12:47:07: [<ffffffffa07bf8af>] ? ldlm_cancel_locks_for_export+0x2f/0x40 [ptlrpc]
12:47:07: [<ffffffffa07cfec4>] ? server_disconnect_export+0x64/0x1a0 [ptlrpc]
12:47:07: [<ffffffffa0eb8fa0>] ? mdt_obd_disconnect+0x50/0x500 [mdt]
12:47:07: [<ffffffffa049dc51>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
12:47:07: [<ffffffffa049dc51>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
12:47:07: [<ffffffffa05b835d>] ? class_fail_export+0x23d/0x540 [obdclass]
12:47:07: [<ffffffffa08234b5>] ? ping_evictor_main+0x245/0x650 [ptlrpc]
12:47:07: [<ffffffff81064bc0>] ? default_wake_function+0x0/0x20
12:47:07: [<ffffffffa0823270>] ? ping_evictor_main+0x0/0x650 [ptlrpc]
12:47:07: [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
12:47:07: [<ffffffff8100c20a>] ? child_rip+0xa/0x20
12:47:07: [<ffffffff8109e680>] ? kthread+0x0/0xc0
12:47:07: [<ffffffff8100c200>] ? child_rip+0x0/0x20
12:47:07:Code: 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 89 fb e8 2a f0 b4 ff b8 00 00 01 00 f0 0f c1 03 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> 1f 44 00 00 83 3b 00 75 f4 eb df 48 83 c4 08 5b c9 c3 66 90 
12:47:07:Call Trace:
12:47:07: [<ffffffffa07e9b39>] ? ldlm_del_waiting_lock+0x29/0x220 [ptlrpc]
12:47:07: [<ffffffffa07c3f68>] ? ldlm_lock_cancel+0x178/0x200 [ptlrpc]
12:47:07: [<ffffffffa07c57a0>] ? ldlm_cancel_locks_for_export_cb+0xb0/0x450 [ptlrpc]
12:47:07: [<ffffffffa04a3dcc>] ? cfs_hash_for_each_relax+0x17c/0x350 [libcfs]
12:47:07: [<ffffffffa05d2e15>] ? uuid_export_put_locked+0x15/0x20 [obdclass]
12:47:07: [<ffffffffa07c56f0>] ? ldlm_cancel_locks_for_export_cb+0x0/0x450 [ptlrpc]
12:47:07: [<ffffffffa07c56f0>] ? ldlm_cancel_locks_for_export_cb+0x0/0x450 [ptlrpc]
12:47:07: [<ffffffffa04a55be>] ? cfs_hash_for_each_empty+0xfe/0x1e0 [libcfs]
12:47:07: [<ffffffffa07bf8af>] ? ldlm_cancel_locks_for_export+0x2f/0x40 [ptlrpc]
12:47:07: [<ffffffffa07cfec4>] ? server_disconnect_export+0x64/0x1a0 [ptlrpc]
12:47:07: [<ffffffffa0eb8fa0>] ? mdt_obd_disconnect+0x50/0x500 [mdt]
12:47:07: [<ffffffffa049dc51>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
12:47:07: [<ffffffffa049dc51>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
12:47:07: [<ffffffffa05b835d>] ? class_fail_export+0x23d/0x540 [obdclass]
12:47:07: [<ffffffffa08234b5>] ? ping_evictor_main+0x245/0x650 [ptlrpc]
12:47:07: [<ffffffff81064bc0>] ? default_wake_function+0x0/0x20
12:47:07: [<ffffffffa0823270>] ? ping_evictor_main+0x0/0x650 [ptlrpc]
12:47:07: [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
12:47:07: [<ffffffff8100c20a>] ? child_rip+0xa/0x20
12:47:07: [<ffffffff8109e680>] ? kthread+0x0/0xc0
12:47:07: [<ffffffff8100c200>] ? child_rip+0x0/0x20
12:47:07:Kernel panic - not syncing: softlockup: hung tasks
12:47:07:Pid: 21676, comm: ll_evictor Not tainted 2.6.32-504.16.2.el6_lustre.g2f99b7f.x86_64 #1
12:47:07:Call Trace:
12:47:07: <IRQ>  [<ffffffff81529fbc>] ? panic+0xa7/0x16f
12:47:07: [<ffffffff810ea5e0>] ? watchdog_timer_fn+0x0/0x1e0
12:47:07: [<ffffffff810ea7aa>] ? watchdog_timer_fn+0x1ca/0x1e0
12:47:07: [<ffffffff810a343e>] ? __run_hrtimer+0x8e/0x1a0
12:47:07: [<ffffffff810aa83f>] ? ktime_get_update_offsets+0x4f/0xd0
12:47:07: [<ffffffff810a37a6>] ? hrtimer_interrupt+0xe6/0x260
12:47:07: [<ffffffff8103422d>] ? local_apic_timer_interrupt+0x3d/0x70
12:47:07: [<ffffffff81534905>] ? smp_apic_timer_interrupt+0x45/0x60
12:47:07: [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
12:47:07: <EOI>  [<ffffffff8152dc8b>] ? _spin_lock_bh+0x2b/0x40
12:47:07: [<ffffffffa07e9b39>] ? ldlm_del_waiting_lock+0x29/0x220 [ptlrpc]
12:47:07: [<ffffffffa07c3f68>] ? ldlm_lock_cancel+0x178/0x200 [ptlrpc]
12:47:07: [<ffffffffa07c57a0>] ? ldlm_cancel_locks_for_export_cb+0xb0/0x450 [ptlrpc]
12:47:07: [<ffffffffa04a3dcc>] ? cfs_hash_for_each_relax+0x17c/0x350 [libcfs]
12:47:07: [<ffffffffa05d2e15>] ? uuid_export_put_locked+0x15/0x20 [obdclass]
12:47:07: [<ffffffffa07c56f0>] ? ldlm_cancel_locks_for_export_cb+0x0/0x450 [ptlrpc]
12:47:07: [<ffffffffa07c56f0>] ? ldlm_cancel_locks_for_export_cb+0x0/0x450 [ptlrpc]
12:47:07: [<ffffffffa04a55be>] ? cfs_hash_for_each_empty+0xfe/0x1e0 [libcfs]
12:47:07: [<ffffffffa07bf8af>] ? ldlm_cancel_locks_for_export+0x2f/0x40 [ptlrpc]
12:47:07: [<ffffffffa07cfec4>] ? server_disconnect_export+0x64/0x1a0 [ptlrpc]
12:47:07: [<ffffffffa0eb8fa0>] ? mdt_obd_disconnect+0x50/0x500 [mdt]
12:47:07: [<ffffffffa049dc51>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
12:47:07: [<ffffffffa049dc51>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
12:47:07: [<ffffffffa05b835d>] ? class_fail_export+0x23d/0x540 [obdclass]
12:47:07: [<ffffffffa08234b5>] ? ping_evictor_main+0x245/0x650 [ptlrpc]
12:47:07: [<ffffffff81064bc0>] ? default_wake_function+0x0/0x20
12:47:07: [<ffffffffa0823270>] ? ping_evictor_main+0x0/0x650 [ptlrpc]
12:47:07: [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
12:47:07: [<ffffffff8100c20a>] ? child_rip+0xa/0x20
12:47:07: [<ffffffff8109e680>] ? kthread+0x0/0xc0
12:47:07: [<ffffffff8100c200>] ? child_rip+0x0/0x20


 Comments   
Comment by nasf (Inactive) [ 27/May/15 ]

Some error message on the MDS2:

May 24 12:43:13 shadow-42vm7 kernel: Lustre: DEBUG MARKER: == sanity-lfsck test 2e: namespace LFSCK can verify remote object linkEA == 12:43:12 (1432471392)
May 24 12:43:13 shadow-42vm7 xinetd[1535]: EXIT: mshell status=0 pid=31338 duration=0(sec)
May 24 12:45:01 shadow-42vm7 kernel: Lustre: lustre-MDT0001: Client lustre-MDT0000-mdtlov_UUID (at 10.1.5.248@tcp) reconnecting
May 24 12:45:01 shadow-42vm7 kernel: Lustre: Skipped 1 previous similar message
May 24 12:45:02 shadow-42vm7 kernel: LustreError: 166-1: MGC10.1.5.248@tcp: Connection to MGS (at 10.1.5.248@tcp) was lost; in progress operations using this service will fail
May 24 12:45:02 shadow-42vm7 kernel: LustreError: Skipped 5 previous similar messages
May 24 12:45:08 shadow-42vm7 kernel: Lustre: lustre-MDT0003: Client lustre-MDT0000-mdtlov_UUID (at 10.1.5.248@tcp) reconnecting
May 24 12:45:08 shadow-42vm7 kernel: Lustre: Skipped 1 previous similar message
May 24 12:45:42 shadow-42vm7 kernel: LNetError: 26731:0:(socklnd.c:1660:ksocknal_destroy_conn()) Completing partial receive from 12345-10.1.5.248@tcp[1], ip 10.1.5.248:7988, with error, wanted: 104, left: 104, last alive is 34 secs ago
May 24 12:45:42 shadow-42vm7 kernel: LustreError: 26731:0:(events.c:306:request_in_callback()) event type 2, status -5, service mdt_out
May 24 12:45:42 shadow-42vm7 kernel: LustreError: 30449:0:(pack_generic.c:587:__lustre_unpack_msg()) message length 0 too small for magic/version check
May 24 12:45:42 shadow-42vm7 kernel: LustreError: 30449:0:(sec.c:2084:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-10.1.5.248@tcp x1502045982094672
May 24 12:45:50 shadow-42vm7 kernel: Lustre: lustre-MDT0002: haven't heard from client lustre-MDT0000-mdtlov_UUID (at 10.1.5.248@tcp) in 49 seconds. I think it's dead, and I am evicting it. exp ffff88006662ac00, cur 1432471550 expire 1432471520 last 1432471501

Means the connection between MDS1 and MDS2 has ever failed, that can be seen on the MDS1 also. Then during/after the reconnection, the MDS2 received bad RPC message (__lustre_unpack_msg). What will cause the short message? maybe some clew for MDS1 side lockup?

Comment by Di Wang [ 19/Jul/15 ]

This does not happen for almost 2 months.

Generated at Sat Feb 10 02:01:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.