[LU-2703] racer: BUG: soft lockup - CPU#0 stuck for 67s! [dd:1404] Created: 29/Jan/13 Updated: 18/Feb/13 Resolved: 18/Feb/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.9 |
| Fix Version/s: | Lustre 1.8.9 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jian Yu | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Branch: b1_8 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 6296 | ||||||||
| Description |
|
While running racer test, the following issue occurred on one of the two clients: 00:11:31:Lustre: DEBUG MARKER: == racer test 1: racer on clients: client-28vm1,client-28vm2.lab.whamcloud.com DURATION=900 == 00:11:30 (1359447090) 00:11:32:Lustre: DEBUG MARKER: DURATION=900 /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer 00:24:12:BUG: soft lockup - CPU#0 stuck for 67s! [dd:1404] 00:24:12:Modules linked in: mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] 00:24:12:CPU 0 00:24:12:Modules linked in: mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] 00:24:12: 00:24:12:Pid: 1404, comm: dd Not tainted 2.6.32-279.19.1.el6.x86_64 #1 Red Hat KVM 00:24:12:RIP: 0010:[<ffffffff814ec53e>] [<ffffffff814ec53e>] _spin_lock+0x1e/0x30 00:24:12:RSP: 0018:ffff880030ce16a8 EFLAGS: 00000206 00:24:13:RAX: 0000000000000001 RBX: ffff880030ce16a8 RCX: ffff8800783c6ba0 00:24:13:RDX: 0000000000000000 RSI: ffff88004ecec7c0 RDI: ffff88007b510a1c 00:24:13:RBP: ffffffff8100bb8e R08: 0000000000000102 R09: 0000000000000000 00:24:14:R10: 0000000003b5e000 R11: 000000000000000e R12: ffffffffa058315f 00:24:14:R13: ffff880030ce1638 R14: 0000000003b5e000 R15: 0000000003b5efff 00:24:14:FS: 00007f288786d700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 00:24:14:CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 00:24:14:CR2: 00000036526cd710 CR3: 000000003d0ee000 CR4: 00000000000006f0 00:24:14:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 00:24:14:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 00:24:14:Process dd (pid: 1404, threadinfo ffff880030ce0000, task ffff88007d082aa0) 00:24:14:Stack: 00:24:14: ffff880030ce17f8 ffffffffa0275889 00000000002537dc ffffea000115a9f0 00:24:14:<d> ffff88003e877950 0000000000000000 0000000003b5e000 0000000003b5efff 00:24:14:<d> ffff880030ce1998 ffffffffa079bddd f869fda2cf0897a1 0000000000000000 00:24:14:Call Trace: 00:24:15: [<ffffffffa0275889>] ? osc_queue_async_io+0x399/0x1140 [osc] 00:24:15: [<ffffffffa079bddd>] ? ll_prepare_write+0x50d/0x1230 [lustre] 00:24:15: [<ffffffffa072adce>] ? lov_stripe_offset+0x28e/0x340 [lov] 00:24:15: [<ffffffffa072a8db>] ? lov_tgt_seq_show+0x26b/0x300 [lov] 00:24:16: [<ffffffffa070d0a9>] ? lov_queue_async_io+0x149/0x4a0 [lov] 00:24:16: [<ffffffffa0795780>] ? queue_or_sync_write+0x160/0xda0 [lustre] 00:24:16: [<ffffffffa07a2c2b>] ? ll_stats_ops_tally+0x6b/0xd0 [lustre] 00:24:16: [<ffffffffa079cde5>] ? ll_commit_write+0x2e5/0x750 [lustre] 00:24:16: [<ffffffffa07b4333>] ? ll_write_begin+0x83/0x210 [lustre] 00:24:16: [<ffffffffa07b4280>] ? ll_write_end+0x30/0x60 [lustre] 00:24:16: [<ffffffff811107fa>] ? generic_file_buffered_write+0x18a/0x2e0 00:24:16: [<ffffffff81070f97>] ? current_fs_time+0x27/0x30 00:24:16: [<ffffffff81112130>] ? __generic_file_aio_write+0x250/0x480 00:24:16: [<ffffffffa0765dba>] ? ll_file_get_tree_lock_iov+0x14a/0x810 [lustre] 00:24:16: [<ffffffff811123cf>] ? generic_file_aio_write+0x6f/0xe0 00:24:16: [<ffffffffa0772449>] ? ll_file_aio_write+0xa19/0x1c60 [lustre] 00:24:16: [<ffffffffa0773760>] ? ll_file_write+0xd0/0xf0 [lustre] 00:24:16: [<ffffffff8105a5c3>] ? perf_event_task_sched_out+0x33/0x80 00:24:16: [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40 00:24:16: [<ffffffff8120ca26>] ? security_file_permission+0x16/0x20 00:24:16: [<ffffffff8117646d>] ? rw_verify_area+0x5d/0xc0 00:24:16: [<ffffffff81176588>] ? vfs_write+0xb8/0x1a0 00:24:16: [<ffffffff81176e81>] ? sys_write+0x51/0x90 00:24:16: [<ffffffff810d3a75>] ? __audit_syscall_exit+0x265/0x290 00:24:16: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b 00:24:16:Code: 00 00 00 01 74 05 e8 72 8c d8 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 3e 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> 1f 44 00 00 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89 Maloo report: https://maloo.whamcloud.com/test_sets/83fa5d12-6a23-11e2-85d4-52540035b04c |
| Comments |
| Comment by Jian Yu [ 29/Jan/13 ] |
|
This is an regression issue after http://review.whamcloud.com/5132 was landed. |
| Comment by Peter Jones [ 30/Jan/13 ] |
|
Bobijam Could you please look into this one? Thanks Peter |
| Comment by Zhenyu Xu [ 30/Jan/13 ] |
|
this specific case (https://maloo.whamcloud.com/test_sets/83fa5d12-6a23-11e2-85d4-52540035b04c) is due to MDS journal dead for some unknown reason. dead process on MDS 01:17:06:kjournald D ffff810002536420 0 392 23 412 383 (L-TLB) 01:17:06: ffff81007f10bca0 0000000000000046 0000000000011220 0000000000000046 01:17:06: 0000000000000220 000000000000000a ffff810037fe30c0 ffffffff8031db60 01:17:06: 0000178e3069efc5 000000000000edfb ffff810037fe32a8 000000008006e6b3 01:17:06:Call Trace: 01:17:06: [<ffffffff8006ed48>] do_gettimeofday+0x40/0x90 01:17:06: [<ffffffff800155b4>] sync_buffer+0x0/0x3f 01:17:06: [<ffffffff800637de>] io_schedule+0x3f/0x67 01:17:06: [<ffffffff800155ef>] sync_buffer+0x3b/0x3f 01:17:06: [<ffffffff80063a0a>] __wait_on_bit+0x40/0x6e 01:17:06: [<ffffffff800155b4>] sync_buffer+0x0/0x3f 01:17:06: [<ffffffff80063aa4>] out_of_line_wait_on_bit+0x6c/0x78 01:17:06: [<ffffffff800a3c15>] wake_bit_function+0x0/0x23 01:17:06: [<ffffffff8803331c>] :jbd:journal_commit_transaction+0xa7f/0x132b 01:17:06: [<ffffffff8003dde5>] lock_timer_base+0x1b/0x3c 01:17:06: [<ffffffff88037489>] :jbd:kjournald+0xc1/0x213 01:17:06: [<ffffffff800a3be7>] autoremove_wake_function+0x0/0x2e 01:17:06: [<ffffffff800a39cf>] keventd_create_kthread+0x0/0xc4 01:17:06: [<ffffffff880373c8>] :jbd:kjournald+0x0/0x213 01:17:06: [<ffffffff800a39cf>] keventd_create_kthread+0x0/0xc4 01:17:06: [<ffffffff80032c45>] kthread+0xfe/0x132 01:17:06: [<ffffffff8005dfc1>] child_rip+0xa/0x11 01:17:06: [<ffffffff800a39cf>] keventd_create_kthread+0x0/0xc4 01:17:06: [<ffffffff80032b47>] kthread+0x0/0x132 01:17:06: [<ffffffff8005dfb7>] child_rip+0x0/0x11 01:17:07:syslogd D ffff810037fe30c0 0 1607 1 1610 1585 (NOTLB) 01:17:07: ffff8100741bdd98 0000000000000082 00000000741bdd18 0000000000000001 01:17:07: 0000000000000000 0000000000000009 ffff81007ec287a0 ffff810037fe30c0 01:17:07: 0000178e2ecd14cd 000000000001708d ffff81007ec28988 0000000000000092 01:17:07:Call Trace: 01:17:07: [<ffffffff8002e4db>] __wake_up+0x38/0x4f 01:17:07: [<ffffffff8803682d>] :jbd:log_wait_commit+0xa3/0xf5 01:17:07: [<ffffffff800a3be7>] autoremove_wake_function+0x0/0x2e 01:17:07: [<ffffffff880307f8>] :jbd:journal_stop+0x22a/0x259 01:17:07: [<ffffffff8002ff63>] __writeback_single_inode+0x1dd/0x31c 01:17:07: [<ffffffff800e4b09>] do_readv_writev+0x26e/0x291 01:17:07: [<ffffffff800f8a7a>] sync_inode+0x24/0x33 01:17:07: [<ffffffff8804d52e>] :ext3:ext3_sync_file+0xce/0xf8 01:17:07: [<ffffffff800504b1>] do_fsync+0x52/0xa4 01:17:07: [<ffffffff800e5424>] __do_fsync+0x23/0x36 01:17:07: [<ffffffff8005d29e>] tracesys+0xd5/0xdf 01:17:07: |
| Comment by Johann Lombardi (Inactive) [ 31/Jan/13 ] |
|
syslogd is waiting for kjournald and kjournald is waiting for I/O to complete. What do you mean by journal is dead? From the original report, it seems that we are stuck on a spinlock for 67s, so i tend to think that someone is doing something it shouldn't (e.g. sleeping, locking ordering issue) while holding a spinlock. |
| Comment by Jian Yu [ 31/Jan/13 ] |
|
Lustre Branch: b1_8 This issue occurred constantly on Lustre b1_8 build #252: https://maloo.whamcloud.com/test_sets/0fe4438e-6bc7-11e2-b2a5-52540035b04c |
| Comment by Johann Lombardi (Inactive) [ 01/Feb/13 ] |
|
hm, we are looping in osc_announce_cached() in the last report and it was osc_queue_async_io() in the original one. I suspect the issue is related to messing up with the cl_loi_list_lock. Unfortunately, sysrq-t output in maloo is full of soft lockup messages. I think we should try to collect a crash dump. |
| Comment by Zhenyu Xu [ 01/Feb/13 ] |
|
For the record, I could not reproduce it for RHEL5.9 server and client (both use 2.6.18-348.1.1.el5) |
| Comment by Jian Yu [ 04/Feb/13 ] |
|
Hi Bobi and Johann, I reproduced this issue manually on Toro client-12vm[1-4] against Lustre b1_8 build #252 and got the kernel dump file of client-12vm1. Client1: client-12vm1 (RHEL6.3 2.6.32-279.19.1.el6.x86_64) Client2: client-12vm2 (RHEL6.3 2.6.32-279.19.1.el6.x86_64) MGS/MDS: client-12vm3 (RHEL5.9 2.6.18-348.1.1.el5_lustre.g3480bb0) OSS: client-12vm4 (RHEL5.9 2.6.18-348.1.1.el5_lustre.g3480bb0) The dump file is /scratch/logs/1.8.9/ I also put the kernel-debuginfo-* packages into that directory. |
| Comment by Jian Yu [ 04/Feb/13 ] |
|
FYI, client-12vm[1-4] are remaining for debugging. |
| Comment by Zhenyu Xu [ 04/Feb/13 ] |
|
osc_update_grant is trying spin_lock cl_loi_list_lock, but I haven't found where the lock has been hold at the moment. crash> bt -a crash> ps | grep dd crash> bt 17462 crash> dis 0xffffffffa0275e41 20 |
| Comment by Johann Lombardi (Inactive) [ 04/Feb/13 ] |
|
Another possibility is that we forgot somewhere to unlock the spinlock ... |
| Comment by Keith Mannthey (Inactive) [ 04/Feb/13 ] |
|
I had started on setting up a local setup to help get the crash but I see we have that first step. I am now working to debug the spinlock. It will be good to see who touched the lock last. |
| Comment by Keith Mannthey (Inactive) [ 04/Feb/13 ] |
|
Sorry no hard data from my local debug setup yet, I wasted some time with a 1.8 build environment when I should have just grabbed the 1.8.9 rpms from the build. In general the plan for the Master Client is to ensure that when a spinlock is acquired the place and task is captured so that when everyone is waiting it lock can be examined to see who and where it was touched last. With any luck it is not a small window where a debug spinlock will not cause the issue to be skipped. With the current state of master in such change is there a version of master that is known to work? |
| Comment by Zhenyu Xu [ 05/Feb/13 ] |
|
I think I've found who holds the cl_loi_list_lock and sleeps away. crash> ps | grep ptlrpcd Will push a patch for review. |
| Comment by Johann Lombardi (Inactive) [ 05/Feb/13 ] |
|
I think master has a patch for this already (it prevents kmalloc on this path). |
| Comment by Zhenyu Xu [ 05/Feb/13 ] |
|
this issue is a little related to |
| Comment by Zhenyu Xu [ 05/Feb/13 ] |
|
patch tracking at http://review.whamcloud.com/5272 commit message LU-2703 osc: ptlrpcd scheduled with a spinlock When memory is stringent, there is a change that ptlrpcd calling path reschedules waiting for memory available while holding a spinlock, the backtrace is as follows: PID: 1808 TASK: ffff880037f0f540 CPU: 0 COMMAND: "ptlrpcd" 0 [ffff88007c881620] schedule at ffffffff814e9c02 1 [ffff88007c8816e8] __cond_resched at ffffffff8106118a 2 [ffff88007c881708] _cond_resched at ffffffff814ea610 3 [ffff88007c881718] __kmalloc at ffffffff8115e8b0 4 [ffff88007c881768] cfs_alloc at ffffffffa039e701 [libcfs] 5 [ffff88007c881798] ldlm_bl_to_thread at ffffffffa051b2e9 [ptlrpc] 6 [ffff88007c8818c8] ldlm_bl_to_thread_lock at ffffffffa051b8eb [ptlrpc] 7 [ffff88007c881918] ldlm_lock_decref_internal at ffffffffa04f7f9d [ptlrpc] 8 [ffff88007c881998] ldlm_lock_decref at ffffffffa04f8fd9 [ptlrpc] 9 [ffff88007c8819e8] osc_cancel at ffffffffa026962e [osc] 10 [ffff88007c881a38] lov_cancel at ffffffffa0684e0b [lov] 11 [ffff88007c881b58] ll_ap_completion at ffffffffa0701669 [lustre] 12 [ffff88007c881c18] lov_ap_completion at ffffffffa0677b68 [lov] 13 [ffff88007c881c48] osc_ap_completion at ffffffffa02683b3 [osc] ====> here ptlrpcd has holden the cl_loi_list_lock 14 [ffff88007c881cb8] brw_interpret at ffffffffa0279429 [osc] 15 [ffff88007c881d58] ptlrpc_check_set at ffffffffa0530a9a [ptlrpc] 16 [ffff88007c881e38] ptlrpcd_check at ffffffffa05677ad [ptlrpc] 17 [ffff88007c881e98] ptlrpcd at ffffffffa0567a50 [ptlrpc] 18 [ffff88007c881f48] kernel_thread at ffffffff8100c0ca This patch unlocks the spinlock before calling osc_ap_completion(). |
| Comment by Johann Lombardi (Inactive) [ 05/Feb/13 ] |
|
This issue was actually added by this commit 29309746c4049aa7da6cde4cb9a44ec0df2b1af3 (bugzilla 21252) which was a fix for bugzilla 16774. I think nobody uses this feature on 1.8 (which is disabled by default anyway). Maybe we could just revert bugzilla 21252 as well as some part of bugzilla 16774 to restore the default readahead locking behavior? |
| Comment by Hongchao Zhang [ 06/Feb/13 ] |
|
the obd_cancel call in ll_ap_completion is too big to be included in a spin_lock for it is easy to cause such issue, |
| Comment by Hongchao Zhang [ 06/Feb/13 ] |
|
the patch is tracked at http://review.whamcloud.com/#change,5285 |
| Comment by Bruno Faccini (Inactive) [ 09/Feb/13 ] |
|
I am not sure but seems to me that some of the test failures may come from memory-leak detected/indicated at obdclass module unload. In fact it is the memory accounting which seems to be wrong and found more memory freed ("leaked: 18446744073709551612", ie 0xFFFFFFFFFFFFFFFC).This problem may have been induced by last patch #5 where the new macro PPGA_SIZE() plays with pointer size to determine what unit-size to alloc/free in osc_build_ppga()/osc_release_ppga(). But this can fail (actually on archs, like i686 where pointer size is < sizeof(struct lustre_handle)=64-bits) if not all allocations use osc_build_ppga() like in osc_build_req(). |