Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2703

racer: BUG: soft lockup - CPU#0 stuck for 67s! [dd:1404]

Details

    • 3
    • 6296

    Description

      While running racer test, the following issue occurred on one of the two clients:

      00:11:31:Lustre: DEBUG MARKER: == racer test 1: racer on clients: client-28vm1,client-28vm2.lab.whamcloud.com DURATION=900 == 00:11:30 (1359447090)
      00:11:32:Lustre: DEBUG MARKER: DURATION=900 /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer 
      00:24:12:BUG: soft lockup - CPU#0 stuck for 67s! [dd:1404]
      00:24:12:Modules linked in: mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
      00:24:12:CPU 0 
      00:24:12:Modules linked in: mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
      00:24:12:
      00:24:12:Pid: 1404, comm: dd Not tainted 2.6.32-279.19.1.el6.x86_64 #1 Red Hat KVM
      00:24:12:RIP: 0010:[<ffffffff814ec53e>]  [<ffffffff814ec53e>] _spin_lock+0x1e/0x30
      00:24:12:RSP: 0018:ffff880030ce16a8  EFLAGS: 00000206
      00:24:13:RAX: 0000000000000001 RBX: ffff880030ce16a8 RCX: ffff8800783c6ba0
      00:24:13:RDX: 0000000000000000 RSI: ffff88004ecec7c0 RDI: ffff88007b510a1c
      00:24:13:RBP: ffffffff8100bb8e R08: 0000000000000102 R09: 0000000000000000
      00:24:14:R10: 0000000003b5e000 R11: 000000000000000e R12: ffffffffa058315f
      00:24:14:R13: ffff880030ce1638 R14: 0000000003b5e000 R15: 0000000003b5efff
      00:24:14:FS:  00007f288786d700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
      00:24:14:CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      00:24:14:CR2: 00000036526cd710 CR3: 000000003d0ee000 CR4: 00000000000006f0
      00:24:14:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      00:24:14:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      00:24:14:Process dd (pid: 1404, threadinfo ffff880030ce0000, task ffff88007d082aa0)
      00:24:14:Stack:
      00:24:14: ffff880030ce17f8 ffffffffa0275889 00000000002537dc ffffea000115a9f0
      00:24:14:<d> ffff88003e877950 0000000000000000 0000000003b5e000 0000000003b5efff
      00:24:14:<d> ffff880030ce1998 ffffffffa079bddd f869fda2cf0897a1 0000000000000000
      00:24:14:Call Trace:
      00:24:15: [<ffffffffa0275889>] ? osc_queue_async_io+0x399/0x1140 [osc]
      00:24:15: [<ffffffffa079bddd>] ? ll_prepare_write+0x50d/0x1230 [lustre]
      00:24:15: [<ffffffffa072adce>] ? lov_stripe_offset+0x28e/0x340 [lov]
      00:24:15: [<ffffffffa072a8db>] ? lov_tgt_seq_show+0x26b/0x300 [lov]
      00:24:16: [<ffffffffa070d0a9>] ? lov_queue_async_io+0x149/0x4a0 [lov]
      00:24:16: [<ffffffffa0795780>] ? queue_or_sync_write+0x160/0xda0 [lustre]
      00:24:16: [<ffffffffa07a2c2b>] ? ll_stats_ops_tally+0x6b/0xd0 [lustre]
      00:24:16: [<ffffffffa079cde5>] ? ll_commit_write+0x2e5/0x750 [lustre]
      00:24:16: [<ffffffffa07b4333>] ? ll_write_begin+0x83/0x210 [lustre]
      00:24:16: [<ffffffffa07b4280>] ? ll_write_end+0x30/0x60 [lustre]
      00:24:16: [<ffffffff811107fa>] ? generic_file_buffered_write+0x18a/0x2e0
      00:24:16: [<ffffffff81070f97>] ? current_fs_time+0x27/0x30
      00:24:16: [<ffffffff81112130>] ? __generic_file_aio_write+0x250/0x480
      00:24:16: [<ffffffffa0765dba>] ? ll_file_get_tree_lock_iov+0x14a/0x810 [lustre]
      00:24:16: [<ffffffff811123cf>] ? generic_file_aio_write+0x6f/0xe0
      00:24:16: [<ffffffffa0772449>] ? ll_file_aio_write+0xa19/0x1c60 [lustre]
      00:24:16: [<ffffffffa0773760>] ? ll_file_write+0xd0/0xf0 [lustre]
      00:24:16: [<ffffffff8105a5c3>] ? perf_event_task_sched_out+0x33/0x80
      00:24:16: [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
      00:24:16: [<ffffffff8120ca26>] ? security_file_permission+0x16/0x20
      00:24:16: [<ffffffff8117646d>] ? rw_verify_area+0x5d/0xc0
      00:24:16: [<ffffffff81176588>] ? vfs_write+0xb8/0x1a0
      00:24:16: [<ffffffff81176e81>] ? sys_write+0x51/0x90
      00:24:16: [<ffffffff810d3a75>] ? __audit_syscall_exit+0x265/0x290
      00:24:16: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
      00:24:16:Code: 00 00 00 01 74 05 e8 72 8c d8 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 3e 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> 1f 44 00 00 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89
      

      Maloo report: https://maloo.whamcloud.com/test_sets/83fa5d12-6a23-11e2-85d4-52540035b04c

      Attachments

        Issue Links

          Activity

            [LU-2703] racer: BUG: soft lockup - CPU#0 stuck for 67s! [dd:1404]

            Another possibility is that we forgot somewhere to unlock the spinlock ...

            johann Johann Lombardi (Inactive) added a comment - Another possibility is that we forgot somewhere to unlock the spinlock ...
            bobijam Zhenyu Xu added a comment -

            osc_update_grant is trying spin_lock cl_loi_list_lock, but I haven't found where the lock has been hold at the moment.

            crash> bt -a
            PID: 3789 TASK: ffff88007c73a040 CPU: 0 COMMAND: "dd"
            [exception RIP: _spin_lock+38]
            RIP: ffffffff814ec546 RSP: ffff88003318b388 RFLAGS: 00000206
            RAX: 0000000000000001 RBX: ffff88007ca5c180 RCX: 0000000000000030
            RDX: 0000000000000000 RSI: ffff88007ca5c180 RDI: ffff880079b7069c
            RBP: ffff88003318b388 R8: 0000000000000000 R9: 0000000000001000
            R10: ffff880037528310 R11: 000000000000000f R12: ffff880079b705d0
            R13: ffff880079b7069c R14: ffff880079b705d0 R15: 0000000000001000
            CS: 0010 SS: 0018
            #0 [ffff88003318b390] osc_update_grant at ffffffffa02667fe [osc]
            #1 [ffff88003318b3e0] osc_brw_fini_request at ffffffffa027727f [osc]
            #2 [ffff88003318b500] osc_brw at ffffffffa0278931 [osc]
            #3 [ffff88003318b640] lov_brw at ffffffffa06879c9 [lov]
            #4 [ffff88003318b6f0] ll_prepare_write at ffffffffa070a3fe [lustre]
            #5 [ffff88003318b9a0] ll_write_begin at ffffffffa0722333 [lustre]
            #6 [ffff88003318ba20] generic_file_buffered_write at ffffffff81110793
            #7 [ffff88003318baf0] __generic_file_aio_write at ffffffff81112130
            #8 [ffff88003318bbb0] generic_file_aio_write at ffffffff811123cf
            #9 [ffff88003318bc00] ll_file_aio_write at ffffffffa06e0449 [lustre]
            #10 [ffff88003318bdd0] ll_file_write at ffffffffa06e1760 [lustre]
            #11 [ffff88003318bef0] vfs_write at ffffffff81176588
            #12 [ffff88003318bf30] sys_write at ffffffff81176e81
            #13 [ffff88003318bf80] system_call_fastpath at ffffffff8100b072
            crash> dis osc_update_grant
            0xffffffffa02667d0 <osc_update_grant>: push %rbp
            0xffffffffa02667d1 <osc_update_grant+1>: mov %rsp,%rbp
            0xffffffffa02667d4 <osc_update_grant+4>: sub $0x40,%rsp
            0xffffffffa02667d8 <osc_update_grant+8>: mov %rbx,-0x18(%rbp)
            0xffffffffa02667dc <osc_update_grant+12>: mov %r12,-0x10(%rbp)
            0xffffffffa02667e0 <osc_update_grant+16>: mov %r13,-0x8(%rbp)
            0xffffffffa02667e4 <osc_update_grant+20>: nopl 0x0(%rax,%rax,1)
            0xffffffffa02667e9 <osc_update_grant+25>: lea 0xcc(%rdi),%r13
            0xffffffffa02667f0 <osc_update_grant+32>: mov %rdi,%r12
            0xffffffffa02667f3 <osc_update_grant+35>: mov %rsi,%rbx
            0xffffffffa02667f6 <osc_update_grant+38>: mov %r13,%rdi
            0xffffffffa02667f9 <osc_update_grant+41>: callq 0xffffffff814ec520 <_spin_lock> =====> client_obd_list_lock(&cli->cl_loi_list_lock);
            0xffffffffa02667fe <osc_update_grant+46>: testb $0x20,0x14833f(%rip) # 0xffffffffa03aeb44
            0xffffffffa0266805 <osc_update_grant+53>: je 0xffffffffa0266810
            0xffffffffa0266807 <osc_update_grant+55>: testb $0x8,0x148332(%rip) # 0xffffffffa03aeb40
            0xffffffffa026680e <osc_update_grant+62>: jne 0xffffffffa0266840
            0xffffffffa0266810 <osc_update_grant+64>: mov (%rbx),%rax
            0xffffffffa0266813 <osc_update_grant+67>: test $0x8000000,%eax
            0xffffffffa0266818 <osc_update_grant+72>: je 0xffffffffa0266823
            0xffffffffa026681a <osc_update_grant+74>: mov 0x48(%rbx),%rax

            crash> ps | grep dd
            2 0 0 ffff88007dc84aa0 IN 0.0 0 0 [kthreadd]
            13 2 0 ffff88007dd55500 RU 0.0 0 0 [sync_supers]
            14 2 0 ffff88007dd54aa0 RU 0.0 0 0 [bdi-default]
            15 2 0 ffff88007dd54040 IN 0.0 0 0 [kintegrityd/0]
            903 2 0 ffff880037b84080 RU 0.0 0 0 [ib_addr]
            1415 1387 0 ffff880037978080 IN 0.1 20220 1076 hald-addon-inpu
            1431 1387 0 ffff880037ffb500 IN 0.0 17804 1036 hald-addon-acpi
            > 3789 6414 0 ffff88007c73a040 RU 0.0 105180 664 dd
            6401 6394 0 ffff880079bdd500 IN 0.1 106096 1244 file_rename.sh
            15786 6398 0 ffff88007bef1500 IN 0.0 105180 664 dd
            17462 6406 0 ffff88007d7ed540 IN 0.0 105180 664 dd
            18320 6404 0 ffff880079cdd500 UN 0.0 115700 992 ls
            crash> bt 15786
            PID: 15786 TASK: ffff88007bef1500 CPU: 0 COMMAND: "dd"
            #0 [ffff880078d715e8] schedule at ffffffff814e9c02
            #1 [ffff880078d716b0] osc_queue_async_io at ffffffffa0275e72 [osc]
            #2 [ffff880078d71800] lov_queue_async_io at ffffffffa067b0a9 [lov]
            #3 [ffff880078d71880] queue_or_sync_write at ffffffffa0703780 [lustre]
            #4 [ffff880078d71960] ll_commit_write at ffffffffa070ade5 [lustre]
            #5 [ffff880078d719f0] ll_write_end at ffffffffa0722280 [lustre]
            #6 [ffff880078d71a20] generic_file_buffered_write at ffffffff811107fa
            #7 [ffff880078d71af0] __generic_file_aio_write at ffffffff81112130
            #8 [ffff880078d71bb0] generic_file_aio_write at ffffffff811123cf
            #9 [ffff880078d71c00] ll_file_aio_write at ffffffffa06e0449 [lustre]
            #10 [ffff880078d71dd0] ll_file_write at ffffffffa06e1760 [lustre]
            #11 [ffff880078d71ef0] vfs_write at ffffffff81176588
            #12 [ffff880078d71f30] sys_write at ffffffff81176e81
            #13 [ffff880078d71f80] system_call_fastpath at ffffffff8100b072

            crash> bt 17462
            PID: 17462 TASK: ffff88007d7ed540 CPU: 0 COMMAND: "dd"
            #0 [ffff8800784d35e8] schedule at ffffffff814e9c02
            #1 [ffff8800784d36b0] osc_queue_async_io at ffffffffa0275e72 [osc]
            #2 [ffff8800784d3800] lov_queue_async_io at ffffffffa067b0a9 [lov]
            #3 [ffff8800784d3880] queue_or_sync_write at ffffffffa0703780 [lustre]
            #4 [ffff8800784d3960] ll_commit_write at ffffffffa070ade5 [lustre]
            #5 [ffff8800784d39f0] ll_write_end at ffffffffa0722280 [lustre]
            #6 [ffff8800784d3a20] generic_file_buffered_write at ffffffff811107fa
            #7 [ffff8800784d3af0] __generic_file_aio_write at ffffffff81112130
            #8 [ffff8800784d3bb0] generic_file_aio_write at ffffffff811123cf
            #9 [ffff8800784d3c00] ll_file_aio_write at ffffffffa06e0449 [lustre]
            #10 [ffff8800784d3dd0] ll_file_write at ffffffffa06e1760 [lustre]
            #11 [ffff8800784d3ef0] vfs_write at ffffffff81176588
            #12 [ffff8800784d3f30] sys_write at ffffffff81176e81
            #13 [ffff8800784d3f80] system_call_fastpath at ffffffff8100b072
            RIP: 00000036d9edae60 RSP: 00007fffce21ace0 RFLAGS: 00000202
            RAX: 0000000000000001 RBX: ffffffff8100b072 RCX: 00000036d9edae60
            RDX: 0000000000000400 RSI: 0000000000ed9000 RDI: 0000000000000001
            RBP: 0000000000ed9000 R8: 00000036da18eee8 R9: 0000000000000001
            R10: 0000000000002403 R11: 0000000000000246 R12: 0000000000ed8fff
            R13: 0000000000000000 R14: 0000000000000400 R15: 0000000000000000
            ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b

            crash> dis 0xffffffffa0275e41 20
            0xffffffffa0275e41 <osc_queue_async_io+2385>: mov -0xd8(%rbp),%rdx
            0xffffffffa0275e48 <osc_queue_async_io+2392>: mov %rbx,-0xb8(%rbp)
            0xffffffffa0275e4f <osc_queue_async_io+2399>: mov %rdx,%rbx
            0xffffffffa0275e52 <osc_queue_async_io+2402>: mov $0x1,%eax
            0xffffffffa0275e57 <osc_queue_async_io+2407>: xchg %rax,(%rbx)
            0xffffffffa0275e5a <osc_queue_async_io+2410>: mov -0xa8(%rbp),%rdi
            0xffffffffa0275e61 <osc_queue_async_io+2417>: mov %r13,%rsi
            0xffffffffa0275e64 <osc_queue_async_io+2420>: callq 0xffffffffa0266890 <ocw_granted>
            0xffffffffa0275e69 <osc_queue_async_io+2425>: test %eax,%eax
            0xffffffffa0275e6b <osc_queue_async_io+2427>: jne 0xffffffffa0275e92
            0xffffffffa0275e6d <osc_queue_async_io+2429>: callq 0xffffffff814e9850 <schedule> =====> l_wait_event(ocw.ocw_waitq, ocw_granted(cli, &ocw), &lwi);
            0xffffffffa0275e72 <osc_queue_async_io+2434>: mov -0xa8(%rbp),%rdi
            0xffffffffa0275e79 <osc_queue_async_io+2441>: mov %r13,%rsi
            0xffffffffa0275e7c <osc_queue_async_io+2444>: callq 0xffffffffa0266890 <ocw_granted>
            0xffffffffa0275e81 <osc_queue_async_io+2449>: test %eax,%eax
            0xffffffffa0275e83 <osc_queue_async_io+2451>: jne 0xffffffffa0275e92
            0xffffffffa0275e85 <osc_queue_async_io+2453>: callq 0xffffffffa039e250 <cfs_signal_pending>
            0xffffffffa0275e8a <osc_queue_async_io+2458>: test %eax,%eax
            0xffffffffa0275e8c <osc_queue_async_io+2460>: nopl 0x0(%rax)

            bobijam Zhenyu Xu added a comment - osc_update_grant is trying spin_lock cl_loi_list_lock, but I haven't found where the lock has been hold at the moment. crash> bt -a PID: 3789 TASK: ffff88007c73a040 CPU: 0 COMMAND: "dd" [exception RIP: _spin_lock+38] RIP: ffffffff814ec546 RSP: ffff88003318b388 RFLAGS: 00000206 RAX: 0000000000000001 RBX: ffff88007ca5c180 RCX: 0000000000000030 RDX: 0000000000000000 RSI: ffff88007ca5c180 RDI: ffff880079b7069c RBP: ffff88003318b388 R8: 0000000000000000 R9: 0000000000001000 R10: ffff880037528310 R11: 000000000000000f R12: ffff880079b705d0 R13: ffff880079b7069c R14: ffff880079b705d0 R15: 0000000000001000 CS: 0010 SS: 0018 #0 [ffff88003318b390] osc_update_grant at ffffffffa02667fe [osc] #1 [ffff88003318b3e0] osc_brw_fini_request at ffffffffa027727f [osc] #2 [ffff88003318b500] osc_brw at ffffffffa0278931 [osc] #3 [ffff88003318b640] lov_brw at ffffffffa06879c9 [lov] #4 [ffff88003318b6f0] ll_prepare_write at ffffffffa070a3fe [lustre] #5 [ffff88003318b9a0] ll_write_begin at ffffffffa0722333 [lustre] #6 [ffff88003318ba20] generic_file_buffered_write at ffffffff81110793 #7 [ffff88003318baf0] __generic_file_aio_write at ffffffff81112130 #8 [ffff88003318bbb0] generic_file_aio_write at ffffffff811123cf #9 [ffff88003318bc00] ll_file_aio_write at ffffffffa06e0449 [lustre] #10 [ffff88003318bdd0] ll_file_write at ffffffffa06e1760 [lustre] #11 [ffff88003318bef0] vfs_write at ffffffff81176588 #12 [ffff88003318bf30] sys_write at ffffffff81176e81 #13 [ffff88003318bf80] system_call_fastpath at ffffffff8100b072 crash> dis osc_update_grant 0xffffffffa02667d0 <osc_update_grant>: push %rbp 0xffffffffa02667d1 <osc_update_grant+1>: mov %rsp,%rbp 0xffffffffa02667d4 <osc_update_grant+4>: sub $0x40,%rsp 0xffffffffa02667d8 <osc_update_grant+8>: mov %rbx,-0x18(%rbp) 0xffffffffa02667dc <osc_update_grant+12>: mov %r12,-0x10(%rbp) 0xffffffffa02667e0 <osc_update_grant+16>: mov %r13,-0x8(%rbp) 0xffffffffa02667e4 <osc_update_grant+20>: nopl 0x0(%rax,%rax,1) 0xffffffffa02667e9 <osc_update_grant+25>: lea 0xcc(%rdi),%r13 0xffffffffa02667f0 <osc_update_grant+32>: mov %rdi,%r12 0xffffffffa02667f3 <osc_update_grant+35>: mov %rsi,%rbx 0xffffffffa02667f6 <osc_update_grant+38>: mov %r13,%rdi 0xffffffffa02667f9 <osc_update_grant+41>: callq 0xffffffff814ec520 <_spin_lock> =====> client_obd_list_lock(&cli->cl_loi_list_lock); 0xffffffffa02667fe <osc_update_grant+46>: testb $0x20,0x14833f(%rip) # 0xffffffffa03aeb44 0xffffffffa0266805 <osc_update_grant+53>: je 0xffffffffa0266810 0xffffffffa0266807 <osc_update_grant+55>: testb $0x8,0x148332(%rip) # 0xffffffffa03aeb40 0xffffffffa026680e <osc_update_grant+62>: jne 0xffffffffa0266840 0xffffffffa0266810 <osc_update_grant+64>: mov (%rbx),%rax 0xffffffffa0266813 <osc_update_grant+67>: test $0x8000000,%eax 0xffffffffa0266818 <osc_update_grant+72>: je 0xffffffffa0266823 0xffffffffa026681a <osc_update_grant+74>: mov 0x48(%rbx),%rax crash> ps | grep dd 2 0 0 ffff88007dc84aa0 IN 0.0 0 0 [kthreadd] 13 2 0 ffff88007dd55500 RU 0.0 0 0 [sync_supers] 14 2 0 ffff88007dd54aa0 RU 0.0 0 0 [bdi-default] 15 2 0 ffff88007dd54040 IN 0.0 0 0 [kintegrityd/0] 903 2 0 ffff880037b84080 RU 0.0 0 0 [ib_addr] 1415 1387 0 ffff880037978080 IN 0.1 20220 1076 hald-addon-inpu 1431 1387 0 ffff880037ffb500 IN 0.0 17804 1036 hald-addon-acpi > 3789 6414 0 ffff88007c73a040 RU 0.0 105180 664 dd 6401 6394 0 ffff880079bdd500 IN 0.1 106096 1244 file_rename.sh 15786 6398 0 ffff88007bef1500 IN 0.0 105180 664 dd 17462 6406 0 ffff88007d7ed540 IN 0.0 105180 664 dd 18320 6404 0 ffff880079cdd500 UN 0.0 115700 992 ls crash> bt 15786 PID: 15786 TASK: ffff88007bef1500 CPU: 0 COMMAND: "dd" #0 [ffff880078d715e8] schedule at ffffffff814e9c02 #1 [ffff880078d716b0] osc_queue_async_io at ffffffffa0275e72 [osc] #2 [ffff880078d71800] lov_queue_async_io at ffffffffa067b0a9 [lov] #3 [ffff880078d71880] queue_or_sync_write at ffffffffa0703780 [lustre] #4 [ffff880078d71960] ll_commit_write at ffffffffa070ade5 [lustre] #5 [ffff880078d719f0] ll_write_end at ffffffffa0722280 [lustre] #6 [ffff880078d71a20] generic_file_buffered_write at ffffffff811107fa #7 [ffff880078d71af0] __generic_file_aio_write at ffffffff81112130 #8 [ffff880078d71bb0] generic_file_aio_write at ffffffff811123cf #9 [ffff880078d71c00] ll_file_aio_write at ffffffffa06e0449 [lustre] #10 [ffff880078d71dd0] ll_file_write at ffffffffa06e1760 [lustre] #11 [ffff880078d71ef0] vfs_write at ffffffff81176588 #12 [ffff880078d71f30] sys_write at ffffffff81176e81 #13 [ffff880078d71f80] system_call_fastpath at ffffffff8100b072 crash> bt 17462 PID: 17462 TASK: ffff88007d7ed540 CPU: 0 COMMAND: "dd" #0 [ffff8800784d35e8] schedule at ffffffff814e9c02 #1 [ffff8800784d36b0] osc_queue_async_io at ffffffffa0275e72 [osc] #2 [ffff8800784d3800] lov_queue_async_io at ffffffffa067b0a9 [lov] #3 [ffff8800784d3880] queue_or_sync_write at ffffffffa0703780 [lustre] #4 [ffff8800784d3960] ll_commit_write at ffffffffa070ade5 [lustre] #5 [ffff8800784d39f0] ll_write_end at ffffffffa0722280 [lustre] #6 [ffff8800784d3a20] generic_file_buffered_write at ffffffff811107fa #7 [ffff8800784d3af0] __generic_file_aio_write at ffffffff81112130 #8 [ffff8800784d3bb0] generic_file_aio_write at ffffffff811123cf #9 [ffff8800784d3c00] ll_file_aio_write at ffffffffa06e0449 [lustre] #10 [ffff8800784d3dd0] ll_file_write at ffffffffa06e1760 [lustre] #11 [ffff8800784d3ef0] vfs_write at ffffffff81176588 #12 [ffff8800784d3f30] sys_write at ffffffff81176e81 #13 [ffff8800784d3f80] system_call_fastpath at ffffffff8100b072 RIP: 00000036d9edae60 RSP: 00007fffce21ace0 RFLAGS: 00000202 RAX: 0000000000000001 RBX: ffffffff8100b072 RCX: 00000036d9edae60 RDX: 0000000000000400 RSI: 0000000000ed9000 RDI: 0000000000000001 RBP: 0000000000ed9000 R8: 00000036da18eee8 R9: 0000000000000001 R10: 0000000000002403 R11: 0000000000000246 R12: 0000000000ed8fff R13: 0000000000000000 R14: 0000000000000400 R15: 0000000000000000 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b crash> dis 0xffffffffa0275e41 20 0xffffffffa0275e41 <osc_queue_async_io+2385>: mov -0xd8(%rbp),%rdx 0xffffffffa0275e48 <osc_queue_async_io+2392>: mov %rbx,-0xb8(%rbp) 0xffffffffa0275e4f <osc_queue_async_io+2399>: mov %rdx,%rbx 0xffffffffa0275e52 <osc_queue_async_io+2402>: mov $0x1,%eax 0xffffffffa0275e57 <osc_queue_async_io+2407>: xchg %rax,(%rbx) 0xffffffffa0275e5a <osc_queue_async_io+2410>: mov -0xa8(%rbp),%rdi 0xffffffffa0275e61 <osc_queue_async_io+2417>: mov %r13,%rsi 0xffffffffa0275e64 <osc_queue_async_io+2420>: callq 0xffffffffa0266890 <ocw_granted> 0xffffffffa0275e69 <osc_queue_async_io+2425>: test %eax,%eax 0xffffffffa0275e6b <osc_queue_async_io+2427>: jne 0xffffffffa0275e92 0xffffffffa0275e6d <osc_queue_async_io+2429>: callq 0xffffffff814e9850 <schedule> =====> l_wait_event(ocw.ocw_waitq, ocw_granted(cli, &ocw), &lwi); 0xffffffffa0275e72 <osc_queue_async_io+2434>: mov -0xa8(%rbp),%rdi 0xffffffffa0275e79 <osc_queue_async_io+2441>: mov %r13,%rsi 0xffffffffa0275e7c <osc_queue_async_io+2444>: callq 0xffffffffa0266890 <ocw_granted> 0xffffffffa0275e81 <osc_queue_async_io+2449>: test %eax,%eax 0xffffffffa0275e83 <osc_queue_async_io+2451>: jne 0xffffffffa0275e92 0xffffffffa0275e85 <osc_queue_async_io+2453>: callq 0xffffffffa039e250 <cfs_signal_pending> 0xffffffffa0275e8a <osc_queue_async_io+2458>: test %eax,%eax 0xffffffffa0275e8c <osc_queue_async_io+2460>: nopl 0x0(%rax)
            yujian Jian Yu added a comment -

            FYI, client-12vm[1-4] are remaining for debugging.

            yujian Jian Yu added a comment - FYI, client-12vm [1-4] are remaining for debugging.
            yujian Jian Yu added a comment -

            Hi Bobi and Johann,

            I reproduced this issue manually on Toro client-12vm[1-4] against Lustre b1_8 build #252 and got the kernel dump file of client-12vm1.

            Client1: client-12vm1 (RHEL6.3 2.6.32-279.19.1.el6.x86_64)
            Client2: client-12vm2 (RHEL6.3 2.6.32-279.19.1.el6.x86_64)
            MGS/MDS: client-12vm3 (RHEL5.9 2.6.18-348.1.1.el5_lustre.g3480bb0)
            OSS: client-12vm4     (RHEL5.9 2.6.18-348.1.1.el5_lustre.g3480bb0)  
            

            The dump file is /scratch/logs/1.8.9/LU-2703/client-12vm1.dump on brent node.

            I also put the kernel-debuginfo-* packages into that directory.

            yujian Jian Yu added a comment - Hi Bobi and Johann, I reproduced this issue manually on Toro client-12vm [1-4] against Lustre b1_8 build #252 and got the kernel dump file of client-12vm1. Client1: client-12vm1 (RHEL6.3 2.6.32-279.19.1.el6.x86_64) Client2: client-12vm2 (RHEL6.3 2.6.32-279.19.1.el6.x86_64) MGS/MDS: client-12vm3 (RHEL5.9 2.6.18-348.1.1.el5_lustre.g3480bb0) OSS: client-12vm4 (RHEL5.9 2.6.18-348.1.1.el5_lustre.g3480bb0) The dump file is /scratch/logs/1.8.9/ LU-2703 /client-12vm1.dump on brent node. I also put the kernel-debuginfo-* packages into that directory.
            bobijam Zhenyu Xu added a comment -

            For the record, I could not reproduce it for RHEL5.9 server and client (both use 2.6.18-348.1.1.el5)

            bobijam Zhenyu Xu added a comment - For the record, I could not reproduce it for RHEL5.9 server and client (both use 2.6.18-348.1.1.el5)

            hm, we are looping in osc_announce_cached() in the last report and it was osc_queue_async_io() in the original one. I suspect the issue is related to messing up with the cl_loi_list_lock. Unfortunately, sysrq-t output in maloo is full of soft lockup messages. I think we should try to collect a crash dump.

            johann Johann Lombardi (Inactive) added a comment - hm, we are looping in osc_announce_cached() in the last report and it was osc_queue_async_io() in the original one. I suspect the issue is related to messing up with the cl_loi_list_lock. Unfortunately, sysrq-t output in maloo is full of soft lockup messages. I think we should try to collect a crash dump.
            yujian Jian Yu added a comment -

            Lustre Branch: b1_8
            Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/252
            Distro/Arch: RHEL6.3 (client), RHEL5.9 (server)

            This issue occurred constantly on Lustre b1_8 build #252: https://maloo.whamcloud.com/test_sets/0fe4438e-6bc7-11e2-b2a5-52540035b04c

            yujian Jian Yu added a comment - Lustre Branch: b1_8 Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/252 Distro/Arch: RHEL6.3 (client), RHEL5.9 (server) This issue occurred constantly on Lustre b1_8 build #252: https://maloo.whamcloud.com/test_sets/0fe4438e-6bc7-11e2-b2a5-52540035b04c

            syslogd is waiting for kjournald and kjournald is waiting for I/O to complete. What do you mean by journal is dead?

            From the original report, it seems that we are stuck on a spinlock for 67s, so i tend to think that someone is doing something it shouldn't (e.g. sleeping, locking ordering issue) while holding a spinlock.

            johann Johann Lombardi (Inactive) added a comment - syslogd is waiting for kjournald and kjournald is waiting for I/O to complete. What do you mean by journal is dead? From the original report, it seems that we are stuck on a spinlock for 67s, so i tend to think that someone is doing something it shouldn't (e.g. sleeping, locking ordering issue) while holding a spinlock.
            bobijam Zhenyu Xu added a comment -

            this specific case (https://maloo.whamcloud.com/test_sets/83fa5d12-6a23-11e2-85d4-52540035b04c) is due to MDS journal dead for some unknown reason.

            dead process on MDS
            01:17:06:kjournald     D ffff810002536420     0   392     23           412   383 (L-TLB)
            01:17:06: ffff81007f10bca0 0000000000000046 0000000000011220 0000000000000046
            01:17:06: 0000000000000220 000000000000000a ffff810037fe30c0 ffffffff8031db60
            01:17:06: 0000178e3069efc5 000000000000edfb ffff810037fe32a8 000000008006e6b3
            01:17:06:Call Trace:
            01:17:06: [<ffffffff8006ed48>] do_gettimeofday+0x40/0x90
            01:17:06: [<ffffffff800155b4>] sync_buffer+0x0/0x3f
            01:17:06: [<ffffffff800637de>] io_schedule+0x3f/0x67
            01:17:06: [<ffffffff800155ef>] sync_buffer+0x3b/0x3f
            01:17:06: [<ffffffff80063a0a>] __wait_on_bit+0x40/0x6e
            01:17:06: [<ffffffff800155b4>] sync_buffer+0x0/0x3f
            01:17:06: [<ffffffff80063aa4>] out_of_line_wait_on_bit+0x6c/0x78
            01:17:06: [<ffffffff800a3c15>] wake_bit_function+0x0/0x23
            01:17:06: [<ffffffff8803331c>] :jbd:journal_commit_transaction+0xa7f/0x132b
            01:17:06: [<ffffffff8003dde5>] lock_timer_base+0x1b/0x3c
            01:17:06: [<ffffffff88037489>] :jbd:kjournald+0xc1/0x213
            01:17:06: [<ffffffff800a3be7>] autoremove_wake_function+0x0/0x2e
            01:17:06: [<ffffffff800a39cf>] keventd_create_kthread+0x0/0xc4
            01:17:06: [<ffffffff880373c8>] :jbd:kjournald+0x0/0x213
            01:17:06: [<ffffffff800a39cf>] keventd_create_kthread+0x0/0xc4
            01:17:06: [<ffffffff80032c45>] kthread+0xfe/0x132
            01:17:06: [<ffffffff8005dfc1>] child_rip+0xa/0x11
            01:17:06: [<ffffffff800a39cf>] keventd_create_kthread+0x0/0xc4
            01:17:06: [<ffffffff80032b47>] kthread+0x0/0x132
            01:17:06: [<ffffffff8005dfb7>] child_rip+0x0/0x11
            
            01:17:07:syslogd       D ffff810037fe30c0     0  1607      1          1610  1585 (NOTLB)
            01:17:07: ffff8100741bdd98 0000000000000082 00000000741bdd18 0000000000000001
            01:17:07: 0000000000000000 0000000000000009 ffff81007ec287a0 ffff810037fe30c0
            01:17:07: 0000178e2ecd14cd 000000000001708d ffff81007ec28988 0000000000000092
            01:17:07:Call Trace:
            01:17:07: [<ffffffff8002e4db>] __wake_up+0x38/0x4f
            01:17:07: [<ffffffff8803682d>] :jbd:log_wait_commit+0xa3/0xf5
            01:17:07: [<ffffffff800a3be7>] autoremove_wake_function+0x0/0x2e
            01:17:07: [<ffffffff880307f8>] :jbd:journal_stop+0x22a/0x259
            01:17:07: [<ffffffff8002ff63>] __writeback_single_inode+0x1dd/0x31c
            01:17:07: [<ffffffff800e4b09>] do_readv_writev+0x26e/0x291
            01:17:07: [<ffffffff800f8a7a>] sync_inode+0x24/0x33
            01:17:07: [<ffffffff8804d52e>] :ext3:ext3_sync_file+0xce/0xf8
            01:17:07: [<ffffffff800504b1>] do_fsync+0x52/0xa4
            01:17:07: [<ffffffff800e5424>] __do_fsync+0x23/0x36
            01:17:07: [<ffffffff8005d29e>] tracesys+0xd5/0xdf
            01:17:07:
            
            bobijam Zhenyu Xu added a comment - this specific case ( https://maloo.whamcloud.com/test_sets/83fa5d12-6a23-11e2-85d4-52540035b04c ) is due to MDS journal dead for some unknown reason. dead process on MDS 01:17:06:kjournald D ffff810002536420 0 392 23 412 383 (L-TLB) 01:17:06: ffff81007f10bca0 0000000000000046 0000000000011220 0000000000000046 01:17:06: 0000000000000220 000000000000000a ffff810037fe30c0 ffffffff8031db60 01:17:06: 0000178e3069efc5 000000000000edfb ffff810037fe32a8 000000008006e6b3 01:17:06:Call Trace: 01:17:06: [<ffffffff8006ed48>] do_gettimeofday+0x40/0x90 01:17:06: [<ffffffff800155b4>] sync_buffer+0x0/0x3f 01:17:06: [<ffffffff800637de>] io_schedule+0x3f/0x67 01:17:06: [<ffffffff800155ef>] sync_buffer+0x3b/0x3f 01:17:06: [<ffffffff80063a0a>] __wait_on_bit+0x40/0x6e 01:17:06: [<ffffffff800155b4>] sync_buffer+0x0/0x3f 01:17:06: [<ffffffff80063aa4>] out_of_line_wait_on_bit+0x6c/0x78 01:17:06: [<ffffffff800a3c15>] wake_bit_function+0x0/0x23 01:17:06: [<ffffffff8803331c>] :jbd:journal_commit_transaction+0xa7f/0x132b 01:17:06: [<ffffffff8003dde5>] lock_timer_base+0x1b/0x3c 01:17:06: [<ffffffff88037489>] :jbd:kjournald+0xc1/0x213 01:17:06: [<ffffffff800a3be7>] autoremove_wake_function+0x0/0x2e 01:17:06: [<ffffffff800a39cf>] keventd_create_kthread+0x0/0xc4 01:17:06: [<ffffffff880373c8>] :jbd:kjournald+0x0/0x213 01:17:06: [<ffffffff800a39cf>] keventd_create_kthread+0x0/0xc4 01:17:06: [<ffffffff80032c45>] kthread+0xfe/0x132 01:17:06: [<ffffffff8005dfc1>] child_rip+0xa/0x11 01:17:06: [<ffffffff800a39cf>] keventd_create_kthread+0x0/0xc4 01:17:06: [<ffffffff80032b47>] kthread+0x0/0x132 01:17:06: [<ffffffff8005dfb7>] child_rip+0x0/0x11 01:17:07:syslogd D ffff810037fe30c0 0 1607 1 1610 1585 (NOTLB) 01:17:07: ffff8100741bdd98 0000000000000082 00000000741bdd18 0000000000000001 01:17:07: 0000000000000000 0000000000000009 ffff81007ec287a0 ffff810037fe30c0 01:17:07: 0000178e2ecd14cd 000000000001708d ffff81007ec28988 0000000000000092 01:17:07:Call Trace: 01:17:07: [<ffffffff8002e4db>] __wake_up+0x38/0x4f 01:17:07: [<ffffffff8803682d>] :jbd:log_wait_commit+0xa3/0xf5 01:17:07: [<ffffffff800a3be7>] autoremove_wake_function+0x0/0x2e 01:17:07: [<ffffffff880307f8>] :jbd:journal_stop+0x22a/0x259 01:17:07: [<ffffffff8002ff63>] __writeback_single_inode+0x1dd/0x31c 01:17:07: [<ffffffff800e4b09>] do_readv_writev+0x26e/0x291 01:17:07: [<ffffffff800f8a7a>] sync_inode+0x24/0x33 01:17:07: [<ffffffff8804d52e>] :ext3:ext3_sync_file+0xce/0xf8 01:17:07: [<ffffffff800504b1>] do_fsync+0x52/0xa4 01:17:07: [<ffffffff800e5424>] __do_fsync+0x23/0x36 01:17:07: [<ffffffff8005d29e>] tracesys+0xd5/0xdf 01:17:07:
            pjones Peter Jones added a comment -

            Bobijam

            Could you please look into this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bobijam Could you please look into this one? Thanks Peter
            yujian Jian Yu added a comment -

            This is an regression issue after http://review.whamcloud.com/5132 was landed.

            yujian Jian Yu added a comment - This is an regression issue after http://review.whamcloud.com/5132 was landed.

            People

              bobijam Zhenyu Xu
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: