[LU-2451] recovery-small test_24b: BUG: soft lockup - CPU#0 stuck for 67s! [ll_imp_inval:4791] Created: 10/Dec/12 Updated: 22/Dec/15 Resolved: 22/Dec/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jian Yu | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Branch: b2_1 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 5791 | ||||||||
| Description |
|
The recovery-small test 24b hung due to the following soft lockup issue on client: 18:45:08:Lustre: DEBUG MARKER: == recovery-small test 24b: test dirty page discard due to client eviction == 18:45:02 (1354934702) 18:45:08:Lustre: DEBUG MARKER: cancel_lru_locks osc start 18:45:08:Lustre: DEBUG MARKER: cancel_lru_locks osc stop 18:45:08:LustreError: 11-0: an error occurred while communicating with 10.10.4.151@tcp. The ost_write operation failed with -107 18:45:08:LustreError: 167-0: This client was evicted by lustre-OST0000; in progress operations using this service will fail. 18:45:09:Lustre: 20511:0:(llite_lib.c:2285:ll_dirty_page_discard_warn()) dirty page discard: 10.10.4.150@tcp:/lustre/fid: [0x200000401:0x42:0x0]//d0.recovery-small/d24/f24b-1 may get corrupted (rc -4) 18:45:09:LustreError: 20511:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff88007b678c00 x1420751262452328/t0(0) o4->lustre-OST0000-osc-ffff88007b271800@10.10.4.151@tcp:6/4 lens 456/416 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 18:46:20:BUG: soft lockup - CPU#0 stuck for 67s! [ll_imp_inval:4791] 18:46:20:Modules linked in: lustre(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) lquota(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] 18:46:22:CPU 0 18:46:22:Modules linked in: lustre(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) lquota(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] 18:46:22: 18:46:22:Pid: 4791, comm: ll_imp_inval Not tainted 2.6.32-279.14.1.el6.x86_64 #1 Red Hat KVM 18:46:22:RIP: 0010:[<ffffffff8150098e>] [<ffffffff8150098e>] _spin_lock+0x1e/0x30 18:46:22:RSP: 0018:ffff88007b60de00 EFLAGS: 00000206 18:46:22:RAX: 0000000000000001 RBX: ffff88007b60de00 RCX: 0000000000000000 18:46:22:RDX: 0000000000000000 RSI: ffffffffa04dfce6 RDI: ffff88007c616734 18:46:22:RBP: ffffffff8100bc0e R08: 00000000ffffff0a R09: 0000000000000000 18:46:23:R10: 000000000000000f R11: 000000000000000f R12: 0000000000000010 18:46:23:R13: ffffffffa0848ef1 R14: ffff88007b60ddc0 R15: 0000000000000000 18:46:24:FS: 00007f65fc14a700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 18:46:24:CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 18:46:24:CR2: 000000000208b3c0 CR3: 000000003796c000 CR4: 00000000000006f0 18:46:24:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 18:46:24:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 18:46:24:Process ll_imp_inval (pid: 4791, threadinfo ffff88007b60c000, task ffff88007b90b500) 18:46:24:Stack: 18:46:24: ffff88007b60de60 ffffffffa090dead 0000000000000010 ffff880000808003 18:46:24:<d> ffff88007b60de30 000000018109252c ffff88007b60de60 ffff880076109800 18:46:24:<d> ffff88007c616138 ffffffff00000100 ebc0de0100000000 0000000000004e1a 18:46:24:Call Trace: 18:46:25: [<ffffffffa090dead>] ? osc_import_event+0x3ad/0x1470 [osc] 18:46:25: [<ffffffffa0687b01>] ? ptlrpc_invalidate_import+0x2d1/0x910 [ptlrpc] 18:46:25: [<ffffffff810602c0>] ? default_wake_function+0x0/0x20 18:46:25: [<ffffffff811abaef>] ? unshare_fs_struct+0x5f/0xb0 18:46:25: [<ffffffffa0688360>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc] 18:46:25: [<ffffffffa06883af>] ? ptlrpc_invalidate_import_thread+0x4f/0x2f0 [ptlrpc] 18:46:25: [<ffffffff8100c14a>] ? child_rip+0xa/0x20 18:46:25: [<ffffffffa0688360>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc] 18:46:25: [<ffffffffa0688360>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc] 18:46:25: [<ffffffff8100c140>] ? child_rip+0x0/0x20 18:46:26:Code: 00 00 00 01 74 05 e8 62 e0 d7 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 3e 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> 1f 44 00 00 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89 18:46:26:Call Trace: 18:46:26: [<ffffffffa090dead>] ? osc_import_event+0x3ad/0x1470 [osc] 18:46:26: [<ffffffffa0687b01>] ? ptlrpc_invalidate_import+0x2d1/0x910 [ptlrpc] 18:46:26: [<ffffffff810602c0>] ? default_wake_function+0x0/0x20 18:46:26: [<ffffffff811abaef>] ? unshare_fs_struct+0x5f/0xb0 18:46:26: [<ffffffffa0688360>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc] 18:46:26: [<ffffffffa06883af>] ? ptlrpc_invalidate_import_thread+0x4f/0x2f0 [ptlrpc] 18:46:26: [<ffffffff8100c14a>] ? child_rip+0xa/0x20 18:46:27: [<ffffffffa0688360>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc] 18:46:27: [<ffffffffa0688360>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc] 18:46:27: [<ffffffff8100c140>] ? child_rip+0x0/0x20 Maloo report: https://maloo.whamcloud.com/test_sets/1d70d272-41da-11e2-adcf-52540035b04c |
| Comments |
| Comment by Niu Yawei (Inactive) [ 11/Dec/12 ] |
|
Probaly someone was holding the cl_loi_list_lock for too long time and caused the soft lockup, unfortunately, there isn't any debuglog and full stack trace on client. The test_24b is a new added test, so it's hard to say if it was a standing problem or caused by recent commits. |
| Comment by Jinshan Xiong (Inactive) [ 11/Dec/12 ] |
|
liang has fixed a similar issue but not sure if it's landed. BTW, this may not be loi_list_lock because otherwise debug info would be printed. |
| Comment by Niu Yawei (Inactive) [ 12/Dec/12 ] |
|
I think you were mentioning the Looks CLIENT_OBD_LIST_LOCK_DEBUG isn't enabled by default in b2_1, let's enable it to see if we can capture any useful log. |
| Comment by Niu Yawei (Inactive) [ 12/Dec/12 ] |
|
enable CLIENT_OBD_LIST_LOCK_DEBUG by default: http://review.whamcloud.com/4806 |
| Comment by Niu Yawei (Inactive) [ 12/Dec/12 ] |
|
This bug only happened once on b2_1, maybe we should lower the priority and see if it can be reproduced in the futhure (hopefully, with debug log and full stack trace). |
| Comment by Jian Yu [ 17/Dec/12 ] |
|
Lustre Tag: v2_1_4_RC1 The same issue occurred again: |
| Comment by Jian Yu [ 18/Dec/12 ] |
|
RHEL6.3/x86_64 (2.3.0 Server + 2.1.4 Client): |
| Comment by Jinshan Xiong (Inactive) [ 19/Dec/12 ] |
|
will fix it in |
| Comment by Jinshan Xiong (Inactive) [ 19/Dec/12 ] |
|
it still makes sense to land the patch 4806 so it will be easier to find out this kind of issue |
| Comment by Jian Yu [ 19/Dec/12 ] |
Testing on patch Set 2 build http://build.whamcloud.com/job/lustre-reviews/11215/ failed. After running recovery-small test 24a and 24b, removing the test directories failed and unmounting client hung. Console log on the client showed that: Lustre: DEBUG MARKER: /usr/sbin/lctl mark == recovery-small test 24b: test dirty page discard due to client eviction == 22:23:00 \(1355898180\)
Lustre: DEBUG MARKER: == recovery-small test 24b: test dirty page discard due to client eviction == 22:23:00 (1355898180)
Lustre: DEBUG MARKER: cancel_lru_locks osc start
Lustre: DEBUG MARKER: cancel_lru_locks osc stop
LustreError: 11-0: an error occurred while communicating with 10.10.4.209@tcp. The ost_write operation failed with -107
Lustre: lustre-OST0000-osc-ffff8800740c6000: Connection to lustre-OST0000 (at 10.10.4.209@tcp) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 167-0: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
Lustre: 5975:0:(llite_lib.c:2287:ll_dirty_page_discard_warn()) dirty page discard: 10.10.4.208@tcp:/lustre/fid: [0x200000400:0x9:0x0]//d0.recovery-small/d24/f24b-1 may get corrupted (rc -4)
LustreError: 5975:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8800745a1400 x1421761918796156/t0(0) o4->lustre-OST0000-osc-ffff8800740c6000@10.10.4.209@tcp:6/4 lens 456/416 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
Lustre: ll_imp_inval:11228: lock ffff8800745b8738 was acquired by <ptlrpcd-brw:5975:brw_interpret:2299> for 5 seconds.
Lustre: ====== for process holding the lock =====
Pid: 5975, comm: ptlrpcd-brw
Call Trace:
[<ffffffff8106244a>] __cond_resched+0x2a/0x40
[<ffffffff814fea60>] _cond_resched+0x30/0x40
[<ffffffff811272e5>] __alloc_pages_nodemask+0x165/0x940
[<ffffffffa03ad831>] ? libcfs_debug_vmsg2+0x4e1/0xb60 [libcfs]
[<ffffffff8115c7ea>] alloc_pages_current+0xaa/0x110
[<ffffffff8112572e>] __get_free_pages+0xe/0x50
[<ffffffffa093b252>] ll_dirty_page_discard_warn+0x42/0x220 [lustre]
[<ffffffffa03adef1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[<ffffffffa0938d6b>] ? ll_queue_done_writing+0xfb/0x300 [lustre]
[<ffffffffa0978dec>] vvp_page_completion_common+0xcc/0xf0 [lustre]
[<ffffffffa0978e5b>] vvp_page_completion_write_common+0x4b/0x60 [lustre]
[<ffffffffa097982e>] vvp_page_completion_write+0x10e/0x230 [lustre]
[<ffffffffa04f937f>] cl_page_completion+0xff/0x690 [obdclass]
[<ffffffffa0790c92>] osc_completion+0x162/0x3d0 [osc]
[<ffffffffa0774904>] osc_ap_completion+0x134/0x330 [osc]
[<ffffffffa0787e71>] brw_interpret+0x6e1/0x12d0 [osc]
[<ffffffffa0648a2b>] ? ptlrpc_unregister_bulk+0x1fb/0xae0 [ptlrpc]
[<ffffffffa0641d4b>] ptlrpc_check_set+0x29b/0x1b00 [ptlrpc]
[<ffffffffa03adef1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[<ffffffffa0672fd0>] ptlrpcd_check+0x1a0/0x230 [ptlrpc]
[<ffffffffa067326e>] ptlrpcd+0x20e/0x370 [ptlrpc]
[<ffffffff810602c0>] ? default_wake_function+0x0/0x20
[<ffffffffa0673060>] ? ptlrpcd+0x0/0x370 [ptlrpc]
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffffa0673060>] ? ptlrpcd+0x0/0x370 [ptlrpc]
[<ffffffffa0673060>] ? ptlrpcd+0x0/0x370 [ptlrpc]
[<ffffffff8100c140>] ? child_rip+0x0/0x20
Lustre: ====== for current process =====
Pid: 11228, comm: ll_imp_inval
Call Trace:
[<ffffffffa03a27f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa0784d4f>] osc_import_event+0x55f/0x1cc0 [osc]
[<ffffffffa066dc49>] ptlrpc_invalidate_import+0x2a9/0x8d0 [ptlrpc]
[<ffffffff810602c0>] ? default_wake_function+0x0/0x20
[<ffffffff811abaef>] ? unshare_fs_struct+0x5f/0xb0
[<ffffffffa066e490>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc]
[<ffffffffa066e4df>] ptlrpc_invalidate_import_thread+0x4f/0x2f0 [ptlrpc]
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffffa066e490>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc]
[<ffffffffa066e490>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc]
[<ffffffff8100c140>] ? child_rip+0x0/0x20
Lustre: ====== end =======
Lustre: 5975:0:(llite_lib.c:2287:ll_dirty_page_discard_warn()) dirty page discard: 10.10.4.208@tcp:/lustre/fid: [0x200000400:0xa:0x0]//d0.recovery-small/d24/f24b-2 may get corrupted (rc -108)
LustreError: 5977:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff88007d243800 x1421761918796168/t0(0) o101->lustre-OST0000-osc-ffff8800740c6000@10.10.4.209@tcp:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
Lustre: DEBUG MARKER: df /mnt/lustre; uname -n >> /mnt/lustre/recon
LustreError: 11297:0:(namei.c:1128:ll_objects_destroy()) obd destroy objid 0xb error -5
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 2>/dev/null || true
Lustre: DEBUG MARKER: /usr/sbin/lctl mark == recovery-small recovery-small.sh test complete, duration 65 sec == 22:23:07 \(1355898187\)
Lustre: DEBUG MARKER: == recovery-small recovery-small.sh test complete, duration 65 sec == 22:23:07 (1355898187)
LustreError: 11636:0:(namei.c:1128:ll_objects_destroy()) obd destroy objid 0x5 error -5
Lustre: DEBUG MARKER: /usr/sbin/lctl mark recovery-small : @@@@@@ FAIL: remove sub-test dirs failed
Lustre: DEBUG MARKER: recovery-small : @@@@@@ FAIL: remove sub-test dirs failed
Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /home/yujian/test_logs/2012-12-18/221902/recovery-small..debug_log.$(hostname -s).1355898187.log;
dmesg > /home/yujian/test_logs/2012-12-18/221902/recovery-small..dmesg.$(hostname -s).1355898187.log
Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre' ' /proc/mounts);
if [ $running -ne 0 ] ; then
echo Stopping client $(hostname) /mnt/lustre opts:;
lsof -t /mnt/lustre || need_kill=no;
if [ x != x -a x$need_kill != xno ]; then
pids=$(lsof -t /mnt/lustre | sort -u);
INFO: task ll_imp_inval:11228 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ll_imp_inval D 0000000000000000 0 11228 2 0x00000080
ffff880073f91d20 0000000000000046 ffffffffa079ccdc 0000000000000000
ffff880073f91d90 000000008106c04f ffffffffa07ab370 ffff88007d2f9449
ffff88007433dab8 ffff880073f91fd8 000000000000fb88 ffff88007433dab8
Call Trace:
[<ffffffff814fee92>] schedule_timeout+0x192/0x2e0
[<ffffffff8107e1c0>] ? process_timeout+0x0/0x10
[<ffffffffa03a35ff>] cfs_pause+0x1f/0x30 [libcfs]
[<ffffffffa0784dad>] osc_import_event+0x5bd/0x1cc0 [osc]
[<ffffffffa066dc49>] ptlrpc_invalidate_import+0x2a9/0x8d0 [ptlrpc]
[<ffffffff810602c0>] ? default_wake_function+0x0/0x20
[<ffffffff811abaef>] ? unshare_fs_struct+0x5f/0xb0
[<ffffffffa066e490>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc]
[<ffffffffa066e4df>] ptlrpc_invalidate_import_thread+0x4f/0x2f0 [ptlrpc]
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffffa066e490>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc]
[<ffffffffa066e490>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc]
[<ffffffff8100c140>] ? child_rip+0x0/0x20
|
| Comment by Niu Yawei (Inactive) [ 19/Dec/12 ] |
|
Hi, Yujian, did you try the fix by Jingshan in |
| Comment by Jian Yu [ 19/Dec/12 ] |
|
bq Hi, Yujian, did you try the fix by Jingshan in Yes, I just finished verifying the patch of http://review.whamcloud.com/4866 for |
| Comment by Andreas Dilger [ 19/Dec/12 ] |
|
Duplicate of |
| Comment by Jinshan Xiong (Inactive) [ 19/Dec/12 ] |
|
reopen the ticket because I want to land patch 4806. |
| Comment by Andreas Dilger [ 29/Sep/15 ] |
|
Should http://review.whamcloud.com/4806 be abandoned and this patch be closed? |