[LU-2451] recovery-small test_24b: BUG: soft lockup - CPU#0 stuck for 67s! [ll_imp_inval:4791] Created: 10/Dec/12  Updated: 22/Dec/15  Resolved: 22/Dec/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.4
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: Niu Yawei (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Lustre Branch: b2_1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/148
Distro/Arch: RHEL6.3/x86_64 (kernel version: 2.6.32-279.14.1.el6)
Network: TCP (1GigE)


Issue Links:
Duplicate
duplicates LU-2505 lfsck: BUG: soft lockup - CPU#0 stuck... Resolved
Severity: 3
Rank (Obsolete): 5791

 Description   

The recovery-small test 24b hung due to the following soft lockup issue on client:

18:45:08:Lustre: DEBUG MARKER: == recovery-small test 24b: test dirty page discard due to client eviction == 18:45:02 (1354934702)
18:45:08:Lustre: DEBUG MARKER: cancel_lru_locks osc start
18:45:08:Lustre: DEBUG MARKER: cancel_lru_locks osc stop
18:45:08:LustreError: 11-0: an error occurred while communicating with 10.10.4.151@tcp. The ost_write operation failed with -107
18:45:08:LustreError: 167-0: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
18:45:09:Lustre: 20511:0:(llite_lib.c:2285:ll_dirty_page_discard_warn()) dirty page discard: 10.10.4.150@tcp:/lustre/fid: [0x200000401:0x42:0x0]//d0.recovery-small/d24/f24b-1 may get corrupted (rc -4)
18:45:09:LustreError: 20511:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff88007b678c00 x1420751262452328/t0(0) o4->lustre-OST0000-osc-ffff88007b271800@10.10.4.151@tcp:6/4 lens 456/416 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
18:46:20:BUG: soft lockup - CPU#0 stuck for 67s! [ll_imp_inval:4791]
18:46:20:Modules linked in: lustre(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) lquota(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
18:46:22:CPU 0 
18:46:22:Modules linked in: lustre(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) lquota(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
18:46:22:
18:46:22:Pid: 4791, comm: ll_imp_inval Not tainted 2.6.32-279.14.1.el6.x86_64 #1 Red Hat KVM
18:46:22:RIP: 0010:[<ffffffff8150098e>]  [<ffffffff8150098e>] _spin_lock+0x1e/0x30
18:46:22:RSP: 0018:ffff88007b60de00  EFLAGS: 00000206
18:46:22:RAX: 0000000000000001 RBX: ffff88007b60de00 RCX: 0000000000000000
18:46:22:RDX: 0000000000000000 RSI: ffffffffa04dfce6 RDI: ffff88007c616734
18:46:22:RBP: ffffffff8100bc0e R08: 00000000ffffff0a R09: 0000000000000000
18:46:23:R10: 000000000000000f R11: 000000000000000f R12: 0000000000000010
18:46:23:R13: ffffffffa0848ef1 R14: ffff88007b60ddc0 R15: 0000000000000000
18:46:24:FS:  00007f65fc14a700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
18:46:24:CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
18:46:24:CR2: 000000000208b3c0 CR3: 000000003796c000 CR4: 00000000000006f0
18:46:24:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
18:46:24:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
18:46:24:Process ll_imp_inval (pid: 4791, threadinfo ffff88007b60c000, task ffff88007b90b500)
18:46:24:Stack:
18:46:24: ffff88007b60de60 ffffffffa090dead 0000000000000010 ffff880000808003
18:46:24:<d> ffff88007b60de30 000000018109252c ffff88007b60de60 ffff880076109800
18:46:24:<d> ffff88007c616138 ffffffff00000100 ebc0de0100000000 0000000000004e1a
18:46:24:Call Trace:
18:46:25: [<ffffffffa090dead>] ? osc_import_event+0x3ad/0x1470 [osc]
18:46:25: [<ffffffffa0687b01>] ? ptlrpc_invalidate_import+0x2d1/0x910 [ptlrpc]
18:46:25: [<ffffffff810602c0>] ? default_wake_function+0x0/0x20
18:46:25: [<ffffffff811abaef>] ? unshare_fs_struct+0x5f/0xb0
18:46:25: [<ffffffffa0688360>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc]
18:46:25: [<ffffffffa06883af>] ? ptlrpc_invalidate_import_thread+0x4f/0x2f0 [ptlrpc]
18:46:25: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
18:46:25: [<ffffffffa0688360>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc]
18:46:25: [<ffffffffa0688360>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc]
18:46:25: [<ffffffff8100c140>] ? child_rip+0x0/0x20
18:46:26:Code: 00 00 00 01 74 05 e8 62 e0 d7 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 3e 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> 1f 44 00 00 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89 
18:46:26:Call Trace:
18:46:26: [<ffffffffa090dead>] ? osc_import_event+0x3ad/0x1470 [osc]
18:46:26: [<ffffffffa0687b01>] ? ptlrpc_invalidate_import+0x2d1/0x910 [ptlrpc]
18:46:26: [<ffffffff810602c0>] ? default_wake_function+0x0/0x20
18:46:26: [<ffffffff811abaef>] ? unshare_fs_struct+0x5f/0xb0
18:46:26: [<ffffffffa0688360>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc]
18:46:26: [<ffffffffa06883af>] ? ptlrpc_invalidate_import_thread+0x4f/0x2f0 [ptlrpc]
18:46:26: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
18:46:27: [<ffffffffa0688360>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc]
18:46:27: [<ffffffffa0688360>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc]
18:46:27: [<ffffffff8100c140>] ? child_rip+0x0/0x20

Maloo report: https://maloo.whamcloud.com/test_sets/1d70d272-41da-11e2-adcf-52540035b04c



 Comments   
Comment by Niu Yawei (Inactive) [ 11/Dec/12 ]

Probaly someone was holding the cl_loi_list_lock for too long time and caused the soft lockup, unfortunately, there isn't any debuglog and full stack trace on client.

The test_24b is a new added test, so it's hard to say if it was a standing problem or caused by recent commits.

Comment by Jinshan Xiong (Inactive) [ 11/Dec/12 ]

liang has fixed a similar issue but not sure if it's landed. BTW, this may not be loi_list_lock because otherwise debug info would be printed.

Comment by Niu Yawei (Inactive) [ 12/Dec/12 ]

I think you were mentioning the LU-2263 & LU-2327, they are not fixed yet.

Looks CLIENT_OBD_LIST_LOCK_DEBUG isn't enabled by default in b2_1, let's enable it to see if we can capture any useful log.

Comment by Niu Yawei (Inactive) [ 12/Dec/12 ]

enable CLIENT_OBD_LIST_LOCK_DEBUG by default: http://review.whamcloud.com/4806

Comment by Niu Yawei (Inactive) [ 12/Dec/12 ]

This bug only happened once on b2_1, maybe we should lower the priority and see if it can be reproduced in the futhure (hopefully, with debug log and full stack trace).

Comment by Jian Yu [ 17/Dec/12 ]

Lustre Tag: v2_1_4_RC1
Distro/Arch: RHEL6.3/x86_64

The same issue occurred again:
https://maloo.whamcloud.com/test_sets/670ac2da-4829-11e2-a406-52540035b04c

Comment by Jian Yu [ 18/Dec/12 ]

RHEL6.3/x86_64 (2.3.0 Server + 2.1.4 Client):
https://maloo.whamcloud.com/test_sets/bd3b0534-486f-11e2-8cdc-52540035b04c

Comment by Jinshan Xiong (Inactive) [ 19/Dec/12 ]

will fix it in LU-2505.

Comment by Jinshan Xiong (Inactive) [ 19/Dec/12 ]

it still makes sense to land the patch 4806 so it will be easier to find out this kind of issue

Comment by Jian Yu [ 19/Dec/12 ]

enable CLIENT_OBD_LIST_LOCK_DEBUG by default: http://review.whamcloud.com/4806

Testing on patch Set 2 build http://build.whamcloud.com/job/lustre-reviews/11215/ failed.

After running recovery-small test 24a and 24b, removing the test directories failed and unmounting client hung.

Console log on the client showed that:

Lustre: DEBUG MARKER: /usr/sbin/lctl mark == recovery-small test 24b: test dirty page discard due to client eviction == 22:23:00 \(1355898180\)
Lustre: DEBUG MARKER: == recovery-small test 24b: test dirty page discard due to client eviction == 22:23:00 (1355898180)
Lustre: DEBUG MARKER: cancel_lru_locks osc start
Lustre: DEBUG MARKER: cancel_lru_locks osc stop
LustreError: 11-0: an error occurred while communicating with 10.10.4.209@tcp. The ost_write operation failed with -107
Lustre: lustre-OST0000-osc-ffff8800740c6000: Connection to lustre-OST0000 (at 10.10.4.209@tcp) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 167-0: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
Lustre: 5975:0:(llite_lib.c:2287:ll_dirty_page_discard_warn()) dirty page discard: 10.10.4.208@tcp:/lustre/fid: [0x200000400:0x9:0x0]//d0.recovery-small/d24/f24b-1 may get corrupted (rc -4)
LustreError: 5975:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff8800745a1400 x1421761918796156/t0(0) o4->lustre-OST0000-osc-ffff8800740c6000@10.10.4.209@tcp:6/4 lens 456/416 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
Lustre: ll_imp_inval:11228: lock ffff8800745b8738 was acquired by <ptlrpcd-brw:5975:brw_interpret:2299> for 5 seconds.
Lustre: ====== for process holding the lock =====
Pid: 5975, comm: ptlrpcd-brw

Call Trace:
 [<ffffffff8106244a>] __cond_resched+0x2a/0x40
 [<ffffffff814fea60>] _cond_resched+0x30/0x40
 [<ffffffff811272e5>] __alloc_pages_nodemask+0x165/0x940
 [<ffffffffa03ad831>] ? libcfs_debug_vmsg2+0x4e1/0xb60 [libcfs]
 [<ffffffff8115c7ea>] alloc_pages_current+0xaa/0x110
 [<ffffffff8112572e>] __get_free_pages+0xe/0x50
 [<ffffffffa093b252>] ll_dirty_page_discard_warn+0x42/0x220 [lustre]
 [<ffffffffa03adef1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa0938d6b>] ? ll_queue_done_writing+0xfb/0x300 [lustre]
 [<ffffffffa0978dec>] vvp_page_completion_common+0xcc/0xf0 [lustre]
 [<ffffffffa0978e5b>] vvp_page_completion_write_common+0x4b/0x60 [lustre]
 [<ffffffffa097982e>] vvp_page_completion_write+0x10e/0x230 [lustre]
 [<ffffffffa04f937f>] cl_page_completion+0xff/0x690 [obdclass]
 [<ffffffffa0790c92>] osc_completion+0x162/0x3d0 [osc]
 [<ffffffffa0774904>] osc_ap_completion+0x134/0x330 [osc]
 [<ffffffffa0787e71>] brw_interpret+0x6e1/0x12d0 [osc]
 [<ffffffffa0648a2b>] ? ptlrpc_unregister_bulk+0x1fb/0xae0 [ptlrpc]
 [<ffffffffa0641d4b>] ptlrpc_check_set+0x29b/0x1b00 [ptlrpc]
 [<ffffffffa03adef1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa0672fd0>] ptlrpcd_check+0x1a0/0x230 [ptlrpc]
 [<ffffffffa067326e>] ptlrpcd+0x20e/0x370 [ptlrpc]
 [<ffffffff810602c0>] ? default_wake_function+0x0/0x20
 [<ffffffffa0673060>] ? ptlrpcd+0x0/0x370 [ptlrpc]
 [<ffffffff8100c14a>] child_rip+0xa/0x20
 [<ffffffffa0673060>] ? ptlrpcd+0x0/0x370 [ptlrpc]
 [<ffffffffa0673060>] ? ptlrpcd+0x0/0x370 [ptlrpc]
 [<ffffffff8100c140>] ? child_rip+0x0/0x20
 
Lustre: ====== for current process =====
Pid: 11228, comm: ll_imp_inval

Call Trace:
 [<ffffffffa03a27f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0784d4f>] osc_import_event+0x55f/0x1cc0 [osc]
 [<ffffffffa066dc49>] ptlrpc_invalidate_import+0x2a9/0x8d0 [ptlrpc]
 [<ffffffff810602c0>] ? default_wake_function+0x0/0x20
 [<ffffffff811abaef>] ? unshare_fs_struct+0x5f/0xb0
 [<ffffffffa066e490>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc]
 [<ffffffffa066e4df>] ptlrpc_invalidate_import_thread+0x4f/0x2f0 [ptlrpc]
 [<ffffffff8100c14a>] child_rip+0xa/0x20
 [<ffffffffa066e490>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc]
 [<ffffffffa066e490>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc]
 [<ffffffff8100c140>] ? child_rip+0x0/0x20

Lustre: ====== end =======
Lustre: 5975:0:(llite_lib.c:2287:ll_dirty_page_discard_warn()) dirty page discard: 10.10.4.208@tcp:/lustre/fid: [0x200000400:0xa:0x0]//d0.recovery-small/d24/f24b-2 may get corrupted (rc -108)
LustreError: 5977:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff88007d243800 x1421761918796168/t0(0) o101->lustre-OST0000-osc-ffff8800740c6000@10.10.4.209@tcp:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
Lustre: DEBUG MARKER: df /mnt/lustre; uname -n >> /mnt/lustre/recon
LustreError: 11297:0:(namei.c:1128:ll_objects_destroy()) obd destroy objid 0xb error -5
Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 2>/dev/null || true
Lustre: DEBUG MARKER: /usr/sbin/lctl mark == recovery-small recovery-small.sh test complete, duration 65 sec == 22:23:07 \(1355898187\)
Lustre: DEBUG MARKER: == recovery-small recovery-small.sh test complete, duration 65 sec == 22:23:07 (1355898187)
LustreError: 11636:0:(namei.c:1128:ll_objects_destroy()) obd destroy objid 0x5 error -5
Lustre: DEBUG MARKER: /usr/sbin/lctl mark  recovery-small : @@@@@@ FAIL: remove sub-test dirs failed
Lustre: DEBUG MARKER: recovery-small : @@@@@@ FAIL: remove sub-test dirs failed
Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /home/yujian/test_logs/2012-12-18/221902/recovery-small..debug_log.$(hostname -s).1355898187.log;
         dmesg > /home/yujian/test_logs/2012-12-18/221902/recovery-small..dmesg.$(hostname -s).1355898187.log
Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre' ' /proc/mounts);
if [ $running -ne 0 ] ; then
echo Stopping client $(hostname) /mnt/lustre opts:;
lsof -t /mnt/lustre || need_kill=no;
if [ x != x -a x$need_kill != xno ]; then
    pids=$(lsof -t /mnt/lustre | sort -u);

INFO: task ll_imp_inval:11228 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ll_imp_inval  D 0000000000000000     0 11228      2 0x00000080
 ffff880073f91d20 0000000000000046 ffffffffa079ccdc 0000000000000000
 ffff880073f91d90 000000008106c04f ffffffffa07ab370 ffff88007d2f9449
 ffff88007433dab8 ffff880073f91fd8 000000000000fb88 ffff88007433dab8
Call Trace:
 [<ffffffff814fee92>] schedule_timeout+0x192/0x2e0
 [<ffffffff8107e1c0>] ? process_timeout+0x0/0x10
 [<ffffffffa03a35ff>] cfs_pause+0x1f/0x30 [libcfs]
 [<ffffffffa0784dad>] osc_import_event+0x5bd/0x1cc0 [osc]
 [<ffffffffa066dc49>] ptlrpc_invalidate_import+0x2a9/0x8d0 [ptlrpc]
 [<ffffffff810602c0>] ? default_wake_function+0x0/0x20
 [<ffffffff811abaef>] ? unshare_fs_struct+0x5f/0xb0
 [<ffffffffa066e490>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc]
 [<ffffffffa066e4df>] ptlrpc_invalidate_import_thread+0x4f/0x2f0 [ptlrpc]
 [<ffffffff8100c14a>] child_rip+0xa/0x20
 [<ffffffffa066e490>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc]
 [<ffffffffa066e490>] ? ptlrpc_invalidate_import_thread+0x0/0x2f0 [ptlrpc]
 [<ffffffff8100c140>] ? child_rip+0x0/0x20
Comment by Niu Yawei (Inactive) [ 19/Dec/12 ]

Hi, Yujian, did you try the fix by Jingshan in LU-2505? I think this should be duplicated with LU-2505.

Comment by Jian Yu [ 19/Dec/12 ]

bq Hi, Yujian, did you try the fix by Jingshan in LU-2505? I think this should be duplicated with LU-2505.

Yes, I just finished verifying the patch of http://review.whamcloud.com/4866 for LU-2505. Both recovery-small test 24 and lfsck passed:
https://maloo.whamcloud.com/test_sets/20b2cac6-49b4-11e2-bef8-52540035b04c
https://maloo.whamcloud.com/test_sets/24c38b8c-49b4-11e2-bef8-52540035b04c

Comment by Andreas Dilger [ 19/Dec/12 ]

Duplicate of LU-2505.

Comment by Jinshan Xiong (Inactive) [ 19/Dec/12 ]

reopen the ticket because I want to land patch 4806.

Comment by Andreas Dilger [ 29/Sep/15 ]

Should http://review.whamcloud.com/4806 be abandoned and this patch be closed?

Generated at Sat Feb 10 01:25:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.