Details
-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
This seems to be hanging for me relatively frequently:
== recovery-small test 66: lock enqueue re-send vs client eviction == 21:24:18 (1438219458) fail_loc=0x8000030c File: `/mnt/lustre/f66.recovery-small' Size: 0 Blocks: 0 IO Block: 4194304 regular empty file Device: 2c54f966h/743766374d Inode: 144115809252278274 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2015-07-29 21:24:19.000000000 -0400 Modify: 2015-07-29 21:24:19.000000000 -0400 Change: 2015-07-29 21:24:19.000000000 -0400 fail_loc=0 fail_loc=0x80000136 mdc.lustre-MDT0000-mdc-ffff8800815b17f0.import=connection=192.168.10.211@tcp ./../tests/test-framework.sh: line 2399: /mnt/lustre/recon: Input/output error
In dmesg we get:
[195769.388166] Lustre: DEBUG MARKER: == recovery-small test 66: lock enqueue re-send vs client eviction == 21:24:18 (1438219458) [195773.859644] Lustre: 17831:0:(genops.c:1491:obd_export_evict_by_uuid()) lustre-MDT0000: evicting f797195c-b579-5409-6da6-f79d52d8f961 at adminstrative request [195773.860464] Lustre: 17831:0:(genops.c:1491:obd_export_evict_by_uuid()) Skipped 1 previous similar message [195773.906764] LustreError: 17610:0:(fail.c:133:__cfs_fail_timeout_set()) cfs_fail_timeout id 136 sleeping for 40000ms [195809.905048] Lustre: lustre-OST0001: haven't heard from client 9007d4b0-ffa9-1973-fae0-b8ec6df9deb5 (at 0@lo) in 48 seconds. I think it's dead, and I am evicting it. exp ffff88003bea07f0, cur 1438219499 expire 1438219469 last 1438219451 [195813.904060] LustreError: 17610:0:(fail.c:137:__cfs_fail_timeout_set()) cfs_fail_timeout id 136 awake [195813.904062] LNet: Service thread pid 17610 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [195813.904064] Pid: 17610, comm: mdt01_003 [195813.904065] [195813.904065] Call Trace: [195813.904075] [<ffffffff81522574>] ? _spin_lock_irqsave+0x24/0x30 [195813.904079] [<ffffffff810866a2>] ? del_timer_sync+0x22/0x30 [195813.904081] [<ffffffff8151f279>] ? schedule_timeout+0x199/0x2e0 [195813.904091] [<ffffffffa0909c41>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [195813.904102] [<ffffffffa0906c4c>] ? __cfs_fail_timeout_set+0x13c/0x160 [libcfs] [195813.904116] [<ffffffffa0c113a5>] ? mdt_getattr_name_lock+0x475/0x1910 [mdt] [195813.904125] [<ffffffffa0c16cd3>] ? old_init_ucred+0x1e3/0x390 [mdt] [195813.904133] [<ffffffffa0c12d62>] ? mdt_intent_getattr+0x292/0x470 [mdt] [195813.904141] [<ffffffffa0c02674>] ? mdt_intent_policy+0x494/0xc40 [mdt] [195813.904169] [<ffffffffa053ef6f>] ? ldlm_lock_enqueue+0x12f/0x860 [ptlrpc] [195813.904194] [<ffffffffa056ab97>] ? ldlm_handle_enqueue0+0x7e7/0x1520 [ptlrpc] [195813.904227] [<ffffffffa05f1111>] ? tgt_enqueue+0x61/0x230 [ptlrpc] [195813.904256] [<ffffffffa05f1d22>] ? tgt_request_handle+0xa42/0x1230 [ptlrpc] [195813.904282] [<ffffffffa059d5e4>] ? ptlrpc_main+0xd74/0x1850 [ptlrpc] [195813.904284] [<ffffffff8152240e>] ? _spin_unlock_irq+0xe/0x20 [195813.904287] [<ffffffff81061829>] ? finish_task_switch+0x59/0xe0 [195813.904289] [<ffffffff8151e3ee>] ? thread_return+0x4e/0x7e0 [195813.904314] [<ffffffffa059c870>] ? ptlrpc_main+0x0/0x1850 [ptlrpc] [195813.904317] [<ffffffff8109ce4e>] ? kthread+0x9e/0xc0 [195813.904319] [<ffffffff8100c24a>] ? child_rip+0xa/0x20 [195813.904321] [<ffffffff8109cdb0>] ? kthread+0x0/0xc0 [195813.904322] [<ffffffff8100c240>] ? child_rip+0x0/0x20 [195813.904323] [195813.904325] LustreError: dumping log to /tmp/lustre-log.1438219503.17610 [195813.913595] LNet: Service thread pid 17610 completed after 40.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).