[LU-6927] recovery_small test_66 failure Created: 30/Jul/15 Updated: 10/Oct/21 Resolved: 10/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Oleg Drokin | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This seems to be hanging for me relatively frequently: == recovery-small test 66: lock enqueue re-send vs client eviction == 21:24:18 (1438219458) fail_loc=0x8000030c File: `/mnt/lustre/f66.recovery-small' Size: 0 Blocks: 0 IO Block: 4194304 regular empty file Device: 2c54f966h/743766374d Inode: 144115809252278274 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2015-07-29 21:24:19.000000000 -0400 Modify: 2015-07-29 21:24:19.000000000 -0400 Change: 2015-07-29 21:24:19.000000000 -0400 fail_loc=0 fail_loc=0x80000136 mdc.lustre-MDT0000-mdc-ffff8800815b17f0.import=connection=192.168.10.211@tcp ./../tests/test-framework.sh: line 2399: /mnt/lustre/recon: Input/output error In dmesg we get: [195769.388166] Lustre: DEBUG MARKER: == recovery-small test 66: lock enqueue re-send vs client eviction == 21:24:18 (1438219458) [195773.859644] Lustre: 17831:0:(genops.c:1491:obd_export_evict_by_uuid()) lustre-MDT0000: evicting f797195c-b579-5409-6da6-f79d52d8f961 at adminstrative request [195773.860464] Lustre: 17831:0:(genops.c:1491:obd_export_evict_by_uuid()) Skipped 1 previous similar message [195773.906764] LustreError: 17610:0:(fail.c:133:__cfs_fail_timeout_set()) cfs_fail_timeout id 136 sleeping for 40000ms [195809.905048] Lustre: lustre-OST0001: haven't heard from client 9007d4b0-ffa9-1973-fae0-b8ec6df9deb5 (at 0@lo) in 48 seconds. I think it's dead, and I am evicting it. exp ffff88003bea07f0, cur 1438219499 expire 1438219469 last 1438219451 [195813.904060] LustreError: 17610:0:(fail.c:137:__cfs_fail_timeout_set()) cfs_fail_timeout id 136 awake [195813.904062] LNet: Service thread pid 17610 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [195813.904064] Pid: 17610, comm: mdt01_003 [195813.904065] [195813.904065] Call Trace: [195813.904075] [<ffffffff81522574>] ? _spin_lock_irqsave+0x24/0x30 [195813.904079] [<ffffffff810866a2>] ? del_timer_sync+0x22/0x30 [195813.904081] [<ffffffff8151f279>] ? schedule_timeout+0x199/0x2e0 [195813.904091] [<ffffffffa0909c41>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [195813.904102] [<ffffffffa0906c4c>] ? __cfs_fail_timeout_set+0x13c/0x160 [libcfs] [195813.904116] [<ffffffffa0c113a5>] ? mdt_getattr_name_lock+0x475/0x1910 [mdt] [195813.904125] [<ffffffffa0c16cd3>] ? old_init_ucred+0x1e3/0x390 [mdt] [195813.904133] [<ffffffffa0c12d62>] ? mdt_intent_getattr+0x292/0x470 [mdt] [195813.904141] [<ffffffffa0c02674>] ? mdt_intent_policy+0x494/0xc40 [mdt] [195813.904169] [<ffffffffa053ef6f>] ? ldlm_lock_enqueue+0x12f/0x860 [ptlrpc] [195813.904194] [<ffffffffa056ab97>] ? ldlm_handle_enqueue0+0x7e7/0x1520 [ptlrpc] [195813.904227] [<ffffffffa05f1111>] ? tgt_enqueue+0x61/0x230 [ptlrpc] [195813.904256] [<ffffffffa05f1d22>] ? tgt_request_handle+0xa42/0x1230 [ptlrpc] [195813.904282] [<ffffffffa059d5e4>] ? ptlrpc_main+0xd74/0x1850 [ptlrpc] [195813.904284] [<ffffffff8152240e>] ? _spin_unlock_irq+0xe/0x20 [195813.904287] [<ffffffff81061829>] ? finish_task_switch+0x59/0xe0 [195813.904289] [<ffffffff8151e3ee>] ? thread_return+0x4e/0x7e0 [195813.904314] [<ffffffffa059c870>] ? ptlrpc_main+0x0/0x1850 [ptlrpc] [195813.904317] [<ffffffff8109ce4e>] ? kthread+0x9e/0xc0 [195813.904319] [<ffffffff8100c24a>] ? child_rip+0xa/0x20 [195813.904321] [<ffffffff8109cdb0>] ? kthread+0x0/0xc0 [195813.904322] [<ffffffff8100c240>] ? child_rip+0x0/0x20 [195813.904323] [195813.904325] LustreError: dumping log to /tmp/lustre-log.1438219503.17610 [195813.913595] LNet: Service thread pid 17610 completed after 40.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). |
| Comments |
| Comment by Oleg Drokin [ 30/Jul/15 ] |
|
On the client we can see this in process list: 17837 pts/0 S+ 0:00 sh -c df /mnt/lustre; uname -n >> /mnt/lustre/recon 17838 pts/0 S+ 0:01 df /mnt/lustre This is inside test-framework's client_reconnect_try() call. |
| Comment by Oleg Drokin [ 30/Jul/15 ] |
|
I tried to do touch/mnt/lustre/file and that ended up in io errro too, and I got this extra dmesg stuff: [240838.112047] Lustre: 17905:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1438264520/real 1438264520] req@ffff8800999fcce8 x1508080398450504/t0(0) o101->lustre-MDT0000-mdc-ffff8800815b17f0@0@lo:12/10 lens 872/1208 e 0 to 1 dl 1438264527 ref 2 fl Rpc:XP/0/ffffffff rc 0/-1 [240838.113580] Lustre: 17905:0:(client.c:2020:ptlrpc_expire_one_request()) Skipped 89 previous similar messages [240838.114146] Lustre: lustre-MDT0000-mdc-ffff8800815b17f0: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete [240838.115003] Lustre: Skipped 10 previous similar messages [240838.115661] Lustre: lustre-MDT0000: Client f797195c-b579-5409-6da6-f79d52d8f961 (at 0@lo) reconnecting [240838.116431] Lustre: Skipped 13 previous similar messages [240838.116910] Lustre: lustre-MDT0000-mdc-ffff8800815b17f0: Connection restored to lustre-MDT0000 (at 0@lo) [240838.117511] Lustre: Skipped 12 previous similar messages [240871.112044] LNet: Service thread pid 17527 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [240871.112880] Pid: 17527, comm: mdt00_000 [240871.113119] [240871.113119] Call Trace: [240871.113612] [<ffffffffa05396d0>] ? _ldlm_lock_debug+0x300/0x690 [ptlrpc] [240871.113940] [<ffffffff8151f271>] schedule_timeout+0x191/0x2e0 [240871.114250] [<ffffffff81085b40>] ? process_timeout+0x0/0x10 [240871.114570] [<ffffffffa055a8c0>] ? ldlm_expired_completion_wait+0x0/0x370 [ptlrpc] [240871.115063] [<ffffffffa055f509>] ldlm_completion_ast+0x609/0x9b0 [ptlrpc] [240871.115362] [<ffffffff81061630>] ? default_wake_function+0x0/0x20 [240871.115661] [<ffffffffa055e86e>] ldlm_cli_enqueue_local+0x21e/0x8b0 [ptlrpc] [240871.116143] [<ffffffffa055ef00>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] [240871.116452] [<ffffffffa0bfe0d0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [240871.116841] [<ffffffffa0c0a814>] mdt_object_local_lock+0x3c4/0xa80 [mdt] [240871.117140] [<ffffffffa0bfe0d0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [240871.117451] [<ffffffffa055ef00>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] [240871.117752] [<ffffffffa0c0b20b>] mdt_object_lock_internal+0x5b/0x290 [mdt] [240871.118061] [<ffffffffa0c0b501>] mdt_object_lock+0x11/0x20 [mdt] [240871.118370] [<ffffffffa0c2f5f2>] mdt_reint_open+0x772/0x2d80 [mdt] [240871.118656] [<ffffffffa0c1a6ed>] mdt_reint_rec+0x5d/0x200 [mdt] [240871.118939] [<ffffffffa0c03bfb>] mdt_reint_internal+0x62b/0xa40 [mdt] [240871.119341] [<ffffffffa0c04206>] mdt_intent_reint+0x1f6/0x430 [mdt] [240871.119631] [<ffffffffa0c02674>] mdt_intent_policy+0x494/0xc40 [mdt] [240871.119934] [<ffffffffa053ef6f>] ldlm_lock_enqueue+0x12f/0x860 [ptlrpc] [240871.123439] [<ffffffffa056ab97>] ldlm_handle_enqueue0+0x7e7/0x1520 [ptlrpc] [240871.123802] [<ffffffffa05f1111>] tgt_enqueue+0x61/0x230 [ptlrpc] [240871.124125] [<ffffffffa05f1d22>] tgt_request_handle+0xa42/0x1230 [ptlrpc] [240871.124479] [<ffffffffa059d5e4>] ptlrpc_main+0xd74/0x1850 [ptlrpc] [240871.124852] [<ffffffffa059c870>] ? ptlrpc_main+0x0/0x1850 [ptlrpc] [240871.125169] [<ffffffff8109ce4e>] kthread+0x9e/0xc0 [240871.125430] [<ffffffff8100c24a>] child_rip+0xa/0x20 [240871.125684] [<ffffffff8109cdb0>] ? kthread+0x0/0xc0 [240871.125939] [<ffffffff8100c240>] ? child_rip+0x0/0x20 [240871.126200] [240871.126445] LustreError: dumping log to /tmp/lustre-log.1438264560.17527 [240931.816060] LustreError: 0:0:(ldlm_lockd.c:350:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 0@lo ns: mdt-lustre-MDT0000_UUID lock: ffff88002deffdb8/0x91588c5d002f9ff4 lrc: 3/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0 bits 0x13 rrc: 3 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0x91588c5d002f9fd1 expref: 7 pid: 17610 timeout: 4355125074 lvb_type: 0 [240931.817853] LustreError: 0:0:(ldlm_lockd.c:350:waiting_locks_callback()) Skipped 1 previous similar message [240931.819475] LustreError: 17527:0:(ldlm_lockd.c:1390:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8800768887f0 ns: mdt-lustre-MDT0000_UUID lock: ffff8800b4683db8/0x91588c5d002fa03a lrc: 3/0,0 mode: CR/CR res: [0x2000090a2:0x1:0x0].0 bits 0x9 rrc: 1 type: IBT flags: 0x50200000000000 nid: 0@lo remote: 0x91588c5d002fa010 expref: 4 pid: 17527 timeout: 0 lvb_type: 0 [240931.828351] LNet: Service thread pid 17527 completed after 100.70s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [240931.829116] LustreError: 11-0: lustre-MDT0000-mdc-ffff8800815b17f0: operation ldlm_enqueue to node 0@lo failed: rc = -107 [240931.829119] LustreError: Skipped 16 previous similar messages [240931.829125] Lustre: lustre-MDT0000-mdc-ffff8800815b17f0: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete [240931.829127] Lustre: Skipped 1 previous similar message [240931.830325] LustreError: 167-0: lustre-MDT0000-mdc-ffff8800815b17f0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. [240931.830328] LustreError: Skipped 16 previous similar messages [240931.831571] LustreError: 17905:0:(file.c:3104:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -108 [240931.831574] LustreError: 17905:0:(file.c:3104:ll_inode_revalidate_fini()) Skipped 1 previous similar message [240951.828046] LustreError: 17909:0:(import.c:332:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0) [240951.828643] LustreError: 17909:0:(import.c:358:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff88006c1fece8 x1508080398126124/t0(0) o101->lustre-MDT0000-mdc-ffff8800815b17f0@0@lo:12/10 lens 696/1080 e 1 to 0 dl 1438219484 ref 2 fl Unregistering:ES/2/ffffffff rc -5/-1 [240951.829800] LustreError: 17909:0:(import.c:373:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out. [240971.828046] LustreError: 17909:0:(import.c:332:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0) [240971.828657] LustreError: 17909:0:(import.c:358:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff88006c1fece8 x1508080398126124/t0(0) o101->lustre-MDT0000-mdc-ffff8800815b17f0@0@lo:12/10 lens 696/1080 e 1 to 0 dl 1438219484 ref 2 fl Unregistering:ES/2/ffffffff rc -5/-1 [240971.829654] LustreError: 17909:0:(import.c:373:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out. [240985.116518] Lustre: lustre-MDT0000: haven't heard from client f797195c-b579-5409-6da6-f79d52d8f961 (at 0@lo) in 53 seconds. I think it's dead, and I am evicting it. exp ffff88008c3027f0, cur 1438264674 expire 1438264644 last 1438264621 [240991.828054] LustreError: 17909:0:(import.c:332:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0) [240991.828618] LustreError: 17909:0:(import.c:358:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff88006c1fece8 x1508080398126124/t0(0) o101->lustre-MDT0000-mdc-ffff8800815b17f0@0@lo:12/10 lens 696/1080 e 1 to 0 dl 1438219484 ref 2 fl Unregistering:ES/2/ffffffff rc -5/-1 [240991.829520] LustreError: 17909:0:(import.c:373:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out. [241011.828067] LustreError: 17909:0:(import.c:332:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0) [241011.828670] LustreError: 17909:0:(import.c:358:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff88006c1fece8 x1508080398126124/t0(0) o101->lustre-MDT0000-mdc-ffff8800815b17f0@0@lo:12/10 lens 696/1080 e 1 to 0 dl 1438219484 ref 2 fl Unregistering:ES/2/ffffffff rc -5/-1 [241011.829831] LustreError: 17909:0:(import.c:373:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out. [241031.828076] LustreError: 17909:0:(import.c:332:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0) [241031.828680] LustreError: 17909:0:(import.c:358:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff88006c1fece8 x1508080398126124/t0(0) o101->lustre-MDT0000-mdc-ffff8800815b17f0@0@lo:12/10 lens 696/1080 e 1 to 0 dl 1438219484 ref 2 fl Unregistering:ES/2/ffffffff rc -5/-1 [241031.829941] LustreError: 17909:0:(import.c:373:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out. [241051.828068] LustreError: 17909:0:(import.c:332:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0) [241051.828672] LustreError: 17909:0:(import.c:358:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff88006c1fece8 x1508080398126124/t0(0) o101->lustre-MDT0000-mdc-ffff8800815b17f0@0@lo:12/10 lens 696/1080 e 1 to 0 dl 1438219484 ref 2 fl Unregistering:ES/2/ffffffff rc -5/-1 [241051.829840] LustreError: 17909:0:(import.c:373:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out. [241071.828052] LustreError: 17909:0:(import.c:332:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0) [241071.828553] LustreError: 17909:0:(import.c:358:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff88006c1fece8 x1508080398126124/t0(0) o101->lustre-MDT0000-mdc-ffff8800815b17f0@0@lo:12/10 lens 696/1080 e 1 to 0 dl 1438219484 ref 2 fl Unregistering:ES/2/ffffffff rc -5/-1 [241071.829411] LustreError: 17909:0:(import.c:373:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out. [241111.828063] LustreError: 17909:0:(import.c:332:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0) [241111.828578] LustreError: 17909:0:(import.c:332:ptlrpc_invalidate_import()) Skipped 1 previous similar message [241111.829146] LustreError: 17909:0:(import.c:358:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff88006c1fece8 x1508080398126124/t0(0) o101->lustre-MDT0000-mdc-ffff8800815b17f0@0@lo:12/10 lens 696/1080 e 1 to 0 dl 1438219484 ref 2 fl Unregistering:ES/2/ffffffff rc -5/-1 [241111.830368] LustreError: 17909:0:(import.c:358:ptlrpc_invalidate_import()) Skipped 1 previous similar message [241111.830925] LustreError: 17909:0:(import.c:373:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out. [241111.831846] LustreError: 17909:0:(import.c:373:ptlrpc_invalidate_import()) Skipped 1 previous similar message |
| Comment by Peter Jones [ 30/Jul/15 ] |
|
Amir Oleg is wondering whether this is an LNET issue. He has a node wedged in this manner on his test cluster that you could look at if that helps (limited time offer - maybe for the next day or so Peter |
| Comment by Oleg Drokin [ 30/Jul/15 ] |
|
Contact me on skype for login credentials |
| Comment by Oleg Drokin [ 14/Sep/15 ] |
|
Adding a bit of a discussion we had with Amir for back history in case this gets reassigned: [9/9/15 1:20:39 PM] Amir Shehata: Hey Oleg, I'm looking at LU-6927... Wondering why you think this is an LNet issue? the stack trace doesn't show LNet functions. what do you suspect is causing the thread being slow?
[9/9/15 1:21:59 PM] Oleg Drokin: [240951.828046] LustreError: 17909:0:(import.c:332:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0)
[240951.828643] LustreError: 17909:0:(import.c:358:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff88006c1fece8 x1508080398126124/t0(0) o101->lustre-MDT0000-mdc-ffff8800815b17f0@0@lo:12/10 lens 696/1080 e 1 to 0 dl 1438219484 ref 2 fl Unregistering:ES/2/ffffffff rc -5/-1
[240951.829800] LustreError: 17909:0:(import.c:373:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting th
[9/9/15 1:22:08 PM] Oleg Drokin: so we are trying to unregister a message from network and cannot
[9/9/15 1:37:49 PM] Amir Shehata: hmm... I'm not familiar with that Unregistering concept. As far as I understand LNet is message based. It send a message and waits for a response. It has its timeouts and ptlrpc has its timeout
[9/9/15 1:38:41 PM] Oleg Drokin: yes. so in this instance ptlrpc registered a buffer sending or receiving. now we want to stop the service so are trying to unregister the buffer
[9/9/15 1:39:01 PM] Oleg Drokin: and we cannot - we hit our timeout waiting for acknowledgement from lnet that thebuffer has been unregistered
[9/9/15 1:39:38 PM] Amir Shehata: hmm.. Ok I'll take a look at ptlrpc_invalidate_import() to see what I can gather
[9/9/15 2:02:54 PM] Amir Shehata: so what I see is that there is a list: imp_sending_list, which requests get added to when sending, but I don't see where the request is removed from that list when it's completed (or aborted)
[9/9/15 3:40:51 PM] Oleg Drokin: /* Request already may be not on sending or delaying list. This
* may happen in the case of marking it erroneous for the case
* ptlrpc_import_delay_req(req, status) find it impossible to
* allow sending this rpc and returns *status != 0. */
if (!list_empty(&req->rq_list)) {
list_del_init(&req->rq_list);
atomic_dec(&imp->imp_inflight);
[9/9/15 3:40:56 PM] Oleg Drokin: here it's removed once it was sent
[9/9/15 3:41:06 PM] Oleg Drokin: ther are a couple more places where it is removed if there was a n error sending
|