[LU-13692] MDS slow/hung threads at mdt_object_local_lock - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.14.0, Lustre 2.12.6
Affects Version/s: Lustre 2.12.4
Labels:
None
Environment:
clients running l2.12.3

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

MDS threads hang/slow getting stack trace dumps

Jun 16 14:20:20 nbp8-mds1 kernel: [1660354.377038] Lustre: MGS: Connection restored to 09397a7f-3fe2-2dc8-d25a-74d157cb2008 (at 10.151.50.72@o2ib)
Jun 16 14:20:20 nbp8-mds1 kernel: [1660354.377044] Lustre: Skipped 277 previous similar messages
Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.852981] Lustre: 8579:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (30/30), not sending early reply
Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.852981]   req@ffff8979f9e16300 x1668974534442592/t1531807577764(0) o36->67852282-a3ed-5acb-a9e2-3cae43fe0406@10.151.0.201@o2ib:5/0 lens 488/3152 e 0 to 0 dl 1592342570 ref 2 fl Interpret:/0/0 rc 0/0
Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.952277] Lustre: 8579:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 134 previous similar messages
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.333617] LNet: Service thread pid 14118 was inactive for 551.86s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.389992] LNet: Skipped 3 previous similar messages
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407185] Pid: 14118, comm: mdt00_115 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407190] Call Trace:
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407202]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412473]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412496]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412506]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412516]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412530]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412541]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412552]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412562]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412571]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412623]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412659]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412691]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412696]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412699]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412722]  [<ffffffffffffffff>] 0xffffffffffffffff
Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412725] LustreError: dumping log to /tmp/lustre-log.1592342774.14118
Jun 16 14:26:14 nbp8-mds1 sec[2849]: SEC_EVENT |msg lustre hung thread
Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.178343] LNet: Service thread pid 8575 was inactive for 552.75s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.234458] Pid: 8575, comm: mdt00_043 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.234459] Call Trace:
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.234472]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239700]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239749]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239763]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239773]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239785]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239797]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239809]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239819]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239829]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239874]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239909]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239945]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239952]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239958]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239982]  [<ffffffffffffffff>] 0xffffffffffffffff
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239988] Pid: 14083, comm: mdt00_088 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239988] Call Trace:
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240020]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240050]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240060]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240071]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240082]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240093]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240104]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240116]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240127]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240137]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240174]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240207]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240240]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240244]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240247]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240251]  [<ffffffffffffffff>] 0xffffffffffffffff
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240254] Pid: 10527, comm: mdt00_069 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240254] Call Trace:
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240285]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240313]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240325]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240335]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240344]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240357]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240369]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240380]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240389]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240400]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240439]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240478]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240510]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240514]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240516]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240520]  [<ffffffffffffffff>] 0xffffffffffffffff
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240523] Pid: 14084, comm: mdt00_089 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240523] Call Trace:
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240554]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240585]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240598]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240610]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240620]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240631]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240647]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240657]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240666]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240677]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240716]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240749]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240781]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240784]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240788]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240792]  [<ffffffffffffffff>] 0xffffffffffffffff
Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240796] LNet: Service thread pid 8568 was inactive for 552.81s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lustre-log.1592343544.8525
4.75 MB
18/Jun/20 5:11 PM
messages
51 kB
18/Jun/20 5:10 PM
vmcore-dmesg.txt
1021 kB
18/Jun/20 5:10 PM

Issue Links

is related to

LU-13500 Client gets evicted - nfsd non-standard errorno -108

Open

LU-14069 OBD_FAIL_LDLM_CANCEL_BL_CB_RACE is buggy in ldlm_handle_cp_callback

Resolved

Activity

[LU-13692] MDS slow/hung threads at mdt_object_local_lock

Mahmoud Hanafi added a comment - 11/Sep/20 6:03 PM

We do run with 'health_sensitivity=0'. I saved "net show" and "stats show" These were not cleared before. I'll try to re-run and clear the stats before the test.

ftp uploads/~~LU-13692~~/router.netshow.2020_09_11
ftp uploads/~~LU-13692~~/router.netshow.2020_09_11

I was able re-test today with router logs. see ftp:/uploads/~~LU-13692~~/~~LU-13692~~_patch4_2020_09_11.tgz

What I saw the server send the rpc

 --- SERVER ---
00000100:00100000:7.0:1599839141.481820:0:8199:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.141.6.236@o2ib417, seq: 2361
00000100:00100000:7.0:1599839141.481824:0:8199:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_012:9308d1b4-ec54-b743-7579-5daa9b33d4cd+88:4780:x1677516826795584:12345-10.141.6.236@o2ib417:101
00010000:00010000:7.0:1599839141.507240:0:8199:0:(ldlm_lock.c:1073:ldlm_granted_list_add_lock()) ### About to add lock: ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 3/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x50200400000000 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 0 lvb_type: 0
00010000:00010000:7.0:1599839141.507253:0:8199:0:(ldlm_lock.c:681:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 3/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x40200400000000 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 0 lvb_type: 0
00010000:00010000:7.0:1599839141.507263:0:8199:0:(ldlm_lockd.c:903:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 4/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x50200400000020 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 0 lvb_type: 0
00010000:00010000:7.0:1599839141.507269:0:8199:0:(ldlm_lockd.c:465:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 412, AT: on) ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 5/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x70200400000020 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 1350 lvb_type: 0
00000100:00100000:7.0:1599839141.507276:0:8199:0:(client.c:1630:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mdt00_012:nbptest3-MDT0000_UUID:8199:1677552166027200:10.141.6.236@o2ib417:104
00000400:00000200:7.0:1599839141.507285:0:8199:0:(lib-move.c:4684:LNetPut()) LNetPut -> 12345-10.141.6.236@o2ib417
00000400:00000200:7.0:1599839141.507288:0:8199:0:(lib-move.c:2478:lnet_handle_send_case_locked()) Source ANY to NMR:  10.141.6.236@o2ib417 routed destination
00000400:00000200:7.0:1599839141.507296:0:8199:0:(lib-move.c:1752:lnet_handle_send()) TRACE: 10.151.27.53@o2ib(10.151.27.53@o2ib:<?>) -> 10.141.6.236@o2ib417(10.141.6.236@o2ib417:10.151.26.234@o2ib) : PUT try# 0
00000100:00000200:7.0:1599839141.507889:0:8199:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0  req@ffff99b5661a0900 x1677552166027200/t0(0) o104->nbptest3-MDT0000@10.141.6.236@o2ib417:15/16 lens 296/224 e 0 to 0 dl 1599839764 ref 1 fl Rpc:R/0/ffffffff rc 0/-1
00000100:00000200:7.0:1599839141.507895:0:8199:0:(events.c:114:reply_in_callback()) @@@ unlink  req@ffff99b5661a0900 x1677552166027200/t0(0) o104->nbptest3-MDT0000@10.141.6.236@o2ib417:15/16 lens 296/224 e 0 to 0 dl 1599839764 ref 1 fl Rpc:R/0/ffffffff rc 0/-1
00000100:00100000:7.0:1599839141.507904:0:8199:0:(client.c:2060:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc mdt00_012:nbptest3-MDT0000_UUID:8199:1677552166027200:10.141.6.236@o2ib417:104
00010000:00010000:7.0:1599839141.507912:0:8199:0:(ldlm_lock.c:210:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30f740/0x8e13058be0aaddf2 lrc: 0/0,0 mode: --/EX res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x44000000000000 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 0 lvb_type: 3
00010000:00010000:7.0:1599839141.507918:0:8199:0:(ldlm_lockd.c:465:ldlm_add_waiting_lock()) ### not re-adding to wait list(timeout: 412, AT: on) ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 4/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 488 type: IBT flags: 0x70200400000020 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 96 pid: 8199 timeout: 1350 lvb_type: 0
00010000:00010000:7.0:1599839141.507924:0:8199:0:(ldlm_lockd.c:1401:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 4/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 488 type: IBT flags: 0x60200400000020 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 96 pid: 8199 timeout: 1350 lvb_type: 0
00010000:00000200:7.0:1599839141.507935:0:8199:0:(ldlm_lib.c:2967:target_send_reply_msg()) @@@ sending reply  req@ffff99b5702d2d00 x1677516826795584/t201863467188(0) o101->9308d1b4-ec54-b743-7579-5daa9b33d4cd@10.141.6.236@o2ib417:131/0 lens 376/1032 e 0 to 0 dl 1599839846 ref 1 fl Interpret:/0/0 rc 0/0
00000400:00000200:7.0:1599839141.507944:0:8199:0:(lib-move.c:4684:LNetPut()) LNetPut -> 12345-10.141.6.236@o2ib417
00000400:00000200:7.0:1599839141.507946:0:8199:0:(lib-move.c:2478:lnet_handle_send_case_locked()) Source Specified: 10.151.27.53@o2ib to NMR:  10.141.6.236@o2ib417 routed destination
00000400:00000200:7.0:1599839141.507955:0:8199:0:(lib-move.c:1752:lnet_handle_send()) TRACE: 10.151.27.53@o2ib(10.151.27.53@o2ib:10.151.27.53@o2ib) -> 10.141.6.236@o2ib417(10.141.6.236@o2ib417:10.151.26.236@o2ib) : PUT try# 0
00000100:00100000:7.0:1599839141.507968:0:8199:0:(service.c:2190:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_012:9308d1b4-ec54-b743-7579-5daa9b33d4cd+96:4780:x1677516826795584:12345-10.141.6.236@o2ib417:101 Request processed in 26145us (26195us total) trans 201863467188 rc 0/0
00000100:00100000:7.0:1599839141.507973:0:8199:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.141.6.236@o2ib417, seq: 2361

*The router get the requests and (tries) to send it out. I wasn't sure if it actually makes it out of the router. *

00000800:00000200:71.0:1599839141.508041:0:25250:0:(o2iblnd_cb.c:343:kiblnd_handle_rx()) Received d1[0] from 10.151.27.53@o2ib
00000400:00000200:71.0:1599839141.508045:0:25250:0:(lib-move.c:4190:lnet_parse()) TRACE: 10.141.6.236@o2ib417(10.151.26.236@o2ib) <- 10.151.27.53@o2ib : PUT - routed
00000400:00000200:71.0:1599839141.508050:0:25250:0:(lib-msg.c:918:lnet_is_health_check()) health check = 0, status = 0, hstatus = 0
00000400:00000200:71.0:1599839141.508053:0:25250:0:(lib-move.c:2478:lnet_handle_send_case_locked()) Source ANY to NMR:  10.141.6.236@o2ib417 local destination
00000400:00000200:71.0:1599839141.508061:0:25250:0:(lib-move.c:1752:lnet_handle_send()) TRACE: 10.151.27.53@o2ib(10.141.26.236@o2ib417:<?>) -> 10.141.6.236@o2ib417(10.141.6.236@o2ib417:10.141.6.236@o2ib417) : PUT try# 0
00000800:00000200:71.0:1599839141.508065:0:25250:0:(o2iblnd_cb.c:1663:kiblnd_send()) sending 1000 bytes in 1 frags to 12345-10.141.6.236@o2ib417
00000800:00000200:71.0:1599839141.508071:0:25250:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff9c5a323c9d80] -> 10.141.6.236@o2ib417 (2) version: 12
00000800:00000200:71.0:1599839141.508072:0:25250:0:(o2iblnd_cb.c:1543:kiblnd_launch_tx()) conn[ffff9c4700f2ea00] (68)++
00000800:00000200:71.0:1599839141.508074:0:25250:0:(o2iblnd_cb.c:1316:kiblnd_queue_tx_locked()) conn[ffff9c4700f2ea00] (69)++
00000800:00000200:71.0:1599839141.508077:0:25250:0:(o2iblnd_cb.c:1549:kiblnd_launch_tx()) conn[ffff9c4700f2ea00] (70)--
00000800:00000200:71.0:1599839141.508078:0:25250:0:(o2iblnd_cb.c:205:kiblnd_post_rx()) conn[ffff9c5b223f4200] (69)++
00000800:00000200:71.0:1599839141.508080:0:25250:0:(o2iblnd_cb.c:239:kiblnd_post_rx()) conn[ffff9c5b223f4200] (70)--
00000800:00000200:71.0:1599839141.508081:0:25250:0:(o2iblnd_cb.c:3875:kiblnd_scheduler()) conn[ffff9c5b223f4200] (69)--
00000800:00000200:71.0:1599839141.508457:0:25250:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff9c5ad91c0200] (70)++
00000800:00000200:71.0:1599839141.508461:0:25250:0:(o2iblnd_cb.c:3875:kiblnd_scheduler()) conn[ffff9c5ad91c0200] (71)--
00000800:00000200:71.0:1599839141.508462:0:25250:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff9c5ad91c0200] (70)++

There is no completion for conn[ffff9c5ad91c0200].

Mahmoud Hanafi added a comment - 11/Sep/20 6:03 PM We do run with 'health_sensitivity=0'. I saved "net show" and "stats show" These were not cleared before. I'll try to re-run and clear the stats before the test. ftp uploads/ LU-13692 /router.netshow.2020_09_11 ftp uploads/ LU-13692 /router.netshow.2020_09_11 I was able re-test today with router logs. see ftp:/uploads/ LU-13692 / LU-13692 _patch4_2020_09_11.tgz What I saw the server send the rpc --- SERVER --- 00000100:00100000:7.0:1599839141.481820:0:8199:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.141.6.236@o2ib417, seq: 2361 00000100:00100000:7.0:1599839141.481824:0:8199:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_012:9308d1b4-ec54-b743-7579-5daa9b33d4cd+88:4780:x1677516826795584:12345-10.141.6.236@o2ib417:101 00010000:00010000:7.0:1599839141.507240:0:8199:0:(ldlm_lock.c:1073:ldlm_granted_list_add_lock()) ### About to add lock: ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 3/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x50200400000000 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 0 lvb_type: 0 00010000:00010000:7.0:1599839141.507253:0:8199:0:(ldlm_lock.c:681:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 3/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x40200400000000 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 0 lvb_type: 0 00010000:00010000:7.0:1599839141.507263:0:8199:0:(ldlm_lockd.c:903:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 4/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x50200400000020 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 0 lvb_type: 0 00010000:00010000:7.0:1599839141.507269:0:8199:0:(ldlm_lockd.c:465:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 412, AT: on) ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 5/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x70200400000020 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 1350 lvb_type: 0 00000100:00100000:7.0:1599839141.507276:0:8199:0:(client.c:1630:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mdt00_012:nbptest3-MDT0000_UUID:8199:1677552166027200:10.141.6.236@o2ib417:104 00000400:00000200:7.0:1599839141.507285:0:8199:0:(lib-move.c:4684:LNetPut()) LNetPut -> 12345-10.141.6.236@o2ib417 00000400:00000200:7.0:1599839141.507288:0:8199:0:(lib-move.c:2478:lnet_handle_send_case_locked()) Source ANY to NMR: 10.141.6.236@o2ib417 routed destination 00000400:00000200:7.0:1599839141.507296:0:8199:0:(lib-move.c:1752:lnet_handle_send()) TRACE: 10.151.27.53@o2ib(10.151.27.53@o2ib:<?>) -> 10.141.6.236@o2ib417(10.141.6.236@o2ib417:10.151.26.234@o2ib) : PUT try # 0 00000100:00000200:7.0:1599839141.507889:0:8199:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0 req@ffff99b5661a0900 x1677552166027200/t0(0) o104->nbptest3-MDT0000@10.141.6.236@o2ib417:15/16 lens 296/224 e 0 to 0 dl 1599839764 ref 1 fl Rpc:R/0/ffffffff rc 0/-1 00000100:00000200:7.0:1599839141.507895:0:8199:0:(events.c:114:reply_in_callback()) @@@ unlink req@ffff99b5661a0900 x1677552166027200/t0(0) o104->nbptest3-MDT0000@10.141.6.236@o2ib417:15/16 lens 296/224 e 0 to 0 dl 1599839764 ref 1 fl Rpc:R/0/ffffffff rc 0/-1 00000100:00100000:7.0:1599839141.507904:0:8199:0:(client.c:2060:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc mdt00_012:nbptest3-MDT0000_UUID:8199:1677552166027200:10.141.6.236@o2ib417:104 00010000:00010000:7.0:1599839141.507912:0:8199:0:(ldlm_lock.c:210:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30f740/0x8e13058be0aaddf2 lrc: 0/0,0 mode: --/EX res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x44000000000000 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 0 lvb_type: 3 00010000:00010000:7.0:1599839141.507918:0:8199:0:(ldlm_lockd.c:465:ldlm_add_waiting_lock()) ### not re-adding to wait list(timeout: 412, AT: on) ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 4/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 488 type: IBT flags: 0x70200400000020 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 96 pid: 8199 timeout: 1350 lvb_type: 0 00010000:00010000:7.0:1599839141.507924:0:8199:0:(ldlm_lockd.c:1401:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 4/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 488 type: IBT flags: 0x60200400000020 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 96 pid: 8199 timeout: 1350 lvb_type: 0 00010000:00000200:7.0:1599839141.507935:0:8199:0:(ldlm_lib.c:2967:target_send_reply_msg()) @@@ sending reply req@ffff99b5702d2d00 x1677516826795584/t201863467188(0) o101->9308d1b4-ec54-b743-7579-5daa9b33d4cd@10.141.6.236@o2ib417:131/0 lens 376/1032 e 0 to 0 dl 1599839846 ref 1 fl Interpret:/0/0 rc 0/0 00000400:00000200:7.0:1599839141.507944:0:8199:0:(lib-move.c:4684:LNetPut()) LNetPut -> 12345-10.141.6.236@o2ib417 00000400:00000200:7.0:1599839141.507946:0:8199:0:(lib-move.c:2478:lnet_handle_send_case_locked()) Source Specified: 10.151.27.53@o2ib to NMR: 10.141.6.236@o2ib417 routed destination 00000400:00000200:7.0:1599839141.507955:0:8199:0:(lib-move.c:1752:lnet_handle_send()) TRACE: 10.151.27.53@o2ib(10.151.27.53@o2ib:10.151.27.53@o2ib) -> 10.141.6.236@o2ib417(10.141.6.236@o2ib417:10.151.26.236@o2ib) : PUT try # 0 00000100:00100000:7.0:1599839141.507968:0:8199:0:(service.c:2190:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_012:9308d1b4-ec54-b743-7579-5daa9b33d4cd+96:4780:x1677516826795584:12345-10.141.6.236@o2ib417:101 Request processed in 26145us (26195us total) trans 201863467188 rc 0/0 00000100:00100000:7.0:1599839141.507973:0:8199:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.141.6.236@o2ib417, seq: 2361 *The router get the requests and (tries) to send it out. I wasn't sure if it actually makes it out of the router. * 00000800:00000200:71.0:1599839141.508041:0:25250:0:(o2iblnd_cb.c:343:kiblnd_handle_rx()) Received d1[0] from 10.151.27.53@o2ib 00000400:00000200:71.0:1599839141.508045:0:25250:0:(lib-move.c:4190:lnet_parse()) TRACE: 10.141.6.236@o2ib417(10.151.26.236@o2ib) <- 10.151.27.53@o2ib : PUT - routed 00000400:00000200:71.0:1599839141.508050:0:25250:0:(lib-msg.c:918:lnet_is_health_check()) health check = 0, status = 0, hstatus = 0 00000400:00000200:71.0:1599839141.508053:0:25250:0:(lib-move.c:2478:lnet_handle_send_case_locked()) Source ANY to NMR: 10.141.6.236@o2ib417 local destination 00000400:00000200:71.0:1599839141.508061:0:25250:0:(lib-move.c:1752:lnet_handle_send()) TRACE: 10.151.27.53@o2ib(10.141.26.236@o2ib417:<?>) -> 10.141.6.236@o2ib417(10.141.6.236@o2ib417:10.141.6.236@o2ib417) : PUT try # 0 00000800:00000200:71.0:1599839141.508065:0:25250:0:(o2iblnd_cb.c:1663:kiblnd_send()) sending 1000 bytes in 1 frags to 12345-10.141.6.236@o2ib417 00000800:00000200:71.0:1599839141.508071:0:25250:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff9c5a323c9d80] -> 10.141.6.236@o2ib417 (2) version: 12 00000800:00000200:71.0:1599839141.508072:0:25250:0:(o2iblnd_cb.c:1543:kiblnd_launch_tx()) conn[ffff9c4700f2ea00] (68)++ 00000800:00000200:71.0:1599839141.508074:0:25250:0:(o2iblnd_cb.c:1316:kiblnd_queue_tx_locked()) conn[ffff9c4700f2ea00] (69)++ 00000800:00000200:71.0:1599839141.508077:0:25250:0:(o2iblnd_cb.c:1549:kiblnd_launch_tx()) conn[ffff9c4700f2ea00] (70)-- 00000800:00000200:71.0:1599839141.508078:0:25250:0:(o2iblnd_cb.c:205:kiblnd_post_rx()) conn[ffff9c5b223f4200] (69)++ 00000800:00000200:71.0:1599839141.508080:0:25250:0:(o2iblnd_cb.c:239:kiblnd_post_rx()) conn[ffff9c5b223f4200] (70)-- 00000800:00000200:71.0:1599839141.508081:0:25250:0:(o2iblnd_cb.c:3875:kiblnd_scheduler()) conn[ffff9c5b223f4200] (69)-- 00000800:00000200:71.0:1599839141.508457:0:25250:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff9c5ad91c0200] (70)++ 00000800:00000200:71.0:1599839141.508461:0:25250:0:(o2iblnd_cb.c:3875:kiblnd_scheduler()) conn[ffff9c5ad91c0200] (71)-- 00000800:00000200:71.0:1599839141.508462:0:25250:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff9c5ad91c0200] (70)++ There is no completion for conn [ffff9c5ad91c0200] .

Amir Shehata (Inactive) added a comment - 11/Sep/20 5:41 PM

I was talking with Oleg, there are a few thing we can try to collect/do to see if there are issues on the LNet/networking side.

First of all, let's grab the LNet statistics from the node (server/routers/clients with the problem), to see if there are drops:

lnetctl net show -v 4
lnetctl stats show

Oleg was saying that he sees the message depart the server but never arrives on the client. No router logs.

If we get the above stats from the router, we can see if there are drops happening on the routers. We can also monitor the stats on some interval and see if the errors/drops increase at the same time frame as the problem. We can do the same type of monitoring on the servers and clients. I think if we're able to correlate LNet errors with the time frame of the problem, we'll have some lead to go on. We can at least look at the type of errors which are happening during that time and see how we handle them.

The other thing I'm thinking is to try and reduce the complexity of LNet. Can we try disabling health, if we haven't already:

lnetctl set health_sensitivity 0

This avoids any resends at the LNet layer.

Amir Shehata (Inactive) added a comment - 11/Sep/20 5:41 PM I was talking with Oleg, there are a few thing we can try to collect/do to see if there are issues on the LNet/networking side. First of all, let's grab the LNet statistics from the node (server/routers/clients with the problem), to see if there are drops: lnetctl net show -v 4 lnetctl stats show Oleg was saying that he sees the message depart the server but never arrives on the client. No router logs. If we get the above stats from the router, we can see if there are drops happening on the routers. We can also monitor the stats on some interval and see if the errors/drops increase at the same time frame as the problem. We can do the same type of monitoring on the servers and clients. I think if we're able to correlate LNet errors with the time frame of the problem, we'll have some lead to go on. We can at least look at the type of errors which are happening during that time and see how we handle them. The other thing I'm thinking is to try and reduce the complexity of LNet. Can we try disabling health, if we haven't already: lnetctl set health_sensitivity 0 This avoids any resends at the LNet layer.

Mahmoud Hanafi added a comment - 10/Sep/20 9:11 PM - edited

There is nothing in the logs on that router. I guess I need to run debug trace on the routers, clients, and server.

Router debug logs may be really big...

I don't understand why these two clients send rpc with same xid.

00000100:00100000:0.0:1599585761.680712:0:15434:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_071:64f13ca0-a798-8be7-76ba-6d1f388635bd+88:6761:x1676790192962432:12345-10.141.6.236@o2ib417:101
00000100:00000200:0.0:1599585761.680714:0:15434:0:(service.c:2145:ptlrpc_server_handle_request()) got req 1676790192962432
00000100:00100000:8.0:1599585761.683587:0:15309:0:(service.c:1989:ptlrpc_server_handle_req_in()) got req x1676790192962432
00000100:00100000:8.0:1599585761.683601:0:15309:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt01_039:673a32db-c752-20f9-dd67-f80a34f5fd8d+90:6813:x1676790192962432:12345-10.141.6.235@o2ib417:101

Mahmoud Hanafi added a comment - 10/Sep/20 9:11 PM - edited There is nothing in the logs on that router. I guess I need to run debug trace on the routers, clients, and server. Router debug logs may be really big... I don't understand why these two clients send rpc with same xid. 00000100:00100000:0.0:1599585761.680712:0:15434:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_071:64f13ca0-a798-8be7-76ba-6d1f388635bd+88:6761:x1676790192962432:12345-10.141.6.236@o2ib417:101 00000100:00000200:0.0:1599585761.680714:0:15434:0:(service.c:2145:ptlrpc_server_handle_request()) got req 1676790192962432 00000100:00100000:8.0:1599585761.683587:0:15309:0:(service.c:1989:ptlrpc_server_handle_req_in()) got req x1676790192962432 00000100:00100000:8.0:1599585761.683601:0:15309:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt01_039:673a32db-c752-20f9-dd67-f80a34f5fd8d+90:6813:x1676790192962432:12345-10.141.6.235@o2ib417:101

Oleg Drokin added a comment - 10/Sep/20 8:42 PM

Thank you, this is a good trace.

Now we see the lost RPC and we see it was beign sent off from the server.

The thing is - it's not sent to the client, but to an intermediate router (not in the hostlist too):

00000400:00000200:1.0:1599585761.690684:0:15434:0:(lib-move.c:1752:lnet_handle_s
end()) TRACE: 10.151.27.53@o2ib(10.151.27.53@o2ib:10.151.27.53@o2ib) -> 10.141.6
.236@o2ib417(10.141.6.236@o2ib417:10.151.26.246@o2ib) : PUT try# 0
00000800:00000200:1.0:1599585761.690687:0:15434:0:(o2iblnd_cb.c:1663:kiblnd_send
()) sending 1000 bytes in 1 frags to 12345-10.151.26.246@o2ib

So this router (10.151.26.246) is the node that ate the message then I imagine? Any useful messages there?

Oleg Drokin added a comment - 10/Sep/20 8:42 PM Thank you, this is a good trace. Now we see the lost RPC and we see it was beign sent off from the server. The thing is - it's not sent to the client, but to an intermediate router (not in the hostlist too): 00000400:00000200:1.0:1599585761.690684:0:15434:0:(lib-move.c:1752:lnet_handle_s end()) TRACE: 10.151.27.53@o2ib(10.151.27.53@o2ib:10.151.27.53@o2ib) -> 10.141.6 .236@o2ib417(10.141.6.236@o2ib417:10.151.26.246@o2ib) : PUT try# 0 00000800:00000200:1.0:1599585761.690687:0:15434:0:(o2iblnd_cb.c:1663:kiblnd_send ()) sending 1000 bytes in 1 frags to 12345-10.151.26.246@o2ib So this router (10.151.26.246) is the node that ate the message then I imagine? Any useful messages there?

Mahmoud Hanafi added a comment - 08/Sep/20 6:04 PM

With Patch#4 we see hung threads at start of the first iteration of MPI run. Without any patches it would take a number of iterations before we would see hung threads. I re-ran things today with peer_credits=128 to make sure we don't have any credit issues. As with the last set of logs I am setting the debug_daemon size very big not to miss anything - I am not sure why we missed the first eviction. This set of logs shouldn't be missing any thing.
/uploads/~~LU-13692~~/~~LU-13692~~_patch4_2020_09_08.tgz

Mahmoud Hanafi added a comment - 08/Sep/20 6:04 PM With Patch#4 we see hung threads at start of the first iteration of MPI run. Without any patches it would take a number of iterations before we would see hung threads. I re-ran things today with peer_credits=128 to make sure we don't have any credit issues. As with the last set of logs I am setting the debug_daemon size very big not to miss anything - I am not sure why we missed the first eviction. This set of logs shouldn't be missing any thing. /uploads/ LU-13692 / LU-13692 _patch4_2020_09_08.tgz

Oleg Drokin added a comment - 07/Sep/20 1:52 AM

I checked for the first eviction and that seems to have been removed from the client log by the time it was dumped so it's not possible to see what was going on on the client at the time.

I see there's still a bunch of stuck threads after that that I need to see if I can find why - probably another AST lost but it's not a give that client log (whatever it was) survived either I guess.

Oleg Drokin added a comment - 07/Sep/20 1:52 AM I checked for the first eviction and that seems to have been removed from the client log by the time it was dumped so it's not possible to see what was going on on the client at the time. I see there's still a bunch of stuck threads after that that I need to see if I can find why - probably another AST lost but it's not a give that client log (whatever it was) survived either I guess.

Mahmoud Hanafi added a comment - 03/Sep/20 5:20 PM

Uploaded logs running with patch#4.

debug="ioctl neterror net warning dlmtrace error emerg ha rpctrace config console lfsck"

ftp:/uploads/~~LU-13692~~/~~LU-13692~~_patch4_20200902.tgz

Mahmoud Hanafi added a comment - 03/Sep/20 5:20 PM Uploaded logs running with patch#4. debug="ioctl neterror net warning dlmtrace error emerg ha rpctrace config console lfsck" ftp:/uploads/ LU-13692 / LU-13692 _patch4_20200902.tgz

Oleg Drokin added a comment - 02/Sep/20 5:08 PM

patchset 4 if you can please.

I agree we are having at least two issues at hand.

Oleg Drokin added a comment - 02/Sep/20 5:08 PM patchset 4 if you can please. I agree we are having at least two issues at hand.

Mahmoud Hanafi added a comment - 01/Sep/20 9:36 PM

I think we have 2 issues here. First is the ldlm deadlock and second the credit issue.
I am going to retest with different credit values. Should I test with Patch#2 or Patch#3.

Mahmoud Hanafi added a comment - 01/Sep/20 9:36 PM I think we have 2 issues here. First is the ldlm deadlock and second the credit issue. I am going to retest with different credit values. Should I test with Patch#2 or Patch#3.

Oleg Drokin added a comment - 28/Aug/20 8:51 PM

Andreas, the thread is sleeping in ptlrpc_set_wait in ptlrpc, so waking it up from ldlm is going to be very nontrivial. We can find the thread easily as it's recorded in the lock. But then we need to determine the ptlrpc set that's privately allocated and referenced on stack of that thread only.

Oleg Drokin added a comment - 28/Aug/20 8:51 PM Andreas, the thread is sleeping in ptlrpc_set_wait in ptlrpc, so waking it up from ldlm is going to be very nontrivial. We can find the thread easily as it's recorded in the lock. But then we need to determine the ptlrpc set that's privately allocated and referenced on stack of that thread only.

Oleg Drokin added a comment - 28/Aug/20 8:39 PM

I discussed with Amir and he thinks your situation resembles a case where there's not enough peer credits so messages are dropped.

I checked the logs and I do see this:

00000800:00000200:5.0:1595883687.321977:0:15687:0:(o2iblnd_cb.c:900:kiblnd_post_tx_locked()) 10.151.26.238@o2ib: no credits

but the date is strange, its from July. No newer messages, but it's only output at trace level "net" that you've not been including later (I guess you were not removing older server.dk files?)

He says you need to check peer_credits_hiw and peer_credits values and peer_credits_hiw should be set to 1/2 of the peer_credits to ensure proper reclaim of credits for certain types of messages (perhaps growing peer credits might not be a bad idea depending on what your settings are now)

"lnetctl net show -v 4", "lnetctl peer show -v" should show some related stats and min credits.

Oleg Drokin added a comment - 28/Aug/20 8:39 PM I discussed with Amir and he thinks your situation resembles a case where there's not enough peer credits so messages are dropped. I checked the logs and I do see this: 00000800:00000200:5.0:1595883687.321977:0:15687:0:(o2iblnd_cb.c:900:kiblnd_post_tx_locked()) 10.151.26.238@o2ib: no credits but the date is strange, its from July. No newer messages, but it's only output at trace level "net" that you've not been including later (I guess you were not removing older server.dk files?) He says you need to check peer_credits_hiw and peer_credits values and peer_credits_hiw should be set to 1/2 of the peer_credits to ensure proper reclaim of credits for certain types of messages (perhaps growing peer credits might not be a bad idea depending on what your settings are now) "lnetctl net show -v 4", "lnetctl peer show -v" should show some related stats and min credits.

People

Assignee:: Oleg Drokin

Reporter:: Mahmoud Hanafi

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 17/Jun/20 8:56 PM

Updated:: 18/May/21 6:50 PM

Resolved:: 29/Oct/20 12:06 PM