[LU-8573] IOR: niobuf.c:319:ptlrpc_register_bulk()) ASSERTION( desc->bd_md_count == 0 ) failed Created: 31/Aug/16  Updated: 02/Dec/18  Resolved: 01/Dec/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0, Lustre 2.10.3
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Blocker
Reporter: Cliff White (Inactive) Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre-master 2.8.56_68_gd4a4c07 build 3430, RHEL7, Spirit performance cluster Revision: d4a4c0795b3befb87d47a5bf441adeba3b1c36f8


Attachments: File client.spirit-28.log.gz     Text File lu-8573-syslog.tar     Text File lu-8573.console.tar     Text File mds.sprit-3.log.gz     Text File oss.aeon-1.log.gz     File oss.aeon-2.log.gz     File spirit-11.log.gz     File spirit-14-lustre-log.20160928_2029-4-lu-8573.bz2     File spirit-15-lustre-log.20160928_2029-4-lu-8573.bz2     File spirit-16-lustre-log.20160928_2029-4-lu-8573.bz2     File spirit-17.log.gz     File spirit-20.log.gz     Text File spirit-3-lustre-log.20160928_2029-4-lu-8573.bz2     File spirit-33.log-20160904     File spirit-4.log.gz     Text File spirit-5.callback.txt.gz     File spirit-5.log.gz     Text File spirit-6.callback.lustre.log.gz     File spirit-6.log.gz     Text File spirit-aeon-1-lustre-log.20160928_2029-4-lu-8573.bz2     Text File spirit-aeon-2-lustre-log.20160928_2029-4-lu-8573.bz2     File sprit-11.callbacks.lustre.log.gz    
Issue Links:
Duplicate
is duplicated by LU-11647 niobuf.c:330:ptlrpc_register_bulk()) ... Resolved
is duplicated by LU-11692 lustre kernel panic - (niobuf.c:330:... Resolved
Related
is related to LU-7650 ko2iblnd map_on_demand can't negotita... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Attempted IOR test on both sets of OSTs.
No progress made on test.
Client and OST logs are full of:
OST

LustreError: 5245:0:(events.c:449:server_bulk_callback()) event type 3, status -113, desc ffff88104dc53200
LustreError: 9071:0:(events.c:449:server_bulk_callback()) event type 3, status -103, desc ffff880b0c956200
LustreError: 5245:0:(events.c:449:server_bulk_callback()) event type 5, status -113, desc ffff880b13e95200
LustreError: 5245:0:(events.c:449:server_bulk_callback()) event type 3, status -113, desc ffff880b13e95200
LustreError: 5245:0:(events.c:449:server_bulk_callback()) event type 5, status -113, desc ffff881032b5a800
LustreError: 5245:0:(events.c:449:server_bulk_callback()) event type 3, status -113, desc ffff881032b5a800

Client

[12378.313197] LustreError: 3138:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff8807442f9c00
[12378.313223] LustreError: 3136:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff8807442f9800
[12378.337649] LustreError: 3136:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff8810102c4200
[12378.337671] LustreError: 3136:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff880f55547400
[12378.337677] LustreError: 3138:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff880729970000
[12378.362167] LustreError: 3137:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff880804b78800
[12378.374324] LustreError: 3137:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff8807299e5200
[12378.410886] LustreError: 3137:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff880f53ce1c00

Eventually, one client LBUGS

[12377.942289] LustreError: 3153:0:(niobuf.c:319:ptlrpc_register_bulk()) ASSERTION( desc->bd_md_count == 0 ) failed: 
[12378.564210] LustreError: 3153:0:(niobuf.c:319:ptlrpc_register_bulk()) LBUG
[12378.564210] LustreError: 3153:0:(niobuf.c:319:ptlrpc_register_bulk()) LBUG
[12378.571890] Pid: 3153, comm: ptlrpcd_01_02
[12378.576468]
Call Trace:
[12378.580868]  [<ffffffffa08557d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
[12378.588652]  [<ffffffffa0855d75>] lbug_with_loc+0x45/0xc0 [libcfs]
[12378.595597]  [<ffffffffa0c3e661>] ptlrpc_register_bulk+0x831/0x9c0 [ptlrpc]
[12378.603389]  [<ffffffffa08cace2>] ? LNetMDUnlink+0xe2/0x180 [lnet]
[12378.610322]  [<ffffffffa0c6be76>] ? sptlrpc_import_sec_ref+0x36/0x40 [ptlrpc]
[12378.618321]  [<ffffffffa0c3f1af>] ptl_send_rpc+0x1ff/0xda0 [ptlrpc]
[12378.625361]  [<ffffffffa0c39256>] ptlrpc_check_set.part.23+0x1896/0x1dd0 [ptlrpc]
[12378.633743]  [<ffffffffa0c397eb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
[12378.640976]  [<ffffffffa0c643fb>] ptlrpcd_check+0x4eb/0x5e0 [ptlrpc]
[12378.648095]  [<ffffffffa0c647ab>] ptlrpcd+0x2bb/0x560 [ptlrpc]
[12378.654613]  [<ffffffff810b8910>] ? default_wake_function+0x0/0x20
[12378.661550]  [<ffffffffa0c644f0>] ? ptlrpcd+0x0/0x560 [ptlrpc]
[12378.668067]  [<ffffffff810a5b2f>] kthread+0xcf/0xe0
[12378.673515]  [<ffffffff810a5a60>] ? kthread+0x0/0xe0
[12378.679058]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
[12378.685086]  [<ffffffff810a5a60>] ? kthread+0x0/0xe0
[12378.690626]
[12378.692396] Kernel panic - not syncing: LBUG
[12378.697164] CPU: 9 PID: 3153 Comm: ptlrpcd_01_02 Tainted: G           OE  ------------   3.10.0-327.28.3.el7.x86_64 #1
[12378.709098] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.01.0002.082220131453 08/22/2013
[12378.720548]  ffffffffa0872def 000000003210ab53 ffff881000c1ba80 ffffffff81636453
[12378.728852]  ffff881000c1bb00 ffffffff8162fce7 ffffffff00000008 ffff881000c1bb10
[12378.737165]  ffff881000c1bab0 000000003210ab53 ffffffffa0c90d30 0000000000000246
[12378.745463] Call Trace:
[12378.748204]  [<ffffffff81636453>] dump_stack+0x19/0x1b
[12378.753939]  [<ffffffff8162fce7>] panic+0xd8/0x1e7
[12378.759295]  [<ffffffffa0855ddb>] lbug_with_loc+0xab/0xc0 [libcfs]
[12378.766235]  [<ffffffffa0c3e661>] ptlrpc_register_bulk+0x831/0x9c0 [ptlrpc]
[12378.774009]  [<ffffffffa08cace2>] ? LNetMDUnlink+0xe2/0x180 [lnet]
[12378.780949]  [<ffffffffa0c6be76>] ? sptlrpc_import_sec_ref+0x36/0x40 [ptlrpc]
[12378.788950]  [<ffffffffa0c3f1af>] ptl_send_rpc+0x1ff/0xda0 [ptlrpc]
[12378.795981]  [<ffffffffa0c39256>] ptlrpc_check_set.part.23+0x1896/0x1dd0 [ptlrpc]
[12378.804368]  [<ffffffffa0c397eb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
[12378.811596]  [<ffffffffa0c643fb>] ptlrpcd_check+0x4eb/0x5e0 [ptlrpc]
[12378.818725]  [<ffffffffa0c647ab>] ptlrpcd+0x2bb/0x560 [ptlrpc]
[12378.825237]  [<ffffffff810b8910>] ? wake_up_state+0x20/0x20
[12378.831492]  [<ffffffffa0c644f0>] ? ptlrpcd_check+0x5e0/0x5e0 [ptlrpc]
[12378.838778]  [<ffffffff810a5b2f>] kthread+0xcf/0xe0
[12378.844224]  [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
[12378.851510]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
[12378.857534]  [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140

vmcore is available on Spirit for analysis



 Comments   
Comment by Oleg Drokin [ 31/Aug/16 ]

What's the path to vmcore and the debuginfo kernel and modules to make use of the vmcore

Comment by Cliff White (Inactive) [ 31/Aug/16 ]

The vmcore is in /scratch/dumps/spirit-33* on any Spirit node. The related RPMS are at /scratch/rpms or /scratch/hudson.

Comment by Cliff White (Inactive) [ 02/Sep/16 ]

File a DCO ticket requesting access, include your public key (DSA/RSA) The login is passwordless.
The spirit head node (where you login) has all the relevant data under /scratch
/scratch/dumps/<hostname>
/scratch/rpms/ or /scratch/hudson has the RPMS for each install. I am going on sabbatical, Frank Heckes is taking over and should be able to help you.

Comment by Gerrit Updater [ 08/Sep/16 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/22378
Subject: LU-8573 ptlrpc: always unregister bulk
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5b1832e89bd0c9432592602149e8496e7c5d4fdf

Comment by Frank Heckes (Inactive) [ 28/Sep/16 ]

the error also occurs running https://build.hpdd.intel.com/job/lustre-reviews/41665/ (http://review.whamcloud.com/#/c/22739/ )
Test was running on centOS-7.2 using patch set 2 of original fix #22738
I'll attach debug logs later (approximately in 1 hour)

P.S.: Sorry for using your account on spirit to push the change

Comment by Hongchao Zhang [ 29/Sep/16 ]

Hi Frank,
Where do you put the debug logs?
Thanks!

Comment by Cliff White (Inactive) [ 03/Oct/16 ]

Here are the debug logs from the OSTs, MDS and couple of clients

Comment by Cliff White (Inactive) [ 03/Oct/16 ]

Syslog and console messages from the run

Comment by Hongchao Zhang [ 09/Oct/16 ]

Hi Cliff,

Thanks for the logs!
I'm sorry that I can't find the logs of the client which encountered the "ASSERTION( desc->bd_md_count == 0 )" in the attached logs,
could you attach the logs (syslog, console and debug log) of the client which encountered the ASSERTION?
Thanks!

Comment by Cliff White (Inactive) [ 10/Oct/16 ]

Here is the system log with the assertion, It looks like the debug logs are no longer on the system

Comment by Hongchao Zhang [ 11/Oct/16 ]

Hi Cliff,

In the newly attach log of spirit-33, the assertion is occurred at Aug 31, and there is no recent assertion.

Aug 31 01:36:14 spirit-33 kernel: Pid: 3153, comm: ptlrpcd_01_02
Aug 31 01:36:14 spirit-33 kernel: #012Call Trace:
Aug 31 01:36:14 spirit-33 kernel: [<ffffffffa08557d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
Aug 31 01:36:14 spirit-33 kernel: [<ffffffffa0855d75>] lbug_with_loc+0x45/0xc0 [libcfs]
Aug 31 01:36:14 spirit-33 kernel: [<ffffffffa0c3e661>] ptlrpc_register_bulk+0x831/0x9c0 [ptlrpc]
Aug 31 01:36:14 spirit-33 kernel: [<ffffffffa08cace2>] ? LNetMDUnlink+0xe2/0x180 [lnet]
Aug 31 01:36:14 spirit-33 kernel: [<ffffffffa0c6be76>] ? sptlrpc_import_sec_ref+0x36/0x40 [ptlrpc]
Aug 31 01:36:14 spirit-33 kernel: [<ffffffffa0c3f1af>] ptl_send_rpc+0x1ff/0xda0 [ptlrpc]
Aug 31 01:36:14 spirit-33 kernel: [<ffffffffa0c39256>] ptlrpc_check_set.part.23+0x1896/0x1dd0 [ptlrpc]
Aug 31 01:36:14 spirit-33 kernel: [<ffffffffa0c397eb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
Aug 31 01:36:14 spirit-33 kernel: [<ffffffffa0c643fb>] ptlrpcd_check+0x4eb/0x5e0 [ptlrpc]
Aug 31 01:36:14 spirit-33 kernel: [<ffffffffa0c647ab>] ptlrpcd+0x2bb/0x560 [ptlrpc]
Aug 31 01:36:14 spirit-33 kernel: [<ffffffff810b8910>] ? default_wake_function+0x0/0x20
Aug 31 01:36:14 spirit-33 kernel: [<ffffffffa0c644f0>] ? ptlrpcd+0x0/0x560 [ptlrpc]
Aug 31 01:36:14 spirit-33 kernel: [<ffffffff810a5b2f>] kthread+0xcf/0xe0
Aug 31 01:36:14 spirit-33 kernel: [<ffffffff810a5a60>] ? kthread+0x0/0xe0
Aug 31 01:36:14 spirit-33 kernel: [<ffffffff81646a98>] ret_from_fork+0x58/0x90
Aug 31 01:36:14 spirit-33 kernel: [<ffffffff810a5a60>] ? kthread+0x0/0xe0
Comment by Cliff White (Inactive) [ 11/Oct/16 ]

Yes, August 31 was the last time this test was run.

Comment by Hongchao Zhang [ 12/Oct/16 ]

But Frank said it should run around Sept 29.

Frank Heckes added a comment - 29/Sep/16 1:48 AM

the error also occurs running https://build.hpdd.intel.com/job/lustre-reviews/41665/ (http://review.whamcloud.com/#/c/22739/ )
Test was running on centOS-7.2 using patch set 2 of original fix #22738

Comment by Cliff White (Inactive) [ 17/Oct/16 ]

Test again with latest master, still have this issue. LDEV-510 has more details

Comment by Cliff White (Inactive) [ 17/Oct/16 ]

Repeated test with latest master, same result. Nodes are currently halted with this issue, can be examined.

Oct 17 17:13:27 spirit-5 kernel: Lustre: DEBUG MARKER: == parallel-scale test iorfpp: iorfpp ================================================================ 17:13:46 (1476724426)
Oct 17 17:13:27 spirit-5 sshd[23824]: Received disconnect from 10.10.1.11: 11: disconnected by user
Oct 17 17:13:27 spirit-5 sshd[23824]: pam_unix(sshd:session): session closed for user root
Oct 17 17:13:27 spirit-5 systemd-logind: Removed session 94.
Oct 17 17:13:31 spirit-5 kernel: Lustre: lustre-OST000c: Client 2af8c060-dda3-f8ed-5532-73d48d5bec75 (at 192.168.1.21@o2ib) reconnecting
Oct 17 17:13:31 spirit-5 kernel: Lustre: lustre-OST000c: Connection restored to 78757673-8c34-2ea0-d2f0-d29a18f40898 (at 192.168.1.16@o2ib)
Oct 17 17:13:31 spirit-5 kernel: Lustre: Skipped 64 previous similar messages
Oct 17 17:13:31 spirit-5 kernel: LustreError: 19542:0:(ldlm_lib.c:3184:target_bulk_io()) @@@ bulk WRITE failed: rc -107  req@ffff88074fd57c50 x1548457026272736/t0(0) o4->b765b4c7-de22-04f2-d85c-19567447ab65@192.168.1.15@o2ib:-1/-1 lens 608/448 e 0 to 0 dl 1476724421 ref 1 fl Interpret:/0/0 rc 0/0
Oct 17 17:13:31 spirit-5 kernel: LustreError: 19542:0:(ldlm_lib.c:3184:target_bulk_io()) Skipped 12 previous similar messages
Oct 17 17:13:31 spirit-5 kernel: Lustre: lustre-OST0000: Bulk IO write error with b765b4c7-de22-04f2-d85c-19567447ab65 (at 192.168.1.15@o2ib), client will retry: rc -107
Oct 17 17:13:31 spirit-5 kernel: Lustre: Skipped 145 previous similar messages
Oct 17 17:13:31 spirit-5 kernel: LustreError: 19363:0:(ldlm_lib.c:3234:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff88074fd53050 x1548457087089584/t0(0) o4->2af8c060-dda3-f8ed-5532-73d48d5bec75@192.168.1.21@o2ib:-1/-1 lens 608/448 e 0 to 0 dl 1476724421 ref 1 fl Interpret:/0/0 rc 0/0
Oct 17 17:13:31 spirit-5 kernel: LustreError: 19363:0:(ldlm_lib.c:3234:target_bulk_io()) Skipped 1319 previous similar messages
...
Oct 17 17:14:28 spirit-5 kernel: LNetError: 14218:0:(o2iblnd_cb.c:3131:kiblnd_check_txs_locked()) Timed out tx: active_txs, 7 seconds
Oct 17 17:14:28 spirit-5 kernel: LNetError: 14218:0:(o2iblnd_cb.c:3131:kiblnd_check_txs_locked()) Skipped 3 previous similar messages
Oct 17 17:14:28 spirit-5 kernel: LNetError: 14218:0:(o2iblnd_cb.c:3194:kiblnd_check_conns()) Timed out RDMA with 192.168.1.11@o2ib (1): c: 103, oc: 0, rc: 124
Oct 17 17:14:28 spirit-5 kernel: LNetError: 14218:0:(o2iblnd_cb.c:3194:kiblnd_check_conns()) Skipped 3 previous similar messages
Oct 17 17:14:28 spirit-5 kernel: LustreError: 14218:0:(events.c:446:server_bulk_callback()) event type 5, status -103, desc ffff880faf285a00
Oct 17 17:14:28 spirit-5 kernel: LustreError: 14218:0:(events.c:446:server_bulk_callback()) event type 3, status -103, desc ffff880faf285a00
Oct 17 17:14:28 spirit-5 kernel: LustreError: 14218:0:(events.c:446:server_bulk_callback()) event type 5, status -103, desc ffff88084d834e00
Oct 17 17:14:28 spirit-5 kernel: LustreError: 14218:0:(events.c:446:server_bulk_callback()) event type 3, status -103, desc ffff88084d834e00

Client side

Oct 17 17:14:47 spirit-11 kernel: LNetError: 762:0:(o2iblnd_cb.c:1074:kiblnd_init_rdma()) RDMA is too large for peer 192.168.1.5@o2ib (131072), src size: 1024000 dst size: 1024000
Oct 17 17:14:47 spirit-11 kernel: LNetError: 762:0:(o2iblnd_cb.c:1698:kiblnd_reply()) Can't setup rdma for GET from 192.168.1.5@o2ib: -90
Oct 17 17:14:47 spirit-11 kernel: LNetError: 762:0:(o2iblnd_cb.c:1698:kiblnd_reply()) Skipped 10 previous similar messages
Oct 17 17:14:47 spirit-11 kernel: LustreError: 762:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88080cef7a00
Oct 17 17:14:47 spirit-11 kernel: Lustre: 4800:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1476724429/real 1476724429]  req@ffff8807c3730300 x1548457007399104/t0(0) o4->lustre-OST000c-osc-ffff880036709000@192.168.1.5@o2ib:6/4 lens 608/448 e 6 to 1 dl 1476724517 ref 2 fl Rpc:eX/2/ffffffff rc 0/-1
Oct 17 17:14:47 spirit-11 kernel: LustreError: 4787:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88080c746600
Oct 17 17:14:47 spirit-11 kernel: Lustre: lustre-OST0012-osc-ffff880036709000: Connection to lustre-OST0012 (at 192.168.1.5@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 17 17:14:47 spirit-11 kernel: Lustre: Skipped 9 previous similar messages
Oct 17 17:14:47 spirit-11 kernel: LustreError: 4785:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff880ffa415800
Oct 17 17:14:47 spirit-11 kernel: Lustre: lustre-OST0012-osc-ffff880036709000: Connection restored to 192.168.1.5@o2ib (at 192.168.1.5@o2ib)
Oct 17 17:14:47 spirit-11 kernel: Lustre: Skipped 9 previous similar messages
Oct 17 17:14:47 spirit-11 kernel: Lustre: 4800:0:(client.c:2111:ptlrpc_expire_one_request()) Skipped 12 previous similar messages
Oct 17 17:14:47 spirit-11 kernel: LustreError: 4791:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff8807c360b800
Oct 17 17:14:47 spirit-11 kernel: LustreError: 4788:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff8807c360ba00
Oct 17 17:14:47 spirit-11 kernel: LustreError: 4791:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff8807c3711e00
Oct 17 17:14:47 spirit-11 kernel: LustreError: 4791:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88080cdc2a00
Oct 17 17:14:47 spirit-11 kernel: LustreError: 4789:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88080c014400
Comment by Cliff White (Inactive) [ 17/Oct/16 ]

Lustre logs dumped from client (spirit-11) and OSS nodes with debug=-1. Logs dumped by hand after failure.

Comment by Hongchao Zhang [ 18/Oct/16 ]

Hi Cliff,

Thanks!
Could you please attach the logs of the node where the "ASSERTION( desc->bd_md_count == 0 )" is encountered? I can't find it in the attached logs (spirit-5,6,11).

Comment by Cliff White (Inactive) [ 18/Oct/16 ]

The logs were taken several hours after the failure, if the ASSERT isn't in those logs, then it must have been over-written. I can re-run the test, and you can login to the nodes if you wish.

Comment by Hongchao Zhang [ 24/Oct/16 ]

Hi Cliff,

I saw there are several tests of "parallel-scale test iorfpp: iorfpp" after "Oct 18 00:00:00" at spirit-11. Do you still use spirit-11 to test
this issue? there is no "client_bulk_callback" errors in the following tests and no "ASSERTION" was found in the /var/log/messages
(the messages after Sep 26).

Comment by nasf (Inactive) [ 24/Oct/16 ]

With the patch http://review.whamcloud.com/#/c/22739/ applied, there should be no "ASSERTION( desc->bd_md_count == 0 )" any longer. Because the patch deregisters the the RPC's MD unconditionally before such ASSERTION() check. That is almost equal to removing the ASSERTION(). But in fact, such ASSERTION() means we assume that the RPC environment should be sane, no need to force cleanup via deregister RPC's MD. So we should find out why the RPC environment is not sane at that time, not only simply 'bypass' such ASSERTION().

On the other hand, it is NOT to say the patch http://review.whamcloud.com/#/c/22739/ is meaningless, because if we still trigger the ASSERTION() with the patch applied, that means there are other logic paths not handled properly.

Comment by Alex Zhuravlev [ 24/Oct/16 ]

I'm not familiar with the related code (still studying..), but it looks like ptlrpc_unregister_bulk() is missing in some codepath..

Comment by Gerrit Updater [ 25/Oct/16 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/23354
Subject: LU-8573 ptlrpc: cleanup bulks for ptlrpc_register_bulk failure
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 683841cc2ea9ee0d16f9fe1c5baf77ed668c15fb

Comment by Cliff White (Inactive) [ 26/Oct/16 ]

IOR continues to fail
Server logs from start of tests

Oct 26 18:45:05 spirit-aeon-1 kernel: Lustre: zfstest-OST0002: Client de07d034-8d22-f659-1cbd-768d0030c923 (at 192.168.1.26@o2ib) reconnecting
Oct 26 18:45:05 spirit-aeon-1 kernel: Lustre: zfstest-OST0002: Connection restored to de07d034-8d22-f659-1cbd-768d0030c923 (at 192.168.1.26@o2ib)
Oct 26 18:45:05 spirit-aeon-1 kernel: Lustre: Skipped 9 previous similar messages
Oct 26 18:45:05 spirit-aeon-1 kernel: LustreError: 29216:0:(ldlm_lib.c:3183:target_bulk_io()) @@@ bulk WRITE failed: rc -107  req@ffff8808089b4850 x1549270578154064/t0(0) o4->7522edd5-eea9-2c67-68cd-2b45f3964653@192.168.1.17@o2ib:-1/-1 lens 608/448 e 0 to 0 dl 1477507516 ref 1 fl Interpret:/0/0 rc 0/0
Oct 26 18:45:05 spirit-aeon-1 kernel: Lustre: zfstest-OST0002: Bulk IO write error with 7522edd5-eea9-2c67-68cd-2b45f3964653 (at 192.168.1.17@o2ib), client will retry: rc -107
Oct 26 18:45:05 spirit-aeon-1 kernel: Lustre: zfstest-OST0000: Client ec385605-4e60-f4be-b60f-cd3e019315dd (at 192.168.1.21@o2ib) reconnecting
Oct 26 18:45:05 spirit-aeon-1 kernel: Lustre: Skipped 23 previous similar messages
Oct 26 18:45:06 spirit-aeon-1 kernel: LustreError: 25234:0:(ldlm_lib.c:3233:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff8808089b7850 x1549270581336640/t0(0) o4->de07d034-8d22-f659-1cbd-768d0030c923@192.168.1.26@o2ib:-1/-1 lens 608/448 e 0 to 0 dl 1477507516 ref 1 fl Interpret:/0/0 rc 0/0
Oct 26 18:45:06 spirit-aeon-1 kernel: LustreError: 25234:0:(ldlm_lib.c:3233:target_bulk_io()) Skipped 2 previous similar messages
Oct 26 18:45:06 spirit-aeon-1 kernel: Lustre: zfstest-OST0002: Connection restored to 027d9577-68a8-de83-149f-38f3670e21f2 (at 192.168.1.14@o2ib)
Oct 26 18:45:06 spirit-aeon-1 kernel: Lustre: Skipped 27 previous similar messages
Oct 26 18:45:06 spirit-aeon-1 kernel: LustreError: 29238:0:(ldlm_lib.c:3233:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff880808853850 x1549270577110928/t0(0) o4->77fffece-ca00-9432-9fd5-857674816ee4@192.168.1.28@o2ib:-1/-1 lens 608/448 e 0 to 0 dl 1477507516 ref 1 fl Interpret:/0/0 rc 0/0
Oct 26 18:45:06 spirit-aeon-1 kernel: Lustre: zfstest-OST0000: Client de07d034-8d22-f659-1cbd-768d0030c923 (at 192.168.1.26@o2ib) reconnecting
Oct 26 18:45:06 spirit-aeon-1 kernel: Lustre: Skipped 19 previous similar messages
Oct 26 18:45:06 spirit-aeon-1 kernel: LustreError: 29238:0:(ldlm_lib.c:3233:target_bulk_io()) Skipped 24 previous similar messages
Oct 26 18:45:07 spirit-aeon-1 kernel: LustreError: 29274:0:(ldlm_lib.c:3183:target_bulk_io()) @@@ bulk WRITE failed: rc -107  req@ffff88007fd76850 x1549270578161152/t0(0) o4->1b29c57b-535c-46b6-486f-7405da7c1796@192.168.1.18@o2ib:-1/-1 lens 608/448 e 0 to 0 dl 1477507516 ref 1 fl Interpret:/2/0 rc 0/0
Oct 26 18:45:07 spirit-aeon-1 kernel: Lustre: zfstest-OST0000: Bulk IO write error with 1b29c57b-535c-46b6-486f-7405da7c1796 (at 192.168.1.18@o2ib), client will retry: rc -107
Oct 26 18:45:07 spirit-aeon-1 kernel: LustreError: 29261:0:(ldlm_lib.c:3233:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff88079810d050 x1549270582366336/t0(0) o4->f71ccf2b-e961-3680-30e4-be748de9fb17@192.168.1.27@o2ib:-1/-1 lens 608/448 e 0 to 0 dl 1477507516 ref 1 fl Interpret:/2/0 rc 0/0
Oct 26 18:45:07 spirit-aeon-1 kernel: LustreError: 29261:0:(ldlm_lib.c:3233:target_bulk_io()) Skipped 21 previous similar messages
Oct 26 18:45:08 spirit-aeon-1 kernel: Lustre: zfstest-OST0000: Connection restored to 51f91233-8032-15d2-cd74-03a850dab34b (at 192.168.1.11@o2ib)
Oct 26 18:45:08 spirit-aeon-1 kernel: Lustre: Skipped 38 previous similar messages
Oct 26 18:45:08 spirit-aeon-1 kernel: Lustre: zfstest-OST0000: Client 7f61e263-a95b-3193-8481-b48b561e6e29 (at 192.168.1.25@o2ib) reconnecting
Oct 26 18:45:08 spirit-aeon-1 kernel: Lustre: Skipped 35 previous similar messages
Oct 26 18:45:10 spirit-aeon-1 kernel: LustreError: 29306:0:(ldlm_lib.c:3233:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff88072a949850 x1549270584459216/t0(0) o4->b38521f1-8de6-5f02-305a-3c91f4caeb2d@192.168.1.24@o2ib:-1/-1 lens 608/448 e 0 to 0 dl 1477507518 ref 1 fl Interpret:/2/0 rc 0/0
Oct 26 18:45:10 spirit-aeon-1 kernel: LustreError: 29306:0:(ldlm_lib.c:3233:target_bulk_io()) Skipped 44 previous similar messages
Oct 26 18:45:12 spirit-aeon-1 kernel: Lustre: zfstest-OST0000: Connection restored to a8341b8c-a880-ca64-9830-6416dc1b988f (at 192.168.1.15@o2ib)
Oct 26 18:45:12 spirit-aeon-1 kernel: Lustre: Skipped 54 previous similar messages
Oct 26 18:45:12 spirit-aeon-1 kernel: Lustre: zfstest-OST0000: Client de07d034-8d22-f659-1cbd-768d0030c923 (at 192.168.1.26@o2ib) reconnecting
Oct 26 18:45:12 spirit-aeon-1 kernel: Lustre: Skipped 56 previous similar messages
Oct 26 18:45:14 spirit-aeon-1 kernel: LustreError: 29376:0:(ldlm_lib.c:3233:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff88075d19bc50 x1549270582366368/t0(0) o4->f71ccf2b-e961-3680-30e4-be748de9fb17@192.168.1.27@o2ib:-1/-1 lens 608/448 e 0 to 0 dl 1477507522 ref 1 fl Interpret:/2/0 rc 0/0
Oct 26 18:45:14 spirit-aeon-1 kernel: LustreError: 29376:0:(ldlm_lib.c:3233:target_bulk_io()) Skipped 68 previous similar messages
Oct 26 18:45:56 spirit-aeon-1 kernel: LNetError: 21200:0:(o2iblnd_cb.c:3131:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
Oct 26 18:45:56 spirit-aeon-1 kernel: LNetError: 21200:0:(o2iblnd_cb.c:3194:kiblnd_check_conns()) Timed out RDMA with 192.168.1.11@o2ib (6): c: 127, oc: 0, rc: 94
Oct 26 18:45:56 spirit-aeon-1 kernel: LustreError: 21200:0:(events.c:446:server_bulk_callback()) event type 5, status -103, desc ffff8804c83c8000
Oct 26 18:45:56 spirit-aeon-1 kernel: LustreError: 21200:0:(events.c:446:server_bulk_callback()) event type 3, status -103, desc ffff8804c83c8000
Oct 26 18:45:56 spirit-aeon-1 kernel: LustreError: 21200:0:(events.c:446:server_bulk_callback()) event type 5, status -103, desc ffff8804c83c8000
Oct 26 18:45:56 spirit-aeon-1 kernel: LustreError: 21200:0:(events.c:446:server_bulk_callback()) event type 3, status -103, desc ffff8804c83c8000
Oct 26 18:45:56 spirit-aeon-1 kernel: LustreError: 21200:0:(events.c:446:server_bulk_callback()) event type 5, status -103, desc ffff8804c83c8000
Oct 26 18:45:56 spirit-aeon-1 kernel: LustreError: 21200:0:(events.c:446:server_bulk_callback()) event type 3, status -103, desc ffff8804c83c8000
Oct 26 18:45:56 spirit-aeon-1 kernel: LustreError: 21200:0:(events.c:446:server_bulk_callback()) event type 5, status -103, desc ffff8804c83c8000
Oct 26 18:45:56 spirit-aeon-1 kernel: LustreError: 21200:0:(events.c:446:server_bulk_callback()) event type 3, status -103, desc ffff8804c83c8000
Oct 26 18:45:56 spirit-aeon-1 kernel: LustreError: 21200:0:(events.c:446:server_bulk_callback()) event type 5, status -103, desc ffff88054023b400
Oct 26 18:45:56 spirit-aeon-1 kernel: Lustre: zfstest-OST0002: Bulk IO write error with 51f91233-8032-15d2-cd74-03a850dab34b (at 192.168.1.11@o2ib), client will retry: rc -110
Oct 26 18:45:56 spirit-aeon-1 kernel: LustreError: 212
Comment by Cliff White (Inactive) [ 26/Oct/16 ]

Client-side messages

Oct 26 18:45:02 spirit-11 kernel: Lustre: DEBUG MARKER: == parallel-scale test iorfpp: iorfpp ================================================================ 18:45:02 (1477507502)
Oct 26 18:45:02 spirit-11 sshd[21261]: Received disconnect from 10.10.1.11: 11: disconnected by user
Oct 26 18:45:02 spirit-11 sshd[21261]: pam_unix(sshd:session): session closed for user root
Oct 26 18:45:02 spirit-11 systemd-logind: Removed session 42.
Oct 26 18:45:05 spirit-11 kernel: LNetError: 3587:0:(o2iblnd_cb.c:1074:kiblnd_init_rdma()) RDMA is too large for peer 192.168.2.2@o2ib (131072), src size: 1048576 dst size: 1048576
Oct 26 18:45:05 spirit-11 kernel: LNetError: 3586:0:(o2iblnd_cb.c:1698:kiblnd_reply()) Can't setup rdma for GET from 192.168.2.2@o2ib: -90
Oct 26 18:45:05 spirit-11 kernel: LustreError: 3586:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d076800
Oct 26 18:45:05 spirit-11 kernel: LustreError: 3586:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d074e00
Oct 26 18:45:05 spirit-11 kernel: LustreError: 3585:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d074e00
Oct 26 18:45:05 spirit-11 kernel: Lustre: 3616:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1477507505/real 1477507505]  req@ffff88077d8d6600 x1549270577199552/t0(0) o4->zfstest-OST0001-osc-ffff880ff9be7000@192.168.2.2@o2ib:6/4 lens 608/448 e 0 to 1 dl 1477507517 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
Oct 26 18:45:05 spirit-11 kernel: Lustre: zfstest-OST0001-osc-ffff880ff9be7000: Connection to zfstest-OST0001 (at 192.168.2.2@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 26 18:45:05 spirit-11 kernel: Lustre: zfstest-OST0001-osc-ffff880ff9be7000: Connection restored to 192.168.2.2@o2ib (at 192.168.2.2@o2ib)
Oct 26 18:45:05 spirit-11 kernel: LustreError: 3582:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d074400
Oct 26 18:45:05 spirit-11 kernel: LustreError: 3582:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d074400
Oct 26 18:45:05 spirit-11 kernel: LustreError: 3581:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d074400
Oct 26 18:45:05 spirit-11 kernel: LustreError: 3584:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d074400
Oct 26 18:45:05 spirit-11 kernel: LNetError: 3587:0:(o2iblnd_cb.c:1074:kiblnd_init_rdma()) Skipped 7 previous similar messages
Oct 26 18:45:05 spirit-11 kernel: LustreError: 3587:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d076800
Oct 26 18:45:05 spirit-11 kernel: LustreError: 3588:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88003666fc00
Oct 26 18:45:05 spirit-11 kernel: LustreError: 3586:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88003666fc00
Oct 26 18:45:05 spirit-11 kernel: LustreError: 3585:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88003666fc00
Comment by Cliff White (Inactive) [ 27/Oct/16 ]

Lustre logs from both OSS nodes. backfstype was ZFS for this test, error can be reproduced on either ZFS or ldiskfs

Comment by Cliff White (Inactive) [ 27/Oct/16 ]

MDS and sample client lustre logs.

Comment by Oleg Drokin [ 27/Oct/16 ]

I think the IO errors aside, it's still important to fix the assertion since twe don't want to have the crashes.
Now if io erors happen to be due to network errors later - then we can dig into this separately, but at least we won't be crashing anything.

Comment by Cliff White (Inactive) [ 27/Oct/16 ]

One note: the IOR test makes no progress due to these errors, it may not be hitting an ASSERT, but the code is completely un-useable.

Comment by Cliff White (Inactive) [ 27/Oct/16 ]

Repeated the fault with ldiskfs, set debug=-1 and dumped lustre logs after the hang occurred.
MDS = spirit-4 OSS= spirit-5/6, all others are clients.

Comment by Cliff White (Inactive) [ 28/Oct/16 ]

Received patch from Alex, "Doug thinks that can be introduced by LU-7650.. I just pushed a patch to revert that: http://review.whamcloud.com/23439"
I was able to successfully complete IOR runs on both sets of servers, will retry ZFS today.

Comment by Gerrit Updater [ 28/Oct/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23439/
Subject: LU-8573 lnet: Revert LU-7650 patches
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 59fbfa5f3fa925f676636f1b78c986d05cd295bc

Comment by Peter Jones [ 29/Oct/16 ]

Landed for 2.9

Comment by Hongchao Zhang [ 14/Sep/18 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: http://review.whamcloud.com/33167
Subject: LU-8573 ptlrpc: race with reply_in_callback
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ac6b543a08033ddb5879706c0b3c941d3cdfbef0

Comment by Andreas Dilger [ 01/Dec/18 ]

Closing this ticket again, and using LU-11647 for the new patch. This avoids confusing the fix version for the patch.

Generated at Sat Feb 10 02:18:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.