[LU-11826] Cannot send after transport endpoint shutdown Created: 24/Dec/18  Updated: 28/Feb/19  Resolved: 23/Feb/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Campbell Mcleay (Inactive) Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS 7.4


Issue Links:
Related
is related to LU-10945 Race between sending bl ast and lock ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When running multiple rm's of files, we get the following error in the shell:

/bin/rm: cannot remove '</some/file/path>’: Cannot send after transport endpoint shutdown

These coincide with the following error in /var/log/messages:

 Dec 24 11:13:09 foxtrot2 kernel: LustreError: 11-0: foxtrot-MDT0000-mdc-ffff883ff6b12800: operation mds_close to node 10.21.22.10@tcp failed: rc = -107Dec 24 11:13:09 foxtrot2 kernel: Lustre: foxtrot-MDT0000-mdc-ffff883ff6b12800: Connection to foxtrot-MDT0000 (at 10.21.22.10@tcp) was lost; in progress operations using this service will wait for recovery to completeDec 24 11:13:09 foxtrot2 kernel: LustreError: 167-0: foxtrot-MDT0000-mdc-ffff883ff6b12800: This client was evicted by foxtrot-MDT0000; in progress operations using this service will fail.Dec 24 11:13:09 foxtrot2 kernel: LustreError: 3598:0:(mdc_locks.c:1211:mdc_intent_getattr_async_interpret()) ldlm_cli_enqueue_fini: -5Dec 24 11:13:09 foxtrot2 kernel: LustreError: 3598:0:(mdc_locks.c:1211:mdc_intent_getattr_async_interpret()) Skipped 37 previous similar messagesDec 24 11:13:09 foxtrot2 kernel: LustreError: Skipped 50 previous similar messagesDec 24 11:13:09 foxtrot2 kernel: LustreError: 39322:0:(llite_lib.c:1512:ll_md_setattr()) md_setattr fails: rc = -5Dec 24 11:13:09 foxtrot2 kernel: LustreError: 38248:0:(file.c:172:ll_close_inode_openhandle()) foxtrot-clilmv-ffff883ff6b12800: inode [0x200030875:0x5d11:0x0] mdc close failed: rc = -107Dec 24 11:13:09 foxtrot2 kernel: LustreError: 38248:0:(file.c:172:ll_close_inode_openhandle()) Skipped 743 previous similar messagesDec 24 11:13:09 foxtrot2 kernel: LustreError: 41760:0:(vvp_io.c:1474:vvp_io_init()) foxtrot: refresh file layout [0x2000302ba:0x103db:0x0] error -108.Dec 24 11:13:09 foxtrot2 kernel: LustreError: 41760:0:(vvp_io.c:1474:vvp_io_init()) Skipped 310070 previous similar messagesDec 24 11:13:09 foxtrot2 kernel: LustreError: 44300:0:(mdc_request.c:1329:mdc_read_page()) foxtrot-MDT0000-mdc-ffff883ff6b12800: [0x20002cfcf:0x5a20:0x0] lock enqueue fails: rc = -108Dec 24 11:13:09 foxtrot2 kernel: LustreError: 39322:0:(llite_lib.c:1512:ll_md_setattr()) Skipped 5 previous similar messagesDec 24 11:13:09 foxtrot2 kernel: LustreError: 12816:0:(vvp_io.c:1474:vvp_io_init()) foxtrot: refresh file layout [0x200030766:0x18539:0x0] error -108.Dec 24 11:13:09 foxtrot2 kernel: LustreError: 39252:0:(vvp_io.c:1474:vvp_io_init()) foxtrot: refresh file layout [0x2000302ba:0x10403:0x0] error -108.Dec 24 11:13:09 foxtrot2 kernel: LustreError: 39252:0:(vvp_io.c:1474:vvp_io_init()) Skipped 143616 previous similar messagesDec 24 11:13:09 foxtrot2 kernel: LustreError: 44302:0:(file.c:172:ll_close_inode_openhandle()) foxtrot-clilmv-ffff883ff6b12800: inode [0x20000070c:0x2ea9:0x0] mdc close failed: rc = -108Dec 24 11:13:09 foxtrot2 kernel: LustreError: 44302:0:(file.c:172:ll_close_inode_openhandle()) Skipped 815 previous similar messagesDec 24 11:13:09 foxtrot2 kernel: LustreError: 12816:0:(vvp_io.c:1474:vvp_io_init()) Skipped 2986 previous similar messagesDec 24 11:13:10 foxtrot2 kernel: Lustre: foxtrot-MDT0000-mdc-ffff883ff6b12800: Connection restored to 10.21.22.10@tcp (at 10.21.22.10@tcp)


 Comments   
Comment by Campbell Mcleay (Inactive) [ 24/Dec/18 ]

Just a note to say that the files it complains about when it says it can't remove them are not protected by read-only ACLs or file modes, and do exist on the filesystem.

Comment by Peter Jones [ 24/Dec/18 ]

Emoly

What do you suggest here?

Thanks

Peter

Comment by Oleg Drokin [ 24/Dec/18 ]

From the client excerpt provided it looks like the client was evicted by the servers. Can you please provide mds-side logs from that same time?

Comment by Campbell Mcleay (Inactive) [ 02/Jan/19 ]

Hi Oleg,

Sorry for the delay in replying. There appear to errors in the MDS logs at that time too:

Dec 24 11:13:09 fmds1 kernel: LustreError: 17227:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.21.22.31@tcp) failed to reply to blocking AST (req@ffff88236d22d700 x1615613399909376 status 0 rc -110), evict it ns: mdt-foxtrot-MDT0000_UUID lock: ffff8820a62b8a00/0x13866bfbe99ef389 lrc: 4/0,0 mode: PR/PR res: [0x20002c7e2:0x4352:0x0].0x0 bits 0x1b rrc: 2 type: IBT flags: 0x60200400000020 nid: 10.21.22.31@tcp remote: 0xbf690c3f5f1d9519 expref: 169058 pid: 87340 timeout: 9220366206 lvb_type: 0
Dec 24 11:13:09 fmds1 kernel: LustreError: 138-a: foxtrot-MDT0000: A client on nid 10.21.22.31@tcp was evicted due to a lock blocking callback time out: rc -110
Dec 24 11:13:09 fmds1 kernel: LustreError: 19974:0:(ldlm_lockd.c:1398:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8820f2829800 ns: mdt-foxtrot-MDT0000_UUID lock: ffff88243a769c00/0x13866bfbe99f2569 lrc: 3/0,0 mode: PR/PR res: [0x2000226c3:0xeea1:0x0].0x0 bits 0x1b rrc: 1 type: IBT flags: 0x50200000000000 nid: 10.21.22.31@tcp remote: 0xbf690c3f5f1d993a expref: 168919 pid: 19974 timeout: 0 lvb_type: 0
Dec 24 11:13:09 fmds1 kernel: LustreError: 20543:0:(ldlm_lib.c:3180:target_bulk_io()) @@@ bulk READ failed: rc 107 req@ffff88236c2f8c00 x1620532433667344/t0(0) o37>5bfdc7b1-5f4a-3415-a424-ef332bafeb5b@10.21.22.31@tcp:-1/-1 lens 568/440 e 0 to 0 dl 1545650004 ref 1 fl Interpret:/0/0 rc 0/0
Dec 24 11:13:09 fmds1 kernel: Lustre: foxtrot-MDT0000: Connection restored to (at 10.21.22.31@tcp)
Dec 24 11:13:09 fmds1 kernel: LustreError: 19974:0:(ldlm_lockd.c:1398:ldlm_handle_enqueue0()) Skipped 30 previous similar messages
Dec 24 11:13:11 fmds1 kernel: LustreError: 98802:0:(client.c:1164:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88231848e900 x1615613399983344/t0(0) o104->foxtrot-MDT0000@10.21.22.31@tcp:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
Dec 24 11:13:11 fmds1 kernel: LustreError: 98802:0:(client.c:1164:ptlrpc_import_delay_req()) Skipped 1 previous similar message
Dec 24 11:13:11 fmds1 kernel: LustreError: 98802:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.21.22.31@tcp) failed to reply to blocking AST (req@ffff88231848e900 x1615613399983344 status 0 rc -5), evict it ns: mdt-foxtrot-MDT0000_UUID lock: ffff882550cebc00/0x13866bfbe6684d33 lrc: 4/0,0 mode: PR/PR res: [0x200030873:0x1e93f:0x0].0x0 bits 0x2 rrc: 4 type: IBT flags: 0x60000400000020 nid: 10.21.22.31@tcp remote: 0xbf690c3f5eec06fe expref: 98520 pid: 20548 timeout: 9220368047 lvb_type: 0
Dec 24 11:13:11 fmds1 kernel: LustreError: 138-a: foxtrot-MDT0000: A client on nid 10.21.22.31@tcp was evicted due to a lock blocking callback time out: rc -5

 

Regards,

Campbell

Comment by Campbell Mcleay (Inactive) [ 08/Jan/19 ]

Hi Oleg,

Any thoughts?

Regards,

Campbell

Comment by Campbell Mcleay (Inactive) [ 10/Jan/19 ]

Can anyone help with this? Thanks

Comment by Oleg Drokin [ 10/Jan/19 ]

so server side messages tell us that

1. Client failed to reply to a blocking ast (a special mesasge that lustre server sends to ask a client to release some lock due to a conflict). - the timeout for these low overhead mesages is pretty small at 5-7 seconds so if there is a momentary lapse in connectivity for example, this might happen.
2. server evicted the client not to hold uprest of the nodes waiting for a potentially dead client
3. the client turned out to be not as dead as it seemed and reconnected.
4. other blocking AST that we had scheduled for this client was interrupted due to eviction and the client was evicted once more.

the part #4 does not sound entirely correct, but it did not change anything in the general flow of things since the original eviction is what has messed everyting up and it most likely happened due to some hiccup on the network.

To add: if you see this every time you do large rm's it's possible you are just overstretching some bit of the network infrastructure? you can collect come lustre debug logs and we can see which part exactly, I guess.

Comment by Campbell Mcleay (Inactive) [ 11/Jan/19 ]

Hi Oleg,

I'll prepare a bunch of rm jobs and turn on debugging. This may take a few days for me to get done.

Thanks,

Campbell

Comment by Campbell Mcleay (Inactive) [ 15/Jan/19 ]

Hi Oleg,

Can you please let me know which of the following debug options I should enable (from :/proc/sys/lnet/subsystem_debug):

undefined mdc mds osc ost class log llite rpc mgmt lnet lnd pinger filter echo ldlm lov lquota osd lfsck snapshot lmv sec gss mgc mgs fid fld

Presumably, it is just:

echo "<subsystem1> <subsystem2> <subsystem3>" > /proc/sys/lnet/debug

Should I just do this on the MDS or on the clients as well? Same subsystems options on both?

 

Thanks,

Campbell

Comment by Oleg Drokin [ 16/Jan/19 ]

you could also do "+..." notation to just add things to what you have there.
Don't change anything in sybsystem_debug, it's already fully enabled, you only need to adjust the debug setting

In your case you are best advised to increase the debug log size to like 1G (note this will consume 1G of RAM)

lctl set_param debug_mb=1024

and then add dlmtrace info and rpctrace with

lctl set_param debug="+dlmtrace +info +rpctrace"

Run these two commands on all the clients you expect the problem might hit and then also on the MDS.

once the problem hits, run "lctl dk >/tmp/${HOSTNAME}-lustre.log" on all nodes affected, collect the files and upload them somewhere.

Thanks.

Comment by Campbell Mcleay (Inactive) [ 17/Jan/19 ]

Hi Oleg,

There was a disconnection at around 22:27 last night, about 5 hours after I started the rm's. Unfortunately, I was not able to dump the log until I got in this morning, and the timestamp at the beginning of the client logs is after this time, however, the MDS debug log has that period covered. I'll restart the rm's and see if I can capture the error. Should I increase the debug_mb value on the clients? In the meantime, I can send you the MDS log, which compressed, is about 22MB. Is there somewhere I can send it?

Regards,

Campbell

Comment by Peter Jones [ 17/Jan/19 ]

Sometimes the diagnostic data collected as part of Lustre troubleshooting is too large to be attached to a JIRA ticket. For these cases, Whamcloud provides an anonymous write-only FTP upload service. In order to use this service, you'll need an FTP client (e.g. ncftp, ftp, etc.) and a JIRA issue. Use the 'uploads' directory and create a new subdirectory using your Jira issue as a name.

In the following example, there are three debug logs in a single directory and the JIRA issue LU-4242 has been created. After completing the upload, please update the relevant issue with a note mentioning the upload, so that our engineers know where to find your logs.

$ ls -lh
total 333M
rw-rr- 1 mjmac mjmac 98M Feb 23 17:36 mds-debug
rw-rr- 1 mjmac mjmac 118M Feb 23 17:37 oss-00-debug
rw-rr- 1 mjmac mjmac 118M Feb 23 17:37 oss-01-debug
$ ncftp ftp.whamcloud.com
NcFTP 3.2.2 (Sep 04, 2008) by Mike Gleason (http://www.NcFTP.com/contact/).
Connecting to 99.96.190.235...
(vsFTPd 2.2.2)
Logging in...
Login successful.
Logged in to ftp.whamcloud.com.
ncftp / > cd uploads
Directory successfully changed.
ncftp /uploads > mkdir LU-4242
ncftp /uploads > cd LU-4242
Directory successfully changed.
ncftp /uploads/LU-4242 > put *
mds-debug: 97.66 MB 11.22 MB/s
oss-00-debug: 117.19 MB 11.16 MB/s
oss-01-debug: 117.48 MB 11.18 MB/s
ncftp /uploads/LU-4242 >

Please note that this is a WRITE-ONLY FTP service, so you will not be able to see (with ls) the files or directories you've created, nor will you (or anyone other than Whamcloud staff) be able to see or read them.

Comment by Campbell Mcleay (Inactive) [ 17/Jan/19 ]

Hi,

I've uploaded the MDS log (uploads/LU-11826/fmds1-lustre.log.gz) to ftp.whamcloud.com. The time of the disconnection was 1547661781 in unix time (16th Jan 22:27:22). I'm running another rm in the hope I can capture some useful logs on the clients as well

Regards,

Campbell

Comment by Campbell Mcleay (Inactive) [ 21/Jan/19 ]

Hi,

Did you find anything of interest in the log I sent? Or do you need the client log? I'm finding it difficult to capture the log on the client as it may take a few days to trigger it and I can't watch it all the time. Even with 4GB of memory assigned to the debug log size, it only seems to capture 5 minutes worth (though the log is 3GB in size)

Regards,

Campbell

Comment by Oleg Drokin [ 22/Jan/19 ]

it's weird that your logs literally start with an eviction:

00010000:00020000:51.1F:1545412061.748721:0:0:0:(ldlm_lockd.c:333:waiting_locks_
callback()) ### lock callback timer expired after 0s: evicting client at 10.21.2
2.32@tcp  ns: mdt-foxtrot-MDT0000_UUID lock: ffff88268bb39400/0x13866bc8bdb565ba
 lrc: 4/0,0 mode: PR/PR res: [0x20002a5ed:0x16aa8:0x0].0x0 bits 0x1b rrc: 3 type
: IBT flags: 0x60200400000020 nid: 10.21.22.32@tcp remote: 0xc849fe3bc093e756 ex
pref: 63841 pid: 20017 timeout: 0 lvb_type: 0

it makes no sense to run days of logs on the clients waiting for the eviction, but you can set "dump logs on eviction" flag and so every time a client is evicted it would dump some logs, the first one would be containing all the history.
you can do this by setting dump_on_eviction=1 with lctl set_param on every client and on the mdt.

Also I don't see any evictions at 1547661781 timestamp in the logs, the preceedign evictions are at 1545663407 which clearly is too long ago and the next evictions are at 1547669576 which is 2+ hours way from your specified time (timezone calculation problem?) and does not match the eviction profile we saw in the initial report (ast timeout). In fact at no time we see "failed to reply to blocking AST" message in the mdt log.

Unfortunately the locks you gathered when you did already became kind of fragmented, so it would be really great if you can capture something from clients and servers using the dump on eviction method above (Lustre has three tier logging so even when logs start with a particular timestamp that does not mean there are no gaps in between those entries)

Comment by Oleg Drokin [ 22/Jan/19 ]

I just noticed that you only capture 5 minutes of log in 3G, that might be enough since a normal lock timeout is like 200s and ast timeout is only about 7s.

Comment by Campbell Mcleay (Inactive) [ 22/Jan/19 ]

Thank you very much Oleg, that's a very handy parameter. I have set this on the clients and the MDS and resumed the deletes. Might take a little while for it to trigger, will send the results through if it disconnects again.

Comment by Campbell Mcleay (Inactive) [ 22/Jan/19 ]

Hi Oleg,

Two of the three clients lost connection with the MDS. Have dumped the logs out and ftp'ed them up. File names are:

fmds1-lustre.log.220119.gz

foxtrot2-lustre.log.220119.gz

foxtrot3-lustre.log.220119.gz

Regards,

Campbell

Comment by Campbell Mcleay (Inactive) [ 28/Jan/19 ]

Hi Oleg,

Did you get a chance to take a look at those logs, and if so, were you able to determine what the issue was? 

Kind regards,

Campbell

Comment by Oleg Drokin [ 29/Jan/19 ]

hm... How were these logs produced? I don't see any evictions in there?

I guess I failed to mention that when you have the dump on eviction setup, there would be /tmp/lustre-log-TIMESTAMP files dropped for every event that are all binary that you need to pass through "lctl df" to turn into text. It's those files that we need. Hopefully they are still there?

the lctl df step is not necessary, we can run that ourselves.

Comment by Campbell Mcleay (Inactive) [ 29/Jan/19 ]

Hi Oleg,

Have uploaded the logs parsed via lctl df as well.

Thanks,

Campbell

Comment by Oleg Drokin [ 29/Jan/19 ]

Thank you for the logs, these ones are helpful, though I wonder what version do you really run, since some of the messages in there don't appear to match my copy of 2.10.6

Anyway the issue you are hitting appears to be LU-10945, the telltale message in your logs is this:

00010000:00010000:27.0:1548163854.268356:0:4761:0:(ldlm_lockd.c:1685:ldlm_request_cancel()) ### server cancels blocked lock after 1548163854s ns: mdt-foxtrot-MDT0000_UUID lock: ffff8822c3821200/0x13866dfbc82358f1 lrc: 4/0,0 mode: PR/PR res: [0x20002f6d0:0x15746:0x0].0x0 bits 0x1b rrc: 5 type: IBT flags: 0x40200000000020 nid: 10.21.22.32@tcp remote: 0xc849fe70b80233f8 expref: 220164 pid: 7423 timeout: 0 lvb_type: 0

The patch https://review.whamcloud.com/#/c/32133/3 should help you except it does not apply to the b2_10 tree, I'll make a port.

Do you have ability to self-build Lustre with the patch (only MDS and OSSes would need the patched code)?

Comment by Campbell Mcleay (Inactive) [ 30/Jan/19 ]

Hi Oleg,

Thanks, we'll try the patch, so let's go ahead with that.

Kind regards,

Campbell

Comment by Oleg Drokin [ 30/Jan/19 ]

The ported patch is here: https://review.whamcloud.com/#/c/34131/

I hoped it would be done testing by now, but apparently we have some test system slowness where results take awhile to become available.

Comment by Campbell Mcleay (Inactive) [ 01/Feb/19 ]

Hi Oleg,

Have built the server packages from that source tree you linked. Upon installation, I got a lot of warnings about llite_lloop.ko needing various unknown symbols, but I read somewhere that this package is obsolete - need I worry about this?

Will start some deletes and see how it goes.

Regards,

 

Campbell

Comment by Oleg Drokin [ 01/Feb/19 ]

Yes, llite_lloop.ko is not really used nowadays so you should not worry too much about it. Please let me know how it goes, also if the problems persist, please collect the logs like before.

Comment by Peter Jones [ 16/Feb/19 ]

How are things shaping up with the patch cmcl? Ok to consider this ticket closed?

Comment by Peter Jones [ 23/Feb/19 ]

Disappointing to not hear explicit feedback on the effectiveness of the patch but I suppose no news is good news...

Comment by Campbell Mcleay (Inactive) [ 28/Feb/19 ]

Hi Peter,

 

My apologies, I missed your last message. We did extensive testing with parallel deletes and there were no 'transport endpoint shutdown' messages. We're still getting evictions from OSTs but that is a separate issue. So I think we can consider the patch a success in fixing that issue.

Kind regards,

Campbell

Comment by Peter Jones [ 28/Feb/19 ]

Ah good - thanks for confirming! We have included this fix in 2.10.7 but I am loathe to include fixes that we don't know serve a purpose.

Generated at Sat Feb 10 02:47:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.