[LU-15915] /bin/rm: fts_read failed: Cannot send after transport endpoint shutdown Created: 06/Jun/22 Updated: 25/Nov/22 Resolved: 19/Nov/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Dneg (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
Am running a large number of deletes on clients and after a while they get evicted, the error on the client is: /bin/rm: fts_read failed: Cannot send after transport endpoint shutdown On the MDS, the error is: un 6 19:28:59 fmds1 kernel: LustreError: 9744:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 10.21.22.31@tcp ns: mdt-foxtrot-MDT0000_UUID lock: ffff94f72a408480/0xb4442ee3e798319c lrc: 3/0,0 mode: PR/PR res: [0x20009b3c6:0x29eb:0x0].0x0 bits 0x20/0x0 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.21.22.31@tcp remote: 0x40ff70b2e6a5419f expref: 147862 pid: 61992 timeout: 6578337 lvb_type: 0 I'm running maybe 10-15 recursive rm on 3 clients, so 30-45 in total at once. I've set debugging params as follows: lctl set_param debug_mb=1024 lctl set_param debug="+dlmtrace +info +rpctrace" lctl set_param dump_on_eviction=1 on clients and the MDS. Lustre version is 2.12.8_6_g5457c37 |
| Comments |
| Comment by Dneg (Inactive) [ 06/Jun/22 ] |
|
should one of the clients get evicted again, I'll upload the logs |
| Comment by Dneg (Inactive) [ 07/Jun/22 ] |
|
Several evictions last night. I've uploaded the logs to ftp.whamcloud.com uploads/ |
| Comment by Dneg (Inactive) [ 07/Jun/22 ] |
|
Could we please raise the priority on this? Thanks |
| Comment by Peter Bojanic [ 30/Sep/22 ] |
|
Are you still experiencing this issue? Can you please attach an hour or two of console logs preceding the client eviction from the MDS and from the affected clients? Also, the Lustre kernel debug logs if they are available. |
| Comment by Andreas Dilger [ 30/Sep/22 ] |
|
Based on the description of the problematic workload (concurrent "rm -r" from many clients) it is possible that this relates to The patch https://review.whamcloud.com/47215 " Would you be interested to test this simple patch on the clients that are running the "rm -r" tasks? It is very low risk and unlikely to cause any problems, but is not totally sure whether it will solve the problem. |
| Comment by Dneg (Inactive) [ 04/Oct/22 ] |
|
Our current workaround is to set the lru_size to 128 (ldlm.namespaces.*.lru_size=128), on the principle that having less locks allow the client and MDS to communicate all the locks in a timely manner, which seems to have reduced the evictions (though it means a lot more getattr calls to the MDS), and things are slower. How would we apply this patch? Should I download srpms, apply the patch and build? If so, which srpms need rebuilding? Can I patch the same release (2.12.8_6_g5457c37) as we're currently using? And is it client only? Or for the servers as well? |
| Comment by Andreas Dilger [ 04/Oct/22 ] |
|
The patch is only on the client, and should be able to apply to the source you are currently using and then rebuild the client RPMs. My assumption, based on the build version in use on your clients, is that you already build your own client RPMs, so this should not be any different. |
| Comment by Dneg (Inactive) [ 04/Oct/22 ] |
|
Thanks Andreas. Actually, we use Whamcloud-supplied rpms, so I'll download the corresponding srpm(s) |
| Comment by Andreas Dilger [ 05/Oct/22 ] |
|
I've cherry-picked patch https://review.whamcloud.com/48764 " |
| Comment by Peter Jones [ 05/Oct/22 ] |
|
dneg I'm not sure how familiar you are in navigating the Whamcloud development infrastructure but you should now be able to grab SRMs from Jenkins - https://build.whamcloud.com/job/lustre-reviews/89806/ |
| Comment by Dneg (Inactive) [ 06/Oct/22 ] |
|
Hi Peter, I found the rpms from that link (e.g., https://build.whamcloud.com/job/lustre-reviews/89806/arch=x86_64,build_type=client,distro=el7.9,ib_stack=inkernel/artifact/artifacts/RPMS/x86_64/kmod-lustre-client-2.12.9_11_gf855161-1.el7.x86_64.rpm etc), but unless you recommend using that later version (2.12.9) I was going to use the same version as we're running, which I've downloaded the source rpms for, and applied the patch Andreas linked then built the client rpms. Let me know if that is ok. Thanks, |
| Comment by Peter Jones [ 06/Oct/22 ] |
|
Yes that is fine - I was just trying to make things easier for you. |
| Comment by Peter Jones [ 14/Oct/22 ] |
|
Hi dneg just checking in to see whether you have tried the suggested patch yet... |
| Comment by Dneg (Inactive) [ 19/Oct/22 ] |
|
Hi Peter, have just managed to apply the patched rpm to the clients now, and have increased the lru_size back up to 10000 from the 128 we had it set to, across the cluster. It took a long time as I had to wait for all the long running backups on these clients to finish, which I had to do one by one. So we should get some results through in the next couple of days, and I'll let you know either way. |
| Comment by Peter Jones [ 19/Oct/22 ] |
|
Great - thanks! |
| Comment by Dneg (Inactive) [ 20/Oct/22 ] |
|
Got three evictions last night on one of the three patched clients: Oct 20 01:54:22 foxtrot2 kernel: LustreError: 167-0: foxtrot-MDT0000-mdc-ffff99c032e95800: This client was evicted by foxtrot-MDT0000; in progress operations using this service will fail. Oct 20 03:02:17 foxtrot2 kernel: LustreError: 167-0: foxtrot-MDT0000-mdc-ffff99c032e95800: This client was evicted by foxtrot-MDT0000; in progress operations using this service will fail. Oct 20 03:38:09 foxtrot2 kernel: LustreError: 167-0: foxtrot-MDT0000-mdc-ffff99c032e95800: This client was evicted by foxtrot-MDT0000; in progress operations using this service will fail. |
| Comment by Dneg (Inactive) [ 21/Oct/22 ] |
|
3 more evictions last night, two on one client, one on another. |
| Comment by Oleg Drokin [ 25/Oct/22 ] |
|
I just realized you did not include the server side logs so we don't know which logs the client was supposed to release and did not in time. Can you please provide the MDS logs from the timeframe when those lustre-log files you uploaded were created? |
| Comment by Dneg (Inactive) [ 26/Oct/22 ] |
|
Hi Oleg, I've uploaded the logs from the mds from the earlier June evictions, as well as the latest batch, to ftp.whamcloud.com/uploads/ Campbell |
| Comment by Dneg (Inactive) [ 28/Oct/22 ] |
|
Dropped the lru_size to 128 as we were still getting evictions and backups were failing. |
| Comment by Oleg Drokin [ 01/Nov/22 ] |
|
so these (June) logs look very familiar, I think there was a similar report some time ago where a lock cancel was triggering other lock cancels ? Can't quite find the report readily though to refresh my memory. ... On the surface it sounds like In the old logs this is how it unfolds: 00010000:00010000:9.0:1654595833.299427:0:3716:0:(ldlm_lockd.c:1775:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: foxtrot-MDT0000-mdc-ffff8acff2c13800 lock: ffff8af441314d80/0x7960025dca1d30a8 lrc: 2/0,0 mode: PR/PR res: [0x20009b4d8:0x14677:0x0].0x0 bits 0x20/0x0 rrc: 3 type: IBT flags: 0x400000000000 nid: local remote: 0xb4442ee9c835c045 expref: -99 pid: 35329 timeout: 0 lvb_type: 0 00010000:00010000:9.0:1654595833.299485:0:3716:0:(ldlm_request.c:1150:ldlm_cli_cancel_local()) ### client-side cancel ns: foxtrot-MDT0000-mdc-ffff8acff2c13800 lock: ffff8af441314d80/0x7960025dca1d30a8 lrc: 3/0,0 mode: PR/PR res: [0x20009b4d8:0x14677:0x0].0x0 bits 0x20/0x20 rrc: 3 type: IBT flags: 0x408400000000 nid: local remote: 0xb4442ee9c835c045 expref: -99 pid: 35329 timeout: 0 lvb_type: 0 00010000:00000040:9.0:1654595833.299489:0:3716:0:(ldlm_resource.c:1601:ldlm_resource_putref()) putref res: ffff8b0921172e40 count: 2 00010000:00000040:24.0:1654595833.299492:0:35567:0:(ldlm_resource.c:1601:ldlm_resource_putref()) putref res: ffff8afaf44dac00 count: 1 00000020:00000040:9.0:1654595833.299493:0:3716:0:(lustre_handles.c:113:class_handle_unhash_nolock()) removing object ffff8af441314d80 with handle 0x7960025dca1d30a8 from hash ... lots and lots of other locks are being collected to be sent in the cancel RPC ... 00010000:00010000:4.0:1654595933.168282:0:3716:0:(ldlm_lockd.c:1800:ldlm_handle_bl_callback()) ### client blocking callback handler END ns: foxtrot-MDT0000-mdc-ffff8acff2c13800 lock: ffff8af441314d80/0x7960025dca1d30a8 lrc: 1/0,0 mode: --/PR res: [0x20009b4d8:0x14677:0x0].0x0 bits 0x20/0x20 rrc: 3 type: IBT flags: 0x4c09400000000 nid: local remote: 0xb4442ee9c835c045 expref: -99 pid: 35329 timeout: 0 lvb_type: 0 So from this we see we already traversed the ldlm_cli_cancel_local->ldlm_lock_cancel->ldlm_lock_destroy_nolock->class_handle_unhash_nolock - AKA the lock was totally destroyed and it's not a matter of the cancel thread not even getting to it as in anyway, looking closer at this thread (which is named ldlm_bl_126) we actually see this: 00000020:00100000:11.0:1654595833.301719:0:3716:0:(genops.c:2379:obd_get_mod_rpc_slot()) foxtrot-MDT0000-mdc-ffff8acff2c13800: sleeping for a modify RPC slot opc 35, max 7 00000100:00100000:4.0:1654595933.167133:0:3716:0:(client.c:2096:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_126:22bdcfdc-5f27-ca71-fcf9-840efc5add00:3716:1733599145210368:10.21.22.10@tcp:35 so somewhere along the way of canceling multiple locks it suddenly met one that required a modify rpc slot that we did not have. opc 35 is MDS_CLOSE, so we were canceling an open lock and could not send the close stopping this whole canceling thread in it's tracks. So this is actually |
| Comment by Dneg (Inactive) [ 01/Nov/22 ] |
|
Hi Oleg, I have applied the patch to lustre/obdclass/genops.c (there was just the one at https://review.whamcloud.com/changes/fs%2Flustre-release~45850/revisions/1/patch?zip&path=lustre%2Fobdclass%2Fgenops.c, correct?) and have built new client rpms. I'll install them on the clients over the next few days, then bump up the lru_size across the cluster and let you know the result. |
| Comment by Peter Jones [ 11/Nov/22 ] |
|
Hey Campbell Just checking in to see how things are progressing Peter |
| Comment by Dneg (Inactive) [ 11/Nov/22 ] |
|
Hi Peter, was just about to post an update. No evictions since the patch was applied ealier in the week (Tuesday), so good news on that front. Will keep an eye on it over the weekend. We get the odd soft lockup (e.g., Nov 9 03:11:25 foxtrot3 kernel: NMI watchdog: BUG: soft lockup - CPU#23 stuck for 22s! [ptlrpcd_01_10:3531]). I can open a separate ticket for that issue if you like |
| Comment by Dneg (Inactive) [ 11/Nov/22 ] |
|
I saw https://jira.whamcloud.com/browse/LU-15742, we already have lru_size at 10000, and ldlm.namespaces.*.lru_max_age=60000 |
| Comment by Dneg (Inactive) [ 17/Nov/22 ] |
|
Looking good, still no evictions after a week. |
| Comment by Peter Jones [ 19/Nov/22 ] |
|
Great! Then let's mark this as a duplicate of |
| Comment by Dneg (Inactive) [ 25/Nov/22 ] |
|
Thanks Peter, can you tell me which releases (in particular, 2.15.x and 2.12.x) have this genops.c patch? |
| Comment by Peter Jones [ 25/Nov/22 ] |
|
This fix was in 2.15.0 and will be in 2.12.10 (if we do one) |