Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15915

/bin/rm: fts_read failed: Cannot send after transport endpoint shutdown

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.12.8
    • None
    • 3
    • 9223372036854775807

    Description

      Am running a large number of deletes on clients and after a while they get evicted, the error on the client is:

      /bin/rm: fts_read failed: Cannot send after transport endpoint shutdown
      

      On the MDS, the error is:

      un  6 19:28:59 fmds1 kernel: LustreError: 9744:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 10.21.22.31@tcp  ns: mdt-foxtrot-MDT0000_UUID lock: ffff94f72a408480/0xb4442ee3e798319c lrc: 3/0,0 mode: PR/PR res: [0x20009b3c6:0x29eb:0x0].0x0 bits 0x20/0x0 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.21.22.31@tcp remote: 0x40ff70b2e6a5419f expref: 147862 pid: 61992 timeout: 6578337 lvb_type: 0
      

      I'm running maybe 10-15 recursive rm on 3 clients, so 30-45 in total at once.

      I've set debugging params as follows:

      lctl set_param debug_mb=1024
      lctl set_param debug="+dlmtrace +info +rpctrace"
      lctl set_param dump_on_eviction=1
      

      on clients and the MDS.

      Lustre version is 2.12.8_6_g5457c37

      Attachments

        Issue Links

          Activity

            [LU-15915] /bin/rm: fts_read failed: Cannot send after transport endpoint shutdown
            pjones Peter Jones added a comment -

            This fix was in 2.15.0 and will be in 2.12.10 (if we do one)

            pjones Peter Jones added a comment - This fix was in 2.15.0 and will be in 2.12.10 (if we do one)

            Thanks Peter, can you tell me which releases (in particular, 2.15.x and 2.12.x) have this genops.c patch?

            dneg Dneg (Inactive) added a comment - Thanks Peter, can you tell me which releases (in particular, 2.15.x and 2.12.x) have this genops.c patch?
            pjones Peter Jones added a comment -

            Great! Then let's mark this as a duplicate of LU-14741. It would be better to track the soft lockup issue under a new ticket.

            pjones Peter Jones added a comment - Great! Then let's mark this as a duplicate of LU-14741 . It would be better to track the soft lockup issue under a new ticket.

            Looking good, still no evictions after a week.

            dneg Dneg (Inactive) added a comment - Looking good, still no evictions after a week.
            dneg Dneg (Inactive) added a comment - - edited

            I saw https://jira.whamcloud.com/browse/LU-15742, we already have lru_size at 10000, and ldlm.namespaces.*.lru_max_age=60000

            dneg Dneg (Inactive) added a comment - - edited I saw https://jira.whamcloud.com/browse/LU-15742 , we already have lru_size at 10000, and ldlm.namespaces.*.lru_max_age=60000
            dneg Dneg (Inactive) added a comment - - edited

            Hi Peter, was just about to post an update. No evictions since the patch was applied ealier in the week (Tuesday), so good news on that front. Will keep an eye on it over the weekend. We get the odd soft lockup (e.g., Nov 9 03:11:25 foxtrot3 kernel: NMI watchdog: BUG: soft lockup - CPU#23 stuck for 22s! [ptlrpcd_01_10:3531]). I can open a separate ticket for that issue if you like

            dneg Dneg (Inactive) added a comment - - edited Hi Peter, was just about to post an update. No evictions since the patch was applied ealier in the week (Tuesday), so good news on that front. Will keep an eye on it over the weekend. We get the odd soft lockup (e.g., Nov 9 03:11:25 foxtrot3 kernel: NMI watchdog: BUG: soft lockup - CPU#23 stuck for 22s! [ptlrpcd_01_10:3531] ). I can open a separate ticket for that issue if you like
            pjones Peter Jones added a comment -

            Hey Campbell

            Just checking in to see how things are progressing

            Peter

            pjones Peter Jones added a comment - Hey Campbell Just checking in to see how things are progressing Peter
            dneg Dneg (Inactive) added a comment - - edited

            Hi Oleg, I have applied the patch to lustre/obdclass/genops.c (there was just the one at https://review.whamcloud.com/changes/fs%2Flustre-release~45850/revisions/1/patch?zip&path=lustre%2Fobdclass%2Fgenops.c, correct?) and have built new client rpms. I'll install them on the clients over the next few days, then bump up the lru_size across the cluster and let you know the result.
            Thanks,
            Campbell

            dneg Dneg (Inactive) added a comment - - edited Hi Oleg, I have applied the patch to lustre/obdclass/genops.c (there was just the one at https://review.whamcloud.com/changes/fs%2Flustre-release~45850/revisions/1/patch?zip&path=lustre%2Fobdclass%2Fgenops.c , correct?) and have built new client rpms. I'll install them on the clients over the next few days, then bump up the lru_size across the cluster and let you know the result. Thanks, Campbell
            green Oleg Drokin added a comment -

            so these (June) logs look very familiar, I think there was a similar report  some time ago where a lock cancel was triggering other lock cancels ? Can't quite find the report readily though to refresh my memory. ... 

            On the surface it sounds like LU-15821 but I think there was another path where as we do a cancel we also want to pack up some more locks that are ready to be cancelled (I cannot verify what your current client version is to see it really got the fix and I don't have more recent debug logs to confirm the behavior is same or different now).

            In the old logs this is how it unfolds:

            00010000:00010000:9.0:1654595833.299427:0:3716:0:(ldlm_lockd.c:1775:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: foxtrot-MDT0000-mdc-ffff8acff2c13800 lock: ffff8af441314d80/0x7960025dca1d30a8 lrc: 2/0,0 mode: PR/PR res: [0x20009b4d8:0x14677:0x0].0x0 bits 0x20/0x0 rrc: 3 type: IBT flags: 0x400000000000 nid: local remote: 0xb4442ee9c835c045 expref: -99 pid: 35329 timeout: 0 lvb_type: 0
            00010000:00010000:9.0:1654595833.299485:0:3716:0:(ldlm_request.c:1150:ldlm_cli_cancel_local()) ### client-side cancel ns: foxtrot-MDT0000-mdc-ffff8acff2c13800 lock: ffff8af441314d80/0x7960025dca1d30a8 lrc: 3/0,0 mode: PR/PR res: [0x20009b4d8:0x14677:0x0].0x0 bits 0x20/0x20 rrc: 3 type: IBT flags: 0x408400000000 nid: local remote: 0xb4442ee9c835c045 expref: -99 pid: 35329 timeout: 0 lvb_type: 0
            00010000:00000040:9.0:1654595833.299489:0:3716:0:(ldlm_resource.c:1601:ldlm_resource_putref()) putref res: ffff8b0921172e40 count: 2
            00010000:00000040:24.0:1654595833.299492:0:35567:0:(ldlm_resource.c:1601:ldlm_resource_putref()) putref res: ffff8afaf44dac00 count: 1
            00000020:00000040:9.0:1654595833.299493:0:3716:0:(lustre_handles.c:113:class_handle_unhash_nolock()) removing object ffff8af441314d80 with handle 0x7960025dca1d30a8 from hash 
            ...
            lots and lots of other locks are being collected to be sent in the cancel RPC
            ...
            00010000:00010000:4.0:1654595933.168282:0:3716:0:(ldlm_lockd.c:1800:ldlm_handle_bl_callback()) ### client blocking callback handler END ns: foxtrot-MDT0000-mdc-ffff8acff2c13800 lock: ffff8af441314d80/0x7960025dca1d30a8 lrc: 1/0,0 mode: --/PR res: [0x20009b4d8:0x14677:0x0].0x0 bits 0x20/0x20 rrc: 3 type: IBT flags: 0x4c09400000000 nid: local remote: 0xb4442ee9c835c045 expref: -99 pid: 35329 timeout: 0 lvb_type: 0

            So from this we see we already traversed the ldlm_cli_cancel_local->ldlm_lock_cancel->ldlm_lock_destroy_nolock->class_handle_unhash_nolock - AKA the lock was totally destroyed and it's not a matter of the cancel thread not even getting to it as in LU-15821

            anyway, looking closer at this thread (which is named ldlm_bl_126) we actually see this:

            00000020:00100000:11.0:1654595833.301719:0:3716:0:(genops.c:2379:obd_get_mod_rpc_slot()) foxtrot-MDT0000-mdc-ffff8acff2c13800: sleeping for a modify RPC slot opc 35, max 7
            00000100:00100000:4.0:1654595933.167133:0:3716:0:(client.c:2096:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_126:22bdcfdc-5f27-ca71-fcf9-840efc5add00:3716:1733599145210368:10.21.22.10@tcp:35 

            so somewhere along the way of canceling multiple locks it suddenly met one that required a modify rpc slot that we did not have. opc 35 is MDS_CLOSE, so we were canceling an open lock and could not send the close stopping this whole canceling thread in it's tracks.

            So this is actually LU-14741 that was never included into b2_12, the backported patch is here: https://review.whamcloud.com/c/fs/lustre-release/+/45850

            green Oleg Drokin added a comment - so these (June) logs look very familiar, I think there was a similar report  some time ago where a lock cancel was triggering other lock cancels ? Can't quite find the report readily though to refresh my memory. ...  On the surface it sounds like LU-15821 but I think there was another path where as we do a cancel we also want to pack up some more locks that are ready to be cancelled (I cannot verify what your current client version is to see it really got the fix and I don't have more recent debug logs to confirm the behavior is same or different now). In the old logs this is how it unfolds: 00010000:00010000:9.0:1654595833.299427:0:3716:0:(ldlm_lockd.c:1775:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: foxtrot-MDT0000-mdc-ffff8acff2c13800 lock: ffff8af441314d80/0x7960025dca1d30a8 lrc: 2/0,0 mode: PR/PR res: [0x20009b4d8:0x14677:0x0].0x0 bits 0x20/0x0 rrc: 3 type: IBT flags: 0x400000000000 nid: local remote: 0xb4442ee9c835c045 expref: -99 pid: 35329 timeout: 0 lvb_type: 0 00010000:00010000:9.0:1654595833.299485:0:3716:0:(ldlm_request.c:1150:ldlm_cli_cancel_local()) ### client-side cancel ns: foxtrot-MDT0000-mdc-ffff8acff2c13800 lock: ffff8af441314d80/0x7960025dca1d30a8 lrc: 3/0,0 mode: PR/PR res: [0x20009b4d8:0x14677:0x0].0x0 bits 0x20/0x20 rrc: 3 type: IBT flags: 0x408400000000 nid: local remote: 0xb4442ee9c835c045 expref: -99 pid: 35329 timeout: 0 lvb_type: 0 00010000:00000040:9.0:1654595833.299489:0:3716:0:(ldlm_resource.c:1601:ldlm_resource_putref()) putref res: ffff8b0921172e40 count: 2 00010000:00000040:24.0:1654595833.299492:0:35567:0:(ldlm_resource.c:1601:ldlm_resource_putref()) putref res: ffff8afaf44dac00 count: 1 00000020:00000040:9.0:1654595833.299493:0:3716:0:(lustre_handles.c:113:class_handle_unhash_nolock()) removing object ffff8af441314d80 with handle 0x7960025dca1d30a8 from hash ... lots and lots of other locks are being collected to be sent in the cancel RPC ... 00010000:00010000:4.0:1654595933.168282:0:3716:0:(ldlm_lockd.c:1800:ldlm_handle_bl_callback()) ### client blocking callback handler END ns: foxtrot-MDT0000-mdc-ffff8acff2c13800 lock: ffff8af441314d80/0x7960025dca1d30a8 lrc: 1/0,0 mode: --/PR res: [0x20009b4d8:0x14677:0x0].0x0 bits 0x20/0x20 rrc: 3 type: IBT flags: 0x4c09400000000 nid: local remote: 0xb4442ee9c835c045 expref: -99 pid: 35329 timeout: 0 lvb_type: 0 So from this we see we already traversed the ldlm_cli_cancel_local->ldlm_lock_cancel->ldlm_lock_destroy_nolock->class_handle_unhash_nolock - AKA the lock was totally destroyed and it's not a matter of the cancel thread not even getting to it as in LU-15821 anyway, looking closer at this thread (which is named ldlm_bl_126) we actually see this: 00000020:00100000:11.0:1654595833.301719:0:3716:0:(genops.c:2379:obd_get_mod_rpc_slot()) foxtrot-MDT0000-mdc-ffff8acff2c13800: sleeping for a modify RPC slot opc 35, max 7 00000100:00100000:4.0:1654595933.167133:0:3716:0:(client.c:2096:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_126:22bdcfdc-5f27-ca71-fcf9-840efc5add00:3716:1733599145210368:10.21.22.10@tcp:35 so somewhere along the way of canceling multiple locks it suddenly met one that required a modify rpc slot that we did not have. opc 35 is MDS_CLOSE, so we were canceling an open lock and could not send the close stopping this whole canceling thread in it's tracks. So this is actually LU-14741 that was never included into b2_12, the backported patch is here: https://review.whamcloud.com/c/fs/lustre-release/+/45850

            Dropped the lru_size to 128 as we were still getting evictions and backups were failing.

            dneg Dneg (Inactive) added a comment - Dropped the lru_size to 128 as we were still getting evictions and backups were failing.

            People

              laisiyao Lai Siyao
              dneg Dneg (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: