Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15915

/bin/rm: fts_read failed: Cannot send after transport endpoint shutdown

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.12.8
    • None
    • 3
    • 9223372036854775807

    Description

      Am running a large number of deletes on clients and after a while they get evicted, the error on the client is:

      /bin/rm: fts_read failed: Cannot send after transport endpoint shutdown
      

      On the MDS, the error is:

      un  6 19:28:59 fmds1 kernel: LustreError: 9744:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 10.21.22.31@tcp  ns: mdt-foxtrot-MDT0000_UUID lock: ffff94f72a408480/0xb4442ee3e798319c lrc: 3/0,0 mode: PR/PR res: [0x20009b3c6:0x29eb:0x0].0x0 bits 0x20/0x0 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.21.22.31@tcp remote: 0x40ff70b2e6a5419f expref: 147862 pid: 61992 timeout: 6578337 lvb_type: 0
      

      I'm running maybe 10-15 recursive rm on 3 clients, so 30-45 in total at once.

      I've set debugging params as follows:

      lctl set_param debug_mb=1024
      lctl set_param debug="+dlmtrace +info +rpctrace"
      lctl set_param dump_on_eviction=1
      

      on clients and the MDS.

      Lustre version is 2.12.8_6_g5457c37

      Attachments

        Issue Links

          Activity

            [LU-15915] /bin/rm: fts_read failed: Cannot send after transport endpoint shutdown
            pjones Peter Jones added a comment -

            This fix was in 2.15.0 and will be in 2.12.10 (if we do one)

            pjones Peter Jones added a comment - This fix was in 2.15.0 and will be in 2.12.10 (if we do one)

            Thanks Peter, can you tell me which releases (in particular, 2.15.x and 2.12.x) have this genops.c patch?

            dneg Dneg (Inactive) added a comment - Thanks Peter, can you tell me which releases (in particular, 2.15.x and 2.12.x) have this genops.c patch?
            pjones Peter Jones added a comment -

            Great! Then let's mark this as a duplicate of LU-14741. It would be better to track the soft lockup issue under a new ticket.

            pjones Peter Jones added a comment - Great! Then let's mark this as a duplicate of LU-14741 . It would be better to track the soft lockup issue under a new ticket.

            Looking good, still no evictions after a week.

            dneg Dneg (Inactive) added a comment - Looking good, still no evictions after a week.
            dneg Dneg (Inactive) added a comment - - edited

            I saw https://jira.whamcloud.com/browse/LU-15742, we already have lru_size at 10000, and ldlm.namespaces.*.lru_max_age=60000

            dneg Dneg (Inactive) added a comment - - edited I saw https://jira.whamcloud.com/browse/LU-15742 , we already have lru_size at 10000, and ldlm.namespaces.*.lru_max_age=60000
            dneg Dneg (Inactive) added a comment - - edited

            Hi Peter, was just about to post an update. No evictions since the patch was applied ealier in the week (Tuesday), so good news on that front. Will keep an eye on it over the weekend. We get the odd soft lockup (e.g., Nov 9 03:11:25 foxtrot3 kernel: NMI watchdog: BUG: soft lockup - CPU#23 stuck for 22s! [ptlrpcd_01_10:3531]). I can open a separate ticket for that issue if you like

            dneg Dneg (Inactive) added a comment - - edited Hi Peter, was just about to post an update. No evictions since the patch was applied ealier in the week (Tuesday), so good news on that front. Will keep an eye on it over the weekend. We get the odd soft lockup (e.g., Nov 9 03:11:25 foxtrot3 kernel: NMI watchdog: BUG: soft lockup - CPU#23 stuck for 22s! [ptlrpcd_01_10:3531] ). I can open a separate ticket for that issue if you like
            pjones Peter Jones added a comment -

            Hey Campbell

            Just checking in to see how things are progressing

            Peter

            pjones Peter Jones added a comment - Hey Campbell Just checking in to see how things are progressing Peter
            dneg Dneg (Inactive) added a comment - - edited

            Hi Oleg, I have applied the patch to lustre/obdclass/genops.c (there was just the one at https://review.whamcloud.com/changes/fs%2Flustre-release~45850/revisions/1/patch?zip&path=lustre%2Fobdclass%2Fgenops.c, correct?) and have built new client rpms. I'll install them on the clients over the next few days, then bump up the lru_size across the cluster and let you know the result.
            Thanks,
            Campbell

            dneg Dneg (Inactive) added a comment - - edited Hi Oleg, I have applied the patch to lustre/obdclass/genops.c (there was just the one at https://review.whamcloud.com/changes/fs%2Flustre-release~45850/revisions/1/patch?zip&path=lustre%2Fobdclass%2Fgenops.c , correct?) and have built new client rpms. I'll install them on the clients over the next few days, then bump up the lru_size across the cluster and let you know the result. Thanks, Campbell

            People

              laisiyao Lai Siyao
              dneg Dneg (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: