Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11826

Cannot send after transport endpoint shutdown

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.10.6
    • None
    • CentOS 7.4
    • 3
    • 9223372036854775807

    Description

      When running multiple rm's of files, we get the following error in the shell:

      /bin/rm: cannot remove '</some/file/path>’: Cannot send after transport endpoint shutdown

      These coincide with the following error in /var/log/messages:

       Dec 24 11:13:09 foxtrot2 kernel: LustreError: 11-0: foxtrot-MDT0000-mdc-ffff883ff6b12800: operation mds_close to node 10.21.22.10@tcp failed: rc = -107Dec 24 11:13:09 foxtrot2 kernel: Lustre: foxtrot-MDT0000-mdc-ffff883ff6b12800: Connection to foxtrot-MDT0000 (at 10.21.22.10@tcp) was lost; in progress operations using this service will wait for recovery to completeDec 24 11:13:09 foxtrot2 kernel: LustreError: 167-0: foxtrot-MDT0000-mdc-ffff883ff6b12800: This client was evicted by foxtrot-MDT0000; in progress operations using this service will fail.Dec 24 11:13:09 foxtrot2 kernel: LustreError: 3598:0:(mdc_locks.c:1211:mdc_intent_getattr_async_interpret()) ldlm_cli_enqueue_fini: -5Dec 24 11:13:09 foxtrot2 kernel: LustreError: 3598:0:(mdc_locks.c:1211:mdc_intent_getattr_async_interpret()) Skipped 37 previous similar messagesDec 24 11:13:09 foxtrot2 kernel: LustreError: Skipped 50 previous similar messagesDec 24 11:13:09 foxtrot2 kernel: LustreError: 39322:0:(llite_lib.c:1512:ll_md_setattr()) md_setattr fails: rc = -5Dec 24 11:13:09 foxtrot2 kernel: LustreError: 38248:0:(file.c:172:ll_close_inode_openhandle()) foxtrot-clilmv-ffff883ff6b12800: inode [0x200030875:0x5d11:0x0] mdc close failed: rc = -107Dec 24 11:13:09 foxtrot2 kernel: LustreError: 38248:0:(file.c:172:ll_close_inode_openhandle()) Skipped 743 previous similar messagesDec 24 11:13:09 foxtrot2 kernel: LustreError: 41760:0:(vvp_io.c:1474:vvp_io_init()) foxtrot: refresh file layout [0x2000302ba:0x103db:0x0] error -108.Dec 24 11:13:09 foxtrot2 kernel: LustreError: 41760:0:(vvp_io.c:1474:vvp_io_init()) Skipped 310070 previous similar messagesDec 24 11:13:09 foxtrot2 kernel: LustreError: 44300:0:(mdc_request.c:1329:mdc_read_page()) foxtrot-MDT0000-mdc-ffff883ff6b12800: [0x20002cfcf:0x5a20:0x0] lock enqueue fails: rc = -108Dec 24 11:13:09 foxtrot2 kernel: LustreError: 39322:0:(llite_lib.c:1512:ll_md_setattr()) Skipped 5 previous similar messagesDec 24 11:13:09 foxtrot2 kernel: LustreError: 12816:0:(vvp_io.c:1474:vvp_io_init()) foxtrot: refresh file layout [0x200030766:0x18539:0x0] error -108.Dec 24 11:13:09 foxtrot2 kernel: LustreError: 39252:0:(vvp_io.c:1474:vvp_io_init()) foxtrot: refresh file layout [0x2000302ba:0x10403:0x0] error -108.Dec 24 11:13:09 foxtrot2 kernel: LustreError: 39252:0:(vvp_io.c:1474:vvp_io_init()) Skipped 143616 previous similar messagesDec 24 11:13:09 foxtrot2 kernel: LustreError: 44302:0:(file.c:172:ll_close_inode_openhandle()) foxtrot-clilmv-ffff883ff6b12800: inode [0x20000070c:0x2ea9:0x0] mdc close failed: rc = -108Dec 24 11:13:09 foxtrot2 kernel: LustreError: 44302:0:(file.c:172:ll_close_inode_openhandle()) Skipped 815 previous similar messagesDec 24 11:13:09 foxtrot2 kernel: LustreError: 12816:0:(vvp_io.c:1474:vvp_io_init()) Skipped 2986 previous similar messagesDec 24 11:13:10 foxtrot2 kernel: Lustre: foxtrot-MDT0000-mdc-ffff883ff6b12800: Connection restored to 10.21.22.10@tcp (at 10.21.22.10@tcp)
      

      Attachments

        Issue Links

          Activity

            [LU-11826] Cannot send after transport endpoint shutdown
            pjones Peter Jones added a comment -

            Ah good - thanks for confirming! We have included this fix in 2.10.7 but I am loathe to include fixes that we don't know serve a purpose.

            pjones Peter Jones added a comment - Ah good - thanks for confirming! We have included this fix in 2.10.7 but I am loathe to include fixes that we don't know serve a purpose.

            Hi Peter,

             

            My apologies, I missed your last message. We did extensive testing with parallel deletes and there were no 'transport endpoint shutdown' messages. We're still getting evictions from OSTs but that is a separate issue. So I think we can consider the patch a success in fixing that issue.

            Kind regards,

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi Peter,   My apologies, I missed your last message. We did extensive testing with parallel deletes and there were no 'transport endpoint shutdown' messages. We're still getting evictions from OSTs but that is a separate issue. So I think we can consider the patch a success in fixing that issue. Kind regards, Campbell
            pjones Peter Jones added a comment -

            Disappointing to not hear explicit feedback on the effectiveness of the patch but I suppose no news is good news...

            pjones Peter Jones added a comment - Disappointing to not hear explicit feedback on the effectiveness of the patch but I suppose no news is good news...
            pjones Peter Jones added a comment -

            How are things shaping up with the patch cmcl? Ok to consider this ticket closed?

            pjones Peter Jones added a comment - How are things shaping up with the patch cmcl ? Ok to consider this ticket closed?
            green Oleg Drokin added a comment -

            Yes, llite_lloop.ko is not really used nowadays so you should not worry too much about it. Please let me know how it goes, also if the problems persist, please collect the logs like before.

            green Oleg Drokin added a comment - Yes, llite_lloop.ko is not really used nowadays so you should not worry too much about it. Please let me know how it goes, also if the problems persist, please collect the logs like before.

            Hi Oleg,

            Have built the server packages from that source tree you linked. Upon installation, I got a lot of warnings about llite_lloop.ko needing various unknown symbols, but I read somewhere that this package is obsolete - need I worry about this?

            Will start some deletes and see how it goes.

            Regards,

             

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - - edited Hi Oleg, Have built the server packages from that source tree you linked. Upon installation, I got a lot of warnings about llite_lloop.ko needing various unknown symbols, but I read somewhere that this package is obsolete - need I worry about this? Will start some deletes and see how it goes. Regards,   Campbell
            green Oleg Drokin added a comment -

            The ported patch is here: https://review.whamcloud.com/#/c/34131/

            I hoped it would be done testing by now, but apparently we have some test system slowness where results take awhile to become available.

            green Oleg Drokin added a comment - The ported patch is here: https://review.whamcloud.com/#/c/34131/ I hoped it would be done testing by now, but apparently we have some test system slowness where results take awhile to become available.

            Hi Oleg,

            Thanks, we'll try the patch, so let's go ahead with that.

            Kind regards,

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi Oleg, Thanks, we'll try the patch, so let's go ahead with that. Kind regards, Campbell
            green Oleg Drokin added a comment -

            Thank you for the logs, these ones are helpful, though I wonder what version do you really run, since some of the messages in there don't appear to match my copy of 2.10.6

            Anyway the issue you are hitting appears to be LU-10945, the telltale message in your logs is this:

            00010000:00010000:27.0:1548163854.268356:0:4761:0:(ldlm_lockd.c:1685:ldlm_request_cancel()) ### server cancels blocked lock after 1548163854s ns: mdt-foxtrot-MDT0000_UUID lock: ffff8822c3821200/0x13866dfbc82358f1 lrc: 4/0,0 mode: PR/PR res: [0x20002f6d0:0x15746:0x0].0x0 bits 0x1b rrc: 5 type: IBT flags: 0x40200000000020 nid: 10.21.22.32@tcp remote: 0xc849fe70b80233f8 expref: 220164 pid: 7423 timeout: 0 lvb_type: 0
            

            The patch https://review.whamcloud.com/#/c/32133/3 should help you except it does not apply to the b2_10 tree, I'll make a port.

            Do you have ability to self-build Lustre with the patch (only MDS and OSSes would need the patched code)?

            green Oleg Drokin added a comment - Thank you for the logs, these ones are helpful, though I wonder what version do you really run, since some of the messages in there don't appear to match my copy of 2.10.6 Anyway the issue you are hitting appears to be LU-10945 , the telltale message in your logs is this: 00010000:00010000:27.0:1548163854.268356:0:4761:0:(ldlm_lockd.c:1685:ldlm_request_cancel()) ### server cancels blocked lock after 1548163854s ns: mdt-foxtrot-MDT0000_UUID lock: ffff8822c3821200/0x13866dfbc82358f1 lrc: 4/0,0 mode: PR/PR res: [0x20002f6d0:0x15746:0x0].0x0 bits 0x1b rrc: 5 type: IBT flags: 0x40200000000020 nid: 10.21.22.32@tcp remote: 0xc849fe70b80233f8 expref: 220164 pid: 7423 timeout: 0 lvb_type: 0 The patch https://review.whamcloud.com/#/c/32133/3 should help you except it does not apply to the b2_10 tree, I'll make a port. Do you have ability to self-build Lustre with the patch (only MDS and OSSes would need the patched code)?

            Hi Oleg,

            Have uploaded the logs parsed via lctl df as well.

            Thanks,

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi Oleg, Have uploaded the logs parsed via lctl df as well. Thanks, Campbell

            People

              green Oleg Drokin
              cmcl Campbell Mcleay (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: