Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11826

Cannot send after transport endpoint shutdown

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.10.6
    • None
    • CentOS 7.4
    • 3
    • 9223372036854775807

    Description

      When running multiple rm's of files, we get the following error in the shell:

      /bin/rm: cannot remove '</some/file/path>’: Cannot send after transport endpoint shutdown

      These coincide with the following error in /var/log/messages:

       Dec 24 11:13:09 foxtrot2 kernel: LustreError: 11-0: foxtrot-MDT0000-mdc-ffff883ff6b12800: operation mds_close to node 10.21.22.10@tcp failed: rc = -107Dec 24 11:13:09 foxtrot2 kernel: Lustre: foxtrot-MDT0000-mdc-ffff883ff6b12800: Connection to foxtrot-MDT0000 (at 10.21.22.10@tcp) was lost; in progress operations using this service will wait for recovery to completeDec 24 11:13:09 foxtrot2 kernel: LustreError: 167-0: foxtrot-MDT0000-mdc-ffff883ff6b12800: This client was evicted by foxtrot-MDT0000; in progress operations using this service will fail.Dec 24 11:13:09 foxtrot2 kernel: LustreError: 3598:0:(mdc_locks.c:1211:mdc_intent_getattr_async_interpret()) ldlm_cli_enqueue_fini: -5Dec 24 11:13:09 foxtrot2 kernel: LustreError: 3598:0:(mdc_locks.c:1211:mdc_intent_getattr_async_interpret()) Skipped 37 previous similar messagesDec 24 11:13:09 foxtrot2 kernel: LustreError: Skipped 50 previous similar messagesDec 24 11:13:09 foxtrot2 kernel: LustreError: 39322:0:(llite_lib.c:1512:ll_md_setattr()) md_setattr fails: rc = -5Dec 24 11:13:09 foxtrot2 kernel: LustreError: 38248:0:(file.c:172:ll_close_inode_openhandle()) foxtrot-clilmv-ffff883ff6b12800: inode [0x200030875:0x5d11:0x0] mdc close failed: rc = -107Dec 24 11:13:09 foxtrot2 kernel: LustreError: 38248:0:(file.c:172:ll_close_inode_openhandle()) Skipped 743 previous similar messagesDec 24 11:13:09 foxtrot2 kernel: LustreError: 41760:0:(vvp_io.c:1474:vvp_io_init()) foxtrot: refresh file layout [0x2000302ba:0x103db:0x0] error -108.Dec 24 11:13:09 foxtrot2 kernel: LustreError: 41760:0:(vvp_io.c:1474:vvp_io_init()) Skipped 310070 previous similar messagesDec 24 11:13:09 foxtrot2 kernel: LustreError: 44300:0:(mdc_request.c:1329:mdc_read_page()) foxtrot-MDT0000-mdc-ffff883ff6b12800: [0x20002cfcf:0x5a20:0x0] lock enqueue fails: rc = -108Dec 24 11:13:09 foxtrot2 kernel: LustreError: 39322:0:(llite_lib.c:1512:ll_md_setattr()) Skipped 5 previous similar messagesDec 24 11:13:09 foxtrot2 kernel: LustreError: 12816:0:(vvp_io.c:1474:vvp_io_init()) foxtrot: refresh file layout [0x200030766:0x18539:0x0] error -108.Dec 24 11:13:09 foxtrot2 kernel: LustreError: 39252:0:(vvp_io.c:1474:vvp_io_init()) foxtrot: refresh file layout [0x2000302ba:0x10403:0x0] error -108.Dec 24 11:13:09 foxtrot2 kernel: LustreError: 39252:0:(vvp_io.c:1474:vvp_io_init()) Skipped 143616 previous similar messagesDec 24 11:13:09 foxtrot2 kernel: LustreError: 44302:0:(file.c:172:ll_close_inode_openhandle()) foxtrot-clilmv-ffff883ff6b12800: inode [0x20000070c:0x2ea9:0x0] mdc close failed: rc = -108Dec 24 11:13:09 foxtrot2 kernel: LustreError: 44302:0:(file.c:172:ll_close_inode_openhandle()) Skipped 815 previous similar messagesDec 24 11:13:09 foxtrot2 kernel: LustreError: 12816:0:(vvp_io.c:1474:vvp_io_init()) Skipped 2986 previous similar messagesDec 24 11:13:10 foxtrot2 kernel: Lustre: foxtrot-MDT0000-mdc-ffff883ff6b12800: Connection restored to 10.21.22.10@tcp (at 10.21.22.10@tcp)
      

      Attachments

        Issue Links

          Activity

            [LU-11826] Cannot send after transport endpoint shutdown

            Hi,

            Did you find anything of interest in the log I sent? Or do you need the client log? I'm finding it difficult to capture the log on the client as it may take a few days to trigger it and I can't watch it all the time. Even with 4GB of memory assigned to the debug log size, it only seems to capture 5 minutes worth (though the log is 3GB in size)

            Regards,

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi, Did you find anything of interest in the log I sent? Or do you need the client log? I'm finding it difficult to capture the log on the client as it may take a few days to trigger it and I can't watch it all the time. Even with 4GB of memory assigned to the debug log size, it only seems to capture 5 minutes worth (though the log is 3GB in size) Regards, Campbell

            Hi,

            I've uploaded the MDS log (uploads/LU-11826/fmds1-lustre.log.gz) to ftp.whamcloud.com. The time of the disconnection was 1547661781 in unix time (16th Jan 22:27:22). I'm running another rm in the hope I can capture some useful logs on the clients as well

            Regards,

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi, I've uploaded the MDS log (uploads/ LU-11826 /fmds1-lustre.log.gz) to ftp.whamcloud.com. The time of the disconnection was 1547661781 in unix time (16th Jan 22:27:22). I'm running another rm in the hope I can capture some useful logs on the clients as well Regards, Campbell
            pjones Peter Jones added a comment -

            Sometimes the diagnostic data collected as part of Lustre troubleshooting is too large to be attached to a JIRA ticket. For these cases, Whamcloud provides an anonymous write-only FTP upload service. In order to use this service, you'll need an FTP client (e.g. ncftp, ftp, etc.) and a JIRA issue. Use the 'uploads' directory and create a new subdirectory using your Jira issue as a name.

            In the following example, there are three debug logs in a single directory and the JIRA issue LU-4242 has been created. After completing the upload, please update the relevant issue with a note mentioning the upload, so that our engineers know where to find your logs.

            $ ls -lh
            total 333M
            rw-rr- 1 mjmac mjmac 98M Feb 23 17:36 mds-debug
            rw-rr- 1 mjmac mjmac 118M Feb 23 17:37 oss-00-debug
            rw-rr- 1 mjmac mjmac 118M Feb 23 17:37 oss-01-debug
            $ ncftp ftp.whamcloud.com
            NcFTP 3.2.2 (Sep 04, 2008) by Mike Gleason (http://www.NcFTP.com/contact/).
            Connecting to 99.96.190.235...
            (vsFTPd 2.2.2)
            Logging in...
            Login successful.
            Logged in to ftp.whamcloud.com.
            ncftp / > cd uploads
            Directory successfully changed.
            ncftp /uploads > mkdir LU-4242
            ncftp /uploads > cd LU-4242
            Directory successfully changed.
            ncftp /uploads/LU-4242 > put *
            mds-debug: 97.66 MB 11.22 MB/s
            oss-00-debug: 117.19 MB 11.16 MB/s
            oss-01-debug: 117.48 MB 11.18 MB/s
            ncftp /uploads/LU-4242 >
            

            Please note that this is a WRITE-ONLY FTP service, so you will not be able to see (with ls) the files or directories you've created, nor will you (or anyone other than Whamcloud staff) be able to see or read them.

            pjones Peter Jones added a comment - Sometimes the diagnostic data collected as part of Lustre troubleshooting is too large to be attached to a JIRA ticket. For these cases, Whamcloud provides an anonymous write-only FTP upload service. In order to use this service, you'll need an FTP client (e.g. ncftp, ftp, etc.) and a JIRA issue. Use the 'uploads' directory and create a new subdirectory using your Jira issue as a name. In the following example, there are three debug logs in a single directory and the JIRA issue LU-4242 has been created. After completing the upload, please update the relevant issue with a note mentioning the upload, so that our engineers know where to find your logs. $ ls -lh total 333M rw-rr- 1 mjmac mjmac 98M Feb 23 17:36 mds-debug rw-rr- 1 mjmac mjmac 118M Feb 23 17:37 oss-00-debug rw-rr- 1 mjmac mjmac 118M Feb 23 17:37 oss-01-debug $ ncftp ftp.whamcloud.com NcFTP 3.2.2 (Sep 04, 2008) by Mike Gleason (http: //www.NcFTP.com/contact/). Connecting to 99.96.190.235... (vsFTPd 2.2.2) Logging in... Login successful. Logged in to ftp.whamcloud.com. ncftp / > cd uploads Directory successfully changed. ncftp /uploads > mkdir LU-4242 ncftp /uploads > cd LU-4242 Directory successfully changed. ncftp /uploads/LU-4242 > put * mds-debug: 97.66 MB 11.22 MB/s oss-00-debug: 117.19 MB 11.16 MB/s oss-01-debug: 117.48 MB 11.18 MB/s ncftp /uploads/LU-4242 > Please note that this is a WRITE-ONLY FTP service, so you will not be able to see (with ls) the files or directories you've created, nor will you (or anyone other than Whamcloud staff) be able to see or read them.

            Hi Oleg,

            There was a disconnection at around 22:27 last night, about 5 hours after I started the rm's. Unfortunately, I was not able to dump the log until I got in this morning, and the timestamp at the beginning of the client logs is after this time, however, the MDS debug log has that period covered. I'll restart the rm's and see if I can capture the error. Should I increase the debug_mb value on the clients? In the meantime, I can send you the MDS log, which compressed, is about 22MB. Is there somewhere I can send it?

            Regards,

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi Oleg, There was a disconnection at around 22:27 last night, about 5 hours after I started the rm's. Unfortunately, I was not able to dump the log until I got in this morning, and the timestamp at the beginning of the client logs is after this time, however, the MDS debug log has that period covered. I'll restart the rm's and see if I can capture the error. Should I increase the debug_mb value on the clients? In the meantime, I can send you the MDS log, which compressed, is about 22MB. Is there somewhere I can send it? Regards, Campbell
            green Oleg Drokin added a comment -

            you could also do "+..." notation to just add things to what you have there.
            Don't change anything in sybsystem_debug, it's already fully enabled, you only need to adjust the debug setting

            In your case you are best advised to increase the debug log size to like 1G (note this will consume 1G of RAM)

            lctl set_param debug_mb=1024
            

            and then add dlmtrace info and rpctrace with

            lctl set_param debug="+dlmtrace +info +rpctrace"
            

            Run these two commands on all the clients you expect the problem might hit and then also on the MDS.

            once the problem hits, run "lctl dk >/tmp/${HOSTNAME}-lustre.log" on all nodes affected, collect the files and upload them somewhere.

            Thanks.

            green Oleg Drokin added a comment - you could also do "+..." notation to just add things to what you have there. Don't change anything in sybsystem_debug, it's already fully enabled, you only need to adjust the debug setting In your case you are best advised to increase the debug log size to like 1G (note this will consume 1G of RAM) lctl set_param debug_mb=1024 and then add dlmtrace info and rpctrace with lctl set_param debug="+dlmtrace +info +rpctrace" Run these two commands on all the clients you expect the problem might hit and then also on the MDS. once the problem hits, run "lctl dk >/tmp/${HOSTNAME}-lustre.log" on all nodes affected, collect the files and upload them somewhere. Thanks.

            Hi Oleg,

            Can you please let me know which of the following debug options I should enable (from :/proc/sys/lnet/subsystem_debug):

            undefined mdc mds osc ost class log llite rpc mgmt lnet lnd pinger filter echo ldlm lov lquota osd lfsck snapshot lmv sec gss mgc mgs fid fld

            Presumably, it is just:

            echo "<subsystem1> <subsystem2> <subsystem3>" > /proc/sys/lnet/debug

            Should I just do this on the MDS or on the clients as well? Same subsystems options on both?

             

            Thanks,

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi Oleg, Can you please let me know which of the following debug options I should enable (from :/proc/sys/lnet/subsystem_debug): undefined mdc mds osc ost class log llite rpc mgmt lnet lnd pinger filter echo ldlm lov lquota osd lfsck snapshot lmv sec gss mgc mgs fid fld Presumably, it is just: echo "<subsystem1> <subsystem2> <subsystem3>" > /proc/sys/lnet/debug Should I just do this on the MDS or on the clients as well? Same subsystems options on both?   Thanks, Campbell

            Hi Oleg,

            I'll prepare a bunch of rm jobs and turn on debugging. This may take a few days for me to get done.

            Thanks,

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi Oleg, I'll prepare a bunch of rm jobs and turn on debugging. This may take a few days for me to get done. Thanks, Campbell
            green Oleg Drokin added a comment - - edited

            so server side messages tell us that

            1. Client failed to reply to a blocking ast (a special mesasge that lustre server sends to ask a client to release some lock due to a conflict). - the timeout for these low overhead mesages is pretty small at 5-7 seconds so if there is a momentary lapse in connectivity for example, this might happen.
            2. server evicted the client not to hold uprest of the nodes waiting for a potentially dead client
            3. the client turned out to be not as dead as it seemed and reconnected.
            4. other blocking AST that we had scheduled for this client was interrupted due to eviction and the client was evicted once more.

            the part #4 does not sound entirely correct, but it did not change anything in the general flow of things since the original eviction is what has messed everyting up and it most likely happened due to some hiccup on the network.

            To add: if you see this every time you do large rm's it's possible you are just overstretching some bit of the network infrastructure? you can collect come lustre debug logs and we can see which part exactly, I guess.

            green Oleg Drokin added a comment - - edited so server side messages tell us that 1. Client failed to reply to a blocking ast (a special mesasge that lustre server sends to ask a client to release some lock due to a conflict). - the timeout for these low overhead mesages is pretty small at 5-7 seconds so if there is a momentary lapse in connectivity for example, this might happen. 2. server evicted the client not to hold uprest of the nodes waiting for a potentially dead client 3. the client turned out to be not as dead as it seemed and reconnected. 4. other blocking AST that we had scheduled for this client was interrupted due to eviction and the client was evicted once more. the part #4 does not sound entirely correct, but it did not change anything in the general flow of things since the original eviction is what has messed everyting up and it most likely happened due to some hiccup on the network. To add: if you see this every time you do large rm's it's possible you are just overstretching some bit of the network infrastructure? you can collect come lustre debug logs and we can see which part exactly, I guess.

            Can anyone help with this? Thanks

            cmcl Campbell Mcleay (Inactive) added a comment - Can anyone help with this? Thanks

            Hi Oleg,

            Any thoughts?

            Regards,

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi Oleg, Any thoughts? Regards, Campbell

            Hi Oleg,

            Sorry for the delay in replying. There appear to errors in the MDS logs at that time too:

            Dec 24 11:13:09 fmds1 kernel: LustreError: 17227:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.21.22.31@tcp) failed to reply to blocking AST (req@ffff88236d22d700 x1615613399909376 status 0 rc -110), evict it ns: mdt-foxtrot-MDT0000_UUID lock: ffff8820a62b8a00/0x13866bfbe99ef389 lrc: 4/0,0 mode: PR/PR res: [0x20002c7e2:0x4352:0x0].0x0 bits 0x1b rrc: 2 type: IBT flags: 0x60200400000020 nid: 10.21.22.31@tcp remote: 0xbf690c3f5f1d9519 expref: 169058 pid: 87340 timeout: 9220366206 lvb_type: 0
            Dec 24 11:13:09 fmds1 kernel: LustreError: 138-a: foxtrot-MDT0000: A client on nid 10.21.22.31@tcp was evicted due to a lock blocking callback time out: rc -110
            Dec 24 11:13:09 fmds1 kernel: LustreError: 19974:0:(ldlm_lockd.c:1398:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8820f2829800 ns: mdt-foxtrot-MDT0000_UUID lock: ffff88243a769c00/0x13866bfbe99f2569 lrc: 3/0,0 mode: PR/PR res: [0x2000226c3:0xeea1:0x0].0x0 bits 0x1b rrc: 1 type: IBT flags: 0x50200000000000 nid: 10.21.22.31@tcp remote: 0xbf690c3f5f1d993a expref: 168919 pid: 19974 timeout: 0 lvb_type: 0
            Dec 24 11:13:09 fmds1 kernel: LustreError: 20543:0:(ldlm_lib.c:3180:target_bulk_io()) @@@ bulk READ failed: rc 107 req@ffff88236c2f8c00 x1620532433667344/t0(0) o37>5bfdc7b1-5f4a-3415-a424-ef332bafeb5b@10.21.22.31@tcp:-1/-1 lens 568/440 e 0 to 0 dl 1545650004 ref 1 fl Interpret:/0/0 rc 0/0
            Dec 24 11:13:09 fmds1 kernel: Lustre: foxtrot-MDT0000: Connection restored to (at 10.21.22.31@tcp)
            Dec 24 11:13:09 fmds1 kernel: LustreError: 19974:0:(ldlm_lockd.c:1398:ldlm_handle_enqueue0()) Skipped 30 previous similar messages
            Dec 24 11:13:11 fmds1 kernel: LustreError: 98802:0:(client.c:1164:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88231848e900 x1615613399983344/t0(0) o104->foxtrot-MDT0000@10.21.22.31@tcp:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
            Dec 24 11:13:11 fmds1 kernel: LustreError: 98802:0:(client.c:1164:ptlrpc_import_delay_req()) Skipped 1 previous similar message
            Dec 24 11:13:11 fmds1 kernel: LustreError: 98802:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.21.22.31@tcp) failed to reply to blocking AST (req@ffff88231848e900 x1615613399983344 status 0 rc -5), evict it ns: mdt-foxtrot-MDT0000_UUID lock: ffff882550cebc00/0x13866bfbe6684d33 lrc: 4/0,0 mode: PR/PR res: [0x200030873:0x1e93f:0x0].0x0 bits 0x2 rrc: 4 type: IBT flags: 0x60000400000020 nid: 10.21.22.31@tcp remote: 0xbf690c3f5eec06fe expref: 98520 pid: 20548 timeout: 9220368047 lvb_type: 0
            Dec 24 11:13:11 fmds1 kernel: LustreError: 138-a: foxtrot-MDT0000: A client on nid 10.21.22.31@tcp was evicted due to a lock blocking callback time out: rc -5

             

            Regards,

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi Oleg, Sorry for the delay in replying. There appear to errors in the MDS logs at that time too: Dec 24 11:13:09 fmds1 kernel: LustreError: 17227:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.21.22.31@tcp) failed to reply to blocking AST (req@ffff88236d22d700 x1615613399909376 status 0 rc -110), evict it ns: mdt-foxtrot-MDT0000_UUID lock: ffff8820a62b8a00/0x13866bfbe99ef389 lrc: 4/0,0 mode: PR/PR res: [0x20002c7e2:0x4352:0x0] .0x0 bits 0x1b rrc: 2 type: IBT flags: 0x60200400000020 nid: 10.21.22.31@tcp remote: 0xbf690c3f5f1d9519 expref: 169058 pid: 87340 timeout: 9220366206 lvb_type: 0 Dec 24 11:13:09 fmds1 kernel: LustreError: 138-a: foxtrot-MDT0000: A client on nid 10.21.22.31@tcp was evicted due to a lock blocking callback time out: rc -110 Dec 24 11:13:09 fmds1 kernel: LustreError: 19974:0:(ldlm_lockd.c:1398:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8820f2829800 ns: mdt-foxtrot-MDT0000_UUID lock: ffff88243a769c00/0x13866bfbe99f2569 lrc: 3/0,0 mode: PR/PR res: [0x2000226c3:0xeea1:0x0] .0x0 bits 0x1b rrc: 1 type: IBT flags: 0x50200000000000 nid: 10.21.22.31@tcp remote: 0xbf690c3f5f1d993a expref: 168919 pid: 19974 timeout: 0 lvb_type: 0 Dec 24 11:13:09 fmds1 kernel: LustreError: 20543:0:(ldlm_lib.c:3180:target_bulk_io()) @@@ bulk READ failed: rc 107 req@ffff88236c2f8c00 x1620532433667344/t0(0) o37 >5bfdc7b1-5f4a-3415-a424-ef332bafeb5b@10.21.22.31@tcp:-1/-1 lens 568/440 e 0 to 0 dl 1545650004 ref 1 fl Interpret:/0/0 rc 0/0 Dec 24 11:13:09 fmds1 kernel: Lustre: foxtrot-MDT0000: Connection restored to (at 10.21.22.31@tcp) Dec 24 11:13:09 fmds1 kernel: LustreError: 19974:0:(ldlm_lockd.c:1398:ldlm_handle_enqueue0()) Skipped 30 previous similar messages Dec 24 11:13:11 fmds1 kernel: LustreError: 98802:0:(client.c:1164:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88231848e900 x1615613399983344/t0(0) o104->foxtrot-MDT0000@10.21.22.31@tcp:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 Dec 24 11:13:11 fmds1 kernel: LustreError: 98802:0:(client.c:1164:ptlrpc_import_delay_req()) Skipped 1 previous similar message Dec 24 11:13:11 fmds1 kernel: LustreError: 98802:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.21.22.31@tcp) failed to reply to blocking AST (req@ffff88231848e900 x1615613399983344 status 0 rc -5), evict it ns: mdt-foxtrot-MDT0000_UUID lock: ffff882550cebc00/0x13866bfbe6684d33 lrc: 4/0,0 mode: PR/PR res: [0x200030873:0x1e93f:0x0] .0x0 bits 0x2 rrc: 4 type: IBT flags: 0x60000400000020 nid: 10.21.22.31@tcp remote: 0xbf690c3f5eec06fe expref: 98520 pid: 20548 timeout: 9220368047 lvb_type: 0 Dec 24 11:13:11 fmds1 kernel: LustreError: 138-a: foxtrot-MDT0000: A client on nid 10.21.22.31@tcp was evicted due to a lock blocking callback time out: rc -5   Regards, Campbell

            People

              green Oleg Drokin
              cmcl Campbell Mcleay (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: