Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6702

shutting down OSTs in parallel with MDT(s)

    XMLWordPrintable

Details

    • Question/Request
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • 9223372036854775807

    Description

      When shutting down OSTs and MDTs in parallel, we see some OSTs that shut down quite quickly:

      Jun  9 10:45:56 eagle-8.eagle.hpdd.intel.com kernel: Lustre: Failing over testfs-OST002e
      Jun  9 10:45:56 eagle-8.eagle.hpdd.intel.com kernel: Lustre: server umount testfs-OST002e complete
      Jun  9 10:45:57 eagle-8.eagle.hpdd.intel.com kernel: Lustre: Failing over testfs-OST0002
      Jun  9 10:45:57 eagle-8.eagle.hpdd.intel.com kernel: Lustre: server umount testfs-OST0002 complete
      Jun  9 10:45:56 eagle-8.eagle.hpdd.intel.com kernel: Lustre: Failing over testfs-OST002e
      Jun  9 10:45:56 eagle-8.eagle.hpdd.intel.com kernel: Lustre: server umount testfs-OST002e complete
      Jun  9 10:45:57 eagle-8.eagle.hpdd.intel.com kernel: Lustre: Failing over testfs-OST0002
      Jun  9 10:45:57 eagle-8.eagle.hpdd.intel.com kernel: Lustre: server umount testfs-OST0002 complete
      

      And yet in other cases, some OSTs get hung up on timeouts, seemingly to the MDT while being shut down:

      Jun  9 10:45:57 eagle-18.eagle.hpdd.intel.com kernel: Lustre: Failing over testfs-OST000c
      Jun  9 10:45:58 eagle-18.eagle.hpdd.intel.com kernel: LustreError: 137-5: testfs-OST000c_UUID: not available for connect from 10.100.4.47@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      Jun  9 10:45:58 eagle-18.eagle.hpdd.intel.com kernel: LustreError: Skipped 52 previous similar messages
      Jun  9 10:45:58 eagle-18.eagle.hpdd.intel.com kernel: Lustre: server umount testfs-OST000c complete
      Jun  9 10:46:00 eagle-18.eagle.hpdd.intel.com kernel: Lustre: Failing over testfs-OST0038
      Jun  9 10:46:00 eagle-18.eagle.hpdd.intel.com kernel: Lustre: server umount testfs-OST0038 complete
      Jun  9 10:46:29 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1585:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1433871982/real 1433871982]  req@ffff8802d3ce1800 x1497165981090008/t0(0) o400->testfs-MDT0000-lwp-OST0022@10.100.4.2@tcp:12/10 lens 224/224 e 0 to 1 dl 1433871989 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Jun  9 10:46:29 eagle-18.eagle.hpdd.intel.com kernel: Lustre: testfs-MDT0000-lwp-OST004d: Connection to testfs-MDT0000 (at 10.100.4.2@tcp) was lost; in progress operations using this service will wait for recovery to complete
      Jun  9 10:46:29 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1585:0:(client.c:1920:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      Jun  9 10:47:14 eagle-18.eagle.hpdd.intel.com kernel: LustreError: 137-5: testfs-OST000c_UUID: not available for connect from 10.100.4.54@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      Jun  9 10:47:14 eagle-18.eagle.hpdd.intel.com kernel: LustreError: Skipped 35 previous similar messages
      Jun  9 10:47:55 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1582:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1433872064/real 1433872064]  req@ffff880101e93800 x1497165981090052/t0(0) o38->testfs-MDT0000-lwp-OST0022@10.100.4.1@tcp:12/10 lens 400/544 e 0 to 1 dl 1433872075 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Jun  9 10:47:55 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1582:0:(client.c:1920:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
      Jun  9 10:49:44 eagle-18.eagle.hpdd.intel.com kernel: LustreError: 137-5: testfs-OST0023_UUID: not available for connect from 10.100.4.54@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      Jun  9 10:49:44 eagle-18.eagle.hpdd.intel.com kernel: LustreError: Skipped 77 previous similar messages
      Jun  9 10:51:05 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1582:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1433872239/real 1433872239]  req@ffff88028aab4800 x1497165981090128/t0(0) o38->testfs-MDT0000-lwp-OST0022@10.100.4.1@tcp:12/10 lens 400/544 e 0 to 1 dl 1433872265 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Jun  9 10:51:05 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1582:0:(client.c:1920:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
      Jun  9 10:54:47 eagle-18.eagle.hpdd.intel.com kernel: LustreError: 137-5: testfs-OST004e_UUID: not available for connect from 10.100.4.33@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      Jun  9 10:54:47 eagle-18.eagle.hpdd.intel.com kernel: LustreError: 137-5: testfs-OST000c_UUID: not available for connect from 10.100.4.33@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      Jun  9 10:54:47 eagle-18.eagle.hpdd.intel.com kernel: LustreError: Skipped 167 previous similar messages
      Jun  9 10:56:20 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1582:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1433872539/real 1433872539]  req@ffff8803250dc800 x1497165981090224/t0(0) o38->testfs-MDT0000-lwp-OST0022@10.100.4.1@tcp:12/10 lens 400/544 e 0 to 1 dl 1433872580 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Jun  9 10:56:20 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1582:0:(client.c:1920:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
      Jun  9 11:01:01 eagle-18.eagle.hpdd.intel.com kernel: Lustre: Failing over testfs-OST0022
      Jun  9 11:01:01 eagle-18.eagle.hpdd.intel.com kernel: Lustre: server umount testfs-OST004d complete
      Jun  9 11:01:01 eagle-18.eagle.hpdd.intel.com kernel: Lustre: Skipped 1 previous similar message
      

      Apparently (if my log reading is not too rusty) these OSTs that got hung up being stopped got timeouts trying to communicate with the MDT, presumably because the MDT beat these OSTs to the stopped state. Is my analysis here accurate? If so a couple of questions:

      What is this connection from the OST to the MDT being used for?

      Is this a connection that the OST initiates to the MDT or vice versa?

      I had always understood that the ideal order for shutting down Lustre was to shut down the MDT(s) first and then the OST(s) so as to not leave the MDT up and running providing references to OSTs that are no longer up and able to service requests. If that understanding is correct how does that square with the timeouts trying to shut down an OST after the MDT is down?

      Attachments

        Activity

          People

            wc-triage WC Triage
            brian Brian Murrell (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: