[LU-6702] shutting down OSTs in parallel with MDT(s) Created: 09/Jun/15 Updated: 12/Jun/15 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Question/Request | Priority: | Major |
| Reporter: | Brian Murrell (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
When shutting down OSTs and MDTs in parallel, we see some OSTs that shut down quite quickly: Jun 9 10:45:56 eagle-8.eagle.hpdd.intel.com kernel: Lustre: Failing over testfs-OST002e Jun 9 10:45:56 eagle-8.eagle.hpdd.intel.com kernel: Lustre: server umount testfs-OST002e complete Jun 9 10:45:57 eagle-8.eagle.hpdd.intel.com kernel: Lustre: Failing over testfs-OST0002 Jun 9 10:45:57 eagle-8.eagle.hpdd.intel.com kernel: Lustre: server umount testfs-OST0002 complete Jun 9 10:45:56 eagle-8.eagle.hpdd.intel.com kernel: Lustre: Failing over testfs-OST002e Jun 9 10:45:56 eagle-8.eagle.hpdd.intel.com kernel: Lustre: server umount testfs-OST002e complete Jun 9 10:45:57 eagle-8.eagle.hpdd.intel.com kernel: Lustre: Failing over testfs-OST0002 Jun 9 10:45:57 eagle-8.eagle.hpdd.intel.com kernel: Lustre: server umount testfs-OST0002 complete And yet in other cases, some OSTs get hung up on timeouts, seemingly to the MDT while being shut down: Jun 9 10:45:57 eagle-18.eagle.hpdd.intel.com kernel: Lustre: Failing over testfs-OST000c Jun 9 10:45:58 eagle-18.eagle.hpdd.intel.com kernel: LustreError: 137-5: testfs-OST000c_UUID: not available for connect from 10.100.4.47@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Jun 9 10:45:58 eagle-18.eagle.hpdd.intel.com kernel: LustreError: Skipped 52 previous similar messages Jun 9 10:45:58 eagle-18.eagle.hpdd.intel.com kernel: Lustre: server umount testfs-OST000c complete Jun 9 10:46:00 eagle-18.eagle.hpdd.intel.com kernel: Lustre: Failing over testfs-OST0038 Jun 9 10:46:00 eagle-18.eagle.hpdd.intel.com kernel: Lustre: server umount testfs-OST0038 complete Jun 9 10:46:29 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1585:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1433871982/real 1433871982] req@ffff8802d3ce1800 x1497165981090008/t0(0) o400->testfs-MDT0000-lwp-OST0022@10.100.4.2@tcp:12/10 lens 224/224 e 0 to 1 dl 1433871989 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jun 9 10:46:29 eagle-18.eagle.hpdd.intel.com kernel: Lustre: testfs-MDT0000-lwp-OST004d: Connection to testfs-MDT0000 (at 10.100.4.2@tcp) was lost; in progress operations using this service will wait for recovery to complete Jun 9 10:46:29 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1585:0:(client.c:1920:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Jun 9 10:47:14 eagle-18.eagle.hpdd.intel.com kernel: LustreError: 137-5: testfs-OST000c_UUID: not available for connect from 10.100.4.54@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Jun 9 10:47:14 eagle-18.eagle.hpdd.intel.com kernel: LustreError: Skipped 35 previous similar messages Jun 9 10:47:55 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1582:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1433872064/real 1433872064] req@ffff880101e93800 x1497165981090052/t0(0) o38->testfs-MDT0000-lwp-OST0022@10.100.4.1@tcp:12/10 lens 400/544 e 0 to 1 dl 1433872075 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jun 9 10:47:55 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1582:0:(client.c:1920:ptlrpc_expire_one_request()) Skipped 6 previous similar messages Jun 9 10:49:44 eagle-18.eagle.hpdd.intel.com kernel: LustreError: 137-5: testfs-OST0023_UUID: not available for connect from 10.100.4.54@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Jun 9 10:49:44 eagle-18.eagle.hpdd.intel.com kernel: LustreError: Skipped 77 previous similar messages Jun 9 10:51:05 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1582:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1433872239/real 1433872239] req@ffff88028aab4800 x1497165981090128/t0(0) o38->testfs-MDT0000-lwp-OST0022@10.100.4.1@tcp:12/10 lens 400/544 e 0 to 1 dl 1433872265 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jun 9 10:51:05 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1582:0:(client.c:1920:ptlrpc_expire_one_request()) Skipped 11 previous similar messages Jun 9 10:54:47 eagle-18.eagle.hpdd.intel.com kernel: LustreError: 137-5: testfs-OST004e_UUID: not available for connect from 10.100.4.33@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Jun 9 10:54:47 eagle-18.eagle.hpdd.intel.com kernel: LustreError: 137-5: testfs-OST000c_UUID: not available for connect from 10.100.4.33@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Jun 9 10:54:47 eagle-18.eagle.hpdd.intel.com kernel: LustreError: Skipped 167 previous similar messages Jun 9 10:56:20 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1582:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1433872539/real 1433872539] req@ffff8803250dc800 x1497165981090224/t0(0) o38->testfs-MDT0000-lwp-OST0022@10.100.4.1@tcp:12/10 lens 400/544 e 0 to 1 dl 1433872580 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jun 9 10:56:20 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1582:0:(client.c:1920:ptlrpc_expire_one_request()) Skipped 11 previous similar messages Jun 9 11:01:01 eagle-18.eagle.hpdd.intel.com kernel: Lustre: Failing over testfs-OST0022 Jun 9 11:01:01 eagle-18.eagle.hpdd.intel.com kernel: Lustre: server umount testfs-OST004d complete Jun 9 11:01:01 eagle-18.eagle.hpdd.intel.com kernel: Lustre: Skipped 1 previous similar message Apparently (if my log reading is not too rusty) these OSTs that got hung up being stopped got timeouts trying to communicate with the MDT, presumably because the MDT beat these OSTs to the stopped state. Is my analysis here accurate? If so a couple of questions: What is this connection from the OST to the MDT being used for? Is this a connection that the OST initiates to the MDT or vice versa? I had always understood that the ideal order for shutting down Lustre was to shut down the MDT(s) first and then the OST(s) so as to not leave the MDT up and running providing references to OSTs that are no longer up and able to service requests. If that understanding is correct how does that square with the timeouts trying to shut down an OST after the MDT is down? |
| Comments |
| Comment by Andreas Dilger [ 10/Jun/15 ] |
|
Brian, The OSS->MDS connection is needed for quota and FLDB service, and is separate from the MDS->OSS connection. |