Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.5.2
-
None
-
RHEL6 server, RHEL6 clients, servers connected to IB and ethernet, clients can be either connected to IB and ethernet or just ethernet
-
3
-
15574
Description
After our active MDS became completely unresponsive earlier, we attempted to fail over to the second MDS. This appeared to succeed, the MGS and MDT mounted successfully, as far as we can tell all clients reconnected, recovery completed. However at this stage, any operation on the file system (for example ls) on any client only connected via ethernet either hung or returned I/O errors, all clients using IB were operating normally.
We then discovered that the MDT that there seemed to be a problem between MDT and all OSTs, as lctl get_param lod.lustre03-MDT0000-mdtlov.target_obd came back empty. Failing back to the (now rebooted) previous MDT worked and the file system is now operating normally again.
Sample errors in syslog on one of the ethernet only clients while ls /mnt/lustre03 was returing I/O errors:
Sep 4 09:56:18 cs04r-sc-serv-06 kernel: Lustre: MGC172.23.144.1@tcp: Connection restored to MGS (at 172.23.144.2@tcp) Sep 4 09:57:58 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16. Sep 4 09:58:23 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16. Sep 4 09:58:48 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16. Sep 4 09:59:13 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16. Sep 4 09:59:38 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16. Sep 4 10:00:03 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16. Sep 4 10:00:28 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16. Sep 4 10:01:18 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16. Sep 4 10:01:18 cs04r-sc-serv-06 kernel: LustreError: Skipped 1 previous similar message Sep 4 10:02:33 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16. Sep 4 10:02:33 cs04r-sc-serv-06 kernel: LustreError: Skipped 2 previous similar messages Sep 4 10:05:03 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16. Sep 4 10:05:03 cs04r-sc-serv-06 kernel: LustreError: Skipped 5 previous similar messages Sep 4 10:09:38 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16. Sep 4 10:09:38 cs04r-sc-serv-06 kernel: LustreError: Skipped 10 previous similar messages Sep 4 10:33:15 cs04r-sc-serv-06 kernel: LustreError: 32662:0:(dir.c:422:ll_get_dir_page()) read cache page: [0xe900001:0x3b1189d1:0x0] at 0: rc -4 Sep 4 10:33:15 cs04r-sc-serv-06 kernel: LustreError: 32662:0:(dir.c:584:ll_dir_read()) error reading dir [0xe900001:0x3b1189d1:0x0] at 0: rc -4 Sep 4 10:34:00 cs04r-sc-serv-06 kernel: LustreError: 32717:0:(dir.c:398:ll_get_dir_page()) dir page locate: [0xe900001:0x3b1189d1:0x0] at 0: rc -5 Sep 4 10:34:00 cs04r-sc-serv-06 kernel: LustreError: 32717:0:(dir.c:584:ll_dir_read()) error reading dir [0xe900001:0x3b1189d1:0x0] at 0: rc -5 Sep 4 10:37:44 cs04r-sc-serv-06 kernel: LustreError: 487:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue: -4 Sep 4 10:37:44 cs04r-sc-serv-06 kernel: LustreError: 487:0:(mdc_locks.c:918:mdc_enqueue()) Skipped 879 previous similar messages Sep 4 10:37:57 cs04r-sc-serv-06 kernel: LustreError: 508:0:(dir.c:398:ll_get_dir_page()) dir page locate: [0xe900001:0x3b1189d1:0x0] at 0: rc -5 Sep 4 10:37:57 cs04r-sc-serv-06 kernel: LustreError: 508:0:(dir.c:584:ll_dir_read()) error reading dir [0xe900001:0x3b1189d1:0x0] at 0: rc -5 Sep 4 10:37:58 cs04r-sc-serv-06 kernel: LustreError: 510:0:(dir.c:398:ll_get_dir_page()) dir page locate: [0xe900001:0x3b1189d1:0x0] at 0: rc -5 Sep 4 10:37:59 cs04r-sc-serv-06 kernel: LustreError: 512:0:(dir.c:584:ll_dir_read()) error reading dir [0xe900001:0x3b1189d1:0x0] at 0: rc -5 Sep 4 10:37:59 cs04r-sc-serv-06 kernel: LustreError: 512:0:(dir.c:584:ll_dir_read()) Skipped 1 previous similar message Sep 4 10:43:34 cs04r-sc-serv-06 kernel: LustreError: 875:0:(dir.c:398:ll_get_dir_page()) dir page locate: [0xe900001:0x3b1189d1:0x0] at 0: rc -5 Sep 4 10:43:34 cs04r-sc-serv-06 kernel: LustreError: 875:0:(dir.c:398:ll_get_dir_page()) Skipped 2 previous similar messages Sep 4 10:43:34 cs04r-sc-serv-06 kernel: LustreError: 875:0:(dir.c:584:ll_dir_read()) error reading dir [0xe900001:0x3b1189d1:0x0] at 0: rc -5 Sep 4 10:43:34 cs04r-sc-serv-06 kernel: LustreError: 875:0:(dir.c:584:ll_dir_read()) Skipped 1 previous similar message Sep 4 10:47:19 cs04r-sc-serv-06 kernel: LustreError: 1122:0:(dir.c:398:ll_get_dir_page()) dir page locate: [0xe900001:0x3b1189d1:0x0] at 0: rc -5 Sep 4 10:47:19 cs04r-sc-serv-06 kernel: LustreError: 1122:0:(dir.c:584:ll_dir_read()) error reading dir [0xe900001:0x3b1189d1:0x0] at 0: rc -5
I'll attach the full MDT syslog file starting with the mount until we unmounted again to fail back to the previous MDT as a file.
Note that IB and lnet over IB has been added to this file system recently, following the instructions in the manual on changing server NIDs, including unmounting everything, unloading lustre modules on the servers completely, tunefs.lustre --writeconf --erase-param with the new NIDs etc, mounting MGS, MDT, OSTs, in this order. (Some ethernet only clients might have been still up during this, but the client I used to test this while it wasn't working certainly had been unmounted then and rebooted a few times after).
We are currently concerned that this will happen again if we have to do another fail over on the MDT, so want to solve this. Let us know what other information we should provide.
Attachments
Issue Links
- is related to
-
LU-5585 MDS became unresponsive, clients hanging until MDS fail over
-
- Resolved
-
I've done a few more tests on this file system while I can (planned maintenance, nearly over now).
I'll try to summarise the results here and hopefully they'll be useful for something (at least they'll be for us to remember what has been tested if we get back to this later..)
In this file system, we have one MGT and one MDT, both share the same disk backend and are on the same LVM VG, separate LVs. We have two MDS servers able to access this storage (cs04r-sc-mds03-01 and cs04r-sc-mds03-02), both have lnet configured to use TCP and o2ib. The MDT has is configured to access the MGS on either of the servers, via two mgsnode parameters, both listing o2ib and tcp IP addresses.
When the MGT and MDT are mounted in this order on cs04r-sc-mds03-01 all seems to be well, no messages in syslog about failure to get MGS log params or anything.
When the MGT and then the MDT are mounted in this order on cs04r-sc-mds03-02, we get these messages about the failure to get MGS log params but other than the first time, the MDT appears to be working fine.
Mounting the MGT on cs04r-sc-mds03-01 and later mounting the MDT on cs04r-sc-mds03-02 also works fine, no errors in syslog.
Mounting the MGT on cs04r-sc-mds03-02 and later mounting the MDT on cs04r-sc-mds03-01 generates the messages about failure to get MGS log params on cs04r-sc-mds03-01.
So, it seems the MGT works on cs04r-sc-mds03-01 but not on cs04r-sc-mds03-02.