[LU-5583] clients receive IO error after MDT failover - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.5.2
Labels:
None
Environment:
RHEL6 server, RHEL6 clients, servers connected to IB and ethernet, clients can be either connected to IB and ethernet or just ethernet

Severity:
3
Rank (Obsolete):
15574

Description

After our active MDS became completely unresponsive earlier, we attempted to fail over to the second MDS. This appeared to succeed, the MGS and MDT mounted successfully, as far as we can tell all clients reconnected, recovery completed. However at this stage, any operation on the file system (for example ls) on any client only connected via ethernet either hung or returned I/O errors, all clients using IB were operating normally.

We then discovered that the MDT that there seemed to be a problem between MDT and all OSTs, as lctl get_param lod.lustre03-MDT0000-mdtlov.target_obd came back empty. Failing back to the (now rebooted) previous MDT worked and the file system is now operating normally again.

Sample errors in syslog on one of the ethernet only clients while ls /mnt/lustre03 was returing I/O errors:

Sep  4 09:56:18 cs04r-sc-serv-06 kernel: Lustre: MGC172.23.144.1@tcp: Connection restored to MGS (at 172.23.144.2@tcp)
Sep  4 09:57:58 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
Sep  4 09:58:23 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
Sep  4 09:58:48 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
Sep  4 09:59:13 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
Sep  4 09:59:38 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
Sep  4 10:00:03 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
Sep  4 10:00:28 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
Sep  4 10:01:18 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
Sep  4 10:01:18 cs04r-sc-serv-06 kernel: LustreError: Skipped 1 previous similar message
Sep  4 10:02:33 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
Sep  4 10:02:33 cs04r-sc-serv-06 kernel: LustreError: Skipped 2 previous similar messages
Sep  4 10:05:03 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
Sep  4 10:05:03 cs04r-sc-serv-06 kernel: LustreError: Skipped 5 previous similar messages
Sep  4 10:09:38 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
Sep  4 10:09:38 cs04r-sc-serv-06 kernel: LustreError: Skipped 10 previous similar messages
Sep  4 10:33:15 cs04r-sc-serv-06 kernel: LustreError: 32662:0:(dir.c:422:ll_get_dir_page()) read cache page: [0xe900001:0x3b1189d1:0x0] at 0: rc -4
Sep  4 10:33:15 cs04r-sc-serv-06 kernel: LustreError: 32662:0:(dir.c:584:ll_dir_read()) error reading dir [0xe900001:0x3b1189d1:0x0] at 0: rc -4
Sep  4 10:34:00 cs04r-sc-serv-06 kernel: LustreError: 32717:0:(dir.c:398:ll_get_dir_page()) dir page locate: [0xe900001:0x3b1189d1:0x0] at 0: rc -5
Sep  4 10:34:00 cs04r-sc-serv-06 kernel: LustreError: 32717:0:(dir.c:584:ll_dir_read()) error reading dir [0xe900001:0x3b1189d1:0x0] at 0: rc -5
Sep  4 10:37:44 cs04r-sc-serv-06 kernel: LustreError: 487:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue: -4
Sep  4 10:37:44 cs04r-sc-serv-06 kernel: LustreError: 487:0:(mdc_locks.c:918:mdc_enqueue()) Skipped 879 previous similar messages
Sep  4 10:37:57 cs04r-sc-serv-06 kernel: LustreError: 508:0:(dir.c:398:ll_get_dir_page()) dir page locate: [0xe900001:0x3b1189d1:0x0] at 0: rc -5
Sep  4 10:37:57 cs04r-sc-serv-06 kernel: LustreError: 508:0:(dir.c:584:ll_dir_read()) error reading dir [0xe900001:0x3b1189d1:0x0] at 0: rc -5
Sep  4 10:37:58 cs04r-sc-serv-06 kernel: LustreError: 510:0:(dir.c:398:ll_get_dir_page()) dir page locate: [0xe900001:0x3b1189d1:0x0] at 0: rc -5
Sep  4 10:37:59 cs04r-sc-serv-06 kernel: LustreError: 512:0:(dir.c:584:ll_dir_read()) error reading dir [0xe900001:0x3b1189d1:0x0] at 0: rc -5
Sep  4 10:37:59 cs04r-sc-serv-06 kernel: LustreError: 512:0:(dir.c:584:ll_dir_read()) Skipped 1 previous similar message
Sep  4 10:43:34 cs04r-sc-serv-06 kernel: LustreError: 875:0:(dir.c:398:ll_get_dir_page()) dir page locate: [0xe900001:0x3b1189d1:0x0] at 0: rc -5
Sep  4 10:43:34 cs04r-sc-serv-06 kernel: LustreError: 875:0:(dir.c:398:ll_get_dir_page()) Skipped 2 previous similar messages
Sep  4 10:43:34 cs04r-sc-serv-06 kernel: LustreError: 875:0:(dir.c:584:ll_dir_read()) error reading dir [0xe900001:0x3b1189d1:0x0] at 0: rc -5
Sep  4 10:43:34 cs04r-sc-serv-06 kernel: LustreError: 875:0:(dir.c:584:ll_dir_read()) Skipped 1 previous similar message
Sep  4 10:47:19 cs04r-sc-serv-06 kernel: LustreError: 1122:0:(dir.c:398:ll_get_dir_page()) dir page locate: [0xe900001:0x3b1189d1:0x0] at 0: rc -5
Sep  4 10:47:19 cs04r-sc-serv-06 kernel: LustreError: 1122:0:(dir.c:584:ll_dir_read()) error reading dir [0xe900001:0x3b1189d1:0x0] at 0: rc -5

I'll attach the full MDT syslog file starting with the mount until we unmounted again to fail back to the previous MDT as a file.

Note that IB and lnet over IB has been added to this file system recently, following the instructions in the manual on changing server NIDs, including unmounting everything, unloading lustre modules on the servers completely, tunefs.lustre --writeconf --erase-param with the new NIDs etc, mounting MGS, MDT, OSTs, in this order. (Some ethernet only clients might have been still up during this, but the client I used to test this while it wasn't working certainly had been unmounted then and rebooted a few times after).

We are currently concerned that this will happen again if we have to do another fail over on the MDT, so want to solve this. Let us know what other information we should provide.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

cs04r-sc-mds03-02-messages.txt
46 kB
04/Sep/14 12:55 PM

Issue Links

is related to

LU-5585 MDS became unresponsive, clients hanging until MDS fail over

Resolved

clients receive IO error after MDT failover

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates