Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5583

clients receive IO error after MDT failover

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.5.2
    • None
    • RHEL6 server, RHEL6 clients, servers connected to IB and ethernet, clients can be either connected to IB and ethernet or just ethernet
    • 3
    • 15574

    Description

      After our active MDS became completely unresponsive earlier, we attempted to fail over to the second MDS. This appeared to succeed, the MGS and MDT mounted successfully, as far as we can tell all clients reconnected, recovery completed. However at this stage, any operation on the file system (for example ls) on any client only connected via ethernet either hung or returned I/O errors, all clients using IB were operating normally.

      We then discovered that the MDT that there seemed to be a problem between MDT and all OSTs, as lctl get_param lod.lustre03-MDT0000-mdtlov.target_obd came back empty. Failing back to the (now rebooted) previous MDT worked and the file system is now operating normally again.

      Sample errors in syslog on one of the ethernet only clients while ls /mnt/lustre03 was returing I/O errors:

      Sep  4 09:56:18 cs04r-sc-serv-06 kernel: Lustre: MGC172.23.144.1@tcp: Connection restored to MGS (at 172.23.144.2@tcp)
      Sep  4 09:57:58 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
      Sep  4 09:58:23 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
      Sep  4 09:58:48 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
      Sep  4 09:59:13 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
      Sep  4 09:59:38 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
      Sep  4 10:00:03 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
      Sep  4 10:00:28 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
      Sep  4 10:01:18 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
      Sep  4 10:01:18 cs04r-sc-serv-06 kernel: LustreError: Skipped 1 previous similar message
      Sep  4 10:02:33 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
      Sep  4 10:02:33 cs04r-sc-serv-06 kernel: LustreError: Skipped 2 previous similar messages
      Sep  4 10:05:03 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
      Sep  4 10:05:03 cs04r-sc-serv-06 kernel: LustreError: Skipped 5 previous similar messages
      Sep  4 10:09:38 cs04r-sc-serv-06 kernel: LustreError: 11-0: lustre03-MDT0000-mdc-ffff880073fec800: Communicating with 172.23.144.2@tcp, operation mds_connect failed with -16.
      Sep  4 10:09:38 cs04r-sc-serv-06 kernel: LustreError: Skipped 10 previous similar messages
      Sep  4 10:33:15 cs04r-sc-serv-06 kernel: LustreError: 32662:0:(dir.c:422:ll_get_dir_page()) read cache page: [0xe900001:0x3b1189d1:0x0] at 0: rc -4
      Sep  4 10:33:15 cs04r-sc-serv-06 kernel: LustreError: 32662:0:(dir.c:584:ll_dir_read()) error reading dir [0xe900001:0x3b1189d1:0x0] at 0: rc -4
      Sep  4 10:34:00 cs04r-sc-serv-06 kernel: LustreError: 32717:0:(dir.c:398:ll_get_dir_page()) dir page locate: [0xe900001:0x3b1189d1:0x0] at 0: rc -5
      Sep  4 10:34:00 cs04r-sc-serv-06 kernel: LustreError: 32717:0:(dir.c:584:ll_dir_read()) error reading dir [0xe900001:0x3b1189d1:0x0] at 0: rc -5
      Sep  4 10:37:44 cs04r-sc-serv-06 kernel: LustreError: 487:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue: -4
      Sep  4 10:37:44 cs04r-sc-serv-06 kernel: LustreError: 487:0:(mdc_locks.c:918:mdc_enqueue()) Skipped 879 previous similar messages
      Sep  4 10:37:57 cs04r-sc-serv-06 kernel: LustreError: 508:0:(dir.c:398:ll_get_dir_page()) dir page locate: [0xe900001:0x3b1189d1:0x0] at 0: rc -5
      Sep  4 10:37:57 cs04r-sc-serv-06 kernel: LustreError: 508:0:(dir.c:584:ll_dir_read()) error reading dir [0xe900001:0x3b1189d1:0x0] at 0: rc -5
      Sep  4 10:37:58 cs04r-sc-serv-06 kernel: LustreError: 510:0:(dir.c:398:ll_get_dir_page()) dir page locate: [0xe900001:0x3b1189d1:0x0] at 0: rc -5
      Sep  4 10:37:59 cs04r-sc-serv-06 kernel: LustreError: 512:0:(dir.c:584:ll_dir_read()) error reading dir [0xe900001:0x3b1189d1:0x0] at 0: rc -5
      Sep  4 10:37:59 cs04r-sc-serv-06 kernel: LustreError: 512:0:(dir.c:584:ll_dir_read()) Skipped 1 previous similar message
      Sep  4 10:43:34 cs04r-sc-serv-06 kernel: LustreError: 875:0:(dir.c:398:ll_get_dir_page()) dir page locate: [0xe900001:0x3b1189d1:0x0] at 0: rc -5
      Sep  4 10:43:34 cs04r-sc-serv-06 kernel: LustreError: 875:0:(dir.c:398:ll_get_dir_page()) Skipped 2 previous similar messages
      Sep  4 10:43:34 cs04r-sc-serv-06 kernel: LustreError: 875:0:(dir.c:584:ll_dir_read()) error reading dir [0xe900001:0x3b1189d1:0x0] at 0: rc -5
      Sep  4 10:43:34 cs04r-sc-serv-06 kernel: LustreError: 875:0:(dir.c:584:ll_dir_read()) Skipped 1 previous similar message
      Sep  4 10:47:19 cs04r-sc-serv-06 kernel: LustreError: 1122:0:(dir.c:398:ll_get_dir_page()) dir page locate: [0xe900001:0x3b1189d1:0x0] at 0: rc -5
      Sep  4 10:47:19 cs04r-sc-serv-06 kernel: LustreError: 1122:0:(dir.c:584:ll_dir_read()) error reading dir [0xe900001:0x3b1189d1:0x0] at 0: rc -5
      

      I'll attach the full MDT syslog file starting with the mount until we unmounted again to fail back to the previous MDT as a file.

      Note that IB and lnet over IB has been added to this file system recently, following the instructions in the manual on changing server NIDs, including unmounting everything, unloading lustre modules on the servers completely, tunefs.lustre --writeconf --erase-param with the new NIDs etc, mounting MGS, MDT, OSTs, in this order. (Some ethernet only clients might have been still up during this, but the client I used to test this while it wasn't working certainly had been unmounted then and rebooted a few times after).

      We are currently concerned that this will happen again if we have to do another fail over on the MDT, so want to solve this. Let us know what other information we should provide.

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              ferner Frederik Ferner (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: