Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2965

MDS evicted OST after attempting recovery.

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.4.0
    • 3
    • 7230

    Attachments

      1. c4-0c0s5n0.log
        33 kB
      2. client.log
        4 kB
      3. mds-test-shot.log
        8 kB
      4. mds-test-shot-whole.log
        1.27 MB
      5. oss-test-shot.log
        8 kB
      6. spare.log
        869 kB

      Activity

        [LU-2965] MDS evicted OST after attempting recovery.
        green Oleg Drokin added a comment -

        I guess the assumption clients stopped pinging is because they were evicted for not pinging

        green Oleg Drokin added a comment - I guess the assumption clients stopped pinging is because they were evicted for not pinging

        10.36.227.198 was our 2.4 non cray client. I attached the log (spare.log) here.

        simmonsja James A Simmons added a comment - 10.36.227.198 was our 2.4 non cray client. I attached the log (spare.log) here.

        James,

        According to the oss-test-shot.log, only one client was evicted by the OSS around 21:30: "61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib)". It seems c4-0c0s5n0 was "1546@gni", not the o2ib client. Is the log on the o2ib client still available?

        Also, how did you infer that some clients had stopped pinging the servers, please?

        liwei Li Wei (Inactive) added a comment - James, According to the oss-test-shot.log, only one client was evicted by the OSS around 21:30: "61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib)". It seems c4-0c0s5n0 was "1546@gni", not the o2ib client. Is the log on the o2ib client still available? Also, how did you infer that some clients had stopped pinging the servers, please?

        Looking at it logs at 175 seconds after every client recovered some stopped pinging the servers.

        simmonsja James A Simmons added a comment - Looking at it logs at 175 seconds after every client recovered some stopped pinging the servers.

        Nope. I didn't test suppress pings at that time. I started the suppress pings test 2 hours later than when all the recover problems happened.

        simmonsja James A Simmons added a comment - Nope. I didn't test suppress pings at that time. I started the suppress pings test 2 hours later than when all the recover problems happened.
        green Oleg Drokin added a comment -

        I just confirmed with James that the actual problem is "even though recovery completed just fine after a server failure, clients were evicted later for some reason".

        Given the timing of 1354 seconds - that's exactly how much passed since recovery - appears the clients were not pinging?
        Were you already playing with suppress pinging at the time, I wonder?

        green Oleg Drokin added a comment - I just confirmed with James that the actual problem is "even though recovery completed just fine after a server failure, clients were evicted later for some reason". Given the timing of 1354 seconds - that's exactly how much passed since recovery - appears the clients were not pinging? Were you already playing with suppress pinging at the time, I wonder?
        bobijam Zhenyu Xu added a comment -

        the OSS has completed recover around 21:06, but 25 minutes later, MDS and OSS all reports that they lost connection with some clients

        oss log

        Mar 8 21:06:02 widow-oss8b4 kernel: [ 903.540784] Lustre: routed1-OST00eb: Recovery over after 7:32, of 8982 clients 8982 recovered and 0 were evicted.
        Mar 8 21:06:02 widow-oss8b4 kernel: [ 903.569906] Lustre: Skipped 5 previous similar messages
        Mar 8 21:31:27 widow-oss8b4 kernel: [ 2428.640322] Lustre: routed1-OST00eb: haven't heard from client 61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib) in 1354 seconds. I think it's dead, and I am evicting it. exp ffff880397414000, cur 1362796287 expire 1362795387 last 1362794933

        and all logs provided do not show details during the 25 minutes, could something happened about the clients' network?

        bobijam Zhenyu Xu added a comment - the OSS has completed recover around 21:06, but 25 minutes later, MDS and OSS all reports that they lost connection with some clients oss log Mar 8 21:06:02 widow-oss8b4 kernel: [ 903.540784] Lustre: routed1-OST00eb: Recovery over after 7:32, of 8982 clients 8982 recovered and 0 were evicted. Mar 8 21:06:02 widow-oss8b4 kernel: [ 903.569906] Lustre: Skipped 5 previous similar messages Mar 8 21:31:27 widow-oss8b4 kernel: [ 2428.640322] Lustre: routed1-OST00eb: haven't heard from client 61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib) in 1354 seconds. I think it's dead, and I am evicting it. exp ffff880397414000, cur 1362796287 expire 1362795387 last 1362794933 and all logs provided do not show details during the 25 minutes, could something happened about the clients' network?

        We saw the problems around 21:07 for the client evicts. The MDS had these errors which lead to the evictions.

        Mar 8 21:12:54 widow-mds1 kernel: [27048.704079] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 379s: evicting client at 3167@gni ns: mdt-ffff880e55b20000 lock: ffff880c3f748600/0x4453d516b35bd568 lrc: 3/0,0 mode: CR/CR res: 8589939214/48084 bits 0x9 rrc: 14 type: IBT flags: 0x200000000020 nid: 3167@gni remote: 0x1f865dc509f72511 expref: 11 pid: 28552 timeout: 4321715043 lvb_type: 0

        To make clear what happened I have attached the MDS logs for the entire test shot.

        simmonsja James A Simmons added a comment - We saw the problems around 21:07 for the client evicts. The MDS had these errors which lead to the evictions. Mar 8 21:12:54 widow-mds1 kernel: [27048.704079] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 379s: evicting client at 3167@gni ns: mdt-ffff880e55b20000 lock: ffff880c3f748600/0x4453d516b35bd568 lrc: 3/0,0 mode: CR/CR res: 8589939214/48084 bits 0x9 rrc: 14 type: IBT flags: 0x200000000020 nid: 3167@gni remote: 0x1f865dc509f72511 expref: 11 pid: 28552 timeout: 4321715043 lvb_type: 0 To make clear what happened I have attached the MDS logs for the entire test shot.
        bobijam Zhenyu Xu added a comment -

        Can you specify the timestamp of the troubled recovery process, from the c4-0c0s5n0.log

        [2013-03-08 21:03:28][c4-0c0s5n0]Lustre: 7635:0:(client.c:1866:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
        [2013-03-08 21:06:04][c4-0c0s5n0]Lustre: routed1-OST00eb-osc-ffff88081e7e4000: Connection restored to routed1-OST00eb (at 10.36.227.92@o2ib)
        [2013-03-08 21:06:04][c4-0c0s5n0]Lustre: routed1-OST008b-osc-ffff88081e7e4000: Connection restored to routed1-OST008b (at 10.36.227.92@o2ib)
        [2013-03-08 21:28:08][c4-0c0s5n0]LustreError: 11-0: routed1-MDT0000-mdc-ffff88081e7e4000: Communicating with 10.36.227.211@o2ib, operation obd_ping failed with -107.
        [2013-03-08 21:28:08][c4-0c0s5n0]Lustre: routed1-MDT0000-mdc-ffff88081e7e4000: Connection to routed1-MDT0000 (at 10.36.227.211@o2ib) was lost; in progress operations using this service will wait for recovery to complete
        [2013-03-08 21:28:08][c4-0c0s5n0]LustreError: 167-0: routed1-MDT0000-mdc-ffff88081e7e4000: This client was evicted by routed1-MDT0000; in progress operations using this service will fail.
        [2013-03-08 21:28:08][c4-0c0s5n0]Lustre: routed1-MDT0000-mdc-ffff88081e7e4000: Connection restored to routed1-MDT0000 (at 10.36.227.211@o2ib)

        I only see that the client restored connection to OST00[eb|8b] at 2013-03-08 21:06:04, and 24 minutes later lost connection with MDS, and restored to it again soon. Then at 22:32 it is umounted (I think by manually).

        bobijam Zhenyu Xu added a comment - Can you specify the timestamp of the troubled recovery process, from the c4-0c0s5n0.log [2013-03-08 21:03:28] [c4-0c0s5n0] Lustre: 7635:0:(client.c:1866:ptlrpc_expire_one_request()) Skipped 5 previous similar messages [2013-03-08 21:06:04] [c4-0c0s5n0] Lustre: routed1-OST00eb-osc-ffff88081e7e4000: Connection restored to routed1-OST00eb (at 10.36.227.92@o2ib) [2013-03-08 21:06:04] [c4-0c0s5n0] Lustre: routed1-OST008b-osc-ffff88081e7e4000: Connection restored to routed1-OST008b (at 10.36.227.92@o2ib) [2013-03-08 21:28:08] [c4-0c0s5n0] LustreError: 11-0: routed1-MDT0000-mdc-ffff88081e7e4000: Communicating with 10.36.227.211@o2ib, operation obd_ping failed with -107. [2013-03-08 21:28:08] [c4-0c0s5n0] Lustre: routed1-MDT0000-mdc-ffff88081e7e4000: Connection to routed1-MDT0000 (at 10.36.227.211@o2ib) was lost; in progress operations using this service will wait for recovery to complete [2013-03-08 21:28:08] [c4-0c0s5n0] LustreError: 167-0: routed1-MDT0000-mdc-ffff88081e7e4000: This client was evicted by routed1-MDT0000; in progress operations using this service will fail. [2013-03-08 21:28:08] [c4-0c0s5n0] Lustre: routed1-MDT0000-mdc-ffff88081e7e4000: Connection restored to routed1-MDT0000 (at 10.36.227.211@o2ib) I only see that the client restored connection to OST00 [eb|8b] at 2013-03-08 21:06:04, and 24 minutes later lost connection with MDS, and restored to it again soon. Then at 22:32 it is umounted (I think by manually).

        I attached a log for one of the nodes that was evicted. Due to these problems recovery took the full 30 minutes instead of finishing quickly.

        simmonsja James A Simmons added a comment - I attached a log for one of the nodes that was evicted. Due to these problems recovery took the full 30 minutes instead of finishing quickly.
        bobijam Zhenyu Xu added a comment -

        Can specify a client example for me?

        bobijam Zhenyu Xu added a comment - Can specify a client example for me?

        People

          bobijam Zhenyu Xu
          simmonsja James A Simmons
          Votes:
          0 Vote for this issue
          Watchers:
          9 Start watching this issue

          Dates

            Created:
            Updated:
            Resolved: