Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7321

LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.5.3
    • LLNL lustre-2.5.4-4chaos_2.6.32_504.16.2.1chaos.ch5.3.x86_64.x86_64
      ZFS Backend
      BG/Q clients
    • 3
    • 9223372036854775807

    Description

      One the file systems that have BlueGene/Q clients, I am seeing these log messages:

      vestai:/g/g0/marc# tail /var/log/lustre.log
      Sep 16 09:25:32 vesta56 kernel: LustreError: 137188:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 4 previous similar messages
      Sep 16 09:25:49 vesta56 kernel: LustreError: 137212:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.11@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0037_UUID lock: ffff8807adc3eb40/0xef9b0828fb0cdca lrc: 5/0,0 mode: PW/PW res: [0x4b17112:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 172.20.17.11@o2ib500 remote: 0x4ceffd8a2ed5c33a expref: 829 pid: 87731 timeout: 9654300922 lvb_type: 0
      Sep 16 09:25:49 vesta56 kernel: LustreError: 137212:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 3 previous similar messages
      Sep 16 09:26:37 vesta56 kernel: LustreError: 87731:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.12@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0037_UUID lock: ffff88064a4bc880/0xef9b0828fb0dd59 lrc: 6/0,0 mode: PW/PW res: [0x4b17112:0x0:0x0].0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 172.20.17.12@o2ib500 remote: 0xdec93cb7eb4eca85 expref: 790 pid: 137212 timeout: 9654349575 lvb_type: 0
      Sep 16 09:26:37 vesta56 kernel: LustreError: 87731:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 8 previous similar messages
      Sep 16 09:29:19 vesta2 kernel: LustreError: 82827:0:(ldlm_lockd.c:463:ldlm_add_waiting_lock()) ### not waiting on destroyed lock (bug 5653) ns: filter-fsv-OST0001_UUID lock: ffff880e14fefd40/0xca78e45d33428446 lrc: 1/0,0 mode: --/PW res: [0x4d27886:0x0:0x0].0 rrc: 6 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x74801080010020 nid: 172.20.17.2@o2ib500 remote: 0xa0f4c9e1f12378c4 expref: 1893 pid: 10293 timeout: 0 lvb_type: 0
      Sep 16 09:29:19 vesta2 kernel: LustreError: 82827:0:(ldlm_lockd.c:463:ldlm_add_waiting_lock()) Skipped 2 previous similar messages
      Sep 16 09:30:59 vesta20 kernel: LustreError: 61820:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0013_UUID lock: ffff88047e3127c0/0x1475582a881a7576 lrc: 5/0,0 mode: PW/PW res: [0x4a8bd2d:0x0:0x0].0 rrc: 5 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 172.20.17.9@o2ib500 remote: 0xa29fd336c2734cc9 expref: 1031 pid: 61818 timeout: 9654614287 lvb_type: 0
      Sep 16 09:31:15 vesta20 kernel: LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0013_UUID lock: ffff8806b70d6380/0x1475582a881a7f4e lrc: 5/0,0 mode: PW/PW res: [0x4a8bd2d:0x0:0x0].0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60001080010020 nid: 172.20.17.9@o2ib500 remote: 0x79abd336c2734cc9 expref: 1031 pid: 61818 timeout: 9654630757 lvb_type: 0
      Sep 16 09:31:15 vesta20 kernel: LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 27 previous similar messages
      vestai:/g/g0/marc# 
      

      There were over 3000 messages printed to the logs that day, not counting the suppressed messages:
      vestai:/g/g0/marc# zgrep -c "returned -5 from glimpse AST" /var/log/lustre.log-20150915.gz
      3232
      vestai:/g/g0/marc#

      These messages only show up with NIDs related to BlueGene/Q clients, and I don't see them on file systems that do not have BlueGene/Q clients. I don't see anything in the console or syslog logs on the clients.

      On the BG/Q clients we are running:
      vulcanio14-ib0@root: cat /proc/fs/lustre/version
      lustre: 2.5.4
      kernel: patchless_client
      build: 2.5.4-4chaos-4chaos--PRISTINE-2.6.32-504.8.2.bgq.3blueos.V1R2M3.bl2.2_1.ppc64

      We saw spurious messages years ago, but after doing a splunk query of our logs it looks like it has been ramping up since late 2014:
      Number of messages per month (not counting "x similar messages":
      10/14: 172
      11/14: 8043
      12/14: 15,002
      1/15: 17,191
      2/15: 36,258
      3/15: 53,145
      4/15: 69,215
      5/15: 96,518
      6/15: 54,504
      7/15: 91,304
      8/15: 144,691
      9/15: 236,491

      Attachments

        1. debug_log.vesta29.LU-7321.gz
          2.45 MB
        2. debug_log.vesta-mds1.LU-7321.gz
          177 kB
        3. debug_log.vulcanio17
          557 kB
        4. dump-client.log
          186 kB
        5. dump-oss.log.gz
          12.38 MB

        Issue Links

          Activity

            [LU-7321] LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST
            simmonsja James A Simmons added a comment - - edited

            Okay I was wrong. I do still get this error. Its just before I saw it on the clients and servers. Now I just see it on the server side. Do you have any jobs fail due to this?

            simmonsja James A Simmons added a comment - - edited Okay I was wrong. I do still get this error. Its just before I saw it on the clients and servers. Now I just see it on the server side. Do you have any jobs fail due to this?

            Yes, it is Lustre 2.5.5 + many patches. I believe it is Mellanox. We don't use DLC, yaml, or change the peer credits.

            morrone Christopher Morrone (Inactive) added a comment - Yes, it is Lustre 2.5.5 + many patches. I believe it is Mellanox. We don't use DLC, yaml, or change the peer credits.
            simmonsja James A Simmons added a comment - - edited

            I fixed up my LNet DLC yaml configuration file and applied LU-8507 so it configures properly with my pre-2.9 version and the problem went away. The issue is a symptom of using not the best LNet configuration. Are you still on the lustre 2.5 version? Is this with using the Mellanox stack? Also are you connecting to x86 server nodes?

            simmonsja James A Simmons added a comment - - edited I fixed up my LNet DLC yaml configuration file and applied LU-8507 so it configures properly with my pre-2.9 version and the problem went away. The issue is a symptom of using not the best LNet configuration. Are you still on the lustre 2.5 version? Is this with using the Mellanox stack? Also are you connecting to x86 server nodes?
            simmonsja James A Simmons added a comment - - edited

            Most recent version of lustre (2.8.56) also displays this problem with ppc8 as well. I can easily reproduce it with a IOR ssf operation on a single node.

            Do you need debugging info from both the clients and the server.

            simmonsja James A Simmons added a comment - - edited Most recent version of lustre (2.8.56) also displays this problem with ppc8 as well. I can easily reproduce it with a IOR ssf operation on a single node. Do you need debugging info from both the clients and the server.
            pjones Peter Jones added a comment -

            Sorry - wrong ticket

            pjones Peter Jones added a comment - Sorry - wrong ticket
            pjones Peter Jones added a comment -

            Hongchao

            Could you please assist with this issue?

            Peter

            pjones Peter Jones added a comment - Hongchao Could you please assist with this issue? Peter

            We have extended PPC testing in autotest (back in Feb 2016), but there have been no failures since then.

            utopiabound Nathaniel Clark added a comment - We have extended PPC testing in autotest (back in Feb 2016), but there have been no failures since then.
            green Oleg Drokin added a comment -

            These levels should be enough from a server that has a history of hitting the error message in question, since guessing which client would hit it next is hard if possible at all.

            I assume you might be wary of added performance impact of these increased debug levels, hence only one server should be enough, even though I personally think rpctrace and ha are kind of low key levels with not too much chatter.

            green Oleg Drokin added a comment - These levels should be enough from a server that has a history of hitting the error message in question, since guessing which client would hit it next is hard if possible at all. I assume you might be wary of added performance impact of these increased debug levels, hence only one server should be enough, even though I personally think rpctrace and ha are kind of low key levels with not too much chatter.

            Can you try to get some debug logs with the following levels enabled: rpctrace ha

            This will help nail down if the error is come back from the client or due to a connectivity error.

            utopiabound Nathaniel Clark added a comment - Can you try to get some debug logs with the following levels enabled: rpctrace ha This will help nail down if the error is come back from the client or due to a connectivity error.

            Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/17672
            Subject: LU-7321 test: FOR TEST ONLY
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5d8700e425773b3738844b477b2fe74d6de85b08

            gerrit Gerrit Updater added a comment - Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/17672 Subject: LU-7321 test: FOR TEST ONLY Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5d8700e425773b3738844b477b2fe74d6de85b08

            Let me see if I can pull in someone more familiar with this code that might have an idea where the issue would be.

            utopiabound Nathaniel Clark added a comment - Let me see if I can pull in someone more familiar with this code that might have an idea where the issue would be.

            People

              green Oleg Drokin
              marc@llnl.gov D. Marc Stearman (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: