Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7321

LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.5.3
    • LLNL lustre-2.5.4-4chaos_2.6.32_504.16.2.1chaos.ch5.3.x86_64.x86_64
      ZFS Backend
      BG/Q clients
    • 3
    • 9223372036854775807

    Description

      One the file systems that have BlueGene/Q clients, I am seeing these log messages:

      vestai:/g/g0/marc# tail /var/log/lustre.log
      Sep 16 09:25:32 vesta56 kernel: LustreError: 137188:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 4 previous similar messages
      Sep 16 09:25:49 vesta56 kernel: LustreError: 137212:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.11@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0037_UUID lock: ffff8807adc3eb40/0xef9b0828fb0cdca lrc: 5/0,0 mode: PW/PW res: [0x4b17112:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 172.20.17.11@o2ib500 remote: 0x4ceffd8a2ed5c33a expref: 829 pid: 87731 timeout: 9654300922 lvb_type: 0
      Sep 16 09:25:49 vesta56 kernel: LustreError: 137212:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 3 previous similar messages
      Sep 16 09:26:37 vesta56 kernel: LustreError: 87731:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.12@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0037_UUID lock: ffff88064a4bc880/0xef9b0828fb0dd59 lrc: 6/0,0 mode: PW/PW res: [0x4b17112:0x0:0x0].0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 172.20.17.12@o2ib500 remote: 0xdec93cb7eb4eca85 expref: 790 pid: 137212 timeout: 9654349575 lvb_type: 0
      Sep 16 09:26:37 vesta56 kernel: LustreError: 87731:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 8 previous similar messages
      Sep 16 09:29:19 vesta2 kernel: LustreError: 82827:0:(ldlm_lockd.c:463:ldlm_add_waiting_lock()) ### not waiting on destroyed lock (bug 5653) ns: filter-fsv-OST0001_UUID lock: ffff880e14fefd40/0xca78e45d33428446 lrc: 1/0,0 mode: --/PW res: [0x4d27886:0x0:0x0].0 rrc: 6 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x74801080010020 nid: 172.20.17.2@o2ib500 remote: 0xa0f4c9e1f12378c4 expref: 1893 pid: 10293 timeout: 0 lvb_type: 0
      Sep 16 09:29:19 vesta2 kernel: LustreError: 82827:0:(ldlm_lockd.c:463:ldlm_add_waiting_lock()) Skipped 2 previous similar messages
      Sep 16 09:30:59 vesta20 kernel: LustreError: 61820:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0013_UUID lock: ffff88047e3127c0/0x1475582a881a7576 lrc: 5/0,0 mode: PW/PW res: [0x4a8bd2d:0x0:0x0].0 rrc: 5 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 172.20.17.9@o2ib500 remote: 0xa29fd336c2734cc9 expref: 1031 pid: 61818 timeout: 9654614287 lvb_type: 0
      Sep 16 09:31:15 vesta20 kernel: LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0013_UUID lock: ffff8806b70d6380/0x1475582a881a7f4e lrc: 5/0,0 mode: PW/PW res: [0x4a8bd2d:0x0:0x0].0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60001080010020 nid: 172.20.17.9@o2ib500 remote: 0x79abd336c2734cc9 expref: 1031 pid: 61818 timeout: 9654630757 lvb_type: 0
      Sep 16 09:31:15 vesta20 kernel: LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 27 previous similar messages
      vestai:/g/g0/marc# 
      

      There were over 3000 messages printed to the logs that day, not counting the suppressed messages:
      vestai:/g/g0/marc# zgrep -c "returned -5 from glimpse AST" /var/log/lustre.log-20150915.gz
      3232
      vestai:/g/g0/marc#

      These messages only show up with NIDs related to BlueGene/Q clients, and I don't see them on file systems that do not have BlueGene/Q clients. I don't see anything in the console or syslog logs on the clients.

      On the BG/Q clients we are running:
      vulcanio14-ib0@root: cat /proc/fs/lustre/version
      lustre: 2.5.4
      kernel: patchless_client
      build: 2.5.4-4chaos-4chaos--PRISTINE-2.6.32-504.8.2.bgq.3blueos.V1R2M3.bl2.2_1.ppc64

      We saw spurious messages years ago, but after doing a splunk query of our logs it looks like it has been ramping up since late 2014:
      Number of messages per month (not counting "x similar messages":
      10/14: 172
      11/14: 8043
      12/14: 15,002
      1/15: 17,191
      2/15: 36,258
      3/15: 53,145
      4/15: 69,215
      5/15: 96,518
      6/15: 54,504
      7/15: 91,304
      8/15: 144,691
      9/15: 236,491

      Attachments

        1. debug_log.vesta29.LU-7321.gz
          2.45 MB
        2. debug_log.vesta-mds1.LU-7321.gz
          177 kB
        3. debug_log.vulcanio17
          557 kB
        4. dump-client.log
          186 kB
        5. dump-oss.log.gz
          12.38 MB

        Issue Links

          Activity

            [LU-7321] LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST

            Do people still see this? I haven't done large scaling testing yet but I was wondering if it is still observed.

            simmonsja James A Simmons added a comment - Do people still see this? I haven't done large scaling testing yet but I was wondering if it is still observed.

            So this is causing my applications to fail. I have no idea how to fix that. Any ideas?

            simmonsja James A Simmons added a comment - So this is causing my applications to fail. I have no idea how to fix that. Any ideas?

            From OSS dump:

            00010000:00010000:19.0:1472585788.794238:0:19880:0:(ldlm_lock.c:662:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: filter-sultan-OST001b_UUID lock: ffff8808c87bb300/0xca29cea60124a3ed lrc: 2/0,0 mode: PW/PW res: [0x1931e3:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 2738880512->2739929087) flags: 0x40000000000000 nid: 10.37.202.11@o2ib1 remote: 0x48e7a088193305c expref: 6 pid: 11322 timeout: 0 lvb_type: 0
            
            utopiabound Nathaniel Clark added a comment - From OSS dump: 00010000:00010000:19.0:1472585788.794238:0:19880:0:(ldlm_lock.c:662:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: filter-sultan-OST001b_UUID lock: ffff8808c87bb300/0xca29cea60124a3ed lrc: 2/0,0 mode: PW/PW res: [0x1931e3:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 2738880512->2739929087) flags: 0x40000000000000 nid: 10.37.202.11@o2ib1 remote: 0x48e7a088193305c expref: 6 pid: 11322 timeout: 0 lvb_type: 0

            Here is a lctl dump from both my client and an OSS server.

            simmonsja James A Simmons added a comment - Here is a lctl dump from both my client and an OSS server.
            simmonsja James A Simmons added a comment - - edited

            Okay I was wrong. I do still get this error. Its just before I saw it on the clients and servers. Now I just see it on the server side. Do you have any jobs fail due to this?

            simmonsja James A Simmons added a comment - - edited Okay I was wrong. I do still get this error. Its just before I saw it on the clients and servers. Now I just see it on the server side. Do you have any jobs fail due to this?

            Yes, it is Lustre 2.5.5 + many patches. I believe it is Mellanox. We don't use DLC, yaml, or change the peer credits.

            morrone Christopher Morrone (Inactive) added a comment - Yes, it is Lustre 2.5.5 + many patches. I believe it is Mellanox. We don't use DLC, yaml, or change the peer credits.
            simmonsja James A Simmons added a comment - - edited

            I fixed up my LNet DLC yaml configuration file and applied LU-8507 so it configures properly with my pre-2.9 version and the problem went away. The issue is a symptom of using not the best LNet configuration. Are you still on the lustre 2.5 version? Is this with using the Mellanox stack? Also are you connecting to x86 server nodes?

            simmonsja James A Simmons added a comment - - edited I fixed up my LNet DLC yaml configuration file and applied LU-8507 so it configures properly with my pre-2.9 version and the problem went away. The issue is a symptom of using not the best LNet configuration. Are you still on the lustre 2.5 version? Is this with using the Mellanox stack? Also are you connecting to x86 server nodes?
            simmonsja James A Simmons added a comment - - edited

            Most recent version of lustre (2.8.56) also displays this problem with ppc8 as well. I can easily reproduce it with a IOR ssf operation on a single node.

            Do you need debugging info from both the clients and the server.

            simmonsja James A Simmons added a comment - - edited Most recent version of lustre (2.8.56) also displays this problem with ppc8 as well. I can easily reproduce it with a IOR ssf operation on a single node. Do you need debugging info from both the clients and the server.
            pjones Peter Jones added a comment -

            Sorry - wrong ticket

            pjones Peter Jones added a comment - Sorry - wrong ticket
            pjones Peter Jones added a comment -

            Hongchao

            Could you please assist with this issue?

            Peter

            pjones Peter Jones added a comment - Hongchao Could you please assist with this issue? Peter

            People

              green Oleg Drokin
              marc@llnl.gov D. Marc Stearman (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: