Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7321

LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.5.3
    • LLNL lustre-2.5.4-4chaos_2.6.32_504.16.2.1chaos.ch5.3.x86_64.x86_64
      ZFS Backend
      BG/Q clients
    • 3
    • 9223372036854775807

    Description

      One the file systems that have BlueGene/Q clients, I am seeing these log messages:

      vestai:/g/g0/marc# tail /var/log/lustre.log
      Sep 16 09:25:32 vesta56 kernel: LustreError: 137188:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 4 previous similar messages
      Sep 16 09:25:49 vesta56 kernel: LustreError: 137212:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.11@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0037_UUID lock: ffff8807adc3eb40/0xef9b0828fb0cdca lrc: 5/0,0 mode: PW/PW res: [0x4b17112:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 172.20.17.11@o2ib500 remote: 0x4ceffd8a2ed5c33a expref: 829 pid: 87731 timeout: 9654300922 lvb_type: 0
      Sep 16 09:25:49 vesta56 kernel: LustreError: 137212:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 3 previous similar messages
      Sep 16 09:26:37 vesta56 kernel: LustreError: 87731:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.12@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0037_UUID lock: ffff88064a4bc880/0xef9b0828fb0dd59 lrc: 6/0,0 mode: PW/PW res: [0x4b17112:0x0:0x0].0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 172.20.17.12@o2ib500 remote: 0xdec93cb7eb4eca85 expref: 790 pid: 137212 timeout: 9654349575 lvb_type: 0
      Sep 16 09:26:37 vesta56 kernel: LustreError: 87731:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 8 previous similar messages
      Sep 16 09:29:19 vesta2 kernel: LustreError: 82827:0:(ldlm_lockd.c:463:ldlm_add_waiting_lock()) ### not waiting on destroyed lock (bug 5653) ns: filter-fsv-OST0001_UUID lock: ffff880e14fefd40/0xca78e45d33428446 lrc: 1/0,0 mode: --/PW res: [0x4d27886:0x0:0x0].0 rrc: 6 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x74801080010020 nid: 172.20.17.2@o2ib500 remote: 0xa0f4c9e1f12378c4 expref: 1893 pid: 10293 timeout: 0 lvb_type: 0
      Sep 16 09:29:19 vesta2 kernel: LustreError: 82827:0:(ldlm_lockd.c:463:ldlm_add_waiting_lock()) Skipped 2 previous similar messages
      Sep 16 09:30:59 vesta20 kernel: LustreError: 61820:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0013_UUID lock: ffff88047e3127c0/0x1475582a881a7576 lrc: 5/0,0 mode: PW/PW res: [0x4a8bd2d:0x0:0x0].0 rrc: 5 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 172.20.17.9@o2ib500 remote: 0xa29fd336c2734cc9 expref: 1031 pid: 61818 timeout: 9654614287 lvb_type: 0
      Sep 16 09:31:15 vesta20 kernel: LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0013_UUID lock: ffff8806b70d6380/0x1475582a881a7f4e lrc: 5/0,0 mode: PW/PW res: [0x4a8bd2d:0x0:0x0].0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60001080010020 nid: 172.20.17.9@o2ib500 remote: 0x79abd336c2734cc9 expref: 1031 pid: 61818 timeout: 9654630757 lvb_type: 0
      Sep 16 09:31:15 vesta20 kernel: LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 27 previous similar messages
      vestai:/g/g0/marc# 
      

      There were over 3000 messages printed to the logs that day, not counting the suppressed messages:
      vestai:/g/g0/marc# zgrep -c "returned -5 from glimpse AST" /var/log/lustre.log-20150915.gz
      3232
      vestai:/g/g0/marc#

      These messages only show up with NIDs related to BlueGene/Q clients, and I don't see them on file systems that do not have BlueGene/Q clients. I don't see anything in the console or syslog logs on the clients.

      On the BG/Q clients we are running:
      vulcanio14-ib0@root: cat /proc/fs/lustre/version
      lustre: 2.5.4
      kernel: patchless_client
      build: 2.5.4-4chaos-4chaos--PRISTINE-2.6.32-504.8.2.bgq.3blueos.V1R2M3.bl2.2_1.ppc64

      We saw spurious messages years ago, but after doing a splunk query of our logs it looks like it has been ramping up since late 2014:
      Number of messages per month (not counting "x similar messages":
      10/14: 172
      11/14: 8043
      12/14: 15,002
      1/15: 17,191
      2/15: 36,258
      3/15: 53,145
      4/15: 69,215
      5/15: 96,518
      6/15: 54,504
      7/15: 91,304
      8/15: 144,691
      9/15: 236,491

      Attachments

        1. debug_log.vesta29.LU-7321.gz
          2.45 MB
        2. debug_log.vesta-mds1.LU-7321.gz
          177 kB
        3. debug_log.vulcanio17
          557 kB
        4. dump-client.log
          186 kB
        5. dump-oss.log.gz
          12.38 MB

        Issue Links

          Activity

            [LU-7321] LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST
            pjones Peter Jones added a comment -

            Hongchao

            Could you please assist with this issue?

            Peter

            pjones Peter Jones added a comment - Hongchao Could you please assist with this issue? Peter

            We have extended PPC testing in autotest (back in Feb 2016), but there have been no failures since then.

            utopiabound Nathaniel Clark added a comment - We have extended PPC testing in autotest (back in Feb 2016), but there have been no failures since then.
            green Oleg Drokin added a comment -

            These levels should be enough from a server that has a history of hitting the error message in question, since guessing which client would hit it next is hard if possible at all.

            I assume you might be wary of added performance impact of these increased debug levels, hence only one server should be enough, even though I personally think rpctrace and ha are kind of low key levels with not too much chatter.

            green Oleg Drokin added a comment - These levels should be enough from a server that has a history of hitting the error message in question, since guessing which client would hit it next is hard if possible at all. I assume you might be wary of added performance impact of these increased debug levels, hence only one server should be enough, even though I personally think rpctrace and ha are kind of low key levels with not too much chatter.

            Can you try to get some debug logs with the following levels enabled: rpctrace ha

            This will help nail down if the error is come back from the client or due to a connectivity error.

            utopiabound Nathaniel Clark added a comment - Can you try to get some debug logs with the following levels enabled: rpctrace ha This will help nail down if the error is come back from the client or due to a connectivity error.

            Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/17672
            Subject: LU-7321 test: FOR TEST ONLY
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5d8700e425773b3738844b477b2fe74d6de85b08

            gerrit Gerrit Updater added a comment - Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/17672 Subject: LU-7321 test: FOR TEST ONLY Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5d8700e425773b3738844b477b2fe74d6de85b08

            Let me see if I can pull in someone more familiar with this code that might have an idea where the issue would be.

            utopiabound Nathaniel Clark added a comment - Let me see if I can pull in someone more familiar with this code that might have an idea where the issue would be.

            I was looking at the debug log files. You need to pay attention the timestamps. We do system maintenance on Thursday on Vulcan (the BG/Q cluster), so you should expect network errors on Dec 03, and Dec 10 due to the I/O nodes being rebooted. The logs are getting dumped when syslog sees the error message come by, so these dumps may have days or weeks of information in them. The messages are happening quite frequently, and I don't believe we are tearing down network connections on each of these instances.

            marc@llnl.gov D. Marc Stearman (Inactive) added a comment - I was looking at the debug log files. You need to pay attention the timestamps. We do system maintenance on Thursday on Vulcan (the BG/Q cluster), so you should expect network errors on Dec 03, and Dec 10 due to the I/O nodes being rebooted. The logs are getting dumped when syslog sees the error message come by, so these dumps may have days or weeks of information in them. The messages are happening quite frequently, and I don't believe we are tearing down network connections on each of these instances.

            And just to add if it helps, the IO node has been up for 6 days:
            [root@vulcansn1:~]# mrsh vulcanio17 uptime
            15:20:58 up 6 days, 6:02, 1 user, load average: 3.39, 5.00, 5.64
            [root@vulcansn1:~]# mrsh vulcanio17 cat /proc/sys/lnet/debug
            ioctl neterror warning error emerg config console
            [root@vulcansn1:~]#

            marc@llnl.gov D. Marc Stearman (Inactive) added a comment - And just to add if it helps, the IO node has been up for 6 days: [root@vulcansn1:~] # mrsh vulcanio17 uptime 15:20:58 up 6 days, 6:02, 1 user, load average: 3.39, 5.00, 5.64 [root@vulcansn1:~] # mrsh vulcanio17 cat /proc/sys/lnet/debug ioctl neterror warning error emerg config console [root@vulcansn1:~] #

            I not sure if the logs have cycled around, but here is the log from vulcanio17, which I just dumped. I'm not sure which field is the timestamp, but it seems like there are no messages since it last mounted.

            marc@llnl.gov D. Marc Stearman (Inactive) added a comment - I not sure if the logs have cycled around, but here is the log from vulcanio17, which I just dumped. I'm not sure which field is the timestamp, but it seems like there are no messages since it last mounted.

            I'm not seeing any errors on the IB fabric that would account for this. Could there be some swabbing issues in the IB LND code? We only see these errors on the file systems connected to PPC clients (the BlueGene BG/Q I/O nodes and login nodes). 172.20.17.17@o2ib500 is vulcanio17 (an I/O node on BG/Q).

            marc@llnl.gov D. Marc Stearman (Inactive) added a comment - I'm not seeing any errors on the IB fabric that would account for this. Could there be some swabbing issues in the IB LND code? We only see these errors on the file systems connected to PPC clients (the BlueGene BG/Q I/O nodes and login nodes). 172.20.17.17@o2ib500 is vulcanio17 (an I/O node on BG/Q).

            There looks like there's a lot of IB errors in the debug logs that are attached (these messages appear in both debug logs that are attached).

            16471:0:(o2iblnd_cb.c:1011:kiblnd_tx_complete()) Tx -> 172.20.17.17@o2ib500 cookie 0xbca6a70ba sending 1 waiting 0: failed 12
            16471:0:(o2iblnd_cb.c:1895:kiblnd_close_conn_locked()) Closing conn to 172.20.17.17@o2ib500: error -5(waiting)
            

            Could these be the source of the errors?

            The error that is printed on the client is coming up from the network stack.

            utopiabound Nathaniel Clark added a comment - There looks like there's a lot of IB errors in the debug logs that are attached (these messages appear in both debug logs that are attached). 16471:0:(o2iblnd_cb.c:1011:kiblnd_tx_complete()) Tx -> 172.20.17.17@o2ib500 cookie 0xbca6a70ba sending 1 waiting 0: failed 12 16471:0:(o2iblnd_cb.c:1895:kiblnd_close_conn_locked()) Closing conn to 172.20.17.17@o2ib500: error -5(waiting) Could these be the source of the errors? The error that is printed on the client is coming up from the network stack.

            People

              green Oleg Drokin
              marc@llnl.gov D. Marc Stearman (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: