[LU-7321] LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.5.3
Labels:
- llnl
- ppc
Environment:
LLNL lustre-2.5.4-4chaos_2.6.32_504.16.2.1chaos.ch5.3.x86_64.x86_64
ZFS Backend
BG/Q clients

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

One the file systems that have BlueGene/Q clients, I am seeing these log messages:

vestai:/g/g0/marc# tail /var/log/lustre.log
Sep 16 09:25:32 vesta56 kernel: LustreError: 137188:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 4 previous similar messages
Sep 16 09:25:49 vesta56 kernel: LustreError: 137212:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.11@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0037_UUID lock: ffff8807adc3eb40/0xef9b0828fb0cdca lrc: 5/0,0 mode: PW/PW res: [0x4b17112:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 172.20.17.11@o2ib500 remote: 0x4ceffd8a2ed5c33a expref: 829 pid: 87731 timeout: 9654300922 lvb_type: 0
Sep 16 09:25:49 vesta56 kernel: LustreError: 137212:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 3 previous similar messages
Sep 16 09:26:37 vesta56 kernel: LustreError: 87731:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.12@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0037_UUID lock: ffff88064a4bc880/0xef9b0828fb0dd59 lrc: 6/0,0 mode: PW/PW res: [0x4b17112:0x0:0x0].0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 172.20.17.12@o2ib500 remote: 0xdec93cb7eb4eca85 expref: 790 pid: 137212 timeout: 9654349575 lvb_type: 0
Sep 16 09:26:37 vesta56 kernel: LustreError: 87731:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 8 previous similar messages
Sep 16 09:29:19 vesta2 kernel: LustreError: 82827:0:(ldlm_lockd.c:463:ldlm_add_waiting_lock()) ### not waiting on destroyed lock (bug 5653) ns: filter-fsv-OST0001_UUID lock: ffff880e14fefd40/0xca78e45d33428446 lrc: 1/0,0 mode: --/PW res: [0x4d27886:0x0:0x0].0 rrc: 6 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x74801080010020 nid: 172.20.17.2@o2ib500 remote: 0xa0f4c9e1f12378c4 expref: 1893 pid: 10293 timeout: 0 lvb_type: 0
Sep 16 09:29:19 vesta2 kernel: LustreError: 82827:0:(ldlm_lockd.c:463:ldlm_add_waiting_lock()) Skipped 2 previous similar messages
Sep 16 09:30:59 vesta20 kernel: LustreError: 61820:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0013_UUID lock: ffff88047e3127c0/0x1475582a881a7576 lrc: 5/0,0 mode: PW/PW res: [0x4a8bd2d:0x0:0x0].0 rrc: 5 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 172.20.17.9@o2ib500 remote: 0xa29fd336c2734cc9 expref: 1031 pid: 61818 timeout: 9654614287 lvb_type: 0
Sep 16 09:31:15 vesta20 kernel: LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0013_UUID lock: ffff8806b70d6380/0x1475582a881a7f4e lrc: 5/0,0 mode: PW/PW res: [0x4a8bd2d:0x0:0x0].0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60001080010020 nid: 172.20.17.9@o2ib500 remote: 0x79abd336c2734cc9 expref: 1031 pid: 61818 timeout: 9654630757 lvb_type: 0
Sep 16 09:31:15 vesta20 kernel: LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 27 previous similar messages
vestai:/g/g0/marc#

There were over 3000 messages printed to the logs that day, not counting the suppressed messages:
vestai:/g/g0/marc# zgrep -c "returned -5 from glimpse AST" /var/log/lustre.log-20150915.gz
3232
vestai:/g/g0/marc#

These messages only show up with NIDs related to BlueGene/Q clients, and I don't see them on file systems that do not have BlueGene/Q clients. I don't see anything in the console or syslog logs on the clients.

On the BG/Q clients we are running:
vulcanio14-ib0@root: cat /proc/fs/lustre/version
lustre: 2.5.4
kernel: patchless_client
build: 2.5.4-4chaos-4chaos--PRISTINE-2.6.32-504.8.2.bgq.3blueos.V1R2M3.bl2.2_1.ppc64

We saw spurious messages years ago, but after doing a splunk query of our logs it looks like it has been ramping up since late 2014:
Number of messages per month (not counting "x similar messages":
10/14: 172
11/14: 8043
12/14: 15,002
1/15: 17,191
2/15: 36,258
3/15: 53,145
4/15: 69,215
5/15: 96,518
6/15: 54,504
7/15: 91,304
8/15: 144,691
9/15: 236,491

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

debug_log.vesta29.LU-7321.gz
2.45 MB
16/Dec/15 12:47 AM
debug_log.vesta-mds1.LU-7321.gz
177 kB
16/Dec/15 12:47 AM
debug_log.vulcanio17
557 kB
16/Dec/15 11:25 PM
dump-client.log
186 kB
30/Aug/16 8:18 PM
dump-oss.log.gz
12.38 MB
30/Aug/16 8:18 PM

Issue Links

is related to

LU-6387 Add Power8 support to Lustre

Resolved

Activity

[LU-7321] LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST

Peter Jones added a comment - 16/Jun/16 5:47 PM

Hongchao

Could you please assist with this issue?

Peter

Peter Jones added a comment - 16/Jun/16 5:47 PM Hongchao Could you please assist with this issue? Peter

Nathaniel Clark added a comment - 19/May/16 4:56 PM

We have extended PPC testing in autotest (back in Feb 2016), but there have been no failures since then.

Nathaniel Clark added a comment - 19/May/16 4:56 PM We have extended PPC testing in autotest (back in Feb 2016), but there have been no failures since then.

Oleg Drokin added a comment - 16/Jan/16 2:22 AM

These levels should be enough from a server that has a history of hitting the error message in question, since guessing which client would hit it next is hard if possible at all.

I assume you might be wary of added performance impact of these increased debug levels, hence only one server should be enough, even though I personally think rpctrace and ha are kind of low key levels with not too much chatter.

Oleg Drokin added a comment - 16/Jan/16 2:22 AM These levels should be enough from a server that has a history of hitting the error message in question, since guessing which client would hit it next is hard if possible at all. I assume you might be wary of added performance impact of these increased debug levels, hence only one server should be enough, even though I personally think rpctrace and ha are kind of low key levels with not too much chatter.

Nathaniel Clark added a comment - 15/Jan/16 10:01 PM

Can you try to get some debug logs with the following levels enabled: rpctrace ha

This will help nail down if the error is come back from the client or due to a connectivity error.

Nathaniel Clark added a comment - 15/Jan/16 10:01 PM Can you try to get some debug logs with the following levels enabled: rpctrace ha This will help nail down if the error is come back from the client or due to a connectivity error.

Gerrit Updater added a comment - 18/Dec/15 2:24 PM

Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/17672
Subject: ~~LU-7321~~ test: FOR TEST ONLY
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5d8700e425773b3738844b477b2fe74d6de85b08

Gerrit Updater added a comment - 18/Dec/15 2:24 PM Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/17672 Subject: LU-7321 test: FOR TEST ONLY Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5d8700e425773b3738844b477b2fe74d6de85b08

Nathaniel Clark added a comment - 17/Dec/15 7:34 PM

Let me see if I can pull in someone more familiar with this code that might have an idea where the issue would be.

Nathaniel Clark added a comment - 17/Dec/15 7:34 PM Let me see if I can pull in someone more familiar with this code that might have an idea where the issue would be.

D. Marc Stearman (Inactive) added a comment - 16/Dec/15 11:42 PM

I was looking at the debug log files. You need to pay attention the timestamps. We do system maintenance on Thursday on Vulcan (the BG/Q cluster), so you should expect network errors on Dec 03, and Dec 10 due to the I/O nodes being rebooted. The logs are getting dumped when syslog sees the error message come by, so these dumps may have days or weeks of information in them. The messages are happening quite frequently, and I don't believe we are tearing down network connections on each of these instances.

D. Marc Stearman (Inactive) added a comment - 16/Dec/15 11:42 PM I was looking at the debug log files. You need to pay attention the timestamps. We do system maintenance on Thursday on Vulcan (the BG/Q cluster), so you should expect network errors on Dec 03, and Dec 10 due to the I/O nodes being rebooted. The logs are getting dumped when syslog sees the error message come by, so these dumps may have days or weeks of information in them. The messages are happening quite frequently, and I don't believe we are tearing down network connections on each of these instances.

D. Marc Stearman (Inactive) added a comment - 16/Dec/15 11:27 PM

And just to add if it helps, the IO node has been up for 6 days:
[root@vulcansn1:~]# mrsh vulcanio17 uptime
15:20:58 up 6 days, 6:02, 1 user, load average: 3.39, 5.00, 5.64
[root@vulcansn1:~]# mrsh vulcanio17 cat /proc/sys/lnet/debug
ioctl neterror warning error emerg config console
[root@vulcansn1:~]#

D. Marc Stearman (Inactive) added a comment - 16/Dec/15 11:27 PM And just to add if it helps, the IO node has been up for 6 days: [root@vulcansn1:~] # mrsh vulcanio17 uptime 15:20:58 up 6 days, 6:02, 1 user, load average: 3.39, 5.00, 5.64 [root@vulcansn1:~] # mrsh vulcanio17 cat /proc/sys/lnet/debug ioctl neterror warning error emerg config console [root@vulcansn1:~] #

D. Marc Stearman (Inactive) added a comment - 16/Dec/15 11:25 PM

I not sure if the logs have cycled around, but here is the log from vulcanio17, which I just dumped. I'm not sure which field is the timestamp, but it seems like there are no messages since it last mounted.

D. Marc Stearman (Inactive) added a comment - 16/Dec/15 11:25 PM I not sure if the logs have cycled around, but here is the log from vulcanio17, which I just dumped. I'm not sure which field is the timestamp, but it seems like there are no messages since it last mounted.

D. Marc Stearman (Inactive) added a comment - 16/Dec/15 11:15 PM

I'm not seeing any errors on the IB fabric that would account for this. Could there be some swabbing issues in the IB LND code? We only see these errors on the file systems connected to PPC clients (the BlueGene BG/Q I/O nodes and login nodes). 172.20.17.17@o2ib500 is vulcanio17 (an I/O node on BG/Q).

D. Marc Stearman (Inactive) added a comment - 16/Dec/15 11:15 PM I'm not seeing any errors on the IB fabric that would account for this. Could there be some swabbing issues in the IB LND code? We only see these errors on the file systems connected to PPC clients (the BlueGene BG/Q I/O nodes and login nodes). 172.20.17.17@o2ib500 is vulcanio17 (an I/O node on BG/Q).

Nathaniel Clark added a comment - 16/Dec/15 10:05 PM

There looks like there's a lot of IB errors in the debug logs that are attached (these messages appear in both debug logs that are attached).

16471:0:(o2iblnd_cb.c:1011:kiblnd_tx_complete()) Tx -> 172.20.17.17@o2ib500 cookie 0xbca6a70ba sending 1 waiting 0: failed 12
16471:0:(o2iblnd_cb.c:1895:kiblnd_close_conn_locked()) Closing conn to 172.20.17.17@o2ib500: error -5(waiting)

Could these be the source of the errors?

The error that is printed on the client is coming up from the network stack.

Nathaniel Clark added a comment - 16/Dec/15 10:05 PM There looks like there's a lot of IB errors in the debug logs that are attached (these messages appear in both debug logs that are attached). 16471:0:(o2iblnd_cb.c:1011:kiblnd_tx_complete()) Tx -> 172.20.17.17@o2ib500 cookie 0xbca6a70ba sending 1 waiting 0: failed 12 16471:0:(o2iblnd_cb.c:1895:kiblnd_close_conn_locked()) Closing conn to 172.20.17.17@o2ib500: error -5(waiting) Could these be the source of the errors? The error that is printed on the client is coming up from the network stack.

People

Assignee:: Oleg Drokin

Reporter:: D. Marc Stearman (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 20/Oct/15 3:58 PM

Updated:: 21/Mar/18 7:42 PM

Resolved:: 21/Mar/18 7:42 PM