[LU-7321] LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST Created: 20/Oct/15  Updated: 21/Mar/18  Resolved: 21/Mar/18

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: D. Marc Stearman (Inactive) Assignee: Oleg Drokin
Resolution: Cannot Reproduce Votes: 0
Labels: llnl, ppc
Environment:

LLNL lustre-2.5.4-4chaos_2.6.32_504.16.2.1chaos.ch5.3.x86_64.x86_64
ZFS Backend
BG/Q clients


Attachments: File debug_log.vesta-mds1.LU-7321.gz     File debug_log.vesta29.LU-7321.gz     File debug_log.vulcanio17     Text File dump-client.log     File dump-oss.log.gz    
Issue Links:
Related
is related to LU-6387 Add Power8 support to Lustre Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

One the file systems that have BlueGene/Q clients, I am seeing these log messages:

vestai:/g/g0/marc# tail /var/log/lustre.log
Sep 16 09:25:32 vesta56 kernel: LustreError: 137188:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 4 previous similar messages
Sep 16 09:25:49 vesta56 kernel: LustreError: 137212:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.11@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0037_UUID lock: ffff8807adc3eb40/0xef9b0828fb0cdca lrc: 5/0,0 mode: PW/PW res: [0x4b17112:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 172.20.17.11@o2ib500 remote: 0x4ceffd8a2ed5c33a expref: 829 pid: 87731 timeout: 9654300922 lvb_type: 0
Sep 16 09:25:49 vesta56 kernel: LustreError: 137212:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 3 previous similar messages
Sep 16 09:26:37 vesta56 kernel: LustreError: 87731:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.12@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0037_UUID lock: ffff88064a4bc880/0xef9b0828fb0dd59 lrc: 6/0,0 mode: PW/PW res: [0x4b17112:0x0:0x0].0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 172.20.17.12@o2ib500 remote: 0xdec93cb7eb4eca85 expref: 790 pid: 137212 timeout: 9654349575 lvb_type: 0
Sep 16 09:26:37 vesta56 kernel: LustreError: 87731:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 8 previous similar messages
Sep 16 09:29:19 vesta2 kernel: LustreError: 82827:0:(ldlm_lockd.c:463:ldlm_add_waiting_lock()) ### not waiting on destroyed lock (bug 5653) ns: filter-fsv-OST0001_UUID lock: ffff880e14fefd40/0xca78e45d33428446 lrc: 1/0,0 mode: --/PW res: [0x4d27886:0x0:0x0].0 rrc: 6 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x74801080010020 nid: 172.20.17.2@o2ib500 remote: 0xa0f4c9e1f12378c4 expref: 1893 pid: 10293 timeout: 0 lvb_type: 0
Sep 16 09:29:19 vesta2 kernel: LustreError: 82827:0:(ldlm_lockd.c:463:ldlm_add_waiting_lock()) Skipped 2 previous similar messages
Sep 16 09:30:59 vesta20 kernel: LustreError: 61820:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0013_UUID lock: ffff88047e3127c0/0x1475582a881a7576 lrc: 5/0,0 mode: PW/PW res: [0x4a8bd2d:0x0:0x0].0 rrc: 5 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 172.20.17.9@o2ib500 remote: 0xa29fd336c2734cc9 expref: 1031 pid: 61818 timeout: 9654614287 lvb_type: 0
Sep 16 09:31:15 vesta20 kernel: LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.9@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST0013_UUID lock: ffff8806b70d6380/0x1475582a881a7f4e lrc: 5/0,0 mode: PW/PW res: [0x4a8bd2d:0x0:0x0].0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60001080010020 nid: 172.20.17.9@o2ib500 remote: 0x79abd336c2734cc9 expref: 1031 pid: 61818 timeout: 9654630757 lvb_type: 0
Sep 16 09:31:15 vesta20 kernel: LustreError: 61814:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) Skipped 27 previous similar messages
vestai:/g/g0/marc# 

There were over 3000 messages printed to the logs that day, not counting the suppressed messages:
vestai:/g/g0/marc# zgrep -c "returned -5 from glimpse AST" /var/log/lustre.log-20150915.gz
3232
vestai:/g/g0/marc#

These messages only show up with NIDs related to BlueGene/Q clients, and I don't see them on file systems that do not have BlueGene/Q clients. I don't see anything in the console or syslog logs on the clients.

On the BG/Q clients we are running:
vulcanio14-ib0@root: cat /proc/fs/lustre/version
lustre: 2.5.4
kernel: patchless_client
build: 2.5.4-4chaos-4chaos--PRISTINE-2.6.32-504.8.2.bgq.3blueos.V1R2M3.bl2.2_1.ppc64

We saw spurious messages years ago, but after doing a splunk query of our logs it looks like it has been ramping up since late 2014:
Number of messages per month (not counting "x similar messages":
10/14: 172
11/14: 8043
12/14: 15,002
1/15: 17,191
2/15: 36,258
3/15: 53,145
4/15: 69,215
5/15: 96,518
6/15: 54,504
7/15: 91,304
8/15: 144,691
9/15: 236,491



 Comments   
Comment by D. Marc Stearman (Inactive) [ 20/Oct/15 ]

Some more information:

In Oct 2014, we were running 2.4.2 on the bluegene clusters, and the file system servers. In Dec 2015, we updated to 2.5.3 on the servers, and 2.5.4 on the bluegene clients.

Comment by Joseph Gmitter (Inactive) [ 20/Oct/15 ]

Hi Nathaniel,
Can you have a look at this issue?
Thanks.
Joe

Comment by Andreas Dilger [ 20/Oct/15 ]

Marc, could you please check a couple of the OST objects that are reported in these error messages to see if they still exist, e.g. [0x4b17112:0x0:0x0] and [0x4a8bd2d:0x0:0x0] (presumably with zdb to avoid unmounting the OSTs). I'm trying to figure out if these errors are appearing because MDT layouts are referencing missing OST objects, or if the objects exist but there is some error accessing or locking them from the PPC clients. The OST FIDs (object IDs) themselves look reasonable, 78737682 and 78167341 respectively, so it doesn't appear to be a FID swabbing issue at least.

Comment by D. Marc Stearman (Inactive) [ 20/Oct/15 ]

Just to make sure I did the right thing:

Using this current message:

Oct 20 10:53:33 vesta46 kernel: LustreError: 40317:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.14@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST002d_UUID lock: ffff8801c0a3bbc0/0x44146a450a733fa0 lrc: 5/0,0 mode: PW/PW res: [0x4f831ef:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x50801000000000 nid: 172.20.17.14@o2ib500 remote: 0x7266945e442341a2 expref: 5 pid: 40321 timeout: 0 lvb_type: 0

I want tot look for object "0x4f831ef:0x0:0x0" on vesta46. I created a snapshot of the OST, and mounted it. "0x4f831ef" translates to 83374575 in decimal.

In the snapshot, I did this:

[root@vesta46:snap]# ls
CONFIGS       oi.10   oi.105  oi.110  oi.116  oi.121  oi.127  oi.18  oi.23  oi.29  oi.34  oi.4   oi.45  oi.50  oi.56  oi.61  oi.67  oi.72  oi.78  oi.83  oi.89  oi.94  quota_slave
O             oi.100  oi.106  oi.111  oi.117  oi.122  oi.13   oi.19  oi.24  oi.3   oi.35  oi.40  oi.46  oi.51  oi.57  oi.62  oi.68  oi.73  oi.79  oi.84  oi.9   oi.95  seq-200000003-lastid
health_check  oi.101  oi.107  oi.112  oi.118  oi.123  oi.14   oi.2   oi.25  oi.30  oi.36  oi.41  oi.47  oi.52  oi.58  oi.63  oi.69  oi.74  oi.8   oi.85  oi.90  oi.96  seq_srv
last_rcvd     oi.102  oi.108  oi.113  oi.119  oi.124  oi.15   oi.20  oi.26  oi.31  oi.37  oi.42  oi.48  oi.53  oi.59  oi.64  oi.7   oi.75  oi.80  oi.86  oi.91  oi.97
oi.0          oi.103  oi.109  oi.114  oi.12   oi.125  oi.16   oi.21  oi.27  oi.32  oi.38  oi.43  oi.49  oi.54  oi.6   oi.65  oi.70  oi.76  oi.81  oi.87  oi.92  oi.98
oi.1          oi.104  oi.11   oi.115  oi.120  oi.126  oi.17   oi.22  oi.28  oi.33  oi.39  oi.44  oi.5   oi.55  oi.60  oi.66  oi.71  oi.77  oi.82  oi.88  oi.93  oi.99
[root@vesta46:snap]# ls O/0
d0  d1  d10  d11  d12  d13  d14  d15  d16  d17  d18  d19  d2  d20  d21  d22  d23  d24  d25  d26  d27  d28  d29  d3  d30  d31  d4  d5  d6  d7  d8  d9
[root@vesta46:snap]# ls O/0/*/83374575
ls: cannot access O/0/*/83374575: No such file or directory
[root@vesta46:snap]# 

This tells me the object does not exist. The next sequence number does, so it seems to be a valid range of objects:

[root@vesta46:snap]# ls O/0/*/83374576
O/0/d16/83374576
[root@vesta46:snap]# 

If I did not look in the right place, let me know and I can check again.

Comment by D. Marc Stearman (Inactive) [ 20/Oct/15 ]

Just for kicks I picked a couple of other messages on that OST, and found that the files DO exist:

Oct 20 07:52:13 vesta46 kernel: LustreError: 40313:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.185@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST002d_UUID lock: ffff880430e4bb00/0x44146a450a4fb185 lrc: 5/0,0 mode: PW/PW res: [0x4f814c7:0x0:0x0].0 rrc: 17 type: EXT [191496192->18446744073709551615] (req 191496192->191561727) flags: 0x40000000000000 nid: 172.20.17.185@o2ib500 remote: 0x82ceff91b720bbf5 expref: 18 pid: 40313 timeout: 0 lvb_type: 0
Oct 20 07:52:41 vesta46 kernel: LustreError: 40316:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.159@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST002d_UUID lock: ffff880aee7881c0/0x44146a450a4fb903 lrc: 5/0,0 mode: PW/PW res: [0x4f81442:0x0:0x0].0 rrc: 224 type: EXT [1861222400->18446744073709551615] (req 1861222400->18446744073709551615) flags: 0x40000000000000 nid: 172.20.17.159@o2ib500 remote: 0x7363d7a3a2c7e459 expref: 134 pid: 40316 timeout: 0 lvb_type: 0
Oct 20 08:00:33 vesta46 kernel: LustreError: 42292:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.177@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST002d_UUID lock: ffff88068ffb36c0/0x44146a450a50a9cd lrc: 4/0,0 mode: PW/PW res: [0x4f81468:0x0:0x0].0 rrc: 558 type: EXT [1877999616->18446744073709551615] (req 1877999616->18446744073709551615) flags: 0x40000000000000 nid: 172.20.17.177@o2ib500 remote: 0xd4e4f78838543fa8 expref: 291 pid: 42292 timeout: 0 lvb_type: 0
Oct 20 10:53:33 vesta46 kernel: LustreError: 40317:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.14@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST002d_UUID lock: ffff8801c0a3bbc0/0x44146a450a733fa0 lrc: 5/0,0 mode: PW/PW res: [0x4f831ef:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x50801000000000 nid: 172.20.17.14@o2ib500 remote: 0x7266945e442341a2 expref: 5 pid: 40321 timeout: 0 lvb_type: 0

[root@vesta46:snap]# ls O/0/*/83367016
O/0/d8/83367016
[root@vesta46:snap]# ls O/0/*/83366978
O/0/d2/83366978
[root@vesta46:snap]# ls O/0/*/83374575
ls: cannot access O/0/*/83374575: No such file or directory
[root@vesta46:snap]# 
[root@vesta46:snap]# ls O/0/*/83367111
O/0/d7/83367111
[root@vesta46:snap]# 

So, out of the four messages on that OST this morning, three of the files exist, but the one at 10:53:33 does not exist.

Comment by Nathaniel Clark [ 17/Nov/15 ]

Can you provide debug logs from the clients, OSS and MDS when this occurs?

Comment by John Fuchs-Chesney (Inactive) [ 03/Dec/15 ]

Hello Marc,

Are you able to provide the information that Nathaniel has requested?

Any other action you would like us to take on this one?
Thanks,
~ jfc.

Comment by D. Marc Stearman (Inactive) [ 03/Dec/15 ]

I can probably trigger some dumps on the OSTs and MDS node. Let me tweak some scripts and wait for it to happen.

Comment by D. Marc Stearman (Inactive) [ 16/Dec/15 ]

Here are the logs from vesta29, and also from the MDS at the time this log messages appeared:

Dec 15 14:56:21 vesta29 kernel: LustreError: 9813:0:(ldlm_lockd.c:692:ldlm_handle_ast_error()) ### client (nid 172.20.17.17@o2ib500) returned -5 from glimpse AST ns: filter-fsv-OST001c_UUID lock: ffff8803ad803980/0x9749e8c363b75b7f lrc: 5/0,0 mode: PW/PW res: [0x542af58:0x0:0x0].0 rrc: 31 type: EXT [719323136->18446744073709551615] (req 719323136->720371711) flags: 0x60000000000020 nid: 172.20.17.17@o2ib500 remote: 0x1fa09e9911a44c7c expref: 349 pid: 15752 timeout: 7159851604 lvb_type: 0
Comment by Christopher Morrone [ 16/Dec/15 ]

Nathaniel, could you please take a step back for a moment and explain the meaning of the error message? I'll start the conversation by giving my best guess at the situation:

Some server (OST) needs to satisfy a request to lookup the a file's size, but a client currently holds a range lock on the file that means the server doesn't know the size (because there may be outstanding dirty data on the client). The server issues a glimpse AST to the client to get that information. The client returns EIO.

So if I'm in the ballpark, the first step would be to look at the client code and figure out under which situations the response will be "EIO". I mean, heck, maybe the glimpse AST code just hasn't been correctly swabbed and the client can't make sense of the RPC.

I could be completely off base. If so, please enlighten us.

I'm trying to figure out if these errors are appearing because MDT layouts are referencing missing OST objects,

I am not sure that I understand that rationale. Why would an OST ask a client about an object for which it itself had no knowledge? That would be a bug right there, would it not? The fact that the OST looked up the locks on the object and decided that a client had a lock on the end of the object pretty much tells us that the object existed on the OST at the time it issue the glimpse AST to the client, does it not?

Comment by Nathaniel Clark [ 16/Dec/15 ]

There looks like there's a lot of IB errors in the debug logs that are attached (these messages appear in both debug logs that are attached).

16471:0:(o2iblnd_cb.c:1011:kiblnd_tx_complete()) Tx -> 172.20.17.17@o2ib500 cookie 0xbca6a70ba sending 1 waiting 0: failed 12
16471:0:(o2iblnd_cb.c:1895:kiblnd_close_conn_locked()) Closing conn to 172.20.17.17@o2ib500: error -5(waiting)

Could these be the source of the errors?

The error that is printed on the client is coming up from the network stack.

Comment by D. Marc Stearman (Inactive) [ 16/Dec/15 ]

I'm not seeing any errors on the IB fabric that would account for this. Could there be some swabbing issues in the IB LND code? We only see these errors on the file systems connected to PPC clients (the BlueGene BG/Q I/O nodes and login nodes). 172.20.17.17@o2ib500 is vulcanio17 (an I/O node on BG/Q).

Comment by D. Marc Stearman (Inactive) [ 16/Dec/15 ]

I not sure if the logs have cycled around, but here is the log from vulcanio17, which I just dumped. I'm not sure which field is the timestamp, but it seems like there are no messages since it last mounted.

Comment by D. Marc Stearman (Inactive) [ 16/Dec/15 ]

And just to add if it helps, the IO node has been up for 6 days:
[root@vulcansn1:~]# mrsh vulcanio17 uptime
15:20:58 up 6 days, 6:02, 1 user, load average: 3.39, 5.00, 5.64
[root@vulcansn1:~]# mrsh vulcanio17 cat /proc/sys/lnet/debug
ioctl neterror warning error emerg config console
[root@vulcansn1:~]#

Comment by D. Marc Stearman (Inactive) [ 16/Dec/15 ]

I was looking at the debug log files. You need to pay attention the timestamps. We do system maintenance on Thursday on Vulcan (the BG/Q cluster), so you should expect network errors on Dec 03, and Dec 10 due to the I/O nodes being rebooted. The logs are getting dumped when syslog sees the error message come by, so these dumps may have days or weeks of information in them. The messages are happening quite frequently, and I don't believe we are tearing down network connections on each of these instances.

Comment by Nathaniel Clark [ 17/Dec/15 ]

Let me see if I can pull in someone more familiar with this code that might have an idea where the issue would be.

Comment by Gerrit Updater [ 18/Dec/15 ]

Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/17672
Subject: LU-7321 test: FOR TEST ONLY
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5d8700e425773b3738844b477b2fe74d6de85b08

Comment by Nathaniel Clark [ 15/Jan/16 ]

Can you try to get some debug logs with the following levels enabled: rpctrace ha

This will help nail down if the error is come back from the client or due to a connectivity error.

Comment by Oleg Drokin [ 16/Jan/16 ]

These levels should be enough from a server that has a history of hitting the error message in question, since guessing which client would hit it next is hard if possible at all.

I assume you might be wary of added performance impact of these increased debug levels, hence only one server should be enough, even though I personally think rpctrace and ha are kind of low key levels with not too much chatter.

Comment by Nathaniel Clark [ 19/May/16 ]

We have extended PPC testing in autotest (back in Feb 2016), but there have been no failures since then.

Comment by Peter Jones [ 16/Jun/16 ]

Hongchao

Could you please assist with this issue?

Peter

Comment by Peter Jones [ 16/Jun/16 ]

Sorry - wrong ticket

Comment by James A Simmons [ 26/Aug/16 ]

Most recent version of lustre (2.8.56) also displays this problem with ppc8 as well. I can easily reproduce it with a IOR ssf operation on a single node.

Do you need debugging info from both the clients and the server.

Comment by James A Simmons [ 29/Aug/16 ]

I fixed up my LNet DLC yaml configuration file and applied LU-8507 so it configures properly with my pre-2.9 version and the problem went away. The issue is a symptom of using not the best LNet configuration. Are you still on the lustre 2.5 version? Is this with using the Mellanox stack? Also are you connecting to x86 server nodes?

Comment by Christopher Morrone [ 29/Aug/16 ]

Yes, it is Lustre 2.5.5 + many patches. I believe it is Mellanox. We don't use DLC, yaml, or change the peer credits.

Comment by James A Simmons [ 30/Aug/16 ]

Okay I was wrong. I do still get this error. Its just before I saw it on the clients and servers. Now I just see it on the server side. Do you have any jobs fail due to this?

Comment by James A Simmons [ 30/Aug/16 ]

Here is a lctl dump from both my client and an OSS server.

Comment by Nathaniel Clark [ 31/Aug/16 ]

From OSS dump:

00010000:00010000:19.0:1472585788.794238:0:19880:0:(ldlm_lock.c:662:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: filter-sultan-OST001b_UUID lock: ffff8808c87bb300/0xca29cea60124a3ed lrc: 2/0,0 mode: PW/PW res: [0x1931e3:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 2738880512->2739929087) flags: 0x40000000000000 nid: 10.37.202.11@o2ib1 remote: 0x48e7a088193305c expref: 6 pid: 11322 timeout: 0 lvb_type: 0
Comment by James A Simmons [ 01/Sep/16 ]

So this is causing my applications to fail. I have no idea how to fix that. Any ideas?

Comment by James A Simmons [ 06/Mar/18 ]

Do people still see this? I haven't done large scaling testing yet but I was wondering if it is still observed.

Generated at Sat Feb 10 02:07:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.