[LU-2232] LustreError: 9120:0:(ost_handler.c:1673:ost_prolong_lock_one()) LBUG Created: 25/Oct/12  Updated: 24/Aug/15  Resolved: 24/Aug/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.1.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Joe Mervini Assignee: Lai Siyao
Resolution: Cannot Reproduce Votes: 0
Labels: llnl
Environment:

Dell R710 servers running TOSS-2.0-2 and DDN 10k storage.


Attachments: File LBUG    
Issue Links:
Duplicate
is duplicated by LU-4844 ost_prolong_lock_one()) ASSERTION( lo... Resolved
Related
is related to LU-5116 Race between resend and reply processing Resolved
Severity: 4
Rank (Obsolete): 5290

 Description   

Last night we had two OSSs panic at virtually the same time with and LBUG error being thrown. We just updated our servers and clients to 2.1.2-4chaos from 2.1.2-3chaos releases with the past 2 days and had not experienced this issue with the previous release. Below is a sample of the console log from one of the servers. I have also captured all the console messages up until the system panicked and am attaching it.

LustreError: 9044:0:(ost_handler.c:1673:ost_prolong_lock_one()) ASSERTION(lock->l_export == opd->opd_exp) failed
LustreError: 9120:0:(ost_handler.c:1673:ost_prolong_lock_one()) ASSERTION(lock->l_export == opd->opd_exp) failed
LustreError: 9120:0:(ost_handler.c:1673:ost_prolong_lock_one()) LBUG
Pid: 9120, comm: ll_ost_io_341

Call Trace:
LustreError: 9083:0:(ost_handler.c:1673:ost_prolong_lock_one()) ASSERTION(lock->l_export == opd->opd_exp) failed
LustreError: 9083:0:(ost_handler.c:1673:ost_prolong_lock_one()) LBUG
[<ffffffffa0440895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Pid: 9083, comm: ll_ost_io_304



 Comments   
Comment by Peter Jones [ 25/Oct/12 ]

Oleg is looking into this one

Comment by Oleg Drokin [ 25/Oct/12 ]

Do you have a crash dump? Can you print the lock content by any chance?

Comment by Joe Mervini [ 25/Oct/12 ]

Unfortunately, no. Our servers are diskless so when it dumps core it is lost on reboot. I have linked /var/crash to a NFS mounted device so hopefully if the LBUG occurs again it will write to that device before freezing.

As far as printing the lock content, I'm not sure what you mean. If you can tell me how to gather that information I will be happy to if I can.

Comment by Oleg Drokin [ 26/Oct/12 ]

Disklessness does not matter.
You just need to configure kdump to dump kernel core to an nfs share (no need to mount it at runtime, the debug kernel will do it separately).
The lock content you will get out of the crashdump once you have it.
How repeatable is this?

Comment by Joe Mervini [ 08/Nov/12 ]

We should probably close this issue. We decided to roll back to the previous version during a system outage last week to avoid running into the issue again in the short term. We have another system that is preproduction that is running the current version so we will monitor that and report back if we encounter the problem again.

Comment by Peter Jones [ 08/Nov/12 ]

ok thanks Joe!

Comment by Alexey Lyashkov [ 10/Oct/13 ]

Xyratex hit that bug at own branch..

Comment by Patrick Farrell (Inactive) [ 03/Feb/14 ]

Cray just hit this in our 2.5 with SLES11SP3 clients on CentOS 6.4 servers, twice.

Both times, it was immediately preceded by a 'double eviction' of a particular client, two callback timers expiring within the same second for the same client:

2014-02-01 13:44:35 LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 117s: evicting client at 54@gni1 ns: filter-esfprod-OST0004_UUID lock: ffff8803fa547980/0xd460be1680c1b564 lrc: 3/0,0 mode: PW/PW res: [0x424b478:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000010020 nid: 54@gni1 remote: 0xf4283d6d3305e0e5 expref: 2679 pid: 22955 timeout: 4994587934 lvb_type: 0
2014-02-01 13:44:35 LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 117s: evicting client at 54@gni1 ns: filter-esfprod-OST0004_UUID lock: ffff8802c669d100/0xd460be1680c1b580 lrc: 3/0,0 mode: PW/PW res: [0x424b479:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000010020 nid: 54@gni1 remote: 0xf4283d6d3305e155 expref: 2679 pid: 22955 timeout: 4994587934 lvb_type: 0

And the other instance of it (on a different system, also running Cray's 2.5):
LustreError: 3489:0:(ldlm_lockd.c:433:ldlm_add_waiting_lock()) Skipped 1 previous similar message
LustreError: 53:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 113s: evicting client at 62@gni
ns: filter-scratch-OST0001_UUID lock: ffff880445551180/0xa9b58315ab2d4d0a lrc: 3/0,0 mode: PW/PW res: [0xb5da7e1:0x0:0x0].0 rrc:
2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 62@gni remote: 0x330a8b1d9fd47938
expref: 1024 pid: 16057 timeout: 4781830324 lvb_type: 0

LustreError: 53:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 113s: evicting client at 62@gni
ns: filter-scratch-OST0001_UUID lock: ffff880387977bc0/0xa9b58315ab2d4d6c lrc: 3/0,0 mode: PW/PW res: [0xb5da7e6:0x0:0x0].0 rrc:
2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 62@gni remote: 0x330a8b1d9fd47bfb
expref: 1030 pid: 16082 timeout: 4781830324 lvb_type: 0

I'm just speculating, but I suspect this is key.

For some reason, the nid_stats struct pointer in the obd_export is zero, so I wasn't able to confirm if this export was actually the double-evicted client. (Perhaps this is always the case?)

[Edited to clarify server and client versions. Also, we've seen this several times since. Not sure why we're suddenly seeing it now, as indications are the underlying bug has been in the code for some time.]

Comment by Patrick Farrell (Inactive) [ 08/Apr/14 ]

Lai, Jodi,

I'm happy to see this bug re-opened, but I'm curious why? Was it seen again at Intel or a customer site? Was it re-opened because of the Cray report?

Comment by Lai Siyao [ 08/Apr/14 ]

Patrick, this is seen again in LU-4844, and it's better to re-open this one and mark LU-4844 as duplicate.

Comment by Patrick Farrell (Inactive) [ 08/Apr/14 ]

Thank you, Lai.

Cray can provide a dump of an OST with this crash, if you think that would be helpful.

We're currently trying to replicate it with D_RPCTRACE enabled, so if we get a dump with that, we can send that over as well. (We're working with Xyratex on this bug.)

Comment by Patrick Farrell (Inactive) [ 08/Apr/14 ]

We replicated this on an OSS with +rpctrace enabled. I've provided the dump to Xyratex, and it will be uploaded here in about 5 minutes:
ftp.whamcloud.com
uploads/LELUS-2232/LELUS-234_LU-2232.1404081304.tar.gz

Comment by Lai Siyao [ 09/Apr/14 ]

The backtrace shows it LBUG on the first ost_prolong_lock_one() in ost_prolong_locks(), IMO what happened is like this:
1. client did IO with lock handle in the request.
2. IO bulk failed on server, so no reply to client to let it resend.
3. lock cancelling timed out on server, the client was evicted.
4. client reconnected and resent previous IO request, however the lock handle is obsolete, so the LASSERT was triggered. (this lock should have been replayed, but this request was simply resent, there is no way to update lock handle in the resent request)

I'll provide a patch to check lock->l_export against opd->opd_exp other than assert for the first ost_prolong_lock_one().

Comment by Lai Siyao [ 10/Apr/14 ]

Patches are ready:
2.4: http://review.whamcloud.com/#/c/9925/
2.5: http://review.whamcloud.com/#/c/9926/
master: http://review.whamcloud.com/#/c/9927/

Comment by Oleg Drokin [ 16/Apr/14 ]

Question - if the client was evicted - the request resending should have failed because the server would reject it.
So there's something else happening that your explanation does not quite explain I think.

Comment by Lai Siyao [ 18/Apr/14 ]

I'm afraid some requests were not aborted upon eviction on client, so that they were resent through new connection. I'll make some test to find more proof.

Comment by Lai Siyao [ 04/May/14 ]

I did some test but couldn't reproduce this failure, so I updated the patch for 2.4/2.5 to debug patch which will print lock and request export address and crash as before, so that we can dump these two exports to help analyse.

Could you apply the patch and reproduce it? and if it crashes upload the crash dump.

Comment by Oleg Drokin [ 02/Jun/14 ]

I think LU-5116 might be related here to explan resend across eviction

Comment by Peter Jones [ 02/Jun/14 ]

It would be a sensible approach to try both the fix from LU-5116 and the diagnostic patch from LU-2232. That way perhaps the issue would be fixed by LU-5116 but if there is still a residual problem we will have fuller information to go forward on.

Comment by Patrick Farrell (Inactive) [ 02/Jun/14 ]

Lai - Sorry we didn't update this earlier. Cray tried to reproduce this bug with your debug patch, and were unable to do so. After that, we pulled your patch set 1 in to our Lustre version and haven't seen this bug since.

Comment by Peter Jones [ 10/Jun/14 ]

Patrick

To be clear, Cray could reliably reproduce this issue, then applied the diagnostic patch and could not? What code line was this on?

Thanks

Peter

Comment by Patrick Farrell (Inactive) [ 10/Jun/14 ]

Peter,

That's correct. It was on b2_5.

  • Patrick
Comment by Peter Jones [ 10/Jun/14 ]

thanks Patrick. Wow. That is odd.

Comment by Patrick Farrell (Inactive) [ 25/Aug/14 ]

Lai - Given that the code in this area was changed a fair bit in 2.6, do you think this bug is likely to still be present?

Comment by Lai Siyao [ 26/Aug/14 ]

Per Oleg's comment, this may be a duplicate of LU-5116, I've tried to verify it by adding some code and test, but not have enough time to finish it. I will keep an eye on it, and at the moment we can mark it resolved.

Comment by Patrick Farrell (Inactive) [ 26/Aug/14 ]

Lai - Thanks for the answer. We'll go with that and I'll let you know if Cray sees any issues. (We're just moving in to internal deployment of 2.6.)

Comment by Christopher Morrone [ 12/May/15 ]

The debug message hit on an MDS. Here it is:

2015-04-24 09:00:42 Lustre: lsd-OST0000: haven't heard from client 6048f9e4-4557-a727-5b7d-dd9f7cb9c8b1 (at 192.168.116.115@o2ib5) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8805cbcd5000, cur 1429891242 expire 1429891092 last 1429891015
2015-04-24 09:14:05 LustreError: 0:0:(ldlm_lockd.c:355:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 172.19.1.85@o2ib100  ns: filter-lsd-OST0000_UUID lock: ffff880bf976a880/0x7e1b5086f6ebf35e lrc: 3/0,0 mode: PR/PR res: [0x2678a59:0x0:0x0].0 rrc: 266 type: EXT [0->18446744073709551615] (req 468705280->468717567) flags: 0x60000000000020 nid: 172.19.1.85@o2ib100 remote: 0xa9a53ce40d684d49 expref: 54562 pid: 6158 timeout: 5072055798 lvb_type: 0
2015-04-24 09:14:05 LustreError: 22433:0:(ldlm_lockd.c:2342:ldlm_cancel_handler()) ldlm_cancel from 172.19.1.85@o2ib100 arrived at 1429892045 with bad export cookie 9086945213692644542
2015-04-24 09:14:05 LustreError: 6684:0:(ldlm_lock.c:2448:ldlm_lock_dump_handle()) ### ### ns: filter-lsd-OST0000_UUID lock: ffff880d7abe3680/0x7e1b5086f3c622c3 lrc: 3/0,0 mode: PR/PR res: [0x24bf940:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x40000000000000 nid: 172.19.1.85@o2ib100 remote: 0xa9a53ce40617fec7 expref: 54547 pid: 6183 timeout: 0 lvb_type: 1
2015-04-24 09:14:05 LustreError: 22433:0:(ldlm_lockd.c:2342:ldlm_cancel_handler()) Skipped 1 previous similar message
2015-04-24 09:18:04 LustreError: 6784:0:(ost_handler.c:1956:ost_prolong_locks()) ### LU-2232: lock export ffff880e2d590400 != req export ffff880d8a2eb400, this lock is obsolete!
2015-04-24 09:18:04  ns: filter-lsd-OST0000_UUID lock: ffff880bf976a880/0x7e1b5086f6ebf35e lrc: 4/0,0 mode: PR/PR res: [0x2678a59:0x0:0x0].0 rrc: 135 type: EXT [0->18446744073709551615] (req 468705280->468717567) flags: 0x60000000000020 nid: 172.19.1.85@o2ib100 remote: 0xa9a53ce40d684d49 expref: 35636 pid: 6158 timeout: 5072055798 lvb_type: 0

No assertion, though. Lustre 2.5.3-5chaos.

Comment by Lai Siyao [ 25/May/15 ]

The log shows after client eviction (and quite possibly reconnected), a stale lock handle is packed in client rw request.

I added some code to try to simulate this, but failed. According to the code, after eviction, client inflight RPCs will be aborted, and locks be cleaned up. If a full debug log covering this recovery(at least on client) can be obtained upon this error message is seen, it can help move this forward.

Comment by Gerrit Updater [ 27/May/15 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/14950
Subject: LU-2232 debug: print debug for prolonged lock
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 1d21596de85b81bb3f98277f8f0b425368ccb187

Comment by Lai Siyao [ 27/May/15 ]

The updated debug patch will print both lock and request details, which can tell us whether this request is new, or resent/replay request.

Comment by Christopher Morrone [ 20/Jun/15 ]

I have pulled change 14950, Patch Set 1, into LLNL's local tree. It is in the queue to go into the next TOSS release and eventually roll out into production.

Comment by D. Marc Stearman (Inactive) [ 24/Aug/15 ]

We have not seen any more occurrences of this error since we rolled out our 2.5.4-4chaos version into production.

Comment by Peter Jones [ 24/Aug/15 ]

Thanks Marc!

Generated at Sat Feb 10 01:23:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.