[LU-2232] LustreError: 9120:0:(ost_handler.c:1673:ost_prolong_lock_one()) LBUG Created: 25/Oct/12 Updated: 24/Aug/15 Resolved: 24/Aug/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.1.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Joe Mervini | Assignee: | Lai Siyao |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
Dell R710 servers running TOSS-2.0-2 and DDN 10k storage. |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 4 | ||||||||||||||||
| Rank (Obsolete): | 5290 | ||||||||||||||||
| Description |
|
Last night we had two OSSs panic at virtually the same time with and LBUG error being thrown. We just updated our servers and clients to 2.1.2-4chaos from 2.1.2-3chaos releases with the past 2 days and had not experienced this issue with the previous release. Below is a sample of the console log from one of the servers. I have also captured all the console messages up until the system panicked and am attaching it. LustreError: 9044:0:(ost_handler.c:1673:ost_prolong_lock_one()) ASSERTION(lock->l_export == opd->opd_exp) failed Call Trace: |
| Comments |
| Comment by Peter Jones [ 25/Oct/12 ] |
|
Oleg is looking into this one |
| Comment by Oleg Drokin [ 25/Oct/12 ] |
|
Do you have a crash dump? Can you print the lock content by any chance? |
| Comment by Joe Mervini [ 25/Oct/12 ] |
|
Unfortunately, no. Our servers are diskless so when it dumps core it is lost on reboot. I have linked /var/crash to a NFS mounted device so hopefully if the LBUG occurs again it will write to that device before freezing. As far as printing the lock content, I'm not sure what you mean. If you can tell me how to gather that information I will be happy to if I can. |
| Comment by Oleg Drokin [ 26/Oct/12 ] |
|
Disklessness does not matter. |
| Comment by Joe Mervini [ 08/Nov/12 ] |
|
We should probably close this issue. We decided to roll back to the previous version during a system outage last week to avoid running into the issue again in the short term. We have another system that is preproduction that is running the current version so we will monitor that and report back if we encounter the problem again. |
| Comment by Peter Jones [ 08/Nov/12 ] |
|
ok thanks Joe! |
| Comment by Alexey Lyashkov [ 10/Oct/13 ] |
|
Xyratex hit that bug at own branch.. |
| Comment by Patrick Farrell (Inactive) [ 03/Feb/14 ] |
|
Cray just hit this in our 2.5 with SLES11SP3 clients on CentOS 6.4 servers, twice. Both times, it was immediately preceded by a 'double eviction' of a particular client, two callback timers expiring within the same second for the same client: 2014-02-01 13:44:35 LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 117s: evicting client at 54@gni1 ns: filter-esfprod-OST0004_UUID lock: ffff8803fa547980/0xd460be1680c1b564 lrc: 3/0,0 mode: PW/PW res: [0x424b478:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000010020 nid: 54@gni1 remote: 0xf4283d6d3305e0e5 expref: 2679 pid: 22955 timeout: 4994587934 lvb_type: 0 And the other instance of it (on a different system, also running Cray's 2.5): LustreError: 53:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 113s: evicting client at 62@gni I'm just speculating, but I suspect this is key. For some reason, the nid_stats struct pointer in the obd_export is zero, so I wasn't able to confirm if this export was actually the double-evicted client. (Perhaps this is always the case?) [Edited to clarify server and client versions. Also, we've seen this several times since. Not sure why we're suddenly seeing it now, as indications are the underlying bug has been in the code for some time.] |
| Comment by Patrick Farrell (Inactive) [ 08/Apr/14 ] |
|
Lai, Jodi, I'm happy to see this bug re-opened, but I'm curious why? Was it seen again at Intel or a customer site? Was it re-opened because of the Cray report? |
| Comment by Lai Siyao [ 08/Apr/14 ] |
|
Patrick, this is seen again in |
| Comment by Patrick Farrell (Inactive) [ 08/Apr/14 ] |
|
Thank you, Lai. Cray can provide a dump of an OST with this crash, if you think that would be helpful. We're currently trying to replicate it with D_RPCTRACE enabled, so if we get a dump with that, we can send that over as well. (We're working with Xyratex on this bug.) |
| Comment by Patrick Farrell (Inactive) [ 08/Apr/14 ] |
|
We replicated this on an OSS with +rpctrace enabled. I've provided the dump to Xyratex, and it will be uploaded here in about 5 minutes: |
| Comment by Lai Siyao [ 09/Apr/14 ] |
|
The backtrace shows it LBUG on the first ost_prolong_lock_one() in ost_prolong_locks(), IMO what happened is like this: I'll provide a patch to check lock->l_export against opd->opd_exp other than assert for the first ost_prolong_lock_one(). |
| Comment by Lai Siyao [ 10/Apr/14 ] |
|
Patches are ready: |
| Comment by Oleg Drokin [ 16/Apr/14 ] |
|
Question - if the client was evicted - the request resending should have failed because the server would reject it. |
| Comment by Lai Siyao [ 18/Apr/14 ] |
|
I'm afraid some requests were not aborted upon eviction on client, so that they were resent through new connection. I'll make some test to find more proof. |
| Comment by Lai Siyao [ 04/May/14 ] |
|
I did some test but couldn't reproduce this failure, so I updated the patch for 2.4/2.5 to debug patch which will print lock and request export address and crash as before, so that we can dump these two exports to help analyse. Could you apply the patch and reproduce it? and if it crashes upload the crash dump. |
| Comment by Oleg Drokin [ 02/Jun/14 ] |
|
I think |
| Comment by Peter Jones [ 02/Jun/14 ] |
|
It would be a sensible approach to try both the fix from |
| Comment by Patrick Farrell (Inactive) [ 02/Jun/14 ] |
|
Lai - Sorry we didn't update this earlier. Cray tried to reproduce this bug with your debug patch, and were unable to do so. After that, we pulled your patch set 1 in to our Lustre version and haven't seen this bug since. |
| Comment by Peter Jones [ 10/Jun/14 ] |
|
Patrick To be clear, Cray could reliably reproduce this issue, then applied the diagnostic patch and could not? What code line was this on? Thanks Peter |
| Comment by Patrick Farrell (Inactive) [ 10/Jun/14 ] |
|
Peter, That's correct. It was on b2_5.
|
| Comment by Peter Jones [ 10/Jun/14 ] |
|
thanks Patrick. Wow. That is odd. |
| Comment by Patrick Farrell (Inactive) [ 25/Aug/14 ] |
|
Lai - Given that the code in this area was changed a fair bit in 2.6, do you think this bug is likely to still be present? |
| Comment by Lai Siyao [ 26/Aug/14 ] |
|
Per Oleg's comment, this may be a duplicate of |
| Comment by Patrick Farrell (Inactive) [ 26/Aug/14 ] |
|
Lai - Thanks for the answer. We'll go with that and I'll let you know if Cray sees any issues. (We're just moving in to internal deployment of 2.6.) |
| Comment by Christopher Morrone [ 12/May/15 ] |
|
The debug message hit on an MDS. Here it is: 2015-04-24 09:00:42 Lustre: lsd-OST0000: haven't heard from client 6048f9e4-4557-a727-5b7d-dd9f7cb9c8b1 (at 192.168.116.115@o2ib5) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8805cbcd5000, cur 1429891242 expire 1429891092 last 1429891015 2015-04-24 09:14:05 LustreError: 0:0:(ldlm_lockd.c:355:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 172.19.1.85@o2ib100 ns: filter-lsd-OST0000_UUID lock: ffff880bf976a880/0x7e1b5086f6ebf35e lrc: 3/0,0 mode: PR/PR res: [0x2678a59:0x0:0x0].0 rrc: 266 type: EXT [0->18446744073709551615] (req 468705280->468717567) flags: 0x60000000000020 nid: 172.19.1.85@o2ib100 remote: 0xa9a53ce40d684d49 expref: 54562 pid: 6158 timeout: 5072055798 lvb_type: 0 2015-04-24 09:14:05 LustreError: 22433:0:(ldlm_lockd.c:2342:ldlm_cancel_handler()) ldlm_cancel from 172.19.1.85@o2ib100 arrived at 1429892045 with bad export cookie 9086945213692644542 2015-04-24 09:14:05 LustreError: 6684:0:(ldlm_lock.c:2448:ldlm_lock_dump_handle()) ### ### ns: filter-lsd-OST0000_UUID lock: ffff880d7abe3680/0x7e1b5086f3c622c3 lrc: 3/0,0 mode: PR/PR res: [0x24bf940:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x40000000000000 nid: 172.19.1.85@o2ib100 remote: 0xa9a53ce40617fec7 expref: 54547 pid: 6183 timeout: 0 lvb_type: 1 2015-04-24 09:14:05 LustreError: 22433:0:(ldlm_lockd.c:2342:ldlm_cancel_handler()) Skipped 1 previous similar message 2015-04-24 09:18:04 LustreError: 6784:0:(ost_handler.c:1956:ost_prolong_locks()) ### LU-2232: lock export ffff880e2d590400 != req export ffff880d8a2eb400, this lock is obsolete! 2015-04-24 09:18:04 ns: filter-lsd-OST0000_UUID lock: ffff880bf976a880/0x7e1b5086f6ebf35e lrc: 4/0,0 mode: PR/PR res: [0x2678a59:0x0:0x0].0 rrc: 135 type: EXT [0->18446744073709551615] (req 468705280->468717567) flags: 0x60000000000020 nid: 172.19.1.85@o2ib100 remote: 0xa9a53ce40d684d49 expref: 35636 pid: 6158 timeout: 5072055798 lvb_type: 0 No assertion, though. Lustre 2.5.3-5chaos. |
| Comment by Lai Siyao [ 25/May/15 ] |
|
The log shows after client eviction (and quite possibly reconnected), a stale lock handle is packed in client rw request. I added some code to try to simulate this, but failed. According to the code, after eviction, client inflight RPCs will be aborted, and locks be cleaned up. If a full debug log covering this recovery(at least on client) can be obtained upon this error message is seen, it can help move this forward. |
| Comment by Gerrit Updater [ 27/May/15 ] |
|
Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/14950 |
| Comment by Lai Siyao [ 27/May/15 ] |
|
The updated debug patch will print both lock and request details, which can tell us whether this request is new, or resent/replay request. |
| Comment by Christopher Morrone [ 20/Jun/15 ] |
|
I have pulled change 14950, Patch Set 1, into LLNL's local tree. It is in the queue to go into the next TOSS release and eventually roll out into production. |
| Comment by D. Marc Stearman (Inactive) [ 24/Aug/15 ] |
|
We have not seen any more occurrences of this error since we rolled out our 2.5.4-4chaos version into production. |
| Comment by Peter Jones [ 24/Aug/15 ] |
|
Thanks Marc! |