[LU-2232] LustreError: 9120:0:(ost_handler.c:1673:ost_prolong_lock_one()) LBUG - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.1.2
Labels:
- llnl
Environment:
Dell R710 servers running TOSS-2.0-2 and DDN 10k storage.

Severity:
4
Rank (Obsolete):
5290

Description

Last night we had two OSSs panic at virtually the same time with and LBUG error being thrown. We just updated our servers and clients to 2.1.2-4chaos from 2.1.2-3chaos releases with the past 2 days and had not experienced this issue with the previous release. Below is a sample of the console log from one of the servers. I have also captured all the console messages up until the system panicked and am attaching it.

LustreError: 9044:0:(ost_handler.c:1673:ost_prolong_lock_one()) ASSERTION(lock->l_export == opd->opd_exp) failed
LustreError: 9120:0:(ost_handler.c:1673:ost_prolong_lock_one()) ASSERTION(lock->l_export == opd->opd_exp) failed
LustreError: 9120:0:(ost_handler.c:1673:ost_prolong_lock_one()) LBUG
Pid: 9120, comm: ll_ost_io_341

Call Trace:
LustreError: 9083:0:(ost_handler.c:1673:ost_prolong_lock_one()) ASSERTION(lock->l_export == opd->opd_exp) failed
LustreError: 9083:0:(ost_handler.c:1673:ost_prolong_lock_one()) LBUG
[<ffffffffa0440895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Pid: 9083, comm: ll_ost_io_304

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

LBUG
15 kB
25/Oct/12 12:24 PM

Issue Links

is duplicated by

LU-4844 ost_prolong_lock_one()) ASSERTION( lock->l_export == opd->opd_exp )

Resolved

is related to

LU-5116 Race between resend and reply processing

Resolved

Activity

[LU-2232] LustreError: 9120:0:(ost_handler.c:1673:ost_prolong_lock_one()) LBUG

Lai Siyao added a comment - 04/May/14 7:45 AM

I did some test but couldn't reproduce this failure, so I updated the patch for 2.4/2.5 to debug patch which will print lock and request export address and crash as before, so that we can dump these two exports to help analyse.

Could you apply the patch and reproduce it? and if it crashes upload the crash dump.

Lai Siyao added a comment - 04/May/14 7:45 AM I did some test but couldn't reproduce this failure, so I updated the patch for 2.4/2.5 to debug patch which will print lock and request export address and crash as before, so that we can dump these two exports to help analyse. Could you apply the patch and reproduce it? and if it crashes upload the crash dump.

Lai Siyao added a comment - 18/Apr/14 4:45 AM

I'm afraid some requests were not aborted upon eviction on client, so that they were resent through new connection. I'll make some test to find more proof.

Lai Siyao added a comment - 18/Apr/14 4:45 AM I'm afraid some requests were not aborted upon eviction on client, so that they were resent through new connection. I'll make some test to find more proof.

Oleg Drokin added a comment - 16/Apr/14 7:29 PM

Question - if the client was evicted - the request resending should have failed because the server would reject it.
So there's something else happening that your explanation does not quite explain I think.

Oleg Drokin added a comment - 16/Apr/14 7:29 PM Question - if the client was evicted - the request resending should have failed because the server would reject it. So there's something else happening that your explanation does not quite explain I think.

Lai Siyao added a comment - 10/Apr/14 2:49 PM

Patches are ready:
2.4: http://review.whamcloud.com/#/c/9925/
2.5: http://review.whamcloud.com/#/c/9926/
master: http://review.whamcloud.com/#/c/9927/

Lai Siyao added a comment - 10/Apr/14 2:49 PM Patches are ready: 2.4: http://review.whamcloud.com/#/c/9925/ 2.5: http://review.whamcloud.com/#/c/9926/ master: http://review.whamcloud.com/#/c/9927/

Lai Siyao added a comment - 09/Apr/14 2:12 PM

The backtrace shows it LBUG on the first ost_prolong_lock_one() in ost_prolong_locks(), IMO what happened is like this:
1. client did IO with lock handle in the request.
2. IO bulk failed on server, so no reply to client to let it resend.
3. lock cancelling timed out on server, the client was evicted.
4. client reconnected and resent previous IO request, however the lock handle is obsolete, so the LASSERT was triggered. (this lock should have been replayed, but this request was simply resent, there is no way to update lock handle in the resent request)

I'll provide a patch to check lock->l_export against opd->opd_exp other than assert for the first ost_prolong_lock_one().

Lai Siyao added a comment - 09/Apr/14 2:12 PM The backtrace shows it LBUG on the first ost_prolong_lock_one() in ost_prolong_locks(), IMO what happened is like this: 1. client did IO with lock handle in the request. 2. IO bulk failed on server, so no reply to client to let it resend. 3. lock cancelling timed out on server, the client was evicted. 4. client reconnected and resent previous IO request, however the lock handle is obsolete, so the LASSERT was triggered. (this lock should have been replayed, but this request was simply resent, there is no way to update lock handle in the resent request) I'll provide a patch to check lock->l_export against opd->opd_exp other than assert for the first ost_prolong_lock_one().

Patrick Farrell (Inactive) added a comment - 08/Apr/14 6:55 PM

We replicated this on an OSS with +rpctrace enabled. I've provided the dump to Xyratex, and it will be uploaded here in about 5 minutes:
ftp.whamcloud.com
uploads/LELUS-2232/LELUS-234_~~LU-2232~~.1404081304.tar.gz

Patrick Farrell (Inactive) added a comment - 08/Apr/14 6:55 PM We replicated this on an OSS with +rpctrace enabled. I've provided the dump to Xyratex, and it will be uploaded here in about 5 minutes: ftp.whamcloud.com uploads/LELUS-2232/LELUS-234_ LU-2232 .1404081304.tar.gz

Patrick Farrell (Inactive) added a comment - 08/Apr/14 3:46 PM

Thank you, Lai.

Cray can provide a dump of an OST with this crash, if you think that would be helpful.

We're currently trying to replicate it with D_RPCTRACE enabled, so if we get a dump with that, we can send that over as well. (We're working with Xyratex on this bug.)

Patrick Farrell (Inactive) added a comment - 08/Apr/14 3:46 PM Thank you, Lai. Cray can provide a dump of an OST with this crash, if you think that would be helpful. We're currently trying to replicate it with D_RPCTRACE enabled, so if we get a dump with that, we can send that over as well. (We're working with Xyratex on this bug.)

Lai Siyao added a comment - 08/Apr/14 2:32 PM

Patrick, this is seen again in ~~LU-4844~~, and it's better to re-open this one and mark ~~LU-4844~~ as duplicate.

Lai Siyao added a comment - 08/Apr/14 2:32 PM Patrick, this is seen again in LU-4844 , and it's better to re-open this one and mark LU-4844 as duplicate.

Patrick Farrell (Inactive) added a comment - 08/Apr/14 3:22 AM

Lai, Jodi,

I'm happy to see this bug re-opened, but I'm curious why? Was it seen again at Intel or a customer site? Was it re-opened because of the Cray report?

Patrick Farrell (Inactive) added a comment - 08/Apr/14 3:22 AM Lai, Jodi, I'm happy to see this bug re-opened, but I'm curious why? Was it seen again at Intel or a customer site? Was it re-opened because of the Cray report?

Patrick Farrell (Inactive) added a comment - 03/Feb/14 8:57 PM - edited

Cray just hit this in our 2.5 with SLES11SP3 clients on CentOS 6.4 servers, twice.

Both times, it was immediately preceded by a 'double eviction' of a particular client, two callback timers expiring within the same second for the same client:

2014-02-01 13:44:35 LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 117s: evicting client at 54@gni1 ns: filter-esfprod-OST0004_UUID lock: ffff8803fa547980/0xd460be1680c1b564 lrc: 3/0,0 mode: PW/PW res: [0x424b478:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000010020 nid: 54@gni1 remote: 0xf4283d6d3305e0e5 expref: 2679 pid: 22955 timeout: 4994587934 lvb_type: 0
2014-02-01 13:44:35 LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 117s: evicting client at 54@gni1 ns: filter-esfprod-OST0004_UUID lock: ffff8802c669d100/0xd460be1680c1b580 lrc: 3/0,0 mode: PW/PW res: [0x424b479:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000010020 nid: 54@gni1 remote: 0xf4283d6d3305e155 expref: 2679 pid: 22955 timeout: 4994587934 lvb_type: 0

And the other instance of it (on a different system, also running Cray's 2.5):
LustreError: 3489:0:(ldlm_lockd.c:433:ldlm_add_waiting_lock()) Skipped 1 previous similar message
LustreError: 53:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 113s: evicting client at 62@gni
ns: filter-scratch-OST0001_UUID lock: ffff880445551180/0xa9b58315ab2d4d0a lrc: 3/0,0 mode: PW/PW res: [0xb5da7e1:0x0:0x0].0 rrc:
2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 62@gni remote: 0x330a8b1d9fd47938
expref: 1024 pid: 16057 timeout: 4781830324 lvb_type: 0

LustreError: 53:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 113s: evicting client at 62@gni
ns: filter-scratch-OST0001_UUID lock: ffff880387977bc0/0xa9b58315ab2d4d6c lrc: 3/0,0 mode: PW/PW res: [0xb5da7e6:0x0:0x0].0 rrc:
2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 62@gni remote: 0x330a8b1d9fd47bfb
expref: 1030 pid: 16082 timeout: 4781830324 lvb_type: 0

I'm just speculating, but I suspect this is key.

For some reason, the nid_stats struct pointer in the obd_export is zero, so I wasn't able to confirm if this export was actually the double-evicted client. (Perhaps this is always the case?)

[Edited to clarify server and client versions. Also, we've seen this several times since. Not sure why we're suddenly seeing it now, as indications are the underlying bug has been in the code for some time.]

Patrick Farrell (Inactive) added a comment - 03/Feb/14 8:57 PM - edited Cray just hit this in our 2.5 with SLES11SP3 clients on CentOS 6.4 servers, twice. Both times, it was immediately preceded by a 'double eviction' of a particular client, two callback timers expiring within the same second for the same client: 2014-02-01 13:44:35 LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 117s: evicting client at 54@gni1 ns: filter-esfprod-OST0004_UUID lock: ffff8803fa547980/0xd460be1680c1b564 lrc: 3/0,0 mode: PW/PW res: [0x424b478:0x0:0x0] .0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000010020 nid: 54@gni1 remote: 0xf4283d6d3305e0e5 expref: 2679 pid: 22955 timeout: 4994587934 lvb_type: 0 2014-02-01 13:44:35 LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 117s: evicting client at 54@gni1 ns: filter-esfprod-OST0004_UUID lock: ffff8802c669d100/0xd460be1680c1b580 lrc: 3/0,0 mode: PW/PW res: [0x424b479:0x0:0x0] .0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000010020 nid: 54@gni1 remote: 0xf4283d6d3305e155 expref: 2679 pid: 22955 timeout: 4994587934 lvb_type: 0 And the other instance of it (on a different system, also running Cray's 2.5): LustreError: 3489:0:(ldlm_lockd.c:433:ldlm_add_waiting_lock()) Skipped 1 previous similar message LustreError: 53:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 113s: evicting client at 62@gni ns: filter-scratch-OST0001_UUID lock: ffff880445551180/0xa9b58315ab2d4d0a lrc: 3/0,0 mode: PW/PW res: [0xb5da7e1:0x0:0x0] .0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 62@gni remote: 0x330a8b1d9fd47938 expref: 1024 pid: 16057 timeout: 4781830324 lvb_type: 0 LustreError: 53:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 113s: evicting client at 62@gni ns: filter-scratch-OST0001_UUID lock: ffff880387977bc0/0xa9b58315ab2d4d6c lrc: 3/0,0 mode: PW/PW res: [0xb5da7e6:0x0:0x0] .0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 62@gni remote: 0x330a8b1d9fd47bfb expref: 1030 pid: 16082 timeout: 4781830324 lvb_type: 0 I'm just speculating, but I suspect this is key. For some reason, the nid_stats struct pointer in the obd_export is zero, so I wasn't able to confirm if this export was actually the double-evicted client. (Perhaps this is always the case?) [Edited to clarify server and client versions. Also, we've seen this several times since. Not sure why we're suddenly seeing it now, as indications are the underlying bug has been in the code for some time.]

Alexey Lyashkov added a comment - 10/Oct/13 3:06 PM

Xyratex hit that bug at own branch..

Alexey Lyashkov added a comment - 10/Oct/13 3:06 PM Xyratex hit that bug at own branch..

People

Assignee:: Lai Siyao

Reporter:: Joe Mervini

Votes:: 0 Vote for this issue

Watchers:: 18 Start watching this issue

Dates

Created:: 25/Oct/12 12:24 PM

Updated:: 24/Aug/15 9:57 PM

Resolved:: 24/Aug/15 9:57 PM