[LU-5599] Lustre Error: Impossible state: 4 Created: 09/Sep/14  Updated: 24/Jul/18  Resolved: 24/Jul/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Brian Bernel (Inactive) Assignee: Bob Glossman (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Fedora 19 x86_64 on Washington Pass nodes, 1GbE & FDR IB


Issue Links:
Related
is related to LU-5492 osc_lock.c:1165:osc_lock_enqueue()) A... Resolved
is related to LU-5910 LustreError: osc_lock_enqueue() Impo... Resolved
Epic/Theme: Kernel, Lustre-2.4, Panic
Severity: 3
Project: Fast Forward
Rank (Obsolete): 15656

 Description   

Lustre Version: 2.4.52
Kernel: patchless_client
Build: v2_4_52 0-gfdd4844-CHANGED-3.9.9-302.fc19.x86_64

2nd Instance:
-----------------
From: Cledat, Romain E
Sent: Monday, September 08, 2014 4:34 PM
To: Bernel, BrianX D
Subject: Error

Message from syslogd@bar1 at Sep 8 16:08:51 ...
kernel:[1235195.162972] LustreError: 85031:0:(osc_lock.c:1129:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 4 r
Message from syslogd@bar1 at Sep 8 16:08:51 ...
kernel:[1235195.193211] LustreError: 85031:0:(osc_lock.c:1129:osc_lock_enqueue()) LBUG

1st Instance:
-----------------
From: Nickerson, Brian R
Sent: Thursday, August 14, 2014 3:44 PM
To: Bernel, BrianX D; Cledat, Romain E
Subject: Kernel crash details

Message from syslogd@bar4 at Aug 14 15:34:57 ...
kernel:[1216856.270451] LustreError: 42598:0:(osc_lock.c:1129:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 4

Message from syslogd@bar4 at Aug 14 15:34:57 ...
kernel:[1216856.271008] LustreError: 42598:0:(osc_lock.c:1129:osc_lock_enqueue()) LBUG

Message from syslogd@bar4 at Aug 14 15:34:57 ...
kernel:[1216856.271830] Kernel panic - not syncing: LBUG



 Comments   
Comment by Cliff White (Inactive) [ 09/Sep/14 ]

There should be a stack dump to go along with the ASSERTION - can you please acquire and attach to the ticket? In addition, it would be useful to have logs for some time before the actual assertion - are there any LustreErrors previous to this?

Comment by Peter Jones [ 10/Sep/14 ]

Bob

Could you please help with this issue?

Thanks

Peter

Comment by Bob Glossman (Inactive) [ 10/Sep/14 ]

the reported lustre version is "Lustre Version: 2.4.52". this suggests it was built from source or derived from a review build in between our standard release versions with names like 2.4.2 or 2.4.3. Could we get detail about how this lustre was generated or obtained? knowing the exact origin is very important to help us understand the problem.

Comment by Bob Glossman (Inactive) [ 10/Sep/14 ]

this may not be known but I wonder if there was any particular applications or load that was running near the time of the 2 reported panic instances.

Comment by Brian Bernel (Inactive) [ 10/Sep/14 ]

Hi Bob,

Thanks so much for your help on this. The Lustre kernel for X-Stack was put together by Gabrielle Paciucci. …As for what might have been going on at the time of the problem, the only thing I have to go on at this point is that git was involved. No kdump available, I’m afraid. A cursory look at the logs doesn’t show anything glaringly amiss, but it does corroborate that garret/git was in play in the minute leading up to the Lustre error.

(Thanks Josh, for directing me to reply by comment vs email)

Regards,

Brian

Comment by Bob Glossman (Inactive) [ 10/Sep/14 ]

Brian,
The stack dumps Cliff asked about a few comments ago is still needed. Any chance of getting those?

Comment by Keith Mannthey (Inactive) [ 02/Oct/14 ]

We do not need a kernel dump to get this information.

Can you upload /var/log/messages or "dmesg" right after this issue is hit?

Comment by Brian Bernel (Inactive) [ 10/Nov/14 ]

Recent Kernel Dump information from bar1 on X-Stack:

[ 3.911516] hpgmg-fv[96927]: segfault at d ip 00007fd22c315d3e sp 00007fd20796bd80 error 6 in libocr.so[7fd22c300000+2e000]
[ 3.252141] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11.
[ 3.828093] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11.
[ 4.189668] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11.
[ 4.197074] LustreError: Skipped 1 previous similar message
[ 0.136898] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11.
[ 3.395164] hpgmg-fv[5886]: segfault at 7f91ecb51d19 ip 00007f91ecb48d3e sp 00007f91450c0d10 error 7 in libocr.so[7f91ecb33000+2e000]
[ 1.541619] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11.
[ 1.549642] LustreError: Skipped 1 previous similar message
[ 0.870999] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11.
[ 1.648428] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11.
[ 0.486506] hpgmg-fv[7547]: segfault at 7f5a27153d19 ip 00007f5a2714ad3e sp 00007f5978fb3d10 error 7 in libocr.so[7f5a27135000+2e000]
[ 0.824354] hpgmg-fv[11206]: segfault at d ip 00007f4f43557d3e sp 00007f4e9bacfd10 error 6 in libocr.so[7f4f43542000+2e000]
[ 3.258284] hpgmg-fv[11745]: segfault at d ip 00007fb77412ad3e sp 00007fb6cd6a4d10 error 6 in libocr.so[7fb774115000+2e000]
[ 2.546239] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118057617399716 mdc close failed: rc = -13
[ 2.559492] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144117957591658911 mdc close failed: rc = -13
[ 1.467034] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118025052861374 mdc close failed: rc = -13
[ 1.476398] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) Skipped 40 previous similar messages
[ 2.468531] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118025052858204 mdc close failed: rc = -13
[ 2.478511] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) Skipped 246 previous similar messages
[ 0.176572] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118025052850460 mdc close failed: rc = -13
[ 0.186522] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) Skipped 473 previous similar messages
[ 0.030234] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118024767594400 mdc close failed: rc = -13
[ 0.039727] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) Skipped 39 previous similar messages
[ 2.904305] hpgmg-fv invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[ 2.904309] hpgmg-fv cpuset=/ mems_allowed=0-1
[ 2.904311] Pid: 11573, comm: hpgmg-fv Tainted: GF O 3.9.9-302.fc19.x86_64 #1

Comment by Brian Bernel (Inactive) [ 10/Nov/14 ]

Related correspondence and screen errors:

Yep, Vincent confirms that he was doing a checkout of the repo at the time… Definitely GIT + Lustre. And now they have a full kernel dump .

Romain

From: Romain Cledat <romain.e.cledat@intel.com>
Date: Saturday, November 8, 2014 at 12:19 PM
To: "Bernel, BrianX D" <brianx.d.bernel@intel.com>
Subject: Your first kdump

Hello,

It seems bar1 rebooted by itself some 7h ago. After some investigation, I think it crashed and was rebooted due to kdump (yeah)

Nov 8 04:51:12 bar1 systemd-logind[748]: New session 17884 of user vincentc.
Nov 7 22:01:56 bar1 rsyslogd: [origin software="rsyslogd" swVersion="7.2.6" x-pid="737" x-info="http://www.rsyslog.com"] start

I think the Nov 7 date is because the clock get reset to a bad value when rebooting. At the end of the reboot, you have:

Nov 7 22:03:11 bar1 systemd[1]: Startup finished in 2.269s (kernel) + 3.678s (initrd) + 1min 19.515s (userspace) = 1min 25.463s.
Nov 7 22:03:11 bar1 abrt-server[2784]: No actions are found for event 'notify'
Nov 8 05:03:23 bar1 chronyd[717]: Selected source 149.20.68.17
Nov 8 05:03:23 bar1 chronyd[717]: System clock wrong by 25198.383400 seconds, adjustment started
Nov 8 05:03:23 bar1 chronyd[717]: System clock was stepped by 25198.383 seconds
Nov 8 05:03:23 bar1 systemd[1]: Time has been changed

So the machine seems to have been down between 4h51 and 5h03. And sure enough there is something in /var/crash. Relevant lines:
[ 2.431054] LustreError: 29800:0:(osc_lock.c:1129:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 4
[ 2.441581] LustreError: 29800:0:(osc_lock.c:1129:osc_lock_enqueue()) LBUG
[ 2.452016] Pid: 29800, comm: git
[ 2.452017] \x0aCall Trace:

What do you know, it happens to be Lustre again .

Romain

PS: I am asking Vincent to confirm it was him and it was git. 5 in the morning though…

Comment by Jay Lan (Inactive) [ 10/Nov/14 ]

NASA Ames hit this bug this morning on its 2.4.3 -7nasC client running sles11sp3 kernel.

Comment by Zhenyu Xu [ 11/Nov/14 ]

It looks like LU-3433, and patch for 2.4 is here http://review.whamcloud.com/#/c/9194/, we also better has LU-4558 patch land for 2.4 http://review.whamcloud.com/#/c/9876/

2.5 already has them.

Comment by Jay Lan (Inactive) [ 11/Nov/14 ]

We had both of them in our 2.4.3-7nasC.

Comment by Zhenyu Xu [ 11/Nov/14 ]

Can you try apply these patches?

http://review.whamcloud.com/#/c/8530/
http://review.whamcloud.com/#/c/9783/
http://review.whamcloud.com/#/c/9194/
http://review.whamcloud.com/#/c/9876/

Comment by Peter Jones [ 11/Nov/14 ]

Jay

I think that it makes sense to open a separate ticket for the NASA issue because the environments are really quite different and the root cause could be different

Peter

Comment by Jay Lan (Inactive) [ 11/Nov/14 ]

Peter,

Will do.

Jay

Comment by Brian Bernel (Inactive) [ 18/May/16 ]

Is there set of instructions somewhere on how to apply these patches?

Comment by Peter Jones [ 24/Jul/18 ]

Out of date ticket

Generated at Sat Feb 10 01:52:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.