Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.4.3
-
None
-
Fedora 19 x86_64 on Washington Pass nodes, 1GbE & FDR IB
-
3
-
Fast Forward
-
15656
Description
Lustre Version: 2.4.52
Kernel: patchless_client
Build: v2_4_52 0-gfdd4844-CHANGED-3.9.9-302.fc19.x86_64
2nd Instance:
-----------------
From: Cledat, Romain E
Sent: Monday, September 08, 2014 4:34 PM
To: Bernel, BrianX D
Subject: Error
Message from syslogd@bar1 at Sep 8 16:08:51 ...
kernel:[1235195.162972] LustreError: 85031:0:(osc_lock.c:1129:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 4 r
Message from syslogd@bar1 at Sep 8 16:08:51 ...
kernel:[1235195.193211] LustreError: 85031:0:(osc_lock.c:1129:osc_lock_enqueue()) LBUG
1st Instance:
-----------------
From: Nickerson, Brian R
Sent: Thursday, August 14, 2014 3:44 PM
To: Bernel, BrianX D; Cledat, Romain E
Subject: Kernel crash details
Message from syslogd@bar4 at Aug 14 15:34:57 ...
kernel:[1216856.270451] LustreError: 42598:0:(osc_lock.c:1129:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 4
Message from syslogd@bar4 at Aug 14 15:34:57 ...
kernel:[1216856.271008] LustreError: 42598:0:(osc_lock.c:1129:osc_lock_enqueue()) LBUG
Message from syslogd@bar4 at Aug 14 15:34:57 ...
kernel:[1216856.271830] Kernel panic - not syncing: LBUG
Related correspondence and screen errors:
Yep, Vincent confirms that he was doing a checkout of the repo at the time… Definitely GIT + Lustre. And now they have a full kernel dump
.
Romain
From: Romain Cledat <romain.e.cledat@intel.com>
Date: Saturday, November 8, 2014 at 12:19 PM
To: "Bernel, BrianX D" <brianx.d.bernel@intel.com>
Subject: Your first kdump
Hello,
It seems bar1 rebooted by itself some 7h ago. After some investigation, I think it crashed and was rebooted due to kdump (yeah)
Nov 8 04:51:12 bar1 systemd-logind[748]: New session 17884 of user vincentc.
Nov 7 22:01:56 bar1 rsyslogd: [origin software="rsyslogd" swVersion="7.2.6" x-pid="737" x-info="http://www.rsyslog.com"] start
I think the Nov 7 date is because the clock get reset to a bad value when rebooting. At the end of the reboot, you have:
Nov 7 22:03:11 bar1 systemd[1]: Startup finished in 2.269s (kernel) + 3.678s (initrd) + 1min 19.515s (userspace) = 1min 25.463s.
Nov 7 22:03:11 bar1 abrt-server[2784]: No actions are found for event 'notify'
Nov 8 05:03:23 bar1 chronyd[717]: Selected source 149.20.68.17
Nov 8 05:03:23 bar1 chronyd[717]: System clock wrong by 25198.383400 seconds, adjustment started
Nov 8 05:03:23 bar1 chronyd[717]: System clock was stepped by 25198.383 seconds
Nov 8 05:03:23 bar1 systemd[1]: Time has been changed
So the machine seems to have been down between 4h51 and 5h03. And sure enough there is something in /var/crash. Relevant lines:
[ 2.431054] LustreError: 29800:0:(osc_lock.c:1129:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 4
[ 2.441581] LustreError: 29800:0:(osc_lock.c:1129:osc_lock_enqueue()) LBUG
[ 2.452016] Pid: 29800, comm: git
[ 2.452017] \x0aCall Trace:
What do you know, it happens to be Lustre again
.
Romain
PS: I am asking Vincent to confirm it was him and it was git. 5 in the morning though…