Details
-
Bug
-
Resolution: Duplicate
-
Blocker
-
None
-
Lustre 2.4.0
-
Sequoia client, lustre 2.3.54-2chaos, github.com/chaos/lustre. Servers were running lustre 2.3.54-6chaos.
-
3
-
5563
Description
When running a 98,304 task ior, one of the lustre clients (Sequoia I/O Node) hit this:
2012-11-14 15:26:41.248076 {DefaultControlEventListener} [mmcs]{692}.1.2: Lustre: LOCK UP! the lock c00000039550af80 was acquired by <ptlrpcd_49:3330:brw_interpret:1998> 502 ti me, I'm ptlrpcd_7:3288 2012-11-14 15:26:41.287858 {DefaultControlEventListener} [mmcs]{692}.1.2: Lustre: LOCK UP! the lock c00000039550af80 was acquired by <ptlrpcd_49:3330:brw_interpret:1998> 502 ti me, I'm sysiod:3752
I believe there were then tasks stuck in read(). sysiod is the process on the I/O Node that is part of the I/O forwarding system, and is doing I/O on behalf of an ior process on a Sequoia compute node.
The attached file "seqio685_console.txt" shows more of the console output when the problem hit. "seqio685_lustre_log.txt" contains the "lctl dk" output. "seqio685_backtraces.txt" contains the output of sysrq "l" and sysrq "t".