[LU-1702] LustreError: 3218:0: (mdt_open.c:1035:mdt_reconstruct_open()) LBUG Created: 02/Aug/12  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Michael Di Domenico Assignee: Bob Glossman (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Redhat 6.2 x86_64 with infiniband


Severity: 3
Rank (Obsolete): 4058

 Description   

Message from syslogd@metal at Aug 109:46:26 ...
kernel: LustreError: 3218:0: (mdt_open.c:1035:mdt_reconstruct_open()) AssERTION( (!(rc < 0) I I (lustre_msg_get_transno(req->r~repmsg) == 0)) ) failed:

Message from syslogd@metal at Aug 1 09:46:26 .,.
kernel:LustreError: 3218:0:(mdt_open.c:1035:mdt_reconstruct_open()) LBUG

Message from syslogd@metal at Aug 1 09:46:26
kernel:Kernel panic -not syncing: LBUG@o2ib:0/0 lens 192/0 e 0 to 0 dl 1343828771 ref 1 fl Interpret:H/0/ffffffff rc 0/

1 Aug 1 09:46:26 metal kernel: LustreError:3218:0:(mdt_open.c:1035:mdt_reconstruct_open()) ASSERTION ( (!(rc < 0) I I (lustre_msg_get_transno(req->r~repmsg) == 0)) ) failed:

Message from syslogd@metal at Aug 1 09:46:26 ...
kernel: LustreError: 3218:0:(mdt_open.c:1035:mdt_reconstruct_open()) AssERTION( (!(rc < 0) I I (lustre_msg_get_transno(req->r~repmsg) == 0)) ) failed:

Message from syslogd@metal at Aug 1 09:46:26 ...
kernel: LustreError: 3218:0:(mdt_open.c:1035:mdt_reconstruct_open()) LBUG

Aug 1 09:46:26 metal kernel: LustreError: 3218:0:(mdt_open.c:1035:mdt_reconstruct_open()) LBUG
Aug 1 09:46:26 metal kernel: Pid: 3218, comm: mdt_24
Aug 1 09:46:26 metal kernel:
Aug 1 09:46:26 metal kernel: Call Trace:
Aug 1 09:46:26 metal kernel: [<ffffffffa0422835>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Aug 1 09:46:26 metal kernel: [<ffffffffa0422d67>] lbug_with_loc+0x47/0xb0 [libcfs]
Aug 1 09:46:26 metal kernel: [<ffffffffa0d6e60b>] mdt_reconstruct_open+0x63b/0x8c0 [mdt]
Aug 1 09:46:26 metal kernel: [<ffffffffa0d620fb>] mdt_reconstruct+0x4b/0xb0 [mdt]
Aug 1 09:46:26 metal kernel: [<ffffffffa0d511f9>] mdt_reint_internal+0x609/0x7b0 [mdt]
Aug 1 09:46:26 metal kernel: [<ffffffffa0d5l5d5>] mdt_intent_reint+0x185/0x4a0 [mdt]
Aug 1 09:46:26 metal kernel: [<ffffffffa0d5036l>] mdt_intent_policy+0x2dl/0x600 [mdt]
Aug 1 09:46:26 metal kernel: [<ffffffffa0668e39>] ldlm_lock_enqueue+0x2f9/0x830 [ptlrpc]
Aug 1 09:46:26 metal kernel: [<ffffffffa068gef0>] ldlm_handle_enqueue0+0x420/0xd90 [ptlrpc]
Aug 1 09:46:26 metal kernel: [<ffffffffa0d506d6>] mdt_enqueue+0x46/0x130 [mdt]
Aug 1 09:46:26 metal kernel: [<ffffffffa0d47b9d>] mdt_handle_common+0x74d/0x1400 [mdt]
Aug 1 09:46:26 metal kernel: [<ffffffffa0d48925>] mdt_regular_handle+0x15/0x20 [mdt]
Aug 1 09:46:26 metal kernel: [<ffffffffa06b10l1>] ptlrpc_server_handle_request+0x3cl/0xcb0[ptlrpc]
Aug 1 09:46:26 metal kernel: [<ffffffffa04233ee>] ?cfs_timer_arm+0xe/0x10 [libcfs]
Aug 1 09:46:26 metal kernel: [<ffffffffa042de19>] ?lc_watchdog_touch+0x79/0xl10 [libcfs]
Aug 1 09:46:26 metal kernel: [<ffffffffa06ab0e2>] ?ptlrpc_wait_event+0xb2/0x2c0 [ptlrpc]
Aug 1 09:46:26 metal kernel: [<ffffffff8l0519c3>] ? __wake_up+0x53/0x70
Aug 1 09:46:26 metal kernel: [<ffffffffa06b201f>]ptlrpc_main+0x7lf/0x12l0 [ptlrpc]
Aug 1 09:46:26 metal kernel: [<ffffffffa06b1900>] ?ptlrpc_main+0x0/0x12l0 [ptlrpc]
Aug 1 09:46:26 metal kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Aug 1 09:46:26 metal kernel: [<ffffffffa06b1900>] ?ptlrpc_main+0x0/0x1210 [ptlrpc]
Aug 1 09:46:26 metal kernel: [<ffffffffa06b1900>] ?ptlrpc_main+0x0/0x1210 [ptlrpc]
Aug 1 09:46:26 metal kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Aug 1 09:46:26 metal kernel:
Aug 1 09:46:26 metal kernel: Kernel panic not syncing: LBUG

Message from syslogd@meta1 at Aug 109:46:26 ... kernel:Kernel panic -not syncing: LBUG
Aug 1 09:46:26 metal kernel: Pid: 3218, comm: mdt_24 Not tainted 2.6.32-220.4.2.e16.x86_64 #1
Aug 1 09:46:26 metal kernel: Call Trace:
Aug 1 09:46:26 metal kernel: [<ffffffff814ec61a>] ? panic+0x78/0x143
Aug 1 09:46:26 metal kernel: [<ffffffffa0422dbb>] ?lbug_with_loc+0x9b/0xb0 [libcfs]
Aug 1 09:46:26 metal kernel: [<ffffffffa0d6e60b>] ?mdt_reconstruct_open+0x63b/0x8c0 [mdt]
Aug 1 09:46:26 metal kernel: [<ffffffffa0d620fb>] ?mdt_reconstruct+0x4b/0xb0 [mdt]
Aug 1 09:46:26 metal kernel: [<ffffffffa0dS1lf9>] ?mdt_reint_internal+0x609/0x7b0 [mdt]
Aug 1 09:46:26 metal kernel: [<ffffffffa0dSlSdS>] ?mdt_intent_reint+0x18S/0x4a0 [mdt]
Aug 1 09:46:26 metal kernel: [<ffffffffa0d5036l>] ?mdt_intent~olicy+0x2d1/0x600 [mdt]
Aug 1 09:46:26 metal kernel: [<ffffffffa0668e39>] ?ldlm_lock_enqueue+0x2f9/0x830 [ptlrpc]
Aug 1 09:46:26 metal kernel: [<ffffffffa068gef0>] ?ldlm_handle_enqueue0+0x420/0xd90 [ptlrpc]
Aug 1 09:46:26 metal kernel: [<ffffffffa0dS06d6>] ?mdt_enqueue+0x46/0x130 [mdt]
Aug 1 09:46:26 metal kernel: [<ffffffffa0d47b9d>] ?mdt_handle_common+0x74d/0x1400 [mdt]
Aug 1 09:46:26 metal kernel: [<ffffffffa0d48925>] ?mdt_regular_handle+0xlS/0x20 [mdt]
Aug 1 09:46:26 metal kernel: [<ffffffffa06b10ll>] ?ptlrpc_server_handle_request+0x3c1/0xcb0[ptlrpc]
Aug 1 09:46:26 metal kernel: [<ffffffffa04233ee>] ?cfs_timer_arm+0xe/0x10 [libcfs]
Aug 1 09:46:26 metal kernel: [<ffffffffa042del9>] ?lc_watchdog_touch+0x79/0xl10 [libcfs]
Aug 1 09:46:26 metal kernel: [<ffffffffa06ab0e2>] ?ptlrpc_wait_event+0xb2/0x2c0 [ptlrpc]



 Comments   
Comment by Peter Jones [ 07/Aug/12 ]

Bob will look into this one

Comment by Bob Glossman (Inactive) [ 07/Aug/12 ]

Michael,
Would really like to know a little more about this problem. Do you have a reproducer? Was it a one-off sort of thing or do you keep seeing it a lot? Are there some syslogs, lustre debug logs, or dmesg logs from around the time of the LBUG that you could provide?

Comment by Michael Di Domenico [ 07/Aug/12 ]

The bug seems to be related to load on the machine (which is caused by) heavy scanning of the filesystem. I don't have an exact reproducer, but the machine has crashed several times during some heavy IO periods.

I'm not able to pull syslog/dmesg entries in-mass from the system. If there's something specific you're looking for i can search around. i do have lustre logs in /tmp, but because they're binary, i am not able to remove them from the system. if you have commands that will let me instrument the data you need from the log and print it, i can scan it back in and send it over

Comment by Bob Glossman (Inactive) [ 07/Aug/12 ]
  • i do have lustre logs in /tmp, but because they're binary, i am not able to remove them from the system. -

If the only thing stopping you from sending us lustre logs is that they are binary, you can convert them to human readable text with 'lctl df'. Would that permit you to send them?

Comment by Michael Di Domenico [ 08/Aug/12 ]

Yes, converting them to ascii definitely helps, however, the excerpt of lines during the kernel panic is 155k, I'll need to pair that down before i can pull the data from the system. can you tell me if there are recurring lines that i can grep out, i'll ask for the whole file, but i suspect it'll get declined since the result is currently 37MB

Comment by Colin Faber [X] (Inactive) [ 29/Aug/12 ]

Hi,

We've experienced this as well and believe it to be more related to recovery issues and less general load related.

I'll try and get some more data together and post it here so this ticket is more complete.

-cf

Comment by James Beal [ 26/Sep/14 ]

I have seen this on 2.2

Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 01:18:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.