[LU-3996]  LustreError: 8136:0:(llog_osd.c:241:llog_osd_read_header()) MGS-osd: error reading log header from [0xa:0xa:0x0]: rc = -14 Created: 23/Sep/13  Updated: 16/Oct/13  Resolved: 16/Oct/13

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Cliff White (Inactive) Assignee: Mikhail Pershin
Resolution: Duplicate Votes: 0
Labels: None

Attachments: File agb5.crash.23.sept.2013     Text File agb5.crash2.26Sept2013.txt     File config.log.tar.gz     File config.log.tar.gz     Text File console.txt     File dk.log.agb5.26Sept.2013.txt.gz     File dumplog.115459.gz     File dumplog.115529.gz     File dumplog.115600.gz     Text File full.dumplogs.09302013.tar.gz    
Issue Links:
Duplicate
is duplicated by LU-3871 e2fsck reports inode reference count ... Resolved
Severity: 3
Rank (Obsolete): 10693

 Description   

Running parallel-scale on Hyperion, initialization for iOR test. Formatted with ZFS.
MDS crashes during test startup.

LustreError: 8136:0:(llog_osd.c:241:llog_osd_read_header()) MGS-osd: error reading log header from [0xa:0xa:0x0]: rc = -14
2013-09-23 11:32:12 LustreError: 8136:0:(mgs_llog.c:1386:record_start_log()) MGS: can't start log lustre-params: rc = -14
2013-09-23 11:32:12 BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8

Console log attached.



 Comments   
Comment by Oleg Drokin [ 24/Sep/13 ]

There's not enough information to see why llog init fails with EFAULT, but the crash reason is obvious, we fail to test for the llog opening status and try to close not opened llog as the result.

Patch for the crash is in http://review.whamcloud.com/7742

Comment by Jodi Levi (Inactive) [ 25/Sep/13 ]

Cliff to provide the additional debug logs.
Will reassign this ticket once that info has been provided.

Comment by Cliff White (Inactive) [ 26/Sep/13 ]

Reproduced crash on 2.4.93 with panic_on_oops=0 Console log and lctl dk attached.

Comment by Oleg Drokin [ 27/Sep/13 ]

Hm, the lctldk output is too late afte the crash, it starts at 11:11am and end on 11:14am on 26th, and the oops was at 11:10
lctl dk needs to be run right after the crash (or you need to increase debug log buffer to a bigger value so it does not wrap super fast)

Comment by Cliff White (Inactive) [ 27/Sep/13 ]

dumplog.115529 contains the error. I have included the dumps from 30 seconds before and 30 seconds after.

Comment by Oleg Drokin [ 27/Sep/13 ]

Thanks!

Ok, so the issue is we are trying to do a 8k read and can read only smaller amount of bytes (not enough logging to see how many) from the referenced llog file.

Right now it sounds like the underlying mgs filesystem llog is damaged, we should mount it directly and check what's up with the llog file for [0xa:0xa:0x0] llog, it's probably way too short?

Comment by Cliff White (Inactive) [ 30/Sep/13 ]

Each time this has failed, it has been on a freshly formatted filesystem. I can replicate and look at the log if that is necessary

Comment by Oleg Drokin [ 30/Sep/13 ]

There's a fair chance the llog is created in a bad way from format, if this is really the case, there's no log for this process.
So Taking a look at llog (making sure it is short after the failure) and then keeping track on it right after reformat even before we mount lustre to confirm would be useful, and then we also need to see exactly how malformed is it, just short size, what's inside and so on.

Comment by Cliff White (Inactive) [ 30/Sep/13 ]

all logs from CONFIGS directory on MGS after crash, dumped to text with llog_reader. Same error as before:

2013-09-30 14:34:04 LustreError: 8465:0:(llog_osd.c:241:llog_osd_read_header()) MGS-osd: error reading log header from [0xa:0xa:0x0]: rc = -14
2013-09-30 14:34:04 LustreError: 8465:0:(mgs_llog.c:1386:record_start_log()) MGS: can't start log lustre-params: rc = -14
2013-09-30 14:34:04 BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8
2013-09-30 14:34:04 IP: [<ffffffffa07ffe99>] llog_handle_put+0x9/0x70 [obdclass]
Comment by Cliff White (Inactive) [ 30/Sep/13 ]

debub_mb=1024, log dumped every 30 seconds for duration of test.

Comment by Cliff White (Inactive) [ 02/Oct/13 ]

Found an easy way to reproduce this:
With clients mounted. on MGS:

  1. hyperion-agb5 /root > lctl conf_param lustre.sys.jobid_var=procname_uid
    crashes immediately
    And, if I set
    export JOBID_VAR="existing"
    in my config, the test runs.
Comment by Mikhail Pershin [ 04/Oct/13 ]

Cliff, can you reproduce that issue with commit http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=a217228ce3e1c93fdfeb1d1aa6ff48b3f82abf83 ?

Comment by Cliff White (Inactive) [ 04/Oct/13 ]

No, I cannot - ran the one-line test and it does not crash. Will run IOR shortly

Comment by Cliff White (Inactive) [ 07/Oct/13 ]

Ran IOR without any crashes. Latest build fixes

Comment by Jodi Levi (Inactive) [ 16/Oct/13 ]

Removed fixversion as this is a duplicate.

Generated at Sat Feb 10 01:38:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.