Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3996

LustreError: 8136:0:(llog_osd.c:241:llog_osd_read_header()) MGS-osd: error reading log header from [0xa:0xa:0x0]: rc = -14

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.5.0
    • None
    • 3
    • 10693

    Description

      Running parallel-scale on Hyperion, initialization for iOR test. Formatted with ZFS.
      MDS crashes during test startup.

      LustreError: 8136:0:(llog_osd.c:241:llog_osd_read_header()) MGS-osd: error reading log header from [0xa:0xa:0x0]: rc = -14
      2013-09-23 11:32:12 LustreError: 8136:0:(mgs_llog.c:1386:record_start_log()) MGS: can't start log lustre-params: rc = -14
      2013-09-23 11:32:12 BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8
      
      

      Console log attached.

      Attachments

        1. full.dumplogs.09302013.tar.gz
          0.2 kB
        2. dumplog.115600.gz
          623 kB
        3. dumplog.115529.gz
          733 kB
        4. dumplog.115459.gz
          740 kB
        5. dk.log.agb5.26Sept.2013.txt.gz
          4.10 MB
        6. console.txt
          5 kB
        7. config.log.tar.gz
          2 kB
        8. config.log.tar.gz
          2 kB
        9. agb5.crash2.26Sept2013.txt
          5 kB
        10. agb5.crash.23.sept.2013
          5 kB

        Issue Links

          Activity

            [LU-3996] LustreError: 8136:0:(llog_osd.c:241:llog_osd_read_header()) MGS-osd: error reading log header from [0xa:0xa:0x0]: rc = -14

            Removed fixversion as this is a duplicate.

            jlevi Jodi Levi (Inactive) added a comment - Removed fixversion as this is a duplicate.

            Ran IOR without any crashes. Latest build fixes

            cliffw Cliff White (Inactive) added a comment - Ran IOR without any crashes. Latest build fixes

            No, I cannot - ran the one-line test and it does not crash. Will run IOR shortly

            cliffw Cliff White (Inactive) added a comment - No, I cannot - ran the one-line test and it does not crash. Will run IOR shortly
            tappro Mikhail Pershin added a comment - Cliff, can you reproduce that issue with commit http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=a217228ce3e1c93fdfeb1d1aa6ff48b3f82abf83 ?

            Found an easy way to reproduce this:
            With clients mounted. on MGS:

            1. hyperion-agb5 /root > lctl conf_param lustre.sys.jobid_var=procname_uid
              crashes immediately
              And, if I set
              export JOBID_VAR="existing"
              in my config, the test runs.
            cliffw Cliff White (Inactive) added a comment - Found an easy way to reproduce this: With clients mounted. on MGS: hyperion-agb5 /root > lctl conf_param lustre.sys.jobid_var=procname_uid crashes immediately And, if I set export JOBID_VAR="existing" in my config, the test runs.

            debub_mb=1024, log dumped every 30 seconds for duration of test.

            cliffw Cliff White (Inactive) added a comment - debub_mb=1024, log dumped every 30 seconds for duration of test.

            all logs from CONFIGS directory on MGS after crash, dumped to text with llog_reader. Same error as before:

            2013-09-30 14:34:04 LustreError: 8465:0:(llog_osd.c:241:llog_osd_read_header()) MGS-osd: error reading log header from [0xa:0xa:0x0]: rc = -14
            2013-09-30 14:34:04 LustreError: 8465:0:(mgs_llog.c:1386:record_start_log()) MGS: can't start log lustre-params: rc = -14
            2013-09-30 14:34:04 BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8
            2013-09-30 14:34:04 IP: [<ffffffffa07ffe99>] llog_handle_put+0x9/0x70 [obdclass]
            
            cliffw Cliff White (Inactive) added a comment - all logs from CONFIGS directory on MGS after crash, dumped to text with llog_reader. Same error as before: 2013-09-30 14:34:04 LustreError: 8465:0:(llog_osd.c:241:llog_osd_read_header()) MGS-osd: error reading log header from [0xa:0xa:0x0]: rc = -14 2013-09-30 14:34:04 LustreError: 8465:0:(mgs_llog.c:1386:record_start_log()) MGS: can't start log lustre-params: rc = -14 2013-09-30 14:34:04 BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8 2013-09-30 14:34:04 IP: [<ffffffffa07ffe99>] llog_handle_put+0x9/0x70 [obdclass]
            green Oleg Drokin added a comment -

            There's a fair chance the llog is created in a bad way from format, if this is really the case, there's no log for this process.
            So Taking a look at llog (making sure it is short after the failure) and then keeping track on it right after reformat even before we mount lustre to confirm would be useful, and then we also need to see exactly how malformed is it, just short size, what's inside and so on.

            green Oleg Drokin added a comment - There's a fair chance the llog is created in a bad way from format, if this is really the case, there's no log for this process. So Taking a look at llog (making sure it is short after the failure) and then keeping track on it right after reformat even before we mount lustre to confirm would be useful, and then we also need to see exactly how malformed is it, just short size, what's inside and so on.

            Each time this has failed, it has been on a freshly formatted filesystem. I can replicate and look at the log if that is necessary

            cliffw Cliff White (Inactive) added a comment - Each time this has failed, it has been on a freshly formatted filesystem. I can replicate and look at the log if that is necessary
            green Oleg Drokin added a comment -

            Thanks!

            Ok, so the issue is we are trying to do a 8k read and can read only smaller amount of bytes (not enough logging to see how many) from the referenced llog file.

            Right now it sounds like the underlying mgs filesystem llog is damaged, we should mount it directly and check what's up with the llog file for [0xa:0xa:0x0] llog, it's probably way too short?

            green Oleg Drokin added a comment - Thanks! Ok, so the issue is we are trying to do a 8k read and can read only smaller amount of bytes (not enough logging to see how many) from the referenced llog file. Right now it sounds like the underlying mgs filesystem llog is damaged, we should mount it directly and check what's up with the llog file for [0xa:0xa:0x0] llog, it's probably way too short?

            People

              tappro Mikhail Pershin
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: