Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0, Lustre 2.5.4
    • Lustre 2.6.0
    • SLES11SP3 clients, master (v2_5_58_0-105-gb7f1952), Cray stress testing.
    • 3
    • 13981

    Description

      Cray has recently begun stress testing master clients in preparation for 2.6.0.

      We hit this bug on several nodes during a stress run last night:

      2014-05-13T23:00:06.485598-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) page@ffff880221781c00[4 ffff880222618600 3 0 1           (null) ffff88022328d740 0x0]
      2014-05-13T23:00:06.485630-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) vvp-page@ffff880221781ca0(0:0:0) vm@ffffea000e1502a0 200000000000801 4:0 ffff880221781c00 0 lru
      2014-05-13T23:00:06.485637-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) lov-page@ffff880221781cf8, raid0
      2014-05-13T23:00:06.485672-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) osc-page@ffff880221781d60 0: 1< 0x845fed 1 0 + + > 2< 0 0 4096 0x7 0x8 |           (null) ffff88044ad844f8 ffff88044d7a59e8 > 3< + ffff88044e055080 1 16 0 > 4< 0 0 8 35586048 - | - - - + > 5< - - - + | 0 - | 0 - ->
      2014-05-13T23:00:06.485681-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) end page@ffff880221781c00
      2014-05-13T23:00:06.485689-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) dump uncover page!
      2014-05-13T23:00:06.485700-05:00 c1-0c0s2n1 Pid: 14298, comm: nsystst
      2014-05-13T23:00:06.485710-05:00 c1-0c0s2n1 Call Trace:
      2014-05-13T23:00:06.485724-05:00 c1-0c0s2n1 [<ffffffff81005eb9>] try_stack_unwind+0x169/0x1b0
      2014-05-13T23:00:06.485746-05:00 c1-0c0s2n1 [<ffffffff81004919>] dump_trace+0x89/0x450
      2014-05-13T23:00:06.485757-05:00 c1-0c0s2n1 [<ffffffffa023b8d7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
      2014-05-13T23:00:06.485784-05:00 c1-0c0s2n1 [<ffffffffa092f348>] osc_req_attr_set+0x6b8/0x740 [osc]
      2014-05-13T23:00:06.485796-05:00 c1-0c0s2n1 [<ffffffffa038dbf1>] cl_req_attr_set+0xd1/0x220 [obdclass]
      2014-05-13T23:00:06.485807-05:00 c1-0c0s2n1 [<ffffffffa091a6da>] osc_build_rpc+0x4fa/0x15d0 [osc]
      2014-05-13T23:00:06.485817-05:00 c1-0c0s2n1 [<ffffffffa093c199>] osc_io_unplug0+0x1419/0x2020 [osc]
      2014-05-13T23:00:06.485837-05:00 c1-0c0s2n1 [<ffffffffa093ddb1>] osc_io_unplug+0x11/0x20 [osc]
      2014-05-13T23:00:06.485844-05:00 c1-0c0s2n1 [<ffffffffa093dfa1>] osc_queue_sync_pages+0x1e1/0x380 [osc]
      2014-05-13T23:00:06.485852-05:00 c1-0c0s2n1 [<ffffffffa092f90d>] osc_io_submit+0x2dd/0x4e0 [osc]
      2014-05-13T23:00:06.485871-05:00 c1-0c0s2n1 [<ffffffffa038d40e>] cl_io_submit_rw+0x6e/0x170 [obdclass]
      2014-05-13T23:00:06.485881-05:00 c1-0c0s2n1 [<ffffffffa076123e>] lov_io_submit+0x2ee/0x4d0 [lov]
      2014-05-13T23:00:06.485894-05:00 c1-0c0s2n1 [<ffffffffa038d40e>] cl_io_submit_rw+0x6e/0x170 [obdclass]
      2014-05-13T23:00:06.485900-05:00 c1-0c0s2n1 [<ffffffffa038f3fa>] cl_io_read_page+0x19a/0x1b0 [obdclass]
      2014-05-13T23:00:06.485907-05:00 c1-0c0s2n1 [<ffffffffa081bf7c>] ll_readpage+0x16c/0x1b0 [lustre]
      2014-05-13T23:00:06.485920-05:00 c1-0c0s2n1 [<ffffffff810fdc68>] generic_file_aio_read+0x268/0x740
      2014-05-13T23:00:06.485932-05:00 c1-0c0s2n1 [<ffffffffa084e0b0>] vvp_io_read_start+0x2a0/0x430 [lustre]
      2014-05-13T23:00:06.485938-05:00 c1-0c0s2n1 [<ffffffffa038d6b2>] cl_io_start+0x72/0x140 [obdclass]
      2014-05-13T23:00:06.485947-05:00 c1-0c0s2n1 [<ffffffffa0391234>] cl_io_loop+0xb4/0x1b0 [obdclass]
      2014-05-13T23:00:06.485959-05:00 c1-0c0s2n1 [<ffffffffa07e9ccc>] ll_file_io_generic+0x58c/0x8a0 [lustre]
      2014-05-13T23:00:06.485967-05:00 c1-0c0s2n1 [<ffffffffa07ea49c>] ll_file_aio_read+0x22c/0x290 [lustre]
      2014-05-13T23:00:06.485994-05:00 c1-0c0s2n1 [<ffffffff8115517b>] do_sync_readv_writev+0xdb/0x120
      2014-05-13T23:00:06.486000-05:00 c1-0c0s2n1 [<ffffffff81156084>] do_readv_writev+0xd4/0x1e0
      2014-05-13T23:00:06.486010-05:00 c1-0c0s2n1 [<ffffffff811563c5>] vfs_readv+0x45/0x60
      2014-05-13T23:00:06.486016-05:00 c1-0c0s2n1 [<ffffffff81156505>] sys_readv+0x55/0xc0
      2014-05-13T23:00:06.486023-05:00 c1-0c0s2n1 [<ffffffff8142286b>] system_call_fastpath+0x16/0x1b
      2014-05-13T23:00:06.486031-05:00 c1-0c0s2n1 [<000000002006d41b>] 0x2006d41b
      2014-05-13T23:00:06.510785-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:826:osc_req_attr_set()) LBUG
      

      I'll make the dump+logs available shortly.

      I note this bug was seen by Oleg during patch review for the readahead changes in LU-3321, see his comments here:

      http://review.whamcloud.com/#/c/8523/

      I also note a previous instance of this assertion was fixed in LU-1650:
      https://jira.hpdd.intel.com/browse/LU-1650

      Attachments

        1. log
          693 kB
        2. log.sort
          573 kB
        3. lustre-ext.so
          132 kB

        Issue Links

          Activity

            [LU-5062] LBUG: osc_req_attr_set

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12139/
            Subject: LU-5062 llite: Solve a race to access lli_has_smd in read case
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set:
            Commit: 078dee1448c2995257923dc712943eab2c78e48e

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12139/ Subject: LU-5062 llite: Solve a race to access lli_has_smd in read case Project: fs/lustre-release Branch: b2_5 Current Patch Set: Commit: 078dee1448c2995257923dc712943eab2c78e48e
            pjones Peter Jones added a comment -

            Landed for 2.6

            pjones Peter Jones added a comment - Landed for 2.6

            Just out of curiosity: I'm still a little puzzled. If I'm reading cl_use_try correctly, as long as there's at least one slice with a valid clo_use, rc shouldn't be ENOSYS, so no assertion. Is the error case here a cl_lock with only a single slice, then?

            For cl_lock of an empty file, it will have two slices: ccc and lov. However, neither of them has ->clo_use defined this is why cl_use_try() will see ENOSYS error.

            I don't need the data from patch set 1, thanks.

            jay Jinshan Xiong (Inactive) added a comment - Just out of curiosity: I'm still a little puzzled. If I'm reading cl_use_try correctly, as long as there's at least one slice with a valid clo_use, rc shouldn't be ENOSYS, so no assertion. Is the error case here a cl_lock with only a single slice, then? For cl_lock of an empty file, it will have two slices: ccc and lov. However, neither of them has ->clo_use defined this is why cl_use_try() will see ENOSYS error. I don't need the data from patch set 1, thanks.

            Yes, I did, because I had it built and the system set up before I saw your response in Gerrit, and then I didn't realize the other code affected more than that sanity testing. Oops.

            Having looked at LU-4900, I understand now why you did what you did in patch set 2. I see now that's called in the macro in cl_use_try.

            So it looks like your change to vvp_io_read_lock is getting this kind of cl_lock in to the cache, and then in to cl_use_try, where it asserts.
            And since cl_use_try is only called for cached locks, we don't hit that assertion all the time...

            I'll do another run later with patch set 2. Is the data from running only patch set 1 helpful in proving the vvp_io_read_lock part of the patch, or should I just ignore it?

            Sorry about that.

            Just out of curiosity: I'm still a little puzzled. If I'm reading cl_use_try correctly, as long as there's at least one slice with a valid clo_use, rc shouldn't be ENOSYS, so no assertion. Is the error case here a cl_lock with only a single slice, then?

            paf Patrick Farrell (Inactive) added a comment - Yes, I did, because I had it built and the system set up before I saw your response in Gerrit, and then I didn't realize the other code affected more than that sanity testing. Oops. Having looked at LU-4900 , I understand now why you did what you did in patch set 2. I see now that's called in the macro in cl_use_try. So it looks like your change to vvp_io_read_lock is getting this kind of cl_lock in to the cache, and then in to cl_use_try, where it asserts. And since cl_use_try is only called for cached locks, we don't hit that assertion all the time... I'll do another run later with patch set 2. Is the data from running only patch set 1 helpful in proving the vvp_io_read_lock part of the patch, or should I just ignore it? Sorry about that. Just out of curiosity: I'm still a little puzzled. If I'm reading cl_use_try correctly, as long as there's at least one slice with a valid clo_use, rc shouldn't be ENOSYS, so no assertion. Is the error case here a cl_lock with only a single slice, then?

            You used patch set 1, is that right?

            jay Jinshan Xiong (Inactive) added a comment - You used patch set 1, is that right?

            Jinshan - While testing the patch, I hit the assertion from LU-4900, same stack trace as there as well:
            2014-06-19T22:02:47.307062-05:00 c0-0c2s6n0 LustreError: 13331:0:(cl_lock.c:1122:cl_use_try()) ASSERTION( result != -38 ) failed:
            2014-06-19T22:02:47.307096-05:00 c0-0c2s6n0 LustreError: 13331:0:(cl_lock.c:1122:cl_use_try()) LBUG

            It's probably just random, since that bug's known to exist, but I wanted to let you know in case there's a possible relationship. It's worth noting I don't think we've ever hit LU-4900 before. (Still is likely to be just chance.)

            paf Patrick Farrell (Inactive) added a comment - Jinshan - While testing the patch, I hit the assertion from LU-4900 , same stack trace as there as well: 2014-06-19T22:02:47.307062-05:00 c0-0c2s6n0 LustreError: 13331:0:(cl_lock.c:1122:cl_use_try()) ASSERTION( result != -38 ) failed: 2014-06-19T22:02:47.307096-05:00 c0-0c2s6n0 LustreError: 13331:0:(cl_lock.c:1122:cl_use_try()) LBUG It's probably just random, since that bug's known to exist, but I wanted to let you know in case there's a possible relationship. It's worth noting I don't think we've ever hit LU-4900 before. (Still is likely to be just chance.)

            Hi Patrick,

            Can you try this patch: http://review.whamcloud.com/10760?

            jay Jinshan Xiong (Inactive) added a comment - Hi Patrick, Can you try this patch: http://review.whamcloud.com/10760?

            I'll start to work on this.

            jay Jinshan Xiong (Inactive) added a comment - I'll start to work on this.

            Cray hit this again today [Update: Looking internally, we've hit it several times recently in testing.].

            Also: It looks like LU-3729 may be a duplicate of this? Though there have been several osc_req_attr_set cases fixed since that bug was opened, I believe.

            paf Patrick Farrell (Inactive) added a comment - - edited Cray hit this again today [Update: Looking internally, we've hit it several times recently in testing.] . Also: It looks like LU-3729 may be a duplicate of this? Though there have been several osc_req_attr_set cases fixed since that bug was opened, I believe.

            Jinshan,
            Are you planning to make a patch for this?

            jlevi Jodi Levi (Inactive) added a comment - Jinshan, Are you planning to make a patch for this?

            People

              jay Jinshan Xiong (Inactive)
              paf Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: