Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0, Lustre 2.5.4
    • Lustre 2.6.0
    • SLES11SP3 clients, master (v2_5_58_0-105-gb7f1952), Cray stress testing.
    • 3
    • 13981

    Description

      Cray has recently begun stress testing master clients in preparation for 2.6.0.

      We hit this bug on several nodes during a stress run last night:

      2014-05-13T23:00:06.485598-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) page@ffff880221781c00[4 ffff880222618600 3 0 1           (null) ffff88022328d740 0x0]
      2014-05-13T23:00:06.485630-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) vvp-page@ffff880221781ca0(0:0:0) vm@ffffea000e1502a0 200000000000801 4:0 ffff880221781c00 0 lru
      2014-05-13T23:00:06.485637-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) lov-page@ffff880221781cf8, raid0
      2014-05-13T23:00:06.485672-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) osc-page@ffff880221781d60 0: 1< 0x845fed 1 0 + + > 2< 0 0 4096 0x7 0x8 |           (null) ffff88044ad844f8 ffff88044d7a59e8 > 3< + ffff88044e055080 1 16 0 > 4< 0 0 8 35586048 - | - - - + > 5< - - - + | 0 - | 0 - ->
      2014-05-13T23:00:06.485681-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) end page@ffff880221781c00
      2014-05-13T23:00:06.485689-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) dump uncover page!
      2014-05-13T23:00:06.485700-05:00 c1-0c0s2n1 Pid: 14298, comm: nsystst
      2014-05-13T23:00:06.485710-05:00 c1-0c0s2n1 Call Trace:
      2014-05-13T23:00:06.485724-05:00 c1-0c0s2n1 [<ffffffff81005eb9>] try_stack_unwind+0x169/0x1b0
      2014-05-13T23:00:06.485746-05:00 c1-0c0s2n1 [<ffffffff81004919>] dump_trace+0x89/0x450
      2014-05-13T23:00:06.485757-05:00 c1-0c0s2n1 [<ffffffffa023b8d7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
      2014-05-13T23:00:06.485784-05:00 c1-0c0s2n1 [<ffffffffa092f348>] osc_req_attr_set+0x6b8/0x740 [osc]
      2014-05-13T23:00:06.485796-05:00 c1-0c0s2n1 [<ffffffffa038dbf1>] cl_req_attr_set+0xd1/0x220 [obdclass]
      2014-05-13T23:00:06.485807-05:00 c1-0c0s2n1 [<ffffffffa091a6da>] osc_build_rpc+0x4fa/0x15d0 [osc]
      2014-05-13T23:00:06.485817-05:00 c1-0c0s2n1 [<ffffffffa093c199>] osc_io_unplug0+0x1419/0x2020 [osc]
      2014-05-13T23:00:06.485837-05:00 c1-0c0s2n1 [<ffffffffa093ddb1>] osc_io_unplug+0x11/0x20 [osc]
      2014-05-13T23:00:06.485844-05:00 c1-0c0s2n1 [<ffffffffa093dfa1>] osc_queue_sync_pages+0x1e1/0x380 [osc]
      2014-05-13T23:00:06.485852-05:00 c1-0c0s2n1 [<ffffffffa092f90d>] osc_io_submit+0x2dd/0x4e0 [osc]
      2014-05-13T23:00:06.485871-05:00 c1-0c0s2n1 [<ffffffffa038d40e>] cl_io_submit_rw+0x6e/0x170 [obdclass]
      2014-05-13T23:00:06.485881-05:00 c1-0c0s2n1 [<ffffffffa076123e>] lov_io_submit+0x2ee/0x4d0 [lov]
      2014-05-13T23:00:06.485894-05:00 c1-0c0s2n1 [<ffffffffa038d40e>] cl_io_submit_rw+0x6e/0x170 [obdclass]
      2014-05-13T23:00:06.485900-05:00 c1-0c0s2n1 [<ffffffffa038f3fa>] cl_io_read_page+0x19a/0x1b0 [obdclass]
      2014-05-13T23:00:06.485907-05:00 c1-0c0s2n1 [<ffffffffa081bf7c>] ll_readpage+0x16c/0x1b0 [lustre]
      2014-05-13T23:00:06.485920-05:00 c1-0c0s2n1 [<ffffffff810fdc68>] generic_file_aio_read+0x268/0x740
      2014-05-13T23:00:06.485932-05:00 c1-0c0s2n1 [<ffffffffa084e0b0>] vvp_io_read_start+0x2a0/0x430 [lustre]
      2014-05-13T23:00:06.485938-05:00 c1-0c0s2n1 [<ffffffffa038d6b2>] cl_io_start+0x72/0x140 [obdclass]
      2014-05-13T23:00:06.485947-05:00 c1-0c0s2n1 [<ffffffffa0391234>] cl_io_loop+0xb4/0x1b0 [obdclass]
      2014-05-13T23:00:06.485959-05:00 c1-0c0s2n1 [<ffffffffa07e9ccc>] ll_file_io_generic+0x58c/0x8a0 [lustre]
      2014-05-13T23:00:06.485967-05:00 c1-0c0s2n1 [<ffffffffa07ea49c>] ll_file_aio_read+0x22c/0x290 [lustre]
      2014-05-13T23:00:06.485994-05:00 c1-0c0s2n1 [<ffffffff8115517b>] do_sync_readv_writev+0xdb/0x120
      2014-05-13T23:00:06.486000-05:00 c1-0c0s2n1 [<ffffffff81156084>] do_readv_writev+0xd4/0x1e0
      2014-05-13T23:00:06.486010-05:00 c1-0c0s2n1 [<ffffffff811563c5>] vfs_readv+0x45/0x60
      2014-05-13T23:00:06.486016-05:00 c1-0c0s2n1 [<ffffffff81156505>] sys_readv+0x55/0xc0
      2014-05-13T23:00:06.486023-05:00 c1-0c0s2n1 [<ffffffff8142286b>] system_call_fastpath+0x16/0x1b
      2014-05-13T23:00:06.486031-05:00 c1-0c0s2n1 [<000000002006d41b>] 0x2006d41b
      2014-05-13T23:00:06.510785-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:826:osc_req_attr_set()) LBUG
      

      I'll make the dump+logs available shortly.

      I note this bug was seen by Oleg during patch review for the readahead changes in LU-3321, see his comments here:

      http://review.whamcloud.com/#/c/8523/

      I also note a previous instance of this assertion was fixed in LU-1650:
      https://jira.hpdd.intel.com/browse/LU-1650

      Attachments

        1. log
          693 kB
        2. log.sort
          573 kB
        3. lustre-ext.so
          132 kB

        Issue Links

          Activity

            [LU-5062] LBUG: osc_req_attr_set

            Hi Patrick,

            Can you try this patch: http://review.whamcloud.com/10760?

            jay Jinshan Xiong (Inactive) added a comment - Hi Patrick, Can you try this patch: http://review.whamcloud.com/10760?

            I'll start to work on this.

            jay Jinshan Xiong (Inactive) added a comment - I'll start to work on this.

            Cray hit this again today [Update: Looking internally, we've hit it several times recently in testing.].

            Also: It looks like LU-3729 may be a duplicate of this? Though there have been several osc_req_attr_set cases fixed since that bug was opened, I believe.

            paf Patrick Farrell (Inactive) added a comment - - edited Cray hit this again today [Update: Looking internally, we've hit it several times recently in testing.] . Also: It looks like LU-3729 may be a duplicate of this? Though there have been several osc_req_attr_set cases fixed since that bug was opened, I believe.

            Jinshan,
            Are you planning to make a patch for this?

            jlevi Jodi Levi (Inactive) added a comment - Jinshan, Are you planning to make a patch for this?

            It turned out that this is a race when a file with empty layout turned into raid0 layout.

            jay Jinshan Xiong (Inactive) added a comment - It turned out that this is a race when a file with empty layout turned into raid0 layout.

            log is the raw binary log, log.sort is the processed log.

            paf Patrick Farrell (Inactive) added a comment - log is the raw binary log, log.sort is the processed log.

            Jinshan,

            Hmm. The problem might be SLES11SP3... Xyratex modified lustre.so so it could get logs out. I'm not familiar with what they changed. (We also had to update our version of Crash to something more recent than we wer eusing.) I'll attach that file as well.

            In any case, I was able to extract the logs. I'll note there did appear to be a few corrupt messages (I got some error messages when using lctl to process the binary logs), so I'll include the binary logs as well as the processed ones. Attaching in a moment.

            paf Patrick Farrell (Inactive) added a comment - Jinshan, Hmm. The problem might be SLES11SP3... Xyratex modified lustre.so so it could get logs out. I'm not familiar with what they changed. (We also had to update our version of Crash to something more recent than we wer eusing.) I'll attach that file as well. In any case, I was able to extract the logs. I'll note there did appear to be a few corrupt messages (I got some error messages when using lctl to process the binary logs), so I'll include the binary logs as well as the processed ones. Attaching in a moment.

            Hi Patrick,

            I still can't extract lustre log from the crash dump file. This is what I got when I was trying to extract it.

            crash> extend /home/jay/crashdump/lustre.so
            /home/jay/crashdump/lustre.so: shared object loaded
            crash> lustre -l x
            lustre_walk_cpus(0, 5, 1)
            cmd:	p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages
            	p (*cfs_trace_data[0])[0].tcd.tcd_pages.next
            lustre: gdb request failed: "p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages"
            crash> p cfs_trace_data
            cfs_trace_data = $7 = 
             {0x1400000004, 0x554e4700000003, 0x5b9a0443e17b2baa, 0xcd0dd360156499dc, 0x5abd76e8, 0x0, 0xffffffffa022c9e0,
             0xffffffffa024f201}
            

            cfs_trace_data should be an array of pointers. However, the memory cfs_trace_data pointing to seems to be mangled. Can you run crash on your local side to see if you can get the log? Otherwise, please check if the dump gathering script is correct.

            jay Jinshan Xiong (Inactive) added a comment - Hi Patrick, I still can't extract lustre log from the crash dump file. This is what I got when I was trying to extract it. crash> extend /home/jay/crashdump/lustre.so /home/jay/crashdump/lustre.so: shared object loaded crash> lustre -l x lustre_walk_cpus(0, 5, 1) cmd: p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages p (*cfs_trace_data[0])[0].tcd.tcd_pages.next lustre: gdb request failed: "p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages" crash> p cfs_trace_data cfs_trace_data = $7 = {0x1400000004, 0x554e4700000003, 0x5b9a0443e17b2baa, 0xcd0dd360156499dc, 0x5abd76e8, 0x0, 0xffffffffa022c9e0, 0xffffffffa024f201} cfs_trace_data should be an array of pointers. However, the memory cfs_trace_data pointing to seems to be mangled. Can you run crash on your local side to see if you can get the log? Otherwise, please check if the dump gathering script is correct.

            Ugh, sorry for the incorrect KO files... There's a small issue in our automated dump gathering.

            I did this one by hand and checked that the KOs are right. This is the dump for the report I just made a few hours ago:

            ftp.whamcloud.com
            uploads/LU-5062/LU_5062_140515.tar.gz

            The timer is saying 9 minutes to complete the upload.
            Upload complete.

            Thank you for taking a look.

            paf Patrick Farrell (Inactive) added a comment - - edited Ugh, sorry for the incorrect KO files... There's a small issue in our automated dump gathering. I did this one by hand and checked that the KOs are right. This is the dump for the report I just made a few hours ago: ftp.whamcloud.com uploads/ LU-5062 /LU_5062_140515.tar.gz The timer is saying 9 minutes to complete the upload. Upload complete. Thank you for taking a look.

            Hi Patrick, can you please upload one more crash dump for me? It looks like the previous upload has problem and I can't extract lustre log from there. Please make sure the ko files match the running code.

            jay Jinshan Xiong (Inactive) added a comment - Hi Patrick, can you please upload one more crash dump for me? It looks like the previous upload has problem and I can't extract lustre log from there. Please make sure the ko files match the running code.

            I will look at this issue.

            jay Jinshan Xiong (Inactive) added a comment - I will look at this issue.

            People

              jay Jinshan Xiong (Inactive)
              paf Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: