Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4351

sanity test_54c: can't find an ext2 filesystem on dev loop3

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.6.0, Lustre 2.5.2
    • Lustre 2.6.0
    • None
    • lustre-master build #1791 ldiskfs
      client is running SLES11 SP3
    • 3
    • 11919

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/0ac2cba4-5de4-11e3-8f3c-52540035b04c.

      The sub-test test_54c failed with the following error:

      error mounting /mnt/lustre/loop54c on /mnt/lustre/d54c

      client dmesg shows:

      [ 2564.211169] LustreError: 9888:0:(rw.c:128:ll_cl_init()) lustre: [0x200001b70:0xeff1:0x0] no active IO, please file a ticket.
      [ 2564.211172] Pid: 9888, comm: loop3
      [ 2564.211173] 
      [ 2564.211174] Call Trace:
      [ 2564.211186]  [<ffffffff81004935>] dump_trace+0x75/0x310
      [ 2564.211210]  [<ffffffffa05c282a>] libcfs_debug_dumpstack+0x4a/0x70 [libcfs]
      [ 2564.211250]  [<ffffffffa0becb8e>] ll_cl_init+0x22e/0x330 [lustre]
      [ 2564.211280]  [<ffffffffa0c0901a>] ll_write_begin+0x8a/0x5d0 [lustre]
      [ 2564.211305]  [<ffffffffa02cb50d>] do_lo_send_aops+0xad/0x1b0 [loop]
      [ 2564.211310]  [<ffffffffa02cb7a0>] do_bio_filebacked+0x190/0x280 [loop]
      [ 2564.211314]  [<ffffffffa02cb952>] loop_thread+0xc2/0x250 [loop]
      [ 2564.211319]  [<ffffffff81082306>] kthread+0x96/0xa0
      [ 2564.211325]  [<ffffffff81467864>] kernel_thread_helper+0x4/0x10
      [ 2564.211328] 
      [ 2564.211335] Buffer I/O error on device loop3, logical block 0
      [ 2564.211336] lost page write due to I/O error on loop3
      [ 2564.211345] Pid: 9888, comm: loop3
      [ 2564.211346] 
      [ 2564.211346] Call Trace:
      [ 2564.211350]  [<ffffffff81004935>] dump_trace+0x75/0x310
      [ 2564.211360]  [<ffffffffa05c282a>] libcfs_debug_dumpstack+0x4a/0x70 [libcfs]
      [ 2564.211377]  [<ffffffffa0becb8e>] ll_cl_init+0x22e/0x330 [lustre]
      [ 2564.211403]  [<ffffffffa0c0901a>] ll_write_begin+0x8a/0x5d0 [lustre]
      [ 2564.211421]  [<ffffffffa02cb50d>] do_lo_send_aops+0xad/0x1b0 [loop]
      [ 2564.211426]  [<ffffffffa02cb7a0>] do_bio_filebacked+0x190/0x280 [loop]
      [ 2564.211430]  [<ffffffffa02cb952>] loop_thread+0xc2/0x250 [loop]
      [ 2564.211434]  [<ffffffff81082306>] kthread+0x96/0xa0
      [ 2564.211445]  [<ffffffff81467864>] kernel_thread_helper+0x4/0x10
      
      [ 2564.237136] EXT2-fs (loop3): error: can't find an ext2 filesystem on dev loop3.
      [ 2568.722796] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_54c: @@@@@@ FAIL: error mounting \/mnt\/lustre\/loop54c on \/mnt\/lustre\/d54c 
      

      Attachments

        Issue Links

          Activity

            [LU-4351] sanity test_54c: can't find an ext2 filesystem on dev loop3

            fixed in sles11sp3 as well in the most recent upstream kernel update from SuSE to version 3.0.101-0.35

            bogl Bob Glossman (Inactive) added a comment - fixed in sles11sp3 as well in the most recent upstream kernel update from SuSE to version 3.0.101-0.35

            Patches have landed.

            jlevi Jodi Levi (Inactive) added a comment - Patches have landed.

            Thanks, Bob - It doesn't seem to be, but Cray has support contracts with SUSE, so someone around here should be able to get me in.

            Much appreciated.

            paf Patrick Farrell (Inactive) added a comment - Thanks, Bob - It doesn't seem to be, but Cray has support contracts with SUSE, so someone around here should be able to get me in. Much appreciated.

            btw, the fix isn't in the latest update to 3.0.101-0.29. I expect it in the next one.

            bogl Bob Glossman (Inactive) added a comment - btw, the fix isn't in the latest update to 3.0.101-0.29. I expect it in the next one.

            Patrick,
            It's https://bugzilla.novell.com/show_bug.cgi?id=878123. Don't know if that's generally accessible.

            bogl Bob Glossman (Inactive) added a comment - Patrick, It's https://bugzilla.novell.com/show_bug.cgi?id=878123 . Don't know if that's generally accessible.

            Bob: Are you able to provide a link or other info about that bug report?

            Thanks

            paf Patrick Farrell (Inactive) added a comment - Bob: Are you able to provide a link or other info about that bug report? Thanks

            Recent bug report submitted upstream to SuSE and marked fixed suggests this bug will very likely disappear after the next sles11sp3 kernel update.

            bogl Bob Glossman (Inactive) added a comment - Recent bug report submitted upstream to SuSE and marked fixed suggests this bug will very likely disappear after the next sles11sp3 kernel update.

            I have been researching the use of address space ops in the kernel loop device code a bit. As far as I can tell it's been part of the mainline upstream linux kernel for a very long time until the following commit in 2011:

            author	Christoph Hellwig <hch@infradead.org>	2011-10-17 10:57:20 (GMT)
            committer	Jens Axboe <axboe@kernel.dk>	2011-10-17 10:57:20 (GMT)
            commit	456be1484ffc72a24bdb4200b5847c4fa90139d9 (patch)
            tree	570f0818bd6cfa245ab23d0121853b7b1e5a649b /drivers/block/loop.c
            parent	8bc03e8f3a334e09e89a7dffb486ee97a5ce84ae (diff)
            loop: remove the incorrect write_begin/write_end shortcut
            Currently the loop device tries to call directly into write_begin/write_end
            instead of going through ->write if it can.  This is a fairly nasty shortcut
            as write_begin and write_end are only callbacks for the generic write code
            and expect to be called with filesystem specific locks held.
            
            This code currently causes various issues for clustered filesystems as it
            doesn't take the required cluster locks, and it also causes issues for XFS
            as it doesn't properly lock against the swapext ioctl as called by the
            defragmentation tools.  This in case causes data corruption if
            defragmentation hits a busy loop device in the wrong time window, as
            reported by RH QA.
            
            The reason why we have this shortcut is that it saves a data copy when
            doing a transformation on the loop device, which is the technical term
            for using cryptoloop (or an XOR transformation).  Given that cryptoloop
            has been deprecated in favour of dm-crypt my opinion is that we should
            simply drop this shortcut instead of finding complicated ways to to
            introduce a formal interface for this shortcut.
            
            Signed-off-by: Christoph Hellwig <hch@lst.de>
            Signed-off-by: Jens Axboe <axboe@kernel.dk>
            

            I suspect the reason we don't see it in el6 is that yanking it out was one of the RH customizations of their distro kernel.
            I suspect the reason it's in SLES11 is the branch their kernel is derived from is too old to include the upstream kernel commit.

            bogl Bob Glossman (Inactive) added a comment - I have been researching the use of address space ops in the kernel loop device code a bit. As far as I can tell it's been part of the mainline upstream linux kernel for a very long time until the following commit in 2011: author Christoph Hellwig <hch@infradead.org> 2011-10-17 10:57:20 (GMT) committer Jens Axboe <axboe@kernel.dk> 2011-10-17 10:57:20 (GMT) commit 456be1484ffc72a24bdb4200b5847c4fa90139d9 (patch) tree 570f0818bd6cfa245ab23d0121853b7b1e5a649b /drivers/block/loop.c parent 8bc03e8f3a334e09e89a7dffb486ee97a5ce84ae (diff) loop: remove the incorrect write_begin/write_end shortcut Currently the loop device tries to call directly into write_begin/write_end instead of going through ->write if it can. This is a fairly nasty shortcut as write_begin and write_end are only callbacks for the generic write code and expect to be called with filesystem specific locks held. This code currently causes various issues for clustered filesystems as it doesn't take the required cluster locks, and it also causes issues for XFS as it doesn't properly lock against the swapext ioctl as called by the defragmentation tools. This in case causes data corruption if defragmentation hits a busy loop device in the wrong time window, as reported by RH QA. The reason why we have this shortcut is that it saves a data copy when doing a transformation on the loop device, which is the technical term for using cryptoloop (or an XOR transformation). Given that cryptoloop has been deprecated in favour of dm-crypt my opinion is that we should simply drop this shortcut instead of finding complicated ways to to introduce a formal interface for this shortcut. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> I suspect the reason we don't see it in el6 is that yanking it out was one of the RH customizations of their distro kernel. I suspect the reason it's in SLES11 is the branch their kernel is derived from is too old to include the upstream kernel commit.

            I see your call stack has do_lo_send_aops in it. That says this is likely the same bug that's in the SLES11 kernel. Does any of your Phi specific patches apply in drivers/block/loop.c? If not then this bug probably exists in 2.6.38 upstream kernel source you are using. Previously had only seen this internal routine in SLES 11 (3.0.x) kernels. Not seen in el6 (2.6.32-xxx) kernels or fc20, rhel7, sles12.

            As far as I know there is no workaround.

            bogl Bob Glossman (Inactive) added a comment - I see your call stack has do_lo_send_aops in it. That says this is likely the same bug that's in the SLES11 kernel. Does any of your Phi specific patches apply in drivers/block/loop.c? If not then this bug probably exists in 2.6.38 upstream kernel source you are using. Previously had only seen this internal routine in SLES 11 (3.0.x) kernels. Not seen in el6 (2.6.32-xxx) kernels or fc20, rhel7, sles12. As far as I know there is no workaround.

            No, it's upstream kernel 2.6.38 with patches for Xeon Phi.

            dmiter Dmitry Eremin (Inactive) added a comment - No, it's upstream kernel 2.6.38 with patches for Xeon Phi.

            People

              bogl Bob Glossman (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: