Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2910

racer test_1, sanity test_54c: ASSERTION( cio->cui_tot_nrsegs >= cio->cui_nrsegs ) failed

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • 7001

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/3fa2a912-85b2-11e2-9f8d-52540035b04c.

      The sub-test test_1 failed with the following error:

      test failed to respond and timed out

      client console shows:

      18:47:35:Lustre: DEBUG MARKER: == racer test 1: racer on clients: client-22-ib,client-23-ib.lab.whamcloud.com DURATION=900 == 18:47:28 (1362451648)
      18:47:35:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 				   /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre2/racer 
      18:47:35:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 				   /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer 
      18:51:19:LustreError: 8649:0:(file.c:2601:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000bd0:0x9095:0x0] error: rc = -116
      18:56:24:LustreError: 28058:0:(lcommon_cl.c:821:ccc_io_advance()) ASSERTION( cio->cui_tot_nrsegs >= cio->cui_nrsegs ) failed: 
      18:56:24:LustreError: 28058:0:(lcommon_cl.c:821:ccc_io_advance()) LBUG
      18:56:24:Pid: 28058, comm: cat
      18:56:24:
      18:56:24:Call Trace:
      18:56:24: [<ffffffffa03fe895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      18:56:24: [<ffffffffa03fee97>] lbug_with_loc+0x47/0xb0 [libcfs]
      18:56:24: [<ffffffffa1264c0e>] ccc_io_advance+0x11e/0x190 [lustre]
      18:56:24: [<ffffffffa05a96d6>] cl_io_rw_advance+0x66/0x150 [obdclass]
      18:56:24: [<ffffffffa05ae602>] cl_io_loop+0xe2/0x1b0 [obdclass]
      18:56:24: [<ffffffffa1215323>] ll_file_io_generic+0x223/0x570 [lustre]
      18:56:24: [<ffffffffa12157af>] ll_file_aio_read+0x13f/0x2c0 [lustre]
      18:56:25: [<ffffffffa121635c>] ll_file_read+0x16c/0x2a0 [lustre]
      18:56:25: [<ffffffff81176cb5>] vfs_read+0xb5/0x1a0
      18:56:25: [<ffffffff81176df1>] sys_read+0x51/0x90
      18:56:25: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      18:56:25:
      18:56:25:Kernel panic - not syncing: LBUG
      18:56:25:Pid: 28058, comm: cat Not tainted 2.6.32-279.19.1.el6.x86_64 #1
      18:56:25:Call Trace:
      18:56:25: [<ffffffff814e9541>] ? panic+0xa0/0x168
      18:56:25: [<ffffffffa03feeeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      18:56:25: [<ffffffffa1264c0e>] ? ccc_io_advance+0x11e/0x190 [lustre]
      18:56:25: [<ffffffffa05a96d6>] ? cl_io_rw_advance+0x66/0x150 [obdclass]
      18:56:25: [<ffffffffa05ae602>] ? cl_io_loop+0xe2/0x1b0 [obdclass]
      18:56:25: [<ffffffffa1215323>] ? ll_file_io_generic+0x223/0x570 [lustre]
      18:56:25: [<ffffffffa12157af>] ? ll_file_aio_read+0x13f/0x2c0 [lustre]
      18:56:25: [<ffffffffa121635c>] ? ll_file_read+0x16c/0x2a0 [lustre]
      18:56:25: [<ffffffff81176cb5>] ? vfs_read+0xb5/0x1a0
      18:56:25: [<ffffffff81176df1>] ? sys_read+0x51/0x90
      18:56:26: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
      18:56:26:Initializing cgroup subsys cpuset
      18:56:26:Initializing cgroup subsys cpu
      

      Attachments

        Activity

          [LU-2910] racer test_1, sanity test_54c: ASSERTION( cio->cui_tot_nrsegs >= cio->cui_nrsegs ) failed
          pjones Peter Jones added a comment -

          Landed for 2.4

          pjones Peter Jones added a comment - Landed for 2.4

          Is there something that should be fixed in the lloop code? There has been more interest in this recently from Robert and Richard for VM provisioning and other cloud applications.

          Did I say lustre loop device? I meant to say linux kernel loop device, sorry.

          First of all, we should figure out what kind of IO has been received by lloop device then we can work out corresponding fix. In current implementation of lloop, we use direct IO to serve bio requests so an obvious way to improve performance is to do cache write - especially if we find out there are plenty of small IOs.

          jay Jinshan Xiong (Inactive) added a comment - Is there something that should be fixed in the lloop code? There has been more interest in this recently from Robert and Richard for VM provisioning and other cloud applications. Did I say lustre loop device? I meant to say linux kernel loop device, sorry. First of all, we should figure out what kind of IO has been received by lloop device then we can work out corresponding fix. In current implementation of lloop, we use direct IO to serve bio requests so an obvious way to improve performance is to do cache write - especially if we find out there are plenty of small IOs.

          Jinshan replied to my comment on Patch Set 1:

          "Failed tests always show "tot_nrsegs: 0" in the LASSERT() so I can imagine this will fix the problem."

          The root cause of this problem is lloop write

          Is there something that should be fixed in the lloop code? There has been more interest in this recently from Robert and Richard for VM provisioning and other cloud applications.

          adilger Andreas Dilger added a comment - Jinshan replied to my comment on Patch Set 1: "Failed tests always show "tot_nrsegs: 0" in the LASSERT() so I can imagine this will fix the problem." The root cause of this problem is lloop write Is there something that should be fixed in the lloop code? There has been more interest in this recently from Robert and Richard for VM provisioning and other cloud applications.
          mdiep Minh Diep added a comment - hit this https://maloo.whamcloud.com/test_sets/7e10a3f6-8e87-11e2-a173-52540035b04c

          Increasing priority, since this is causing almost every test to fail.

          adilger Andreas Dilger added a comment - Increasing priority, since this is causing almost every test to fail.

          Skip iov update when tot_nrsegs is zero: http://review.whamcloud.com/5747

          niu Niu Yawei (Inactive) added a comment - Skip iov update when tot_nrsegs is zero: http://review.whamcloud.com/5747

          What I saw is another LASSERT which is newly added from the patch:

          LustreError: 31446:0:(lcommon_cl.c:794:ccc_io_update_iov()) ASSERTION( cio->cui_tot_nrsegs >= cio->cui_nrsegs ) failed: tot_nrsegs: 0, nrsegs: 1
          LustreError: 31446:0:(lcommon_cl.c:794:ccc_io_update_iov()) LBUG
          Kernel panic - not syncing: LBUG
          Pid: 31446, comm: loop3 Tainted: P           ---------------    2.6.32 #1
          Call Trace:
           [<ffffffff814fe08e>] ? panic+0xa0/0x168
           [<ffffffffa10b3f5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
           [<ffffffffa0f6e569>] ? ccc_io_update_iov+0xe9/0xf0 [lustre]
           [<ffffffffa0f77d8c>] ? vvp_io_rw_lock+0x8c/0x690 [lustre]
           [<ffffffffa10c4301>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
           [<ffffffffa0f783c5>] ? vvp_io_write_lock+0x35/0x40 [lustre]
           [<ffffffffa06e03d3>] ? cl_io_lock+0x63/0x560 [obdclass]
           [<ffffffffa10bedf8>] ? libcfs_log_return+0x28/0x40 [libcfs]
           [<ffffffffa0f4a2bf>] ? ll_cl_init+0x38f/0x570 [lustre]
           [<ffffffffa0f4a6e3>] ? ll_prepare_write+0x53/0x1a0 [lustre]
           [<ffffffffa0f62c0e>] ? ll_write_begin+0x7e/0x1a0 [lustre]
           [<ffffffff8111346c>] ? pagecache_write_begin+0x1c/0x20
           [<ffffffff8135ef05>] ? do_lo_send_aops+0x155/0x1a0
           [<ffffffff8135f15b>] ? do_bio_filebacked+0x20b/0x2d0
           [<ffffffff8135edb0>] ? do_lo_send_aops+0x0/0x1a0
           [<ffffffff8135f2f1>] ? loop_thread+0xd1/0x270
           [<ffffffff81092180>] ? autoremove_wake_function+0x0/0x40
           [<ffffffff8135f220>] ? loop_thread+0x0/0x270
           [<ffffffff81091e16>] ? kthread+0x96/0xa0
           [<ffffffff8100c14a>] ? child_rip+0xa/0x20
           [<ffffffff81091d80>] ? kthread+0x0/0xa0
           [<ffffffff8100c140>] ? child_rip+0x0/0x20
          

          Actually, sometimes the tot_nrsegs could be zero (see ll_cl_init()), the newly added assert isn't correct in such case.

          niu Niu Yawei (Inactive) added a comment - What I saw is another LASSERT which is newly added from the patch: LustreError: 31446:0:(lcommon_cl.c:794:ccc_io_update_iov()) ASSERTION( cio->cui_tot_nrsegs >= cio->cui_nrsegs ) failed: tot_nrsegs: 0, nrsegs: 1 LustreError: 31446:0:(lcommon_cl.c:794:ccc_io_update_iov()) LBUG Kernel panic - not syncing: LBUG Pid: 31446, comm: loop3 Tainted: P --------------- 2.6.32 #1 Call Trace: [<ffffffff814fe08e>] ? panic+0xa0/0x168 [<ffffffffa10b3f5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa0f6e569>] ? ccc_io_update_iov+0xe9/0xf0 [lustre] [<ffffffffa0f77d8c>] ? vvp_io_rw_lock+0x8c/0x690 [lustre] [<ffffffffa10c4301>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa0f783c5>] ? vvp_io_write_lock+0x35/0x40 [lustre] [<ffffffffa06e03d3>] ? cl_io_lock+0x63/0x560 [obdclass] [<ffffffffa10bedf8>] ? libcfs_log_return+0x28/0x40 [libcfs] [<ffffffffa0f4a2bf>] ? ll_cl_init+0x38f/0x570 [lustre] [<ffffffffa0f4a6e3>] ? ll_prepare_write+0x53/0x1a0 [lustre] [<ffffffffa0f62c0e>] ? ll_write_begin+0x7e/0x1a0 [lustre] [<ffffffff8111346c>] ? pagecache_write_begin+0x1c/0x20 [<ffffffff8135ef05>] ? do_lo_send_aops+0x155/0x1a0 [<ffffffff8135f15b>] ? do_bio_filebacked+0x20b/0x2d0 [<ffffffff8135edb0>] ? do_lo_send_aops+0x0/0x1a0 [<ffffffff8135f2f1>] ? loop_thread+0xd1/0x270 [<ffffffff81092180>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8135f220>] ? loop_thread+0x0/0x270 [<ffffffff81091e16>] ? kthread+0x96/0xa0 [<ffffffff8100c14a>] ? child_rip+0xa/0x20 [<ffffffff81091d80>] ? kthread+0x0/0xa0 [<ffffffff8100c140>] ? child_rip+0x0/0x20 Actually, sometimes the tot_nrsegs could be zero (see ll_cl_init()), the newly added assert isn't correct in such case.
          jhammond John Hammond added a comment -

          I'm hitting this reliably using 2.3.62-47-g3a0bb68 (which includes the patch here) on my home setup. The easiest way to reproduce for my is to run sanity 54c in a loop. Maybe this should be reopened.

          jhammond John Hammond added a comment - I'm hitting this reliably using 2.3.62-47-g3a0bb68 (which includes the patch here) on my home setup. The easiest way to reproduce for my is to run sanity 54c in a loop. Maybe this should be reopened.

          patch landed for 2.4

          niu Niu Yawei (Inactive) added a comment - patch landed for 2.4

          Niu, you're right about this. There exists fake layout change after layout swapping is introduced, and rename can cause this because it revokes full ibits lock.

          jay Jinshan Xiong (Inactive) added a comment - Niu, you're right about this. There exists fake layout change after layout swapping is introduced, and rename can cause this because it revokes full ibits lock.

          People

            niu Niu Yawei (Inactive)
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: