Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2910

racer test_1, sanity test_54c: ASSERTION( cio->cui_tot_nrsegs >= cio->cui_nrsegs ) failed

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • 7001

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/3fa2a912-85b2-11e2-9f8d-52540035b04c.

      The sub-test test_1 failed with the following error:

      test failed to respond and timed out

      client console shows:

      18:47:35:Lustre: DEBUG MARKER: == racer test 1: racer on clients: client-22-ib,client-23-ib.lab.whamcloud.com DURATION=900 == 18:47:28 (1362451648)
      18:47:35:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 				   /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre2/racer 
      18:47:35:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 				   /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer 
      18:51:19:LustreError: 8649:0:(file.c:2601:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000bd0:0x9095:0x0] error: rc = -116
      18:56:24:LustreError: 28058:0:(lcommon_cl.c:821:ccc_io_advance()) ASSERTION( cio->cui_tot_nrsegs >= cio->cui_nrsegs ) failed: 
      18:56:24:LustreError: 28058:0:(lcommon_cl.c:821:ccc_io_advance()) LBUG
      18:56:24:Pid: 28058, comm: cat
      18:56:24:
      18:56:24:Call Trace:
      18:56:24: [<ffffffffa03fe895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      18:56:24: [<ffffffffa03fee97>] lbug_with_loc+0x47/0xb0 [libcfs]
      18:56:24: [<ffffffffa1264c0e>] ccc_io_advance+0x11e/0x190 [lustre]
      18:56:24: [<ffffffffa05a96d6>] cl_io_rw_advance+0x66/0x150 [obdclass]
      18:56:24: [<ffffffffa05ae602>] cl_io_loop+0xe2/0x1b0 [obdclass]
      18:56:24: [<ffffffffa1215323>] ll_file_io_generic+0x223/0x570 [lustre]
      18:56:24: [<ffffffffa12157af>] ll_file_aio_read+0x13f/0x2c0 [lustre]
      18:56:25: [<ffffffffa121635c>] ll_file_read+0x16c/0x2a0 [lustre]
      18:56:25: [<ffffffff81176cb5>] vfs_read+0xb5/0x1a0
      18:56:25: [<ffffffff81176df1>] sys_read+0x51/0x90
      18:56:25: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      18:56:25:
      18:56:25:Kernel panic - not syncing: LBUG
      18:56:25:Pid: 28058, comm: cat Not tainted 2.6.32-279.19.1.el6.x86_64 #1
      18:56:25:Call Trace:
      18:56:25: [<ffffffff814e9541>] ? panic+0xa0/0x168
      18:56:25: [<ffffffffa03feeeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      18:56:25: [<ffffffffa1264c0e>] ? ccc_io_advance+0x11e/0x190 [lustre]
      18:56:25: [<ffffffffa05a96d6>] ? cl_io_rw_advance+0x66/0x150 [obdclass]
      18:56:25: [<ffffffffa05ae602>] ? cl_io_loop+0xe2/0x1b0 [obdclass]
      18:56:25: [<ffffffffa1215323>] ? ll_file_io_generic+0x223/0x570 [lustre]
      18:56:25: [<ffffffffa12157af>] ? ll_file_aio_read+0x13f/0x2c0 [lustre]
      18:56:25: [<ffffffffa121635c>] ? ll_file_read+0x16c/0x2a0 [lustre]
      18:56:25: [<ffffffff81176cb5>] ? vfs_read+0xb5/0x1a0
      18:56:25: [<ffffffff81176df1>] ? sys_read+0x51/0x90
      18:56:26: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
      18:56:26:Initializing cgroup subsys cpuset
      18:56:26:Initializing cgroup subsys cpu
      

      Attachments

        Activity

          [LU-2910] racer test_1, sanity test_54c: ASSERTION( cio->cui_tot_nrsegs >= cio->cui_nrsegs ) failed
          mdiep Minh Diep added a comment - hit this https://maloo.whamcloud.com/test_sets/7e10a3f6-8e87-11e2-a173-52540035b04c

          Increasing priority, since this is causing almost every test to fail.

          adilger Andreas Dilger added a comment - Increasing priority, since this is causing almost every test to fail.

          Skip iov update when tot_nrsegs is zero: http://review.whamcloud.com/5747

          niu Niu Yawei (Inactive) added a comment - Skip iov update when tot_nrsegs is zero: http://review.whamcloud.com/5747

          What I saw is another LASSERT which is newly added from the patch:

          LustreError: 31446:0:(lcommon_cl.c:794:ccc_io_update_iov()) ASSERTION( cio->cui_tot_nrsegs >= cio->cui_nrsegs ) failed: tot_nrsegs: 0, nrsegs: 1
          LustreError: 31446:0:(lcommon_cl.c:794:ccc_io_update_iov()) LBUG
          Kernel panic - not syncing: LBUG
          Pid: 31446, comm: loop3 Tainted: P           ---------------    2.6.32 #1
          Call Trace:
           [<ffffffff814fe08e>] ? panic+0xa0/0x168
           [<ffffffffa10b3f5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
           [<ffffffffa0f6e569>] ? ccc_io_update_iov+0xe9/0xf0 [lustre]
           [<ffffffffa0f77d8c>] ? vvp_io_rw_lock+0x8c/0x690 [lustre]
           [<ffffffffa10c4301>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
           [<ffffffffa0f783c5>] ? vvp_io_write_lock+0x35/0x40 [lustre]
           [<ffffffffa06e03d3>] ? cl_io_lock+0x63/0x560 [obdclass]
           [<ffffffffa10bedf8>] ? libcfs_log_return+0x28/0x40 [libcfs]
           [<ffffffffa0f4a2bf>] ? ll_cl_init+0x38f/0x570 [lustre]
           [<ffffffffa0f4a6e3>] ? ll_prepare_write+0x53/0x1a0 [lustre]
           [<ffffffffa0f62c0e>] ? ll_write_begin+0x7e/0x1a0 [lustre]
           [<ffffffff8111346c>] ? pagecache_write_begin+0x1c/0x20
           [<ffffffff8135ef05>] ? do_lo_send_aops+0x155/0x1a0
           [<ffffffff8135f15b>] ? do_bio_filebacked+0x20b/0x2d0
           [<ffffffff8135edb0>] ? do_lo_send_aops+0x0/0x1a0
           [<ffffffff8135f2f1>] ? loop_thread+0xd1/0x270
           [<ffffffff81092180>] ? autoremove_wake_function+0x0/0x40
           [<ffffffff8135f220>] ? loop_thread+0x0/0x270
           [<ffffffff81091e16>] ? kthread+0x96/0xa0
           [<ffffffff8100c14a>] ? child_rip+0xa/0x20
           [<ffffffff81091d80>] ? kthread+0x0/0xa0
           [<ffffffff8100c140>] ? child_rip+0x0/0x20
          

          Actually, sometimes the tot_nrsegs could be zero (see ll_cl_init()), the newly added assert isn't correct in such case.

          niu Niu Yawei (Inactive) added a comment - What I saw is another LASSERT which is newly added from the patch: LustreError: 31446:0:(lcommon_cl.c:794:ccc_io_update_iov()) ASSERTION( cio->cui_tot_nrsegs >= cio->cui_nrsegs ) failed: tot_nrsegs: 0, nrsegs: 1 LustreError: 31446:0:(lcommon_cl.c:794:ccc_io_update_iov()) LBUG Kernel panic - not syncing: LBUG Pid: 31446, comm: loop3 Tainted: P --------------- 2.6.32 #1 Call Trace: [<ffffffff814fe08e>] ? panic+0xa0/0x168 [<ffffffffa10b3f5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa0f6e569>] ? ccc_io_update_iov+0xe9/0xf0 [lustre] [<ffffffffa0f77d8c>] ? vvp_io_rw_lock+0x8c/0x690 [lustre] [<ffffffffa10c4301>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa0f783c5>] ? vvp_io_write_lock+0x35/0x40 [lustre] [<ffffffffa06e03d3>] ? cl_io_lock+0x63/0x560 [obdclass] [<ffffffffa10bedf8>] ? libcfs_log_return+0x28/0x40 [libcfs] [<ffffffffa0f4a2bf>] ? ll_cl_init+0x38f/0x570 [lustre] [<ffffffffa0f4a6e3>] ? ll_prepare_write+0x53/0x1a0 [lustre] [<ffffffffa0f62c0e>] ? ll_write_begin+0x7e/0x1a0 [lustre] [<ffffffff8111346c>] ? pagecache_write_begin+0x1c/0x20 [<ffffffff8135ef05>] ? do_lo_send_aops+0x155/0x1a0 [<ffffffff8135f15b>] ? do_bio_filebacked+0x20b/0x2d0 [<ffffffff8135edb0>] ? do_lo_send_aops+0x0/0x1a0 [<ffffffff8135f2f1>] ? loop_thread+0xd1/0x270 [<ffffffff81092180>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8135f220>] ? loop_thread+0x0/0x270 [<ffffffff81091e16>] ? kthread+0x96/0xa0 [<ffffffff8100c14a>] ? child_rip+0xa/0x20 [<ffffffff81091d80>] ? kthread+0x0/0xa0 [<ffffffff8100c140>] ? child_rip+0x0/0x20 Actually, sometimes the tot_nrsegs could be zero (see ll_cl_init()), the newly added assert isn't correct in such case.
          jhammond John Hammond added a comment -

          I'm hitting this reliably using 2.3.62-47-g3a0bb68 (which includes the patch here) on my home setup. The easiest way to reproduce for my is to run sanity 54c in a loop. Maybe this should be reopened.

          jhammond John Hammond added a comment - I'm hitting this reliably using 2.3.62-47-g3a0bb68 (which includes the patch here) on my home setup. The easiest way to reproduce for my is to run sanity 54c in a loop. Maybe this should be reopened.

          patch landed for 2.4

          niu Niu Yawei (Inactive) added a comment - patch landed for 2.4

          Niu, you're right about this. There exists fake layout change after layout swapping is introduced, and rename can cause this because it revokes full ibits lock.

          jay Jinshan Xiong (Inactive) added a comment - Niu, you're right about this. There exists fake layout change after layout swapping is introduced, and rename can cause this because it revokes full ibits lock.

          Though it's hard to be reproduced, from the code we can see if io restared, the iov_len could be messed up and this LASSERT will be triggered at the end.

          Xiong, if you think layout change in racer test is unexpected, then there could be something serious wrong in the layout code, because I can see lots of io restart in my local testing (without file_swap, most io restart come from vvp_io_fini(), some from can_populate_pages()). Is it possible caused by rename?

          niu Niu Yawei (Inactive) added a comment - Though it's hard to be reproduced, from the code we can see if io restared, the iov_len could be messed up and this LASSERT will be triggered at the end. Xiong, if you think layout change in racer test is unexpected, then there could be something serious wrong in the layout code, because I can see lots of io restart in my local testing (without file_swap, most io restart come from vvp_io_fini(), some from can_populate_pages()). Is it possible caused by rename?
          sarah Sarah Liu added a comment -

          Didn't hit it in the latest tag testing, I think this can be dropped to major

          sarah Sarah Liu added a comment - Didn't hit it in the latest tag testing, I think this can be dropped to major

          Can this be reduced from a blocker since it doesn't sound like it can be reproduced?

          jlevi Jodi Levi (Inactive) added a comment - Can this be reduced from a blocker since it doesn't sound like it can be reproduced?

          ah no, i didn't

          jay Jinshan Xiong (Inactive) added a comment - ah no, i didn't

          People

            niu Niu Yawei (Inactive)
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: