Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2652

lov_io.c:222:lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • 6194

    Description

      HAd this crash running sanityn test 12:

      [45366.990436] Lustre: DEBUG MARKER: == sanityn test 12: test lock ordering (link, stat, unlink) ============= 10:01:13 (1358607673)
      [45367.146837] Lustre: DEBUG MARKER: start dir: /mnt/lustre/lockdir=144115205255725085 file: /mnt/lustre/lockdir/lockfile=144115205255725083
      [45445.727378] LustreError: 20512:0:(lov_io.c:222:lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed: 
      [45445.728276] LustreError: 20512:0:(lov_io.c:222:lov_sub_get()) LBUG
      [45445.728751] Pid: 20512, comm: statmany
      [45445.729089] 
      [45445.729090] Call Trace:
      [45445.729696]  [<ffffffffa0a7a915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [45445.730184]  [<ffffffffa0a7af17>] lbug_with_loc+0x47/0xb0 [libcfs]
      [45445.730673]  [<ffffffffa14041ba>] lov_sub_get+0x4aa/0x690 [lov]
      [45445.731129]  [<ffffffffa1400132>] lov_sublock_env_get+0xd2/0x140 [lov]
      [45445.731612]  [<ffffffffa1401841>] lov_sublock_alloc+0xf1/0x450 [lov]
      [45445.732086]  [<ffffffffa1402a2c>] lov_lock_init_raid0+0x3ec/0xe50 [lov]
      [45445.732632]  [<ffffffffa13fa2ae>] lov_lock_init+0x1e/0x60 [lov]
      [45445.733135]  [<ffffffffa0bec52c>] cl_lock_hold_mutex+0x34c/0x660 [obdclass]
      [45445.733652]  [<ffffffffa0bec9a2>] cl_lock_request+0x62/0x270 [obdclass]
      [45445.734143]  [<ffffffffa090b1f9>] cl_glimpse_lock+0x179/0x480 [lustre]
      [45445.734635]  [<ffffffffa090ba65>] cl_glimpse_size0+0x1a5/0x1d0 [lustre]
      [45445.735142]  [<ffffffffa08c5298>] ll_inode_revalidate_it+0x198/0x1c0 [lustre]
      [45445.735748]  [<ffffffffa08c5309>] ll_getattr_it+0x49/0x170 [lustre]
      [45445.736230]  [<ffffffffa08c5467>] ll_getattr+0x37/0x40 [lustre]
      [45445.736719]  [<ffffffff81214fa3>] ? security_inode_getattr+0x23/0x30
      [45445.737183]  [<ffffffff81180891>] vfs_getattr+0x51/0x80
      [45445.737613]  [<ffffffff810385d8>] ? pvclock_clocksource_read+0x58/0xd0
      [45445.738070]  [<ffffffff81180920>] vfs_fstatat+0x60/0x80
      [45445.738508]  [<ffffffff81180a6b>] vfs_stat+0x1b/0x20
      [45445.738919]  [<ffffffff81180a94>] sys_newstat+0x24/0x50
      [45445.739332]  [<ffffffff8109b05a>] ? do_gettimeofday+0x1a/0x50
      [45445.740967]  [<ffffffff8100c5b5>] ? math_state_restore+0x45/0x60
      [45445.741415]  [<ffffffff814fb72e>] ? do_device_not_available+0xe/0x10
      [45445.741875]  [<ffffffff8100befb>] ? device_not_available+0x1b/0x20
      [45445.742317]  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
      [45445.742763] 
      [45445.744348] Kernel panic - not syncing: LBUG
      

      Crashdump is in /exports/crashdumps/192.168.10.217-2013-01-19-10\:02\:36

      Attachments

        Issue Links

          Activity

            [LU-2652] lov_io.c:222:lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            TGCC site hits the same issue/LBUG running an MPI-IO app on 2.1.3 (LU-3524), I mean the original/title "(lov_io.c:212:lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed" LBUG for this ticket, so based on this can we definitely trust that this particular problem is layout related ?

            bfaccini Bruno Faccini (Inactive) added a comment - - edited TGCC site hits the same issue/LBUG running an MPI-IO app on 2.1.3 ( LU-3524 ), I mean the original/title "(lov_io.c:212:lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed" LBUG for this ticket, so based on this can we definitely trust that this particular problem is layout related ?

            Duplicate of LU-2766, the same issue with different symptom.

            jay Jinshan Xiong (Inactive) added a comment - Duplicate of LU-2766 , the same issue with different symptom.
            jhammond John Hammond added a comment -

            Unfortunately this is still present in 2.4. Oleg's description matches up very well with what I'm seeing on 2.4 and 2.5, so reopening seems best. Based on stack traces and the shared reproducer, I assume this is really the same issue as LU-2766.

            jhammond John Hammond added a comment - Unfortunately this is still present in 2.4. Oleg's description matches up very well with what I'm seeing on 2.4 and 2.5, so reopening seems best. Based on stack traces and the shared reproducer, I assume this is really the same issue as LU-2766 .

            Do you think this is the same bug as was previously being hit here, or a different bug with the same symptom (this wasn't hit with racer, and was apparently fixed by landing the patch)? Reopening this bug (which is marked a blocker) that was fixed and landed for 2.4.0 complicates the tracking process.

            It might be better to close this one again and open a separate bug related to racer + layout swap. While the symptom (LASSERT()) is serious, the stress of swapping the layout of two files in a loop is very unlikely to be seen in real life, so I won't rank this as a blocker.

            adilger Andreas Dilger added a comment - Do you think this is the same bug as was previously being hit here, or a different bug with the same symptom (this wasn't hit with racer, and was apparently fixed by landing the patch)? Reopening this bug (which is marked a blocker) that was fixed and landed for 2.4.0 complicates the tracking process. It might be better to close this one again and open a separate bug related to racer + layout swap. While the symptom (LASSERT()) is serious, the stress of swapping the layout of two files in a loop is very unlikely to be seen in real life, so I won't rank this as a blocker.
            jhammond John Hammond added a comment -

            Saw this running racer today on 2.4.51-3-g9f5eea8. Reproduces much more readily if you do:

            # llmount.sh
            # cd /mnt/lustre
            # touch 0 1
            # while true; do
                lfs swap_layouts $((RANDOM % 2)) $((RANDOM % 2))
            done &
            # while true; do
                lfs swap_layouts $((RANDOM % 2)) $((RANDOM % 2))
            done &
            

            The same is true of LU-2766.

            jhammond John Hammond added a comment - Saw this running racer today on 2.4.51-3-g9f5eea8. Reproduces much more readily if you do: # llmount.sh # cd /mnt/lustre # touch 0 1 # while true; do lfs swap_layouts $((RANDOM % 2)) $((RANDOM % 2)) done & # while true; do lfs swap_layouts $((RANDOM % 2)) $((RANDOM % 2)) done & The same is true of LU-2766 .
            green Oleg Drokin added a comment -

            Patches landed, issue no longer visible

            green Oleg Drokin added a comment - Patches landed, issue no longer visible
            jay Jinshan Xiong (Inactive) added a comment - - edited patches are at: http://review.whamcloud.com/5157 , http://review.whamcloud.com/5158 , http://review.whamcloud.com/5159 . Oleg is verifying them.

            People

              jay Jinshan Xiong (Inactive)
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: