Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3524

Lustre 2.1.3: lov_io.c:212:lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.1.3
    • None
    • 3
    • 8869

    Description

      At TGCC site, which is currently running Lustre 2.1.3, time to time, customer get crashes with the following assertion :

      LustreError: 23580:0:(lov_io.c:212:lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed:
      LustreError: 23580:0:(lov_io.c:212:lov_sub_get()) LBUG
      Pid: 23580, comm: IMB-IO
      
      Call Trace:
       [<ffffffffa034d7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       [<ffffffffa034de07>] lbug_with_loc+0x47/0xb0 [libcfs]
       [<ffffffffa0917a8f>] lov_sub_get+0x47f/0x6f0 [lov]
       [<ffffffffa0913ca2>] lov_sublock_env_get+0xd2/0x140 [lov]
       [<ffffffffa0914e61>] lov_sublock_alloc+0xf1/0x470 [lov]
       [<ffffffffa09162fc>] lov_lock_init_raid0+0x3dc/0xe30 [lov]
       [<ffffffffa090eab4>] lov_lock_init+0x54/0xe0 [lov]
       [<ffffffffa049215c>] cl_lock_hold_mutex+0x37c/0x6b0 [obdclass]
       [<ffffffffa04925ee>] cl_lock_request+0x5e/0x1c0 [obdclass]
       [<ffffffffa09ee9bf>] cl_glimpse_lock+0x16f/0x410 [lustre]
       [<ffffffffa09f2f0a>] ccc_prep_size+0x10a/0x290 [lustre]
       [<ffffffffa09f8425>] vvp_io_read_start+0xb5/0x3e0 [lustre]
       [<ffffffffa04938da>] cl_io_start+0x6a/0x140 [obdclass]
       [<ffffffffa0497bbc>] cl_io_loop+0xcc/0x190 [obdclass]
       [<ffffffffa09a7f07>] ll_file_io_generic+0x3a7/0x560 [lustre]
       [<ffffffffa09a81f9>] ll_file_aio_read+0x139/0x2c0 [lustre]
       [<ffffffffa09a86b9>] ll_file_read+0x169/0x2a0 [lustre]
       [<ffffffff81163a15>] vfs_read+0xb5/0x1a0
       [<ffffffff81163b51>] sys_read+0x51/0x90
       [<ffffffff81487d7e>] ? do_device_not_available+0xe/0x10
       [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b
      

      After some investigation, it seems to be LU-2652, and we tried a backport of http://review.whamcloud.com/5157, http://review.whamcloud.com/5158 and http://review.whamcloud.com/5159.
      But there was a lot of changes in the corresponding files since lustre 2.1 (layout lock), and 33/45 chuncks are failing.
      Moreover, it seems that these 3 patches are to fix deadlocks introduced by LU-1876 (Layout Lock Server Patch Landings to Master).

      Attachments

        Activity

          [LU-3524] Lustre 2.1.3: lov_io.c:212:lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed
          pjones Peter Jones added a comment -

          ok thanks Sebastien

          pjones Peter Jones added a comment - ok thanks Sebastien

          As we are unable to provide requested information, this ticket can be closed.

          Thank you,
          Sebastien.

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - As we are unable to provide requested information, this ticket can be closed. Thank you, Sebastien.

          To help me working more in-deep on this issue, could it be possible to get the full stacks out of the crash-dump ?? And may be more like concerned data structs if I ask you later ?

          bfaccini Bruno Faccini (Inactive) added a comment - To help me working more in-deep on this issue, could it be possible to get the full stacks out of the crash-dump ?? And may be more like concerned data structs if I ask you later ?

          On my side and in the meantime I investigate patches from LU-2652/LU-2766 to see if they are really related.

          bfaccini Bruno Faccini (Inactive) added a comment - On my side and in the meantime I investigate patches from LU-2652 / LU-2766 to see if they are really related.

          I guess it is a standard IMB-IO but with a lustre aware mpi-io library. I have asked final user to provide fine details and will keep you updated.

          Alex.

          louveta Alexandre Louvet (Inactive) added a comment - I guess it is a standard IMB-IO but with a lustre aware mpi-io library. I have asked final user to provide fine details and will keep you updated. Alex.
          lustre-bull Lustre Bull added a comment -

          Hi bruno,

          I don't have anymore information about this LBUG. I forward you questions to Bull support team to have more details.

          lustre-bull Lustre Bull added a comment - Hi bruno, I don't have anymore information about this LBUG. I forward you questions to Bull support team to have more details.

          Patrick,
          Do you know if the different crashes occured when running with the application/workload ?
          Do we have any details on how the "IMB-IO" process/application works and particularly if it uses some stripping specifics?
          Moreover do you know if this crash could be forced to reproduce ?

          bfaccini Bruno Faccini (Inactive) added a comment - Patrick, Do you know if the different crashes occured when running with the application/workload ? Do we have any details on how the "IMB-IO" process/application works and particularly if it uses some stripping specifics? Moreover do you know if this crash could be forced to reproduce ?
          pjones Peter Jones added a comment -

          Bruno is looking into this one

          pjones Peter Jones added a comment - Bruno is looking into this one

          People

            bfaccini Bruno Faccini (Inactive)
            patrick.valentin Patrick Valentin (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: