Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2789

lod_load_striping()) ASSERTION( lo->ldo_stripenr == 0 ) failed

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.4.0
    • Lustre 2.4.0, Lustre 2.5.0
    • 3
    • 6755

    Description

      This could probably be reproduced by racer given enough runs but I can reproduce it as follows:

      # llmount.sh
      # mount n@tcp:/lustre /mnt/lustre2 -t lustre
      # (cd /mnt/lustre; while true; do lfs setstripe -c 1 f0; done) &
      # (cd /mnt/lustre2; while true; do mv f0 f1; done) &
      
      Message from syslogd@n at Feb  8 15:36:51 ...
       kernel:LustreError: 3186:0:(lod_lov.c:782:lod_load_striping()) ASSERTION( lo->ldo_stripe[i] ) failed: stripe 0 is NULL
      
      Message from syslogd@n at Feb  8 15:36:51 ...
       kernel:LustreError: 3186:0:(lod_lov.c:782:lod_load_striping()) LBUG
      
      Message from syslogd@n at Feb  8 15:36:51 ...
       kernel:Kernel panic - not syncing: LBUG
      

      Here is the crash dump for the rename handler:

      crash> bt -l
      PID: 13628  TASK: ffff8800a98a1540  CPU: 1   COMMAND: "mdt00_001"
       #0 [ffff8800a98a3828] machine_kexec at ffffffff81031f7b
          /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6.x86_64/arch/x86/kernel/machine_kexec_64.c: 336
       #1 [ffff8800a98a3888] crash_kexec at ffffffff810b8c22
          /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6.x86_64/kernel/kexec.c: 1106
       #2 [ffff8800a98a3958] panic at ffffffff814e9818
          /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6.x86_64/kernel/panic.c: 103
       #3 [ffff8800a98a39d8] lbug_with_loc at ffffffffa0595eeb [libcfs]
          /root/lustre-release/libcfs/libcfs/linux/linux-debug.c: 188
       #4 [ffff8800a98a39f8] lod_load_striping at ffffffffa0e199f3 [lod]
          /root/lustre-release/lustre/lod/lod_internal.h: 255
       #5 [ffff8800a98a3a38] lod_declare_attr_set at ffffffffa0e25fbb [lod]
          /root/lustre-release/lustre/lod/lod_object.c: 300
       #6 [ffff8800a98a3a88] mdd_rename at ffffffffa0beb6d8 [mdd]
          /root/lustre-release/lustre/mdd/mdd_dir.c: 2087
       #7 [ffff8800a98a3ba8] mdt_reint_rename at ffffffffa0d54617 [mdt]
          /root/lustre-release/lustre/mdt/mdt_reint.c: 1270
       #8 [ffff8800a98a3cc8] mdt_reint_rec at ffffffffa0d506b1 [mdt]
          /root/lustre-release/libcfs/include/libcfs/libcfs_debug.h: 211
       #9 [ffff8800a98a3ce8] mdt_reint_internal at ffffffffa0d49d13 [mdt]
          /root/lustre-release/libcfs/include/libcfs/libcfs_debug.h: 211
      #10 [ffff8800a98a3d28] mdt_reint at ffffffffa0d4a044 [mdt]
          /root/lustre-release/lustre/mdt/mdt_handler.c: 1818
      #11 [ffff8800a98a3d48] mdt_handle_common at ffffffffa0d3afb8 [mdt]
          /root/lustre-release/lustre/mdt/mdt_handler.c: 2981
      #12 [ffff8800a98a3d98] mds_regular_handle at ffffffffa0d725f5 [mdt]
          /root/lustre-release/lustre/mdt/mdt_mds.c: 354
      #13 [ffff8800a98a3da8] ptlrpc_server_handle_request at ffffffffa08e9c7c [ptlrpc]
          /root/lustre-release/lustre/include/lustre_net.h: 2771
      #14 [ffff8800a98a3ea8] ptlrpc_main at ffffffffa08eb1c6 [ptlrpc]
          /root/lustre-release/lustre/ptlrpc/service.c: 2487
      #15 [ffff8800a98a3f48] kernel_thread at ffffffff8100c0ca
          /usr/src/debug///////kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6.x86_64/arch/x86/kernel/entry_64.S: 1213
      

      lfs setstripe is in ioctl() with its mdt_reint_open() handler in:

      mdt_reint_open()
       ...
        mdt_create_data()
         ...
          lod_declare_xattr_set()
           ...
            osp_precreate_reserve()
      

      Attachments

        Issue Links

          Activity

            [LU-2789] lod_load_striping()) ASSERTION( lo->ldo_stripenr == 0 ) failed
            haasken Ryan Haasken added a comment -

            I think that http://review.whamcloud.com/#/c/7919 resolves the race condition in the unlink and rename paths, but isn't http://review.whamcloud.com/#/c/7223/3 still necessary for the setattr case? After 7223 is landed, will this bug be ready to close?

            7223 was rejected on the basis that the mdt layer should not look into lod internals, but doesn't 7919 do the same thing? Can we land 7223 as a short-term solution for the setattr case?

            haasken Ryan Haasken added a comment - I think that http://review.whamcloud.com/#/c/7919 resolves the race condition in the unlink and rename paths, but isn't http://review.whamcloud.com/#/c/7223/3 still necessary for the setattr case? After 7223 is landed, will this bug be ready to close? 7223 was rejected on the basis that the mdt layer should not look into lod internals, but doesn't 7919 do the same thing? Can we land 7223 as a short-term solution for the setattr case?
            spitzcor Cory Spitz added a comment -

            The fix version for this bug should probably be 2.6 (and 2.5.1 at least).

            spitzcor Cory Spitz added a comment - The fix version for this bug should probably be 2.6 (and 2.5.1 at least).
            haasken Ryan Haasken added a comment -

            The patch for LU-4083 (7919) looks related to this issue, but LU-4083 is in the rename and unlink paths. This bug is in the setattr path. Since http://review.whamcloud.com/#/c/5839/ was abandoned and obsoleted by 7919, does that imply that 7919 fixes this bug? It doesn't look like it covers the setattr path.

            I think Patrick's patch at http://review.whamcloud.com/#/c/7223/ attempts to fix the setattr path. The complaint on that patch is that it looks into LOD internals. However, http://review.whamcloud.com/#/c/7919/4 does the same thing, and that has landed. What is preventing Patrick's change from landing?

            In Patrick's patch (7223), would it be preferable to lock mot_lov_mutex in mdt_reint_setattr around the call to mdt_attr_set? That would match the way patch 7919 locks the mutex in mdt_reint_unlink around the call to mdo_unlink. I'm not sure if that would be correct though.

            Please share your thoughts on the relationship between patches 5839, 7919, and 7223. Thanks!

            haasken Ryan Haasken added a comment - The patch for LU-4083 (7919) looks related to this issue, but LU-4083 is in the rename and unlink paths. This bug is in the setattr path. Since http://review.whamcloud.com/#/c/5839/ was abandoned and obsoleted by 7919, does that imply that 7919 fixes this bug? It doesn't look like it covers the setattr path. I think Patrick's patch at http://review.whamcloud.com/#/c/7223/ attempts to fix the setattr path. The complaint on that patch is that it looks into LOD internals. However, http://review.whamcloud.com/#/c/7919/4 does the same thing, and that has landed. What is preventing Patrick's change from landing? In Patrick's patch (7223), would it be preferable to lock mot_lov_mutex in mdt_reint_setattr around the call to mdt_attr_set? That would match the way patch 7919 locks the mutex in mdt_reint_unlink around the call to mdo_unlink. I'm not sure if that would be correct though. Please share your thoughts on the relationship between patches 5839, 7919, and 7223. Thanks!
            haasken Ryan Haasken added a comment -

            We've run into a failed assertion added in patch 5839 on a system running Lustre 2.5 with patches 5802, 5839, and 7223 applied. This is the failed assertion:

            LustreError: 3407:0:(lod_object.c:993:lod_striping_create()) ASSERTION( lo->ldo_stripe != ((void *)0) && lo->ldo_stripenr > 0 ) failed:
            
            haasken Ryan Haasken added a comment - We've run into a failed assertion added in patch 5839 on a system running Lustre 2.5 with patches 5802, 5839, and 7223 applied. This is the failed assertion: LustreError: 3407:0:(lod_object.c:993:lod_striping_create()) ASSERTION( lo->ldo_stripe != ((void *)0) && lo->ldo_stripenr > 0 ) failed:
            green Oleg Drokin added a comment -

            there's LU-4083 with a patch that is probably related.

            green Oleg Drokin added a comment - there's LU-4083 with a patch that is probably related.

            mot_lov_mutex should solve the race given it protects a chain of declare/execution.

            bzzz Alex Zhuravlev added a comment - mot_lov_mutex should solve the race given it protects a chain of declare/execution.

            any of these patches don't solve whole issue.
            we still have a race when create object insert between declare and execute operation.

            shadow Alexey Lyashkov added a comment - any of these patches don't solve whole issue. we still have a race when create object insert between declare and execute operation.

            I tested and confirmed John's assertion that taking lov_mutex resolves the race condition.

            Patch for the last race condition is here. Verified to resolve the issue, and I don't think the condition for taking the lock can be any more specific to chown:
            http://review.whamcloud.com/7223

            paf Patrick Farrell (Inactive) added a comment - - edited I tested and confirmed John's assertion that taking lov_mutex resolves the race condition. Patch for the last race condition is here. Verified to resolve the issue, and I don't think the condition for taking the lock can be any more specific to chown: http://review.whamcloud.com/7223

            John,

            We've recently run in to the osp_sync_add_rec assertion on a system with 5802 and 5839. Sounds like you expected that, but I just wanted to let you know.

            paf Patrick Farrell (Inactive) added a comment - John, We've recently run in to the osp_sync_add_rec assertion on a system with 5802 and 5839. Sounds like you expected that, but I just wanted to let you know.
            spitzcor Cory Spitz added a comment -

            John, not exactly production, but a small test system. Upon initial look, it seemed that #5839 could solve/fix our problem.

            spitzcor Cory Spitz added a comment - John, not exactly production, but a small test system. Upon initial look, it seemed that #5839 could solve/fix our problem.

            People

              jhammond John Hammond
              jhammond John Hammond
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: