Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2808

osp_object_assign_fid()) ASSERTION( fid_is_zero(lu_object_fid(&o->opo_obj.do_lu)) ) failed:

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.4.0
    • 3
    • 6797

    Description

      Racy reproducer:

      # llmount.sh
      # cd /mnt/lustre
      # while true; do lfs setstripe -c 1 f0; rm f0; done &
      [1] 3814
      # while true; do truncate --size=1 f0; done
      cannot truncate `f0' to length 21705: No such file or directory
      cannot truncate `f0' to length 17399: No such file or directory
      cannot truncate `f0' to length 18024: No such file or directory
      cannot truncate `f0' to length 25593: No such file or directory
      cannot truncate `f0' to length 19126: No such file or directory
      cannot truncate `f0' to length 29680: No such file or directory
      cannot truncate `f0' to length 14928: No such file or directory
      cannot truncate `f0' to length 23877: No such file or directory
      cannot truncate `f0' to length 6911: No such file or directory
      cannot truncate `f0' to length 868: No such file or directory
      cannot truncate `f0' to length 791: No such file or directory
      cannot truncate `f0' to length 28593: No such file or directory
      cannot truncate `f0' to length 7330: No such file or directory
      cannot truncate `f0' to length 9708: No such file or directory
      
      Message from syslogd@m at Feb 13 10:29:39 ...
       kernel:LustreError: 3088:0:(osp_object.c:56:osp_object_assign_fid()) ASSERTION( fid_is_zero(lu_object_fid(&o->opo_obj.do_lu)) ) failed:
      
      Message from syslogd@m at Feb 13 10:29:39 ...
       kernel:LustreError: 3088:0:(osp_object.c:56:osp_object_assign_fid()) LBUG
      
      Message from syslogd@m at Feb 13 10:29:39 ...
       kernel:Kernel panic - not syncing: LBUG
      
      crash> bt
      PID: 31940  TASK: ffff8801663b5500  CPU: 0   COMMAND: "mdt00_001"
       #0 [ffff8801663b78e8] machine_kexec at ffffffff81031f7b
       #1 [ffff8801663b7948] crash_kexec at ffffffff810b8c22
       #2 [ffff8801663b7a18] panic at ffffffff814eae18
       #3 [ffff8801663b7a98] lbug_with_loc at ffffffffa0ef3eeb [libcfs]
       #4 [ffff8801663b7ab8] osp_object_assign_fid at ffffffffa0b98942 [osp]
       #5 [ffff8801663b7ae8] osp_declare_attr_set at ffffffffa0b98b11 [osp]
       #6 [ffff8801663b7b38] lod_declare_attr_set at ffffffffa0b68083 [lod]
       #7 [ffff8801663b7b88] mdd_attr_set at ffffffffa051f5e9 [mdd]
       #8 [ffff8801663b7c08] mdt_attr_set at ffffffffa0a4deb8 [mdt]
       #9 [ffff8801663b7c58] mdt_reint_setattr at ffffffffa0a4e7ad [mdt]
      #10 [ffff8801663b7cc8] mdt_reint_rec at ffffffffa0a486b1 [mdt]
      #11 [ffff8801663b7ce8] mdt_reint_internal at ffffffffa0a41d13 [mdt]
      #12 [ffff8801663b7d28] mdt_reint at ffffffffa0a42044 [mdt]
      #13 [ffff8801663b7d48] mdt_handle_common at ffffffffa0a32fb8 [mdt]
      #14 [ffff8801663b7d98] mds_regular_handle at ffffffffa0a6a5c5 [mdt]
      #15 [ffff8801663b7da8] ptlrpc_server_handle_request at ffffffffa062c00c [ptlrpc]
      #16 [ffff8801663b7ea8] ptlrpc_main at ffffffffa062d556 [ptlrpc]
      #17 [ffff8801663b7f48] kernel_thread at ffffffff8100c0ca
      

      A deterministic reproducer is attached. It does

      fd1 = open("f0", O_RDWR|O_CREAT|O_LOV_DELAY_CREATE, 0666);
      fd2 = open("f0", O_RDWR|O_CREAT, 0666);
      ftruncate(fd2, 1);
      

      This seems to be independent from LU-2523. I tried this with and without the patch from LU-2523 and got the same result.

      Note that in the call to ftruncate() if fd1 is used instead of fd2 or if a zero length is used then there is no LBUG. If truncate("f0", 1) is used then there is also no LBUG.

      Interestingly, if MOUNT_2=yes is used and the first open is of "/mnt/lustre/f0" and the second of "/mnt/lustre2/f0" then there is no LBUG. However while the setstripe ioctl() will appear to succeed, in fact the default striping will be applied to the file.

      Attachments

        Issue Links

          Activity

            [LU-2808] osp_object_assign_fid()) ASSERTION( fid_is_zero(lu_object_fid(&o->opo_obj.do_lu)) ) failed:

            John, are you going to rework your patch based on Alex's comment, or does Alex need to work on that?

            adilger Andreas Dilger added a comment - John, are you going to rework your patch based on Alex's comment, or does Alex need to work on that?

            I think few paths (including open/w, setattr) should be checking whether the striping has been already created, if not create it holding an exclusive lock.
            later we'll wrap all this with appropriate layout lock.

            bzzz Alex Zhuravlev added a comment - I think few paths (including open/w, setattr) should be checking whether the striping has been already created, if not create it holding an exclusive lock. later we'll wrap all this with appropriate layout lock.

            frankly, this doesn't look like a solution.

            bzzz Alex Zhuravlev added a comment - frankly, this doesn't look like a solution.
            jhammond John Hammond added a comment -

            I admit that reproducer seems pretty unlikely. To me the upshot of fixing this is that truncate()/ftruncate() could probably be added to racer.sh. It won't hurt my feelings if this gets dropped from the blocker list.

            jhammond John Hammond added a comment - I admit that reproducer seems pretty unlikely. To me the upshot of fixing this is that truncate()/ftruncate() could probably be added to racer.sh. It won't hurt my feelings if this gets dropped from the blocker list.

            I agree that fixing the LASSERT() is the first priority. I don't think this is such a critical problem that is going to be hit by normal usage.

            However, if the truncate is sent to the MDS, but is then ignored, the client cannot send the RPC to the OSTs. Either the MDS needs to proxy the RPC for the client, or return some error code to the client (-EAGAIN?) so that it will get the layout and send the RPC on to the OSTs.

            adilger Andreas Dilger added a comment - I agree that fixing the LASSERT() is the first priority. I don't think this is such a critical problem that is going to be hit by normal usage. However, if the truncate is sent to the MDS, but is then ignored, the client cannot send the RPC to the OSTs. Either the MDS needs to proxy the RPC for the client, or return some error code to the client (-EAGAIN?) so that it will get the layout and send the RPC on to the OSTs.
            jhammond John Hammond added a comment - Please see http://review.whamcloud.com/5473 .
            jhammond John Hammond added a comment -

            In this case the striping has already been created (by the second open()) when ftruncate() is called. The problem seems to be that the lsm is not returned to the client after the second open() or the client ignores it. Then in ftruncate() the client does not have lsm for the file so it sends setattr with the new size to the MDS. So you have the MDS setting the size of an already striped file which triggers the LASSERT() in ops_declare_attr_set()/osp_object_assign_fid(). It seems like the easiest fix here would be to check for an already assigned fid:

            @@ -53,7 +53,9 @@ static void osp_object_assign_fid(const struct lu_env *env,
             {
                    struct osp_thread_info *osi = osp_env_info(env);
             
            -       LASSERT(fid_is_zero(lu_object_fid(&o->opo_obj.do_lu)));
            +       if (!fid_is_zero(lu_object_fid(&o->opo_obj.do_lu)))
            +               return;
            +
                    LASSERT(o->opo_reserved);
                    o->opo_reserved = 0;
            
            

            Please correct me if I'm wrong.

            jhammond John Hammond added a comment - In this case the striping has already been created (by the second open()) when ftruncate() is called. The problem seems to be that the lsm is not returned to the client after the second open() or the client ignores it. Then in ftruncate() the client does not have lsm for the file so it sends setattr with the new size to the MDS. So you have the MDS setting the size of an already striped file which triggers the LASSERT() in ops_declare_attr_set()/osp_object_assign_fid(). It seems like the easiest fix here would be to check for an already assigned fid: @@ -53,7 +53,9 @@ static void osp_object_assign_fid(const struct lu_env *env, { struct osp_thread_info *osi = osp_env_info(env); - LASSERT(fid_is_zero(lu_object_fid(&o->opo_obj.do_lu))); + if (!fid_is_zero(lu_object_fid(&o->opo_obj.do_lu))) + return; + LASSERT(o->opo_reserved); o->opo_reserved = 0; Please correct me if I'm wrong.

            Alex, I agree. Not a critical issue, but it would be nice to get this fixed for the final 2.4 release.

            adilger Andreas Dilger added a comment - Alex, I agree. Not a critical issue, but it would be nice to get this fixed for the final 2.4 release.

            we discussed few times that it'd be better for truncate to create striping if it doesn't exist yet. this would help to reduce number of cases to handle (to LU-2794, i think) and fix one outstanding issue.

            bzzz Alex Zhuravlev added a comment - we discussed few times that it'd be better for truncate to create striping if it doesn't exist yet. this would help to reduce number of cases to handle (to LU-2794 , i think) and fix one outstanding issue.

            People

              jhammond John Hammond
              jhammond John Hammond
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: