Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4363

(llite_lib.c:1683:ll_update_inode()) ASSERTION( lu_fid_eq(&lli->lli_fid, &body->fid1) ) failed

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.1.6
    • 3
    • 11946

    Description

      Hi,

      At IFERC customer site, 7 compute nodes crashed with the following message in the console:

      2013-11-21 00:57:45 LustreError: 92325:0:(llite_lib.c:1683:ll_update_inode()) ASSERTION( lu_fid_eq(&lli->lli_fid, &body->fid1) ) failed: Trying to change FID [0x217294ce4:0x107f0:0x0] to the [0x217294ce4:0x107f1:0x0], inode 150634522759727089/35072332(ffff8807dcbf85f8)
      2013-11-21 00:57:45 LustreError: 92325:0:(llite_lib.c:1683:ll_update_inode()) LBUG
      2013-11-21 00:57:45 Pid: 92325, comm: writer_v131
      2013-11-21 00:57:45
      2013-11-21 00:57:45 Call Trace:
      2013-11-21 00:57:45  [<ffffffffa046f7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      2013-11-21 00:57:45  [<ffffffffa046fe07>] lbug_with_loc+0x47/0xb0 [libcfs]
      2013-11-21 00:57:45  [<ffffffffa0a91ca0>] ll_update_inode+0x4a0/0xf60 [lustre]
      2013-11-21 00:57:45  [<ffffffffa0a928ea>] ll_prep_inode+0x18a/0xae0 [lustre]
      2013-11-21 00:57:45  [<ffffffffa0a7c8c3>] ll_intent_file_open+0x563/0xb80 [lustre]
      2013-11-21 00:57:45  [<ffffffffa0aa6a90>] ? ll_md_blocking_ast+0x0/0x700 [lustre]
      2013-11-21 00:57:45  [<ffffffff8108163e>] ? down+0x2e/0x50
      2013-11-21 00:57:45  [<ffffffffa0a7cf67>] ll_lov_setstripe_ea_info+0x87/0x2b0 [lustre]
      2013-11-21 00:57:45  [<ffffffffa0a831a5>] ll_lov_setstripe+0x85/0x5a0 [lustre]
      2013-11-21 00:57:45  [<ffffffffa0aa3e8b>] ? ll_stats_ops_tally+0x6b/0xd0 [lustre]
      2013-11-21 00:57:45  [<ffffffffa0a84ac6>] ll_file_ioctl+0x826/0xe00 [lustre]
      2013-11-21 00:57:45  [<ffffffff81179ff2>] vfs_ioctl+0x22/0xa0
      2013-11-21 00:57:45  [<ffffffff8117a4ba>] do_vfs_ioctl+0x3aa/0x580
      2013-11-21 00:57:45  [<ffffffff8117a711>] sys_ioctl+0x81/0xa0
      2013-11-21 00:57:45  [<ffffffff8149970e>] ? do_device_not_available+0xe/0x10
      2013-11-21 00:57:45  [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b
      2013-11-21 00:57:45
      2013-11-21 00:57:45 Kernel panic - not syncing: LBUG
      2013-11-21 00:57:45 Pid: 92325, comm: writer_v131 Tainted: G        W  ---------------    2.6.32-279.5.2.bl6.Bull.36.x86_64 #1
      2013-11-21 00:57:45 Call Trace:
      2013-11-21 00:57:45  [<ffffffff81495fe3>] ? panic+0xa0/0x168
      2013-11-21 00:57:45  [<ffffffffa046fe5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      2013-11-21 00:57:45  [<ffffffffa0a91ca0>] ? ll_update_inode+0x4a0/0xf60 [lustre]
      2013-11-21 00:57:45  [<ffffffffa0a928ea>] ? ll_prep_inode+0x18a/0xae0 [lustre]
      2013-11-21 00:57:45  [<ffffffffa0a7c8c3>] ? ll_intent_file_open+0x563/0xb80 [lustre]
      2013-11-21 00:57:45  [<ffffffffa0aa6a90>] ? ll_md_blocking_ast+0x0/0x700 [lustre]
      2013-11-21 00:57:45  [<ffffffff8108163e>] ? down+0x2e/0x50
      2013-11-21 00:57:45  [<ffffffffa0a7cf67>] ? ll_lov_setstripe_ea_info+0x87/0x2b0 [lustre]
      2013-11-21 00:57:45  [<ffffffffa0a831a5>] ? ll_lov_setstripe+0x85/0x5a0 [lustre]
      2013-11-21 00:57:45  [<ffffffffa0aa3e8b>] ? ll_stats_ops_tally+0x6b/0xd0 [lustre]
      2013-11-21 00:57:45  [<ffffffffa0a84ac6>] ? ll_file_ioctl+0x826/0xe00 [lustre]
      2013-11-21 00:57:45  [<ffffffff81179ff2>] ? vfs_ioctl+0x22/0xa0
      2013-11-21 00:57:45  [<ffffffff8117a4ba>] ? do_vfs_ioctl+0x3aa/0x580
      2013-11-21 00:57:45  [<ffffffff8117a711>] ? sys_ioctl+0x81/0xa0
      2013-11-21 00:57:45  [<ffffffff8149970e>] ? do_device_not_available+0xe/0x10
      2013-11-21 00:57:45  [<ffffffff810030f2>] ? system_call_fastpath+0x16/0x1b
      

      This issue looks like LU-2523 and LU-3311, but the patch for b2_1 has not made any progress since July.

      I havetested with the following reproducer, given in LU-2523:

      llmount.sh
      cd /mnt/lustre
      touch file1
      
      In a single process do:
        struct lov_user_md_v3 *lum;
        /* Initialize lum */
        fd2 = open("file2", O_RDWR|O_CREAT|O_LOV_DELAY_CREATE, 0666);
        rename("file1", "file2");
        ioctl(fd2, LL_IOC_LOV_SETSTRIPE, lum);
      

      With a stock 2.1.6 I can easily reproduce the issue. And unfortunately, with patch at http://review.whamcloud.com/6775 I am still able to hit the bug.

      Thanks,
      Sebastien.

      Attachments

        Activity

          [LU-4363] (llite_lib.c:1683:ll_update_inode()) ASSERTION( lu_fid_eq(&lli->lli_fid, &body->fid1) ) failed
          laisiyao Lai Siyao added a comment -
          ioctl(fd2, LL_IOC_LOV_SETSTRIPE, lum);
          

          As you can see, setstripe is done via ioctl on an opened file handle, but in the code setstripe is implemented as an open (so it's actually a re-open), this looks should succeed, but current MDS code doesn't allow re-open or create OST object for unlinked file. However there is no posix standard for setstripe call, this can be regarded as normal, but it should be documented somewhere IMO.

          laisiyao Lai Siyao added a comment - ioctl(fd2, LL_IOC_LOV_SETSTRIPE, lum); As you can see, setstripe is done via ioctl on an opened file handle, but in the code setstripe is implemented as an open (so it's actually a re-open), this looks should succeed, but current MDS code doesn't allow re-open or create OST object for unlinked file. However there is no posix standard for setstripe call, this can be regarded as normal, but it should be documented somewhere IMO.

          Hi,

          Thank you very much for the explanations! Now with http://review.whamcloud.com/6775 + http://review.whamcloud.com/8529 I am not able to hit the assertion anymore
          So we definitively need both fixes in b2_1.

          One more question: could you re-explain the drawback you identified with this solution (it was related to setstripe returning -ENOENT but I dd not get your point) ?

          Thanks,
          Sebastien.

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi, Thank you very much for the explanations! Now with http://review.whamcloud.com/6775 + http://review.whamcloud.com/8529 I am not able to hit the assertion anymore So we definitively need both fixes in b2_1. One more question: could you re-explain the drawback you identified with this solution (it was related to setstripe returning -ENOENT but I dd not get your point) ? Thanks, Sebastien.
          laisiyao Lai Siyao added a comment -

          Sorry Sebastien, I posted the wrong patch, it should be http://review.whamcloud.com/6775 + http://review.whamcloud.com/#/c/8529/.

          You hit that assertion is because MDS_OPEN_BY_FID flag is not in 2.1 code, so open tends to be done by name on MDS, therefore when rename happens, the new file with different fid will be opened, and it causes the assert on fid change on client.

          2.4 has this flag, and patch http://review.whamcloud.com/#/c/8529/ backports this flag to 2.1, so the assert will not be hit any more.

          laisiyao Lai Siyao added a comment - Sorry Sebastien, I posted the wrong patch, it should be http://review.whamcloud.com/6775 + http://review.whamcloud.com/#/c/8529/ . You hit that assertion is because MDS_OPEN_BY_FID flag is not in 2.1 code, so open tends to be done by name on MDS, therefore when rename happens, the new file with different fid will be opened, and it causes the assert on fid change on client. 2.4 has this flag, and patch http://review.whamcloud.com/#/c/8529/ backports this flag to 2.1, so the assert will not be hit any more.

          Hi,

          Patch http://review.whamcloud.com/8529 can be applied on Lustre 2.1.6, but http://review.whamcloud.com/7476 cannot because it is a master version (more than 20 hunks failed when trying on 2.1).
          That being said, I do not clearly understand the relationship between the assertion we are suffering from, and the open-by-fid feature. I mean, all we need is something to prevent Lustre 2.1 to crash when users do something like the reproducer detailed in the description of this ticket.

          Sebastien.

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi, Patch http://review.whamcloud.com/8529 can be applied on Lustre 2.1.6, but http://review.whamcloud.com/7476 cannot because it is a master version (more than 20 hunks failed when trying on 2.1). That being said, I do not clearly understand the relationship between the assertion we are suffering from, and the open-by-fid feature. I mean, all we need is something to prevent Lustre 2.1 to crash when users do something like the reproducer detailed in the description of this ticket. Sebastien.
          laisiyao Lai Siyao added a comment -

          Hi Sebastien, I just committed a patch http://review.whamcloud.com/#/c/8529/, you can apply it plus http://review.whamcloud.com/#/c/7476/ to make the test pass. However as is noted by John in LU-2523 that setstripe will return -ENOENT in your test, this is because MDS has strict check to forbid open again or create object on OST for unlinked file (even though it's currently opened). Do you think this is acceptable?

          laisiyao Lai Siyao added a comment - Hi Sebastien, I just committed a patch http://review.whamcloud.com/#/c/8529/ , you can apply it plus http://review.whamcloud.com/#/c/7476/ to make the test pass. However as is noted by John in LU-2523 that setstripe will return -ENOENT in your test, this is because MDS has strict check to forbid open again or create object on OST for unlinked file (even though it's currently opened). Do you think this is acceptable?
          laisiyao Lai Siyao added a comment -

          Yes, Sebastien, I'm looking for a simpler way to handle this open-by-fid case only, and I'm still testing, will commit the patch tomorrow.

          laisiyao Lai Siyao added a comment - Yes, Sebastien, I'm looking for a simpler way to handle this open-by-fid case only, and I'm still testing, will commit the patch tomorrow.
          pjones Peter Jones added a comment -

          ok Sebastien. We are looking into options that would work for b2_1

          pjones Peter Jones added a comment - ok Sebastien. We are looking into options that would work for b2_1

          Peter,

          The problem is upgrade to 2.4 at IFERC is planned for Q4 2014

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - Peter, The problem is upgrade to 2.4 at IFERC is planned for Q4 2014
          pjones Peter Jones added a comment -

          Lai

          Would this be easier to port to b2_4?

          Sebastien

          If the answer to the above is yes, would you consider deploying a 2.4.x release at IFERC?

          Peter

          pjones Peter Jones added a comment - Lai Would this be easier to port to b2_4? Sebastien If the answer to the above is yes, would you consider deploying a 2.4.x release at IFERC? Peter
          laisiyao Lai Siyao added a comment - - edited

          http://review.whamcloud.com/#/c/7476/ should be able to fix this, but this patch is for master code, and it has some dependency on patches not on 2.1.

          laisiyao Lai Siyao added a comment - - edited http://review.whamcloud.com/#/c/7476/ should be able to fix this, but this patch is for master code, and it has some dependency on patches not on 2.1.

          People

            laisiyao Lai Siyao
            sebastien.buisson Sebastien Buisson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: