Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3318

mdc_set_lock_data() ASSERTION( old_inode->i_state & I_FREEING )

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.4.0, Lustre 2.5.0
    • Lustre 2.4.0
    • None
    • 2.3.64 ppc64 client
      2.1.4 x86_64 server
    • 3
    • 8220

    Description

      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:184:lov_init_sub()) header@c000000c6a188980[0x0, 28, [0x100f00000:0x131a05a:0x0] hash]{ 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:184:lov_init_sub()) ....lovsub@c000000c6a188a18[0]
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:184:lov_init_sub()) ....osc@c000000d6758cad0id: 0x0:20029530 idx: 240 gen: 0 kms_valid: 1 kms 1618944 
      rc: 0 force_sync: 0 min_xid: 0 size: 1618944 mtime: 1368036151 atime: 1368036151 ctime: 1368036150 blocks: 2048
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:184:lov_init_sub()) } header@c000000c6a188980
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:184:lov_init_sub()) stripe 0 is already owned.
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) header@c000000f4fd1c8f8[0x0, 45, [0x5ae58ebb30:0x27:0x0] hash]{ 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....vvp@c000000f4fd1c990(+ 0 0) inode: c000000c6a513178 6549798167695589415/1524993723 100666 1 0 c000000f4fd1c990 [0x5ae58ebb30:0x27:0x0]
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....lov@c000000c562bdfd8stripes: 2, valid, lsm{c000000c56b65b00 0x0BD10BD0 1 2 0}: 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) header@c000000c6a188980[0x0, 28, [0x100f00000:0x131a05a:0x0] hash]{ 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....lovsub@c000000c6a188a18[0]
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....osc@c000000d6758cad0id: 0x0:20029530 idx: 240 gen: 0 kms_valid: 1 kms 1618944 rc: 0 force_sync: 0 min_xid: 0 size: 1618944 mtime: 1368036151 atime: 1368036151 ctime: 1368036150 blocks: 2048
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) } header@c000000c6a188980
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) header@c000000c6a188a70[0x0, 18, [0x1006b0000:0x130771a:0x0] hash]{ 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....lovsub@c000000c6a188b08[1]
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....osc@c000000d6758cbf0id: 0x0:19953434 idx: 107 gen: 0 kms_valid: 1 kms 1048576 rc: 0 force_sync: 0 min_xid: 0 size: 1048576 mtime: 1368036151 atime: 1368036151 ctime: 1368036151 blocks: 2048
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) } header@c000000c6a188a70
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) } header@c000000f4fd1c8f8
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) owned.
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:186:lov_init_sub()) header@c000000f4fd1c7f0[0x0, 1, [0x5ae58ebb30:0x27:0x0]]
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:186:lov_init_sub()) try to own.
      2013-05-08 11:02:31 LustreError: 7256:0:(lcommon_cl.c:1201:cl_file_inode_init()) Failure to initialize cl object [0x5ae58ebb30:0x27:0x0]: -5
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:184:lov_init_sub()) header@c000000c6a188980[0x0, 28, [0x100f00000:0x131a05a:0x0] hash]{ 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:184:lov_init_sub()) ....lovsub@c000000c6a188a18[0]
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:184:lov_init_sub()) ....osc@c000000d6758cad0id: 0x0:20029530 idx: 240 gen: 0 kms_valid: 1 kms 1618944 rc: 0 force_sync: 0 min_xid: 0 size: 1618944 mtime: 1368036151 atime: 1368036151 ctime: 1368036150 blocks: 2048
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:184:lov_init_sub()) } header@c000000c6a188980
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:184:lov_init_sub()) stripe 0 is already owned.
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) header@c000000f4fd1c8f8[0x0, 45, [0x5ae58ebb30:0x27:0x0] hash]{ 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....vvp@c000000f4fd1c990(+ 0 0) inode: c000000c6a513178 6549798167695589415/1524993723 100666 1 0 c000000f4fd1c990 [0x5ae58ebb30:0x27:0x0]
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....lov@c000000c562bdfd8stripes: 2, valid, lsm{c000000c56b65b00 0x0BD10BD0 1 2 0}: 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) header@c000000c6a188980[0x0, 28, [0x100f00000:0x131a05a:0x0] hash]{ 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....lovsub@c000000c6a188a18[0]
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....osc@c000000d6758cad0id: 0x0:20029530 idx: 240 gen: 0 kms_valid: 1 kms 1618944 rc: 0 force_sync: 0 min_xid: 0 size: 1618944 mtime: 1368036151 atime: 1368036151 ctime: 1368036150 blocks: 2048
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) } header@c000000c6a188980
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) header@c000000c6a188a70[0x0, 18, [0x1006b0000:0x130771a:0x0] hash]{ 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....lovsub@c000000c6a188b08[1]
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....osc@c000000d6758cbf0id: 0x0:19953434 idx: 107 gen: 0 kms_valid: 1 kms 1048576 rc: 0 force_sync: 0 min_xid: 0 size: 1048576 mtime: 1368036151 atime: 1368036151 ctime: 1368036151 blocks: 2048
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) } header@c000000c6a188a70
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) } header@c000000f4fd1c8f8
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) owned.
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:186:lov_init_sub()) header@c000000f4fd1c6e8[0x0, 1, [0x5ae58ebb30:0x27:0x0]]
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:186:lov_init_sub()) try to own.
      2013-05-08 11:02:31 LustreError: 7256:0:(llite_lib.c:2197:ll_prep_inode()) new_inode -fatal: rc -5
      2013-05-08 11:08:51 LustreError: 45085:0:(mdc_locks.c:143:mdc_set_lock_data()) ASSERTION( old_inode->i_state & I_FREEING ) failed: Found existing inode c000000d336808f8/6549372160627028526/1524894535 state 0 in lock: setting data to c000000c6abfdb78/252641838/1524894535
      2013-05-08 11:08:51 LustreError: 45085:0:(mdc_locks.c:143:mdc_set_lock_data()) LBUG
      2013-05-08 11:08:51 Call Trace:
      2013-05-08 11:08:51 [c000000ddf04f1d0] [c000000000012f04] .show_stack+0x74/0x1c0 (unreliable)
      2013-05-08 11:08:51 [c000000ddf04f280] [d00000000a9e0cb8] .libcfs_debug_dumpstack+0xd8/0x150 [libcfs]
      2013-05-08 11:08:51 [c000000ddf04f330] [d00000000a9e1480] .lbug_with_loc+0x50/0xc0 [libcfs]
      2013-05-08 11:08:51 [c000000ddf04f3c0] [d00000000c24fd2c] .mdc_set_lock_data+0x33c/0x340 [mdc]
      2013-05-08 11:08:51 [c000000ddf04f480] [d00000000c4a8674] .ll_lookup_it_finish+0xb34/0x1700 [lustre]
      2013-05-08 11:08:51 [c000000ddf04f5d0] [d00000000c4a96d8] .ll_lookup_it+0x498/0xfb0 [lustre]
      2013-05-08 11:08:51 [c000000ddf04f750] [d00000000c4aa4b4] .ll_lookup_nd+0x2c4/0x580 [lustre]
      2013-05-08 11:08:51 [c000000ddf04f810] [c0000000001d0614] .do_lookup+0x254/0x2d0
      2013-05-08 11:08:51 [c000000ddf04f8e0] [c0000000001d3348] .__link_path_walk+0x1f8/0x15a0
      2013-05-08 11:08:51 [c000000ddf04fa10] [c0000000001d4aa8] .path_walk+0x98/0x180
      2013-05-08 11:08:51 [c000000ddf04fab0] [c0000000001d4d9c] .do_path_lookup+0x7c/0xf0
      2013-05-08 11:08:51 [c000000ddf04fb40] [c0000000001d5b80] .user_path_at+0x60/0xb0
      2013-05-08 11:08:51 [c000000ddf04fc90] [c0000000001c9034] .vfs_fstatat+0x44/0xb0
      2013-05-08 11:08:51 [c000000ddf04fd30] [c0000000001c9274] .SyS_stat64+0x24/0x60
      2013-05-08 11:08:51 [c000000ddf04fe30] [c000000000008564] syscall_exit+0x0/0x40
      2013-05-08 11:08:51 Kernel panic - not syncing: LBUG
      2013-05-08 11:08:51 Call Trace:
      2013-05-08 11:08:51 [c000000ddf04f1f0] [c000000000012f04] .show_stack+0x74/0x1c0 (unreliable)
      2013-05-08 11:08:51 [c000000ddf04f2a0] [c0000000005c4f34] .panic+0xc4/0x1f8
      2013-05-08 11:08:51 [c000000ddf04f330] [d00000000a9e14e0] .lbug_with_loc+0xb0/0xc0 [libcfs]
      2013-05-08 11:08:51 [c000000ddf04f3c0] [d00000000c24fd2c] .mdc_set_lock_data+0x33c/0x340 [mdc]
      2013-05-08 11:08:51 [c000000ddf04f480] [d00000000c4a8674] .ll_lookup_it_finish+0xb34/0x1700 [lustre]
      2013-05-08 11:08:51 [c000000ddf04f5d0] [d00000000c4a96d8] .ll_lookup_it+0x498/0xfb0 [lustre]
      2013-05-08 11:08:51 [c000000ddf04f750] [d00000000c4aa4b4] .ll_lookup_nd+0x2c4/0x580 [lustre]
      2013-05-08 11:08:51 [c000000ddf04f810] [c0000000001d0614] .do_lookup+0x254/0x2d0
      2013-05-08 11:08:51 [c000000ddf04f8e0] [c0000000001d3348] .__link_path_walk+0x1f8/0x15a0
      2013-05-08 11:08:51 [c000000ddf04fa10] [c0000000001d4aa8] .path_walk+0x98/0x180
      2013-05-08 11:08:51 [c000000ddf04fab0] [c0000000001d4d9c] .do_path_lookup+0x7c/0xf0
      2013-05-08 11:08:51 [c000000ddf04fb40] [c0000000001d5b80] .user_path_at+0x60/0xb0
      2013-05-08 11:08:51 [c000000ddf04fc90] [c0000000001c9034] .vfs_fstatat+0x44/0xb0
      2013-05-08 11:08:51 [c000000ddf04fd30] [c0000000001c9274] .SyS_stat64+0x24/0x60
      2013-05-08 11:08:51 [c000000ddf04fe30] [c000000000008564] syscall_exit+0x0/0x40
      2013-05-08 11:08:51 May  8 11:08:51 rzuseqlac2 kernel: LustreError: 45085:0:(mdc_locks.c:143:mdc_set_lock_data()) ASSERTION( old_inode->i_state & I_FREEING ) failed: Found existing inode c000000d336808f8/6549372160627028526/1524894535 state 0 in lock: setting data to c000000c6abfdb78/252641838/1524894535
      2013-05-08 11:08:51 May  8 11:08:51 rzuseqlac2 kernel: LustreError: 45085:0:(mdc_locks.c:143:mdc_set_lock_data()) LBUG
      2013-05-08 11:08:51 May  8 11:08:51 rzuseqlac2 kernel: Kernel panic - not syncing: LBUG
      

      We have two crash dumps available for analysis.

      LLNL-bug-id: BG-165

      Attachments

        Issue Links

          Activity

            [LU-3318] mdc_set_lock_data() ASSERTION( old_inode->i_state & I_FREEING )

            The problem is that if some 32bit user(like NFS ?) will use ino directly? if we cut 64 bit inode ino to 32 bit and return it to the user, which might not be able to get back the inode with this 32 bit ino?

            I thought we pack FID into NFS file handle. I can't think of an use case of using ino directly - but apparently I'm not an expert about this

            jay Jinshan Xiong (Inactive) added a comment - The problem is that if some 32bit user(like NFS ?) will use ino directly? if we cut 64 bit inode ino to 32 bit and return it to the user, which might not be able to get back the inode with this 32 bit ino? I thought we pack FID into NFS file handle. I can't think of an use case of using ino directly - but apparently I'm not an expert about this

            I can confirm that LU-3327 involves a 32bit application on a 64bit client.

            thermeon Scott Sampson added a comment - I can confirm that LU-3327 involves a 32bit application on a 64bit client.

            This was discussed in the past, and the assumption was made that no applications were going to access both 32-bit and 64-bit inode numbers at the same time. The inode number should not be stored or used internally, but rather the FID should always be used for internal access, and the inode number should always be regenerated in the context of the user thread that is doing the stat() or whatever needs it.

            adilger Andreas Dilger added a comment - This was discussed in the past, and the assumption was made that no applications were going to access both 32-bit and 64-bit inode numbers at the same time. The inode number should not be stored or used internally, but rather the FID should always be used for internal access, and the inode number should always be regenerated in the context of the user thread that is doing the stat() or whatever needs it.
            di.wang Di Wang added a comment - - edited

            The problem is that if some 32bit user(like NFS ?) will use ino directly? if we cut 64 bit inode ino to 32 bit and return it to the user, which might not be able to get back the inode with this 32 bit ino?

            di.wang Di Wang added a comment - - edited The problem is that if some 32bit user(like NFS ?) will use ino directly? if we cut 64 bit inode ino to 32 bit and return it to the user, which might not be able to get back the inode with this 32 bit ino?

            We should always use `unsigned long' in kernel space. The only concern for 32bit compatibility is when we're filling in the user buffer. So in this case for ll_iget(), we should call cl_fid_build_ino() with api32 = 0.

            jay Jinshan Xiong (Inactive) added a comment - We should always use `unsigned long' in kernel space. The only concern for 32bit compatibility is when we're filling in the user buffer. So in this case for ll_iget(), we should call cl_fid_build_ino() with api32 = 0.

            Is that possible those 32 bits application running on these ppc 64-bit clients?

            We confirmed that it is possible to build and run 32-bit apps on the ppc64 client that hit this bug.

            nedbass Ned Bass (Inactive) added a comment - Is that possible those 32 bits application running on these ppc 64-bit clients? We confirmed that it is possible to build and run 32-bit apps on the ppc64 client that hit this bug.
            di.wang Di Wang added a comment -

            Just tried to figure out the FID from these two inodes in the error message. The FID should be [0x5ae40b47b0: 0x1b22e: 0].

            So fid_flatten( [0x5ae40b47b0: 0x1b22e: 0]) = 6549372160627028526 and fid_flatten32([0x5ae40b47b0: 0x1b22e: 0]) = 252641838

            which just match the error message and confirm this guess

            2013-05-08 11:08:51 LustreError: 45085:0:(mdc_locks.c:143:mdc_set_lock_data()) ASSERTION( old_inode->i_state & I_FREEING ) failed: Found existing inode c000000d336808f8/6549372160627028526/1524894535 state 0 in lock: setting data to c000000c6abfdb78/252641838/1524894535
            2013-05-08 11:08:51 LustreError: 45085:0:(mdc_locks.c:143:mdc_set_lock_data()) LBUG
            

            But not sure how to fix this.

            di.wang Di Wang added a comment - Just tried to figure out the FID from these two inodes in the error message. The FID should be [0x5ae40b47b0: 0x1b22e: 0] . So fid_flatten( [0x5ae40b47b0: 0x1b22e: 0] ) = 6549372160627028526 and fid_flatten32( [0x5ae40b47b0: 0x1b22e: 0] ) = 252641838 which just match the error message and confirm this guess 2013-05-08 11:08:51 LustreError: 45085:0:(mdc_locks.c:143:mdc_set_lock_data()) ASSERTION( old_inode->i_state & I_FREEING ) failed: Found existing inode c000000d336808f8/6549372160627028526/1524894535 state 0 in lock: setting data to c000000c6abfdb78/252641838/1524894535 2013-05-08 11:08:51 LustreError: 45085:0:(mdc_locks.c:143:mdc_set_lock_data()) LBUG But not sure how to fix this.
            di.wang Di Wang added a comment -

            Is that possible those 32 bits application running on these ppc 64-bit clients? And also, is it possible to umount clients and erase the config log with tunefs, and remount clients, so lmv layer can be created on client side, which will slow down the sequence consume speed.

            di.wang Di Wang added a comment - Is that possible those 32 bits application running on these ppc 64-bit clients? And also, is it possible to umount clients and erase the config log with tunefs, and remount clients, so lmv layer can be created on client side, which will slow down the sequence consume speed.

            We didn't upgrade all of our clients to 2.1.4 because we paused the rollout due to LU-3029. Also, this file system (lscratchrza) has ppc 32-bit clients mounting it right now (rzdawndev a BG/P system), and there would be 32-bit applications writing to the file system. These clients are not the same clients that generated this ASSERTION, those are ppc 64-bit clients.

            marc@llnl.gov D. Marc Stearman (Inactive) added a comment - We didn't upgrade all of our clients to 2.1.4 because we paused the rollout due to LU-3029 . Also, this file system (lscratchrza) has ppc 32-bit clients mounting it right now (rzdawndev a BG/P system), and there would be 32-bit applications writing to the file system. These clients are not the same clients that generated this ASSERTION, those are ppc 64-bit clients.

            Ah apparently our 2.1 x86 clients don't have the LU-1632 patch:

            $ touch a b c d
            $ for x in a b c d ; do lfs path2fid $x ; done
            [0x5b05195040:0x1:0x0]
            [0x5b05197750:0x1:0x0]
            [0x5b05199e60:0x1:0x0]
            [0x5b0519c188:0x1:0x0]
            $ pwd
            /p/lscratchrza/bass6
            $ rpm -q lustre
            lustre-2.1.2-4chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64
            $ arch
            x86_64
            
            nedbass Ned Bass (Inactive) added a comment - Ah apparently our 2.1 x86 clients don't have the LU-1632 patch: $ touch a b c d $ for x in a b c d ; do lfs path2fid $x ; done [0x5b05195040:0x1:0x0] [0x5b05197750:0x1:0x0] [0x5b05199e60:0x1:0x0] [0x5b0519c188:0x1:0x0] $ pwd /p/lscratchrza/bass6 $ rpm -q lustre lustre-2.1.2-4chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64 $ arch x86_64
            di.wang Di Wang added a comment - - edited

            Another interesting problem is that according to the comment from Ned (14/May/13 7:51 AM)

            # rzuseqlac2 /root > touch /p/lscratchrza/bass6/file1
            # rzuseqlac2 /root > lfs path2fid !$
            lfs path2fid /p/lscratchrza/bass6/file1
            [0x5ae60e2460:0x55c7:0x0]
            

            And when the ticket is created

             header@c000000f4fd1c6e8[0x0, 1, [0x5ae58ebb30:0x27:0x0]]
            

            the sequence is around 0x5ae58ebb30, Ned, I am not sure if you can still find this file (lfs fid2path 0x5ae58ebb30:0x27:0x0)and find out its creation time. If that file is being created around that time, then it consume about (0x5ae60e2460 - 0x5ae58ebb30) = 8M sequence in 4 days, which still seems too much for me, except I miss sth here.

            Ned, how many clients does this cluster have? And currently all clients are running with 2.3.58 right?

            di.wang Di Wang added a comment - - edited Another interesting problem is that according to the comment from Ned (14/May/13 7:51 AM) # rzuseqlac2 /root > touch /p/lscratchrza/bass6/file1 # rzuseqlac2 /root > lfs path2fid !$ lfs path2fid /p/lscratchrza/bass6/file1 [0x5ae60e2460:0x55c7:0x0] And when the ticket is created header@c000000f4fd1c6e8[0x0, 1, [0x5ae58ebb30:0x27:0x0]] the sequence is around 0x5ae58ebb30, Ned, I am not sure if you can still find this file (lfs fid2path 0x5ae58ebb30:0x27:0x0)and find out its creation time. If that file is being created around that time, then it consume about (0x5ae60e2460 - 0x5ae58ebb30) = 8M sequence in 4 days, which still seems too much for me, except I miss sth here. Ned, how many clients does this cluster have? And currently all clients are running with 2.3.58 right?

            People

              di.wang Di Wang
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: