Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3318

mdc_set_lock_data() ASSERTION( old_inode->i_state & I_FREEING )

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.4.0, Lustre 2.5.0
    • Lustre 2.4.0
    • None
    • 2.3.64 ppc64 client
      2.1.4 x86_64 server
    • 3
    • 8220

    Description

      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:184:lov_init_sub()) header@c000000c6a188980[0x0, 28, [0x100f00000:0x131a05a:0x0] hash]{ 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:184:lov_init_sub()) ....lovsub@c000000c6a188a18[0]
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:184:lov_init_sub()) ....osc@c000000d6758cad0id: 0x0:20029530 idx: 240 gen: 0 kms_valid: 1 kms 1618944 
      rc: 0 force_sync: 0 min_xid: 0 size: 1618944 mtime: 1368036151 atime: 1368036151 ctime: 1368036150 blocks: 2048
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:184:lov_init_sub()) } header@c000000c6a188980
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:184:lov_init_sub()) stripe 0 is already owned.
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) header@c000000f4fd1c8f8[0x0, 45, [0x5ae58ebb30:0x27:0x0] hash]{ 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....vvp@c000000f4fd1c990(+ 0 0) inode: c000000c6a513178 6549798167695589415/1524993723 100666 1 0 c000000f4fd1c990 [0x5ae58ebb30:0x27:0x0]
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....lov@c000000c562bdfd8stripes: 2, valid, lsm{c000000c56b65b00 0x0BD10BD0 1 2 0}: 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) header@c000000c6a188980[0x0, 28, [0x100f00000:0x131a05a:0x0] hash]{ 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....lovsub@c000000c6a188a18[0]
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....osc@c000000d6758cad0id: 0x0:20029530 idx: 240 gen: 0 kms_valid: 1 kms 1618944 rc: 0 force_sync: 0 min_xid: 0 size: 1618944 mtime: 1368036151 atime: 1368036151 ctime: 1368036150 blocks: 2048
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) } header@c000000c6a188980
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) header@c000000c6a188a70[0x0, 18, [0x1006b0000:0x130771a:0x0] hash]{ 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....lovsub@c000000c6a188b08[1]
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....osc@c000000d6758cbf0id: 0x0:19953434 idx: 107 gen: 0 kms_valid: 1 kms 1048576 rc: 0 force_sync: 0 min_xid: 0 size: 1048576 mtime: 1368036151 atime: 1368036151 ctime: 1368036151 blocks: 2048
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) } header@c000000c6a188a70
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) } header@c000000f4fd1c8f8
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) owned.
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:186:lov_init_sub()) header@c000000f4fd1c7f0[0x0, 1, [0x5ae58ebb30:0x27:0x0]]
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:186:lov_init_sub()) try to own.
      2013-05-08 11:02:31 LustreError: 7256:0:(lcommon_cl.c:1201:cl_file_inode_init()) Failure to initialize cl object [0x5ae58ebb30:0x27:0x0]: -5
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:184:lov_init_sub()) header@c000000c6a188980[0x0, 28, [0x100f00000:0x131a05a:0x0] hash]{ 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:184:lov_init_sub()) ....lovsub@c000000c6a188a18[0]
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:184:lov_init_sub()) ....osc@c000000d6758cad0id: 0x0:20029530 idx: 240 gen: 0 kms_valid: 1 kms 1618944 rc: 0 force_sync: 0 min_xid: 0 size: 1618944 mtime: 1368036151 atime: 1368036151 ctime: 1368036150 blocks: 2048
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:184:lov_init_sub()) } header@c000000c6a188980
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:184:lov_init_sub()) stripe 0 is already owned.
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) header@c000000f4fd1c8f8[0x0, 45, [0x5ae58ebb30:0x27:0x0] hash]{ 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....vvp@c000000f4fd1c990(+ 0 0) inode: c000000c6a513178 6549798167695589415/1524993723 100666 1 0 c000000f4fd1c990 [0x5ae58ebb30:0x27:0x0]
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....lov@c000000c562bdfd8stripes: 2, valid, lsm{c000000c56b65b00 0x0BD10BD0 1 2 0}: 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) header@c000000c6a188980[0x0, 28, [0x100f00000:0x131a05a:0x0] hash]{ 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....lovsub@c000000c6a188a18[0]
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....osc@c000000d6758cad0id: 0x0:20029530 idx: 240 gen: 0 kms_valid: 1 kms 1618944 rc: 0 force_sync: 0 min_xid: 0 size: 1618944 mtime: 1368036151 atime: 1368036151 ctime: 1368036150 blocks: 2048
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) } header@c000000c6a188980
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) header@c000000c6a188a70[0x0, 18, [0x1006b0000:0x130771a:0x0] hash]{ 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....lovsub@c000000c6a188b08[1]
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) ....osc@c000000d6758cbf0id: 0x0:19953434 idx: 107 gen: 0 kms_valid: 1 kms 1048576 rc: 0 force_sync: 0 min_xid: 0 size: 1048576 mtime: 1368036151 atime: 1368036151 ctime: 1368036151 blocks: 2048
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) } header@c000000c6a188a70
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) 
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) } header@c000000f4fd1c8f8
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:185:lov_init_sub()) owned.
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:186:lov_init_sub()) header@c000000f4fd1c6e8[0x0, 1, [0x5ae58ebb30:0x27:0x0]]
      2013-05-08 11:02:31 LustreError: 7256:0:(lov_object.c:186:lov_init_sub()) try to own.
      2013-05-08 11:02:31 LustreError: 7256:0:(llite_lib.c:2197:ll_prep_inode()) new_inode -fatal: rc -5
      2013-05-08 11:08:51 LustreError: 45085:0:(mdc_locks.c:143:mdc_set_lock_data()) ASSERTION( old_inode->i_state & I_FREEING ) failed: Found existing inode c000000d336808f8/6549372160627028526/1524894535 state 0 in lock: setting data to c000000c6abfdb78/252641838/1524894535
      2013-05-08 11:08:51 LustreError: 45085:0:(mdc_locks.c:143:mdc_set_lock_data()) LBUG
      2013-05-08 11:08:51 Call Trace:
      2013-05-08 11:08:51 [c000000ddf04f1d0] [c000000000012f04] .show_stack+0x74/0x1c0 (unreliable)
      2013-05-08 11:08:51 [c000000ddf04f280] [d00000000a9e0cb8] .libcfs_debug_dumpstack+0xd8/0x150 [libcfs]
      2013-05-08 11:08:51 [c000000ddf04f330] [d00000000a9e1480] .lbug_with_loc+0x50/0xc0 [libcfs]
      2013-05-08 11:08:51 [c000000ddf04f3c0] [d00000000c24fd2c] .mdc_set_lock_data+0x33c/0x340 [mdc]
      2013-05-08 11:08:51 [c000000ddf04f480] [d00000000c4a8674] .ll_lookup_it_finish+0xb34/0x1700 [lustre]
      2013-05-08 11:08:51 [c000000ddf04f5d0] [d00000000c4a96d8] .ll_lookup_it+0x498/0xfb0 [lustre]
      2013-05-08 11:08:51 [c000000ddf04f750] [d00000000c4aa4b4] .ll_lookup_nd+0x2c4/0x580 [lustre]
      2013-05-08 11:08:51 [c000000ddf04f810] [c0000000001d0614] .do_lookup+0x254/0x2d0
      2013-05-08 11:08:51 [c000000ddf04f8e0] [c0000000001d3348] .__link_path_walk+0x1f8/0x15a0
      2013-05-08 11:08:51 [c000000ddf04fa10] [c0000000001d4aa8] .path_walk+0x98/0x180
      2013-05-08 11:08:51 [c000000ddf04fab0] [c0000000001d4d9c] .do_path_lookup+0x7c/0xf0
      2013-05-08 11:08:51 [c000000ddf04fb40] [c0000000001d5b80] .user_path_at+0x60/0xb0
      2013-05-08 11:08:51 [c000000ddf04fc90] [c0000000001c9034] .vfs_fstatat+0x44/0xb0
      2013-05-08 11:08:51 [c000000ddf04fd30] [c0000000001c9274] .SyS_stat64+0x24/0x60
      2013-05-08 11:08:51 [c000000ddf04fe30] [c000000000008564] syscall_exit+0x0/0x40
      2013-05-08 11:08:51 Kernel panic - not syncing: LBUG
      2013-05-08 11:08:51 Call Trace:
      2013-05-08 11:08:51 [c000000ddf04f1f0] [c000000000012f04] .show_stack+0x74/0x1c0 (unreliable)
      2013-05-08 11:08:51 [c000000ddf04f2a0] [c0000000005c4f34] .panic+0xc4/0x1f8
      2013-05-08 11:08:51 [c000000ddf04f330] [d00000000a9e14e0] .lbug_with_loc+0xb0/0xc0 [libcfs]
      2013-05-08 11:08:51 [c000000ddf04f3c0] [d00000000c24fd2c] .mdc_set_lock_data+0x33c/0x340 [mdc]
      2013-05-08 11:08:51 [c000000ddf04f480] [d00000000c4a8674] .ll_lookup_it_finish+0xb34/0x1700 [lustre]
      2013-05-08 11:08:51 [c000000ddf04f5d0] [d00000000c4a96d8] .ll_lookup_it+0x498/0xfb0 [lustre]
      2013-05-08 11:08:51 [c000000ddf04f750] [d00000000c4aa4b4] .ll_lookup_nd+0x2c4/0x580 [lustre]
      2013-05-08 11:08:51 [c000000ddf04f810] [c0000000001d0614] .do_lookup+0x254/0x2d0
      2013-05-08 11:08:51 [c000000ddf04f8e0] [c0000000001d3348] .__link_path_walk+0x1f8/0x15a0
      2013-05-08 11:08:51 [c000000ddf04fa10] [c0000000001d4aa8] .path_walk+0x98/0x180
      2013-05-08 11:08:51 [c000000ddf04fab0] [c0000000001d4d9c] .do_path_lookup+0x7c/0xf0
      2013-05-08 11:08:51 [c000000ddf04fb40] [c0000000001d5b80] .user_path_at+0x60/0xb0
      2013-05-08 11:08:51 [c000000ddf04fc90] [c0000000001c9034] .vfs_fstatat+0x44/0xb0
      2013-05-08 11:08:51 [c000000ddf04fd30] [c0000000001c9274] .SyS_stat64+0x24/0x60
      2013-05-08 11:08:51 [c000000ddf04fe30] [c000000000008564] syscall_exit+0x0/0x40
      2013-05-08 11:08:51 May  8 11:08:51 rzuseqlac2 kernel: LustreError: 45085:0:(mdc_locks.c:143:mdc_set_lock_data()) ASSERTION( old_inode->i_state & I_FREEING ) failed: Found existing inode c000000d336808f8/6549372160627028526/1524894535 state 0 in lock: setting data to c000000c6abfdb78/252641838/1524894535
      2013-05-08 11:08:51 May  8 11:08:51 rzuseqlac2 kernel: LustreError: 45085:0:(mdc_locks.c:143:mdc_set_lock_data()) LBUG
      2013-05-08 11:08:51 May  8 11:08:51 rzuseqlac2 kernel: Kernel panic - not syncing: LBUG
      

      We have two crash dumps available for analysis.

      LLNL-bug-id: BG-165

      Attachments

        Issue Links

          Activity

            [LU-3318] mdc_set_lock_data() ASSERTION( old_inode->i_state & I_FREEING )
            pjones Peter Jones added a comment -

            Thanks Ned!

            pjones Peter Jones added a comment - Thanks Ned!

            Yes, this can be closed.

            nedbass Ned Bass (Inactive) added a comment - Yes, this can be closed.

            Now that Change, 6371 has landed to master can this ticket be closed? Or is there additional patches to land?

            jlevi Jodi Levi (Inactive) added a comment - Now that Change, 6371 has landed to master can this ticket be closed? Or is there additional patches to land?
            di.wang Di Wang added a comment -

            Hmm, since there are not enough debug log here, here is what I think it happens

            1. Some 32 bit application(thread) access the file, which will prepare inode with 32 bit ino(because of http://review.whamcloud.com/#change,5711), and also get back LOOKUP|UPDATE|LAYOUT lock.
            2. Then LOOKUP/UPDATE lock is being revoked by the server, but client still open this file, so LAYOUT lock is left on the client side. (with 32bit inode attached to it).
            3. Some other normal threads(non-32 bit flag on the thread) lookup the file, and enqueue the lock, getback the fid, then build 64bit ino, and when it will locate the same lock by FID on cache, but the lock is already attached with the inode in 32bit ino, which cause the LBUG.

            Probably .cpp file is being processed by some 32-bit pre-processing tool before building or during building? Just wild guess. Anyway, the 32-bit ino must be built because the file has been accessed by one thread with TIF_32BIT flag. I will update the patch http://review.whamcloud.com/#change,6371 according to Andreas's comment.

            di.wang Di Wang added a comment - Hmm, since there are not enough debug log here, here is what I think it happens 1. Some 32 bit application(thread) access the file, which will prepare inode with 32 bit ino(because of http://review.whamcloud.com/#change,5711 ), and also get back LOOKUP|UPDATE|LAYOUT lock. 2. Then LOOKUP/UPDATE lock is being revoked by the server, but client still open this file, so LAYOUT lock is left on the client side. (with 32bit inode attached to it). 3. Some other normal threads(non-32 bit flag on the thread) lookup the file, and enqueue the lock, getback the fid, then build 64bit ino, and when it will locate the same lock by FID on cache, but the lock is already attached with the inode in 32bit ino, which cause the LBUG. Probably .cpp file is being processed by some 32-bit pre-processing tool before building or during building? Just wild guess. Anyway, the 32-bit ino must be built because the file has been accessed by one thread with TIF_32BIT flag. I will update the patch http://review.whamcloud.com/#change,6371 according to Andreas's comment.

            Also, I looked more closely at the original crash dump to see what the application was doing. The user seems to be building a C++ application using the Boost.Jam build tool and the IBM xlC compilers. Here is the process hierarchy for the process that crashed:

            crash> ps -p 25993                                                              
            PID: 0      TASK: c000000000e903f0  CPU: 0   COMMAND: "swapper"                 
             PID: 1      TASK: c000000f59aa87f0  CPU: 45  COMMAND: "init"                   
              PID: 3900   TASK: c000000f43edbb30  CPU: 32  COMMAND: "ksshd"                 
               PID: 48487  TASK: c000000dc5d10900  CPU: 33  COMMAND: "ksshd"                
                PID: 48807  TASK: c000000d4eed2a70  CPU: 38  COMMAND: "ksshd"               
                 PID: 48810  TASK: c000000385466480  CPU: 11  COMMAND: "tcsh"               
                  PID: 5628   TASK: c0000004c12fe520  CPU: 12  COMMAND: "xjam1"             
                   PID: 5925   TASK: c000000f3667a810  CPU: 36  COMMAND: "bjam"             
                    PID: 22403  TASK: c0000004c0594e60  CPU: 17  COMMAND: "bgxlC"           
                     PID: 25989  TASK: c0000003853634c0  CPU: 36  COMMAND: "ipa"            
                      PID: 25993  TASK: c000000f36458ff0  CPU: 12  COMMAND: "xlCcode"  
            

            The file with the duplicate FID was a .cpp source file. I'm not sure which part of the tool chain is 32-bit, but this isn't exactly an obscure use case.

            nedbass Ned Bass (Inactive) added a comment - Also, I looked more closely at the original crash dump to see what the application was doing. The user seems to be building a C++ application using the Boost.Jam build tool and the IBM xlC compilers. Here is the process hierarchy for the process that crashed: crash> ps -p 25993 PID: 0 TASK: c000000000e903f0 CPU: 0 COMMAND: "swapper" PID: 1 TASK: c000000f59aa87f0 CPU: 45 COMMAND: "init" PID: 3900 TASK: c000000f43edbb30 CPU: 32 COMMAND: "ksshd" PID: 48487 TASK: c000000dc5d10900 CPU: 33 COMMAND: "ksshd" PID: 48807 TASK: c000000d4eed2a70 CPU: 38 COMMAND: "ksshd" PID: 48810 TASK: c000000385466480 CPU: 11 COMMAND: "tcsh" PID: 5628 TASK: c0000004c12fe520 CPU: 12 COMMAND: "xjam1" PID: 5925 TASK: c000000f3667a810 CPU: 36 COMMAND: "bjam" PID: 22403 TASK: c0000004c0594e60 CPU: 17 COMMAND: "bgxlC" PID: 25989 TASK: c0000003853634c0 CPU: 36 COMMAND: "ipa" PID: 25993 TASK: c000000f36458ff0 CPU: 12 COMMAND: "xlCcode" The file with the duplicate FID was a .cpp source file. I'm not sure which part of the tool chain is 32-bit, but this isn't exactly an obscure use case.

            I managed to reproduce the bug by running this repeatedly, where stat32 is a program which I compiled with -m32 that just calls stat() on its argument.

            echo 3 > /proc/sys/vm/drop_caches                                                  
            cat $FILE > /dev/null                                                                
            sleep 1                                                                            
            ./stat32 $FILE
            

            I eventually hit the ASSERTION running cat. The behavior is non-deterministic. Sometimes stat32 gets -EIO back from stat() with this console error:

            LustreError: 59173:0:(llite_lib.c:2197:ll_prep_inode()) new_inode -fatal: rc -5

            Other times there are no failures.

            nedbass Ned Bass (Inactive) added a comment - I managed to reproduce the bug by running this repeatedly, where stat32 is a program which I compiled with -m32 that just calls stat() on its argument. echo 3 > /proc/sys/vm/drop_caches cat $FILE > /dev/null sleep 1 ./stat32 $FILE I eventually hit the ASSERTION running cat . The behavior is non-deterministic. Sometimes stat32 gets -EIO back from stat() with this console error: LustreError: 59173:0:(llite_lib.c:2197:ll_prep_inode()) new_inode -fatal: rc -5 Other times there are no failures.

            Ned, fwiw, my related case LU-3327 is definitely just a 32bit app on a 64bit client.

            thermeon Scott Sampson added a comment - Ned, fwiw, my related case LU-3327 is definitely just a 32bit app on a 64bit client.

            I'm having trouble understanding what exactly we think applications are doing to trigger this bug. For example, nasf said:

            we do not know how the 32-bit ino# will be used, and cannot prevent it to be passed down back to llite for using.

            How would an application pass an inode number back to llite? I'm not aware of any library or system calls that allow files to be accessed by inode number.

            Do we think it has to be the same process mixing 32-bit and 64-bit system calls, or is it problematic just to mix 32-bit and 64-bit processes that access the same files on the same client?

            nedbass Ned Bass (Inactive) added a comment - I'm having trouble understanding what exactly we think applications are doing to trigger this bug. For example, nasf said: we do not know how the 32-bit ino# will be used, and cannot prevent it to be passed down back to llite for using. How would an application pass an inode number back to llite? I'm not aware of any library or system calls that allow files to be accessed by inode number. Do we think it has to be the same process mixing 32-bit and 64-bit system calls, or is it problematic just to mix 32-bit and 64-bit processes that access the same files on the same client?

            Can we detect this case and return an error rather than asserting on it? Allowing a misbehaved application to crash the OS seems unacceptable.

            nedbass Ned Bass (Inactive) added a comment - Can we detect this case and return an error rather than asserting on it? Allowing a misbehaved application to crash the OS seems unacceptable.

            Most applications do not use or care about the inode number. However, 32-bit applications ona 64-bit client will definitely get an error from stat() if the curproc check is removed. That is MUCH more likely to cause a problem for users than a very obscure case of someone mixing 32-bit and 64-bit applications that use the inode number on the same client and expecting them to work well together.

            This has been in all of the 2.x releases without any complaints, so I'd rather leave this alone.

            adilger Andreas Dilger added a comment - Most applications do not use or care about the inode number. However, 32-bit applications ona 64-bit client will definitely get an error from stat() if the curproc check is removed. That is MUCH more likely to cause a problem for users than a very obscure case of someone mixing 32-bit and 64-bit applications that use the inode number on the same client and expecting them to work well together. This has been in all of the 2.x releases without any complaints, so I'd rather leave this alone.

            People

              di.wang Di Wang
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: