Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18104

racer test_2: lustre: failed to initialize cl_object

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • Lustre 2.17.0
    • Lustre 2.16.0, Lustre 2.15.6
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for jianyu <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/9223dd61-e0db-4931-8235-7b2d488093bd

      test_2 failed with the following error:

      Timeout occurred after 968 minutes, last suite running was racer
      

      Dmesg on client:

      LustreError: 23088:0:(lcommon_cl.c:195:cl_file_inode_init()) lustre: failed to initialize cl_object [0x280007934:0x15ec:0x0]: rc = -5
      LustreError: 23088:0:(lcommon_cl.c:195:cl_file_inode_init()) Skipped 20 previous similar messages
      LustreError: 23088:0:(llite_lib.c:3739:ll_prep_inode()) lustre: new_inode - fatal error: rc = -5
      LustreError: 23088:0:(llite_lib.c:3739:ll_prep_inode()) Skipped 20 previous similar messages
      Lustre: dir [0x200008108:0x5f8:0x0] stripe 0 readdir failed: -2, directory is partially accessed!
      Lustre: Skipped 53 previous similar messages
      Autotest: Test running for 880 minutes (lustre-reviews_full-dne-part-2_106470.1005)
      Lustre: dir [0x280007935:0x4c27:0x0] stripe 0 readdir failed: -2, directory is partially accessed!
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/106470 - 6.4.0-150600.23.14-default
      servers: https://build.whamcloud.com/job/lustre-master/4559 - 5.14.0-427.24.1_lustre.el9.x86_64

      <<Please provide additional information about the failure here>>

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      racer test_2 - Timeout occurred after 968 minutes, last suite running was racer

      Attachments

        Issue Links

          Activity

            [LU-18104] racer test_2: lustre: failed to initialize cl_object
            yujian Jian Yu added a comment - Lustre 2.15.6 RC1: https://testing.whamcloud.com/test_sets/30ff1121-747f-4c9c-ada2-1e5b7a988005
            yujian Jian Yu added a comment - Lustre 2.16.0 RC5: https://testing.whamcloud.com/test_sets/10a8332f-ada4-4739-a10f-ca943d43e96b
            yujian Jian Yu added a comment -

            Lustre 2.16.0 RC2 test session details:
            clients: https://build.whamcloud.com/job/lustre-master/4584 - 5.14.0-362.24.1.el9_3.aarch64
            servers: https://build.whamcloud.com/job/lustre-master/4584 - 4.18.0-477.27.1.el8_lustre.x86_64
            https://testing.whamcloud.com/test_sets/7f051be2-f9a6-4314-8450-a69c8b0b2dda
             

            yujian Jian Yu added a comment - Lustre 2.16.0 RC2 test session details: clients: https://build.whamcloud.com/job/lustre-master/4584 - 5.14.0-362.24.1.el9_3.aarch64 servers: https://build.whamcloud.com/job/lustre-master/4584 - 4.18.0-477.27.1.el8_lustre.x86_64 https://testing.whamcloud.com/test_sets/7f051be2-f9a6-4314-8450-a69c8b0b2dda  

            there is a whole bunch of issues (like LU-17536 ) where we hit a deadlock in rename because of dentry duplication.

            bzzz Alex Zhuravlev added a comment - there is a whole bunch of issues (like LU-17536 ) where we hit a deadlock in rename because of dentry duplication.

            This looks like it is a client-side deadlock, and doesn't have anything to do with the credits on the MDS. While we do allocate a large number of credits for some migration operations, those MDS stack traces are mostly informational, and on a production system with a larger journal those messages would not be printed. The journal transaction would complete even though the reserved credits are large, and the JBD2 code would BUG if the credits actually run out. The racer test_1 subtest where those journal credits stacks were printed was able to complete and clean up successfully.

            It is in racer test_2 (rename stress test) where the problem is hit on client1, which looks like some kind of deadlock in the client VFS (there are no messages or interesting stacks on the MDS). The client1 stack trace log shows:
            https://testing.whamcloud.com/test_sets/0e0f1acf-653f-47a1-9ddf-caa2dff16021
            https://testing.whamcloud.com/test_logs/04db027f-18c3-4e26-a34f-d367b233bdf4/show_text

            __schedule,schedule,schedule_preempt_disabled,rwsem_down_read_slowpath,down_read,walk_component,link_path_walk,path_openat,do_filp_open,do_sys_openat2,__x64_sys_openat,do_syscall_64
            	PIDs(2): :618172 :779337 
            __schedule,schedule,schedule_preempt_disabled,rwsem_down_write_slowpath,down_write,vfs_rename,do_renameat2,__x64_sys_rename,do_syscall_64
            	PIDs(1): :779045 
            __schedule,schedule,schedule_preempt_disabled,__mutex_lock,lock_rename,do_renameat2,__x64_sys_rename,do_syscall_64
            	PIDs(18): :779243 :779273 :779287 :779303 :779318 :779351 :779385 :779462 :779506 :779581 :779824 :780140 :780299 :780381 :780966 :781177 :781905 :782020 
            __schedule,schedule,schedule_preempt_disabled,rwsem_down_read_slowpath,down_read,walk_component,path_lookupat,filename_lookup,vfs_statx,vfs_fstatat,__do_sys_newfstatat,do_syscall_64
            	PIDs(3): :779317 :779390 :780236 
            __schedule,schedule,schedule_preempt_disabled,rwsem_down_write_slowpath,down_write,lock_rename,do_renameat2,__x64_sys_rename,do_syscall_64
            	PIDs(5): :779343 :779346 :779350 :779399 :779478 
            __schedule,schedule,schedule_preempt_disabled,rwsem_down_write_slowpath,down_write,open_last_lookups,path_openat,do_filp_open,do_sys_openat2,__x64_sys_openat,do_syscall_64
            	PIDs(3): :779349 :779403 :779476 
            __schedule,schedule,schedule_preempt_disabled,rwsem_down_write_slowpath,down_write,__filename_create,do_mkdirat,__x64_sys_mkdir,do_syscall_64
            	PIDs(2): :779352 :779405 
            __schedule,schedule,schedule_preempt_disabled,rwsem_down_read_slowpath,down_read,open_last_lookups,path_openat,do_filp_open,do_sys_openat2,__x64_sys_openat,do_syscall_64
            	PIDs(1): :779384 
            __schedule,schedule,schedule_preempt_disabled,rwsem_down_read_slowpath,down_read,walk_component,path_lookupat,filename_lookup,vfs_statx,do_statx,__x64_sys_statx,do_syscall_64
            	PIDs(1): :779491 
            
            adilger Andreas Dilger added a comment - This looks like it is a client-side deadlock, and doesn't have anything to do with the credits on the MDS. While we do allocate a large number of credits for some migration operations, those MDS stack traces are mostly informational, and on a production system with a larger journal those messages would not be printed. The journal transaction would complete even though the reserved credits are large, and the JBD2 code would BUG if the credits actually run out. The racer test_1 subtest where those journal credits stacks were printed was able to complete and clean up successfully. It is in racer test_2 (rename stress test) where the problem is hit on client1, which looks like some kind of deadlock in the client VFS (there are no messages or interesting stacks on the MDS). The client1 stack trace log shows: https://testing.whamcloud.com/test_sets/0e0f1acf-653f-47a1-9ddf-caa2dff16021 https://testing.whamcloud.com/test_logs/04db027f-18c3-4e26-a34f-d367b233bdf4/show_text __schedule,schedule,schedule_preempt_disabled,rwsem_down_read_slowpath,down_read,walk_component,link_path_walk,path_openat,do_filp_open,do_sys_openat2,__x64_sys_openat,do_syscall_64 PIDs(2): :618172 :779337 __schedule,schedule,schedule_preempt_disabled,rwsem_down_write_slowpath,down_write,vfs_rename,do_renameat2,__x64_sys_rename,do_syscall_64 PIDs(1): :779045 __schedule,schedule,schedule_preempt_disabled,__mutex_lock,lock_rename,do_renameat2,__x64_sys_rename,do_syscall_64 PIDs(18): :779243 :779273 :779287 :779303 :779318 :779351 :779385 :779462 :779506 :779581 :779824 :780140 :780299 :780381 :780966 :781177 :781905 :782020 __schedule,schedule,schedule_preempt_disabled,rwsem_down_read_slowpath,down_read,walk_component,path_lookupat,filename_lookup,vfs_statx,vfs_fstatat,__do_sys_newfstatat,do_syscall_64 PIDs(3): :779317 :779390 :780236 __schedule,schedule,schedule_preempt_disabled,rwsem_down_write_slowpath,down_write,lock_rename,do_renameat2,__x64_sys_rename,do_syscall_64 PIDs(5): :779343 :779346 :779350 :779399 :779478 __schedule,schedule,schedule_preempt_disabled,rwsem_down_write_slowpath,down_write,open_last_lookups,path_openat,do_filp_open,do_sys_openat2,__x64_sys_openat,do_syscall_64 PIDs(3): :779349 :779403 :779476 __schedule,schedule,schedule_preempt_disabled,rwsem_down_write_slowpath,down_write,__filename_create,do_mkdirat,__x64_sys_mkdir,do_syscall_64 PIDs(2): :779352 :779405 __schedule,schedule,schedule_preempt_disabled,rwsem_down_read_slowpath,down_read,open_last_lookups,path_openat,do_filp_open,do_sys_openat2,__x64_sys_openat,do_syscall_64 PIDs(1): :779384 __schedule,schedule,schedule_preempt_disabled,rwsem_down_read_slowpath,down_read,walk_component,path_lookupat,filename_lookup,vfs_statx,do_statx,__x64_sys_statx,do_syscall_64 PIDs(1): :779491
            bobijam Zhenyu Xu added a comment -

            searched some of the racer timedout cases in autotest report, it's the MDS failing to finish migrate operations for lack of transaction credits that causes the clients hang there. I list two cases here

            [ 9871.024046] Lustre: 72630:0:(osd_handler.c:2056:osd_trans_start()) lustre-MDT0002: credits 2697 > trans_max 2464
            [ 9871.024052] Lustre: 72630:0:(osd_handler.c:1948:osd_trans_dump_creds()) create: 3/12/0, destroy: 1/4/0
            [ 9871.024055] Lustre: 72630:0:(osd_handler.c:1955:osd_trans_dump_creds()) attr_set: 0/0/0, xattr_set: 23/2112/0
            [ 9871.024057] Lustre: 72630:0:(osd_handler.c:1962:osd_trans_dump_creds()) write: 5/55/0, punch: 0/0/0, quota 6/102/0
            [ 9871.024059] Lustre: 72630:0:(osd_handler.c:1972:osd_trans_dump_creds()) insert: 17/395/0, delete: 2/5/0
            [ 9871.024061] Lustre: 72630:0:(osd_handler.c:1979:osd_trans_dump_creds()) ref_add: 10/10/0, ref_del: 1/1/0
            [ 9871.024073] CPU: 1 PID: 72630 Comm: mdt_io00_007 Kdump: loaded Tainted: G OE ------- — 5.14.0-427.24.1_lustre.el9.x86_64 #1
            [ 9871.024076] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
            [ 9871.024081] Call Trace:
            [ 9871.024090] <TASK>
            [ 9871.024093] dump_stack_lvl+0x34/0x48
            [ 9871.024106] osd_trans_start+0x676/0x6d0 [osd_ldiskfs]
            [ 9871.024145] top_trans_start+0x199/0x4d0 [ptlrpc]
            [ 9871.024300] mdd_migrate_object+0x9d1/0x1130 [mdd]
            [ 9871.024331] ? lprocfs_counter_add+0x109/0x170 [obdclass]
            [ 9871.024418] ? mdd_migrate+0x14/0x20 [mdd]
            [ 9871.024437] mdd_migrate+0x14/0x20 [mdd]
            [ 9871.024456] mdt_reint_migrate+0x1224/0x1bb0 [mdt]
            [ 9871.024509] mdt_reint_rec+0x11c/0x270 [mdt]
            [ 9871.024545] mdt_reint_internal+0x4e4/0x850 [mdt]
            [ 9871.024577] mdt_reint+0x59/0x110 [mdt]
            [ 9871.024611] tgt_handle_request0+0x14a/0x770 [ptlrpc]
            [ 9871.024719] tgt_request_handle+0x1eb/0xb80 [ptlrpc]
            [ 9871.024829] ptlrpc_server_handle_request.isra.0+0x40a/0xc80 [ptlrpc]
            [ 9871.024927] ptlrpc_main+0xaa0/0xfe0 [ptlrpc]
            [ 9871.025025] ? __pfx_ptlrpc_main+0x10/0x10 [ptlrpc]
            [ 9871.025120] kthread+0xe0/0x100
            [ 9871.025124] ? __pfx_kthread+0x10/0x10
            [ 9871.025126] ret_from_fork+0x2c/0x50

            [51509.738587] Lustre: 1856484:0:(osd_handler.c:2056:osd_trans_start()) lustre-MDT0000: credits 3559 > trans_max 2464
            [51509.738607] Lustre: 1856484:0:(osd_handler.c:1948:osd_trans_dump_creds()) create: 3/12/0, destroy: 1/4/0
            [51509.738610] Lustre: 1856484:0:(osd_handler.c:1955:osd_trans_dump_creds()) attr_set: 2/2/0, xattr_set: 35/3030/0
            [51509.738613] Lustre: 1856484:0:(osd_handler.c:1962:osd_trans_dump_creds()) write: 5/55/0, punch: 0/0/0, quota 6/102/0
            [51509.738615] Lustre: 1856484:0:(osd_handler.c:1972:osd_trans_dump_creds()) insert: 17/335/0, delete: 3/6/0
            [51509.738617] Lustre: 1856484:0:(osd_handler.c:1979:osd_trans_dump_creds()) ref_add: 10/10/0, ref_del: 3/3/0
            [51509.738625] CPU: 0 PID: 1856484 Comm: mdt_io00_010 Kdump: loaded Tainted: G W OE ------- — 5.14.0-427.24.1_lustre.el9.x86_64 #1
            [51509.738632] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
            [51509.738638] Call Trace:
            [51509.738653] <TASK>
            [51509.738658] dump_stack_lvl+0x34/0x48
            [51509.738718] osd_trans_start+0x676/0x6d0 [osd_ldiskfs]
            [51509.738871] top_trans_start+0x199/0x4d0 [ptlrpc]
            [51509.739560] mdd_migrate_object+0x9d1/0x1130 [mdd]
            [51509.739679] ? lprocfs_counter_add+0x109/0x170 [obdclass]
            [51509.740044] ? mdd_migrate+0x14/0x20 [mdd]
            [51509.740063] mdd_migrate+0x14/0x20 [mdd]
            [51509.740083] mdt_reint_migrate+0x1224/0x1bb0 [mdt]
            [51509.740303] mdt_reint_rec+0x11c/0x270 [mdt]
            [51509.740343] mdt_reint_internal+0x4e4/0x850 [mdt]
            [51509.740375] mdt_reint+0x59/0x110 [mdt]
            [51509.740408] tgt_handle_request0+0x14a/0x770 [ptlrpc]
            [51509.740533] tgt_request_handle+0x1eb/0xb80 [ptlrpc]
            [51509.740645] ptlrpc_server_handle_request.isra.0+0x40a/0xc80 [ptlrpc]
            [51509.740746] ptlrpc_main+0xaa0/0xfe0 [ptlrpc]
            [51509.740842] ? __pfx_ptlrpc_main+0x10/0x10 [ptlrpc]
            [51509.740936] kthread+0xe0/0x100
            [51509.740969] ? __pfx_kthread+0x10/0x10
            [51509.740972] ret_from_fork+0x2c/0x50

            Does migration declares too much credits in xattr transfer?

            bobijam Zhenyu Xu added a comment - searched some of the racer timedout cases in autotest report, it's the MDS failing to finish migrate operations for lack of transaction credits that causes the clients hang there. I list two cases here MDS log from https://testing.whamcloud.com/test_sets/0e0f1acf-653f-47a1-9ddf-caa2dff16021 [ 9871.024046] Lustre: 72630:0:(osd_handler.c:2056:osd_trans_start()) lustre-MDT0002: credits 2697 > trans_max 2464 [ 9871.024052] Lustre: 72630:0:(osd_handler.c:1948:osd_trans_dump_creds()) create: 3/12/0, destroy: 1/4/0 [ 9871.024055] Lustre: 72630:0:(osd_handler.c:1955:osd_trans_dump_creds()) attr_set: 0/0/0, xattr_set: 23/2112/0 [ 9871.024057] Lustre: 72630:0:(osd_handler.c:1962:osd_trans_dump_creds()) write: 5/55/0, punch: 0/0/0, quota 6/102/0 [ 9871.024059] Lustre: 72630:0:(osd_handler.c:1972:osd_trans_dump_creds()) insert: 17/395/0, delete: 2/5/0 [ 9871.024061] Lustre: 72630:0:(osd_handler.c:1979:osd_trans_dump_creds()) ref_add: 10/10/0, ref_del: 1/1/0 [ 9871.024073] CPU: 1 PID: 72630 Comm: mdt_io00_007 Kdump: loaded Tainted: G OE ------- — 5.14.0-427.24.1_lustre.el9.x86_64 #1 [ 9871.024076] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 9871.024081] Call Trace: [ 9871.024090] <TASK> [ 9871.024093] dump_stack_lvl+0x34/0x48 [ 9871.024106] osd_trans_start+0x676/0x6d0 [osd_ldiskfs] [ 9871.024145] top_trans_start+0x199/0x4d0 [ptlrpc] [ 9871.024300] mdd_migrate_object+0x9d1/0x1130 [mdd] [ 9871.024331] ? lprocfs_counter_add+0x109/0x170 [obdclass] [ 9871.024418] ? mdd_migrate+0x14/0x20 [mdd] [ 9871.024437] mdd_migrate+0x14/0x20 [mdd] [ 9871.024456] mdt_reint_migrate+0x1224/0x1bb0 [mdt] [ 9871.024509] mdt_reint_rec+0x11c/0x270 [mdt] [ 9871.024545] mdt_reint_internal+0x4e4/0x850 [mdt] [ 9871.024577] mdt_reint+0x59/0x110 [mdt] [ 9871.024611] tgt_handle_request0+0x14a/0x770 [ptlrpc] [ 9871.024719] tgt_request_handle+0x1eb/0xb80 [ptlrpc] [ 9871.024829] ptlrpc_server_handle_request.isra.0+0x40a/0xc80 [ptlrpc] [ 9871.024927] ptlrpc_main+0xaa0/0xfe0 [ptlrpc] [ 9871.025025] ? __pfx_ptlrpc_main+0x10/0x10 [ptlrpc] [ 9871.025120] kthread+0xe0/0x100 [ 9871.025124] ? __pfx_kthread+0x10/0x10 [ 9871.025126] ret_from_fork+0x2c/0x50 MDS log from https://testing.whamcloud.com/test_sets/9223dd61-e0db-4931-8235-7b2d488093bd [51509.738587] Lustre: 1856484:0:(osd_handler.c:2056:osd_trans_start()) lustre-MDT0000: credits 3559 > trans_max 2464 [51509.738607] Lustre: 1856484:0:(osd_handler.c:1948:osd_trans_dump_creds()) create: 3/12/0, destroy: 1/4/0 [51509.738610] Lustre: 1856484:0:(osd_handler.c:1955:osd_trans_dump_creds()) attr_set: 2/2/0, xattr_set: 35/3030/0 [51509.738613] Lustre: 1856484:0:(osd_handler.c:1962:osd_trans_dump_creds()) write: 5/55/0, punch: 0/0/0, quota 6/102/0 [51509.738615] Lustre: 1856484:0:(osd_handler.c:1972:osd_trans_dump_creds()) insert: 17/335/0, delete: 3/6/0 [51509.738617] Lustre: 1856484:0:(osd_handler.c:1979:osd_trans_dump_creds()) ref_add: 10/10/0, ref_del: 3/3/0 [51509.738625] CPU: 0 PID: 1856484 Comm: mdt_io00_010 Kdump: loaded Tainted: G W OE ------- — 5.14.0-427.24.1_lustre.el9.x86_64 #1 [51509.738632] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [51509.738638] Call Trace: [51509.738653] <TASK> [51509.738658] dump_stack_lvl+0x34/0x48 [51509.738718] osd_trans_start+0x676/0x6d0 [osd_ldiskfs] [51509.738871] top_trans_start+0x199/0x4d0 [ptlrpc] [51509.739560] mdd_migrate_object+0x9d1/0x1130 [mdd] [51509.739679] ? lprocfs_counter_add+0x109/0x170 [obdclass] [51509.740044] ? mdd_migrate+0x14/0x20 [mdd] [51509.740063] mdd_migrate+0x14/0x20 [mdd] [51509.740083] mdt_reint_migrate+0x1224/0x1bb0 [mdt] [51509.740303] mdt_reint_rec+0x11c/0x270 [mdt] [51509.740343] mdt_reint_internal+0x4e4/0x850 [mdt] [51509.740375] mdt_reint+0x59/0x110 [mdt] [51509.740408] tgt_handle_request0+0x14a/0x770 [ptlrpc] [51509.740533] tgt_request_handle+0x1eb/0xb80 [ptlrpc] [51509.740645] ptlrpc_server_handle_request.isra.0+0x40a/0xc80 [ptlrpc] [51509.740746] ptlrpc_main+0xaa0/0xfe0 [ptlrpc] [51509.740842] ? __pfx_ptlrpc_main+0x10/0x10 [ptlrpc] [51509.740936] kthread+0xe0/0x100 [51509.740969] ? __pfx_kthread+0x10/0x10 [51509.740972] ret_from_fork+0x2c/0x50 Does migration declares too much credits in xattr transfer?
            lixi_wc Li Xi added a comment -

            bobijam Would you please take a look?

            lixi_wc Li Xi added a comment - bobijam Would you please take a look?

            People

              bobijam Zhenyu Xu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: