Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6831

The ticket for tracking all DNE2 bugs

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.8.0, Lustre 2.9.0
    • 3
    • 9223372036854775807

    Description

      This ticket is for tracking all of DNE2 bugs.

      Attachments

        Issue Links

          Activity

            [LU-6831] The ticket for tracking all DNE2 bugs

            Updated my software stack and I'm seeing a lot of these on the OSS servers:

            [94725.339746] Lustre: sultan-OST0004: already connected client sultan-MDT0000-mdtlov_UUID (at 10.37.248.155@o2ib1) with handle 0xb4b2e32f66f3ee41. Rejecting client with the same UUID trying to reconnect with handle 0x157ffaac64917bbd

            Its seems to be only MDS1 having this. On that MDS the error message is:

            95881.016995] LustreError: 137-5: sultan-MDT0001_UUID: not available for connect from 10.37.248.130@o2ib1 (no target). If you are running an HA pair check that the target is mounted on the other server.

            simmonsja James A Simmons added a comment - Updated my software stack and I'm seeing a lot of these on the OSS servers: [94725.339746] Lustre: sultan-OST0004: already connected client sultan-MDT0000-mdtlov_UUID (at 10.37.248.155@o2ib1) with handle 0xb4b2e32f66f3ee41. Rejecting client with the same UUID trying to reconnect with handle 0x157ffaac64917bbd Its seems to be only MDS1 having this. On that MDS the error message is: 95881.016995] LustreError: 137-5: sultan-MDT0001_UUID: not available for connect from 10.37.248.130@o2ib1 (no target). If you are running an HA pair check that the target is mounted on the other server.
            simmonsja James A Simmons added a comment - - edited

            Here is the full log from the node that was crashing this morning

            Just to let you know the IOC_LMV_SETSTRIPE is no longer a issue.

            simmonsja James A Simmons added a comment - - edited Here is the full log from the node that was crashing this morning Just to let you know the IOC_LMV_SETSTRIPE is no longer a issue.
            di.wang Di Wang added a comment -

            could you please get the debug log(-1 level) on MDT0? I assume jsimmons is on MDT0 ? Thanks.

            di.wang Di Wang added a comment - could you please get the debug log(-1 level) on MDT0? I assume jsimmons is on MDT0 ? Thanks.

            Due to the lose of some of my MDS servers I attempted to create new striped directories today but instead I get this error every time.

            lfs setdirstripe -c 4 /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test
            error on LL_IOC_LMV_SETSTRIPE '/lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test' (3): Invalid argument
            error: setdirstripe: create stripe dir '/lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test' failed

            This happens even when I'm root.

            simmonsja James A Simmons added a comment - Due to the lose of some of my MDS servers I attempted to create new striped directories today but instead I get this error every time. lfs setdirstripe -c 4 /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test error on LL_IOC_LMV_SETSTRIPE '/lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test' (3): Invalid argument error: setdirstripe: create stripe dir '/lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test' failed This happens even when I'm root.

            I attached my client logs to LU-6984.

            simmonsja James A Simmons added a comment - I attached my client logs to LU-6984 .

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15720/
            Subject: LU-6831 lmv: revalidate the dentry for striped dir
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a17909a92da74cb26fb9bf2824f968b2adf0897e

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15720/ Subject: LU-6831 lmv: revalidate the dentry for striped dir Project: fs/lustre-release Branch: master Current Patch Set: Commit: a17909a92da74cb26fb9bf2824f968b2adf0897e

            Testing to see if the problem exist on directory striped across 8 MDS servers. Waiting for the results. I will push some log data soon for you.

            simmonsja James A Simmons added a comment - Testing to see if the problem exist on directory striped across 8 MDS servers. Waiting for the results. I will push some log data soon for you.
            di.wang Di Wang added a comment -

            James: Any news for this -2 problem? Thanks

            di.wang Di Wang added a comment - James: Any news for this -2 problem? Thanks
            di.wang Di Wang added a comment -

            James: no, I did not see these errors? Could you please collect -1 debug log on client side, when you remove one of these files? thanks

            di.wang Di Wang added a comment - James: no, I did not see these errors? Could you please collect -1 debug log on client side, when you remove one of these files? thanks

            An update in my latest testing. I'm still seeing problems when creating 1 million+ files per directory. Clearing out the debug logs I see the problem is only on the client side. When running a application I see:

            command line used: /lustre/sultan/stf008/scratch/jsimmons/mdtest -I 100000 -i 5 -d /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test/shared_1000k_10
            Path: /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test
            FS: 21.8 TiB Used FS: 0.2% Inodes: 58.7 Mi Used Inodes: 4.6%

            10 tasks, 1000000 files/directories
            aprun: Apid 3172: Caught signal Window changed, sending to application
            08/03/2015 10:34:45: Process 0(nid00028): FAILED in create_remove_directory_tree, Unable to remove directory: No such file or directory
            Rank 0 [Mon Aug 3 10:34:45 2015] [c0-0c0s1n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
            _pmiu_daemon(SIGCHLD): [NID 00028] [c0-0c0s1n2] [Mon Aug 3 10:34:45 2015] PE RANK 0 exit signal Aborted
            aprun: Apid 3172: Caught signal Interrupt, sending to application
            _pmiu_daemon(SIGCHLD): [NID 00012] [c0-0c0s6n0] [Mon Aug 3 10:50:50 2015] PE RANK 7 exit signal Interrupt
            _pmiu_daemon(SIGCHLD): [NID 00018] [c0-0c0s6n2] [Mon Aug 3 10:50:50 2015] PE RANK 9 exit signal Interrupt
            _pmiu_daemon(SIGCHLD): [NID 00013] [c0-0c0s6n1] [Mon Aug 3 10:50:50 2015] PE RANK 8 exit signal Interrupt

            After the test failed any attempt to remove the files create by these test fail. When I attempt to remove the files I see the following errors in dmesg.

            LustreError: 5430:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2
            LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2
            LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) Skipped 7 previous similar messages
            LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2

            DiWang have you seen these errors during your testing?

            simmonsja James A Simmons added a comment - An update in my latest testing. I'm still seeing problems when creating 1 million+ files per directory. Clearing out the debug logs I see the problem is only on the client side. When running a application I see: command line used: /lustre/sultan/stf008/scratch/jsimmons/mdtest -I 100000 -i 5 -d /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test/shared_1000k_10 Path: /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test FS: 21.8 TiB Used FS: 0.2% Inodes: 58.7 Mi Used Inodes: 4.6% 10 tasks, 1000000 files/directories aprun: Apid 3172: Caught signal Window changed, sending to application 08/03/2015 10:34:45: Process 0(nid00028): FAILED in create_remove_directory_tree, Unable to remove directory: No such file or directory Rank 0 [Mon Aug 3 10:34:45 2015] [c0-0c0s1n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0 _pmiu_daemon(SIGCHLD): [NID 00028] [c0-0c0s1n2] [Mon Aug 3 10:34:45 2015] PE RANK 0 exit signal Aborted aprun: Apid 3172: Caught signal Interrupt, sending to application _pmiu_daemon(SIGCHLD): [NID 00012] [c0-0c0s6n0] [Mon Aug 3 10:50:50 2015] PE RANK 7 exit signal Interrupt _pmiu_daemon(SIGCHLD): [NID 00018] [c0-0c0s6n2] [Mon Aug 3 10:50:50 2015] PE RANK 9 exit signal Interrupt _pmiu_daemon(SIGCHLD): [NID 00013] [c0-0c0s6n1] [Mon Aug 3 10:50:50 2015] PE RANK 8 exit signal Interrupt After the test failed any attempt to remove the files create by these test fail. When I attempt to remove the files I see the following errors in dmesg. LustreError: 5430:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2 LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2 LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) Skipped 7 previous similar messages LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2 DiWang have you seen these errors during your testing?
            di.wang Di Wang added a comment -

            Sorry, it might be a mistakes, even the patch on this ticket is not landed.

            di.wang Di Wang added a comment - Sorry, it might be a mistakes, even the patch on this ticket is not landed.

            People

              di.wang Di Wang
              di.wang Di Wang
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated: