Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6831

The ticket for tracking all DNE2 bugs

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.8.0, Lustre 2.9.0
    • 3
    • 9223372036854775807

    Description

      This ticket is for tracking all of DNE2 bugs.

      Attachments

        Issue Links

          Activity

            [LU-6831] The ticket for tracking all DNE2 bugs

            Yep. I'm testing it right now.

            simmonsja James A Simmons added a comment - Yep. I'm testing it right now.
            di.wang Di Wang added a comment -

            James: I just updated the patch http://review.whamcloud.com/#/c/16969/ please retry, thanks.

            di.wang Di Wang added a comment - James: I just updated the patch http://review.whamcloud.com/#/c/16969/ please retry, thanks.

            a soft lockup happing on a spin lock -

            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.147894] BUG: soft lockup - CPU#0 stuck for 67s! [osp_up7-0:20904]
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.152894] BUG: soft lockup - CPU#1 stuck for 67s! [osp_up4-0:20901]

            Dec 22 10:54:26 feral17.ccs.ornl.gItsov kernel: [ 793.152993] Pid: 20901, comm: osp_up4-0 Tainted: P --------------- 2.6.32
            504.30.3.el6.head.x86_64 #1 Supermicro X8DT6/X8DT6
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.152996] RIP: 0010:[<ffffffff8152da2e>] [<ffffffff8152da2e>] _spin_lock+0x1e/0x30
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153004] RSP: 0018:ffff8817d793dda0 EFLAGS: 00000202
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153005] RAX: 0000000000000002 RBX: ffff8817d793dda0 RCX: 0000000000000000
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153007] RDX: 0000000000000003 RSI: ffff8817d793dea8 RDI: ffff880bcd9e8d50
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153009] RBP: ffffffff8100bc0e R08: ffff8817d793c000 R09: 00000000ffffffff
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153010] R10: 000000a7ad891c7a R11: 0000000000000001 R12: 0000000000000000
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153012] R13: 0000000000000000 R14: 0000000000015900 R15: 0000000000000000
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153014] FS: 0000000000000000(0000) GS:ffff880028220000(0000) knlGS:0000000000000000
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153016] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153017] CR2: 0000003d64205380 CR3: 0000000001a85000 CR4: 00000000000007e0
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153019] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153021] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153023] Process osp_up4-0 (pid: 20901, threadinfo ffff8817d793c000, task ffff881828342ab0)
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153024] Stack:
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153025] ffff8817d793ddf0 ffffffffa13ab4f9 0000000000000246 ffff8817d793dea8
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153027] <d> ffff8817d793ddf0 ffff881828342ab0 ffff880bcd9e8d40 ffff880bcd8d0800
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153030] <d> ffff8817d793de40 ffff8817d793dea8 ffff8817d793dee0 ffffffffa13b0074
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153032] Call Trace:
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153055] [<ffffffffa13ab4f9>] ? osp_get_next_request+0x29/0x1a0 [osp]
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153066] [<ffffffffa13b0074>] ? osp_send_update_thread+0x2f4/0x5b0 [osp]
            Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153071] [<ffffffff81064d00>] ? default_wake_function+0x0/0x20

            simmonsja James A Simmons added a comment - a soft lockup happing on a spin lock - Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.147894] BUG: soft lockup - CPU#0 stuck for 67s! [osp_up7-0:20904] Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.152894] BUG: soft lockup - CPU#1 stuck for 67s! [osp_up4-0:20901] Dec 22 10:54:26 feral17.ccs.ornl.gItsov kernel: [ 793.152993] Pid: 20901, comm: osp_up4-0 Tainted: P --------------- 2.6.32 504.30.3.el6.head.x86_64 #1 Supermicro X8DT6/X8DT6 Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.152996] RIP: 0010: [<ffffffff8152da2e>] [<ffffffff8152da2e>] _spin_lock+0x1e/0x30 Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153004] RSP: 0018:ffff8817d793dda0 EFLAGS: 00000202 Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153005] RAX: 0000000000000002 RBX: ffff8817d793dda0 RCX: 0000000000000000 Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153007] RDX: 0000000000000003 RSI: ffff8817d793dea8 RDI: ffff880bcd9e8d50 Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153009] RBP: ffffffff8100bc0e R08: ffff8817d793c000 R09: 00000000ffffffff Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153010] R10: 000000a7ad891c7a R11: 0000000000000001 R12: 0000000000000000 Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153012] R13: 0000000000000000 R14: 0000000000015900 R15: 0000000000000000 Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153014] FS: 0000000000000000(0000) GS:ffff880028220000(0000) knlGS:0000000000000000 Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153016] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153017] CR2: 0000003d64205380 CR3: 0000000001a85000 CR4: 00000000000007e0 Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153019] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153021] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153023] Process osp_up4-0 (pid: 20901, threadinfo ffff8817d793c000, task ffff881828342ab0) Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153024] Stack: Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153025] ffff8817d793ddf0 ffffffffa13ab4f9 0000000000000246 ffff8817d793dea8 Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153027] <d> ffff8817d793ddf0 ffff881828342ab0 ffff880bcd9e8d40 ffff880bcd8d0800 Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153030] <d> ffff8817d793de40 ffff8817d793dea8 ffff8817d793dee0 ffffffffa13b0074 Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153032] Call Trace: Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153055] [<ffffffffa13ab4f9>] ? osp_get_next_request+0x29/0x1a0 [osp] Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153066] [<ffffffffa13b0074>] ? osp_send_update_thread+0x2f4/0x5b0 [osp] Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.153071] [<ffffffff81064d00>] ? default_wake_function+0x0/0x20

            Updated my software stack and I'm seeing a lot of these on the OSS servers:

            [94725.339746] Lustre: sultan-OST0004: already connected client sultan-MDT0000-mdtlov_UUID (at 10.37.248.155@o2ib1) with handle 0xb4b2e32f66f3ee41. Rejecting client with the same UUID trying to reconnect with handle 0x157ffaac64917bbd

            Its seems to be only MDS1 having this. On that MDS the error message is:

            95881.016995] LustreError: 137-5: sultan-MDT0001_UUID: not available for connect from 10.37.248.130@o2ib1 (no target). If you are running an HA pair check that the target is mounted on the other server.

            simmonsja James A Simmons added a comment - Updated my software stack and I'm seeing a lot of these on the OSS servers: [94725.339746] Lustre: sultan-OST0004: already connected client sultan-MDT0000-mdtlov_UUID (at 10.37.248.155@o2ib1) with handle 0xb4b2e32f66f3ee41. Rejecting client with the same UUID trying to reconnect with handle 0x157ffaac64917bbd Its seems to be only MDS1 having this. On that MDS the error message is: 95881.016995] LustreError: 137-5: sultan-MDT0001_UUID: not available for connect from 10.37.248.130@o2ib1 (no target). If you are running an HA pair check that the target is mounted on the other server.
            simmonsja James A Simmons added a comment - - edited

            Here is the full log from the node that was crashing this morning

            Just to let you know the IOC_LMV_SETSTRIPE is no longer a issue.

            simmonsja James A Simmons added a comment - - edited Here is the full log from the node that was crashing this morning Just to let you know the IOC_LMV_SETSTRIPE is no longer a issue.
            di.wang Di Wang added a comment -

            could you please get the debug log(-1 level) on MDT0? I assume jsimmons is on MDT0 ? Thanks.

            di.wang Di Wang added a comment - could you please get the debug log(-1 level) on MDT0? I assume jsimmons is on MDT0 ? Thanks.

            Due to the lose of some of my MDS servers I attempted to create new striped directories today but instead I get this error every time.

            lfs setdirstripe -c 4 /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test
            error on LL_IOC_LMV_SETSTRIPE '/lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test' (3): Invalid argument
            error: setdirstripe: create stripe dir '/lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test' failed

            This happens even when I'm root.

            simmonsja James A Simmons added a comment - Due to the lose of some of my MDS servers I attempted to create new striped directories today but instead I get this error every time. lfs setdirstripe -c 4 /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test error on LL_IOC_LMV_SETSTRIPE '/lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test' (3): Invalid argument error: setdirstripe: create stripe dir '/lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test' failed This happens even when I'm root.

            I attached my client logs to LU-6984.

            simmonsja James A Simmons added a comment - I attached my client logs to LU-6984 .

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15720/
            Subject: LU-6831 lmv: revalidate the dentry for striped dir
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a17909a92da74cb26fb9bf2824f968b2adf0897e

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15720/ Subject: LU-6831 lmv: revalidate the dentry for striped dir Project: fs/lustre-release Branch: master Current Patch Set: Commit: a17909a92da74cb26fb9bf2824f968b2adf0897e

            Testing to see if the problem exist on directory striped across 8 MDS servers. Waiting for the results. I will push some log data soon for you.

            simmonsja James A Simmons added a comment - Testing to see if the problem exist on directory striped across 8 MDS servers. Waiting for the results. I will push some log data soon for you.

            People

              di.wang Di Wang
              di.wang Di Wang
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated: