Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14379

ls hangs on client and console log message LNet: Service thread pid 25241 was inactive performing mdt_reint_rename

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • None
    • zfs-0.7
      kernel-3.10.0-1160.4.1.1chaos.ch6.x86_64
      lustre-2.12.5_10.llnl-3.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      At some time the morning of Jan 27th, we got reports of directory listings ("ls") hanging, where the directories were on MDT5 and MDT7.

      The console log of MDT5 and MDT7 both reported repeated watchdog dumps, all with very similar stacks. The first one on MDT7 appeared Thu Jan 21 12:01:33 2021

      # From zinc8 dmesg log
      LNet: Service thread pid 25241 was inactive for 200.13s. The thread might be hung, or it might only be slow and will resume later. D
      Pid: 25241, comm: mdt01_049 3.10.0-1160.4.1.1chaos.ch6.x86_64 #1 SMP Fri Oct 9 17:56:20 PDT 2020
      Call Trace:
       [<ffffffffc141a460>] ldlm_completion_ast+0x440/0x870 [ptlrpc]
       [<ffffffffc141be2f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
       [<ffffffffc141ef3e>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
       [<ffffffffc1982342>] osp_md_object_lock+0x162/0x2d0 [osp]
       [<ffffffffc1895194>] lod_object_lock+0xf4/0x780 [lod]
       [<ffffffffc1916ace>] mdd_object_lock+0x3e/0xe0 [mdd]
       [<ffffffffc17ae681>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
       [<ffffffffc17aec1a>] mdt_remote_object_lock+0x2a/0x30 [mdt]
       [<ffffffffc17c407e>] mdt_rename_lock+0xbe/0x4b0 [mdt]
       [<ffffffffc17c6400>] mdt_reint_rename+0x2c0/0x2900 [mdt]
       [<ffffffffc17cf113>] mdt_reint_rec+0x83/0x210 [mdt]
       [<ffffffffc17ab303>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
       [<ffffffffc17b6b37>] mdt_reint+0x67/0x140 [mdt] 
       [<ffffffffc14b8b1a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
       [<ffffffffc145d80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
       [<ffffffffc1461bfd>] ptlrpc_main+0xc4d/0x2280 [ptlrpc]
       [<ffffffffadecafc1>] kthread+0xd1/0xe0
       [<ffffffffae5c1ff7>] ret_from_fork_nospec_end+0x0/0x39
       [<ffffffffffffffff>] 0xffffffffffffffff
      LustreError: dumping log to /tmp/lustre-log.1611259694.25241 

      The remaining dumps took a different path within ldlm_cli_enqueue:

      ptlrpc_set_wait+0x4d8/0x800 [ptlrpc]
      ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
      ldlm_cli_enqueue+0x3d2/0x920 [ptlrpc]
      osp_md_object_lock+0x162/0x2d0 [osp]
      lod_object_lock+0xf4/0x780 [lod]
      mdd_object_lock+0x3e/0xe0 [mdd]
      mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
      mdt_remote_object_lock+0x2a/0x30 [mdt]
      mdt_rename_lock+0xbe/0x4b0 [mdt]
      mdt_reint_rename+0x2c0/0x2900 [mdt]
      mdt_reint_rec+0x83/0x210 [mdt]
      mdt_reint_internal+0x6e3/0xaf0 [mdt]
      mdt_reint+0x67/0x140 [mdt]
      tgt_request_handle+0xada/0x1570 [ptlrpc]
      ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      ptlrpc_main+0xc4d/0x2280 [ptlrpc]
      kthread+0xd1/0xe0
      ret_from_fork_nospec_end+0x0/0x39
      0xffffffffffffffff
      

      I can provide mdt7 debug logs and ldlm namespace dump; I have a core dump from mdt5.  And dmesg logs for both.

      Attachments

        Issue Links

          Activity

            [LU-14379] ls hangs on client and console log message LNet: Service thread pid 25241 was inactive performing mdt_reint_rename

            Closing (left "Fixed", forgot to change to "Duplicate").

            ofaaland Olaf Faaland added a comment - Closing (left "Fixed", forgot to change to "Duplicate").
            laisiyao Lai Siyao added a comment -

            Mmm, it might be because commit "1857880 LU-13437 llite: pass name in getattr by FID" is missing.

            laisiyao Lai Siyao added a comment - Mmm, it might be because commit "1857880 LU-13437 llite: pass name in getattr by FID" is missing.
            ofaaland Olaf Faaland added a comment - - edited

            Ah.  Thank you.

            We are preparing to upgrade to lustre 2.12.6, we just don't have it on the machines yet.

            A little later in the day, users started reporting bad dirents, e.g.

            [root@oslic20:test]# ll
            ls: cannot access scratch-2: No such file or directory
            total 1798843
            -rw-rw---- 1 varley2 varley2          0 Jan 27 12:14 CHG
            -rw-rw---- 1 varley2 varley2          0 Jan 27 12:14 CHGCAR
            -rw-rw---- 1 varley2 varley2      27295 Jan 27 17:53 CONTCAR
            -rw-rw---- 1 varley2 varley2    5621531 Jan 27 17:54 DOSCAR
            -rw-rw---- 1 varley2 varley2      83839 Jan 27 17:54 EIGENVAL
            -rw-rw---- 1 varley2 varley2        207 Jan 27 12:14 IBZKPT
            lrwxrwxrwx 1 varley2 varley2          8 Jan 27 11:05 INCAR -> ../INCAR
            lrwxrwxrwx 1 varley2 varley2         10 Jan 27 11:05 KPOINTS -> ../KPOINTS
            -rw-rw---- 1 varley2 varley2  149105972 Jan 27 17:54 LOCPOT
            -rw-rw---- 1 varley2 varley2       2121 Jan 27 17:53 OSZICAR
            -rw-rw---- 1 varley2 varley2     612859 Jan 27 17:54 OUTCAR
            -rw-rw---- 1 varley2 varley2        234 Jan 27 12:14 PCDAT
            -rw------- 1 varley2 varley2      27295 Jan 27 11:04 POSCAR
            lrwxrwxrwx 1 varley2 varley2          9 Jan 27 11:05 POTCAR -> ../POTCAR
            -rw-rw---- 1 varley2 varley2   21995032 Jan 27 17:54 PROCAR
            -rw-rw---- 1 varley2 varley2          0 Jan 27 12:14 REPORT
            -rw------- 1 varley2 varley2 1820073216 Jan 27 17:53 WAVECAR
            -rw-rw---- 1 varley2 varley2      25475 Jan 27 17:53 XDATCAR
            lrwxrwxrwx 1 varley2 varley2          9 Jan 27 11:05 msub.vasp -> psub.vasp
            -rw-rw---- 1 varley2 varley2       9560 Jan 27 17:53 psub.out
            -rw-rw---- 1 varley2 varley2        709 Jan 27 11:05 psub.vasp
            -????????? ? ?       ?                ?            ? scratch-2
            -rw-rw---- 1 varley2 varley2   34802010 Jan 27 17:54 vasprun.xml 

            and

            [root@oslic20:2MP]# ls -lR
            .:
            total 25
            drwx------ 2 varley2 varley2 25600 Jan 27 16:31 2ndshell-broken
            
            
            ./2ndshell-broken:
            ls: cannot access ./2ndshell-broken/scratch-2: No such file or directory
            total 0
            -????????? ? ? ? ?            ? scratch-2

            Is this a likely side-affect of LU-13437? We have some of those patches but not all of them in our 2.12.5 branch - I don't know if that's better or worse than having none:

            $ git lg -n 200 --grep LU-13437 2.12.6_3.llnl | sort -k3
            * 3314727 LU-13437 llite: pack parent FID in getattr
            * 1857880 LU-13437 llite: pass name in getattr by FID
            * f1712b3 LU-13437 lmv: check stripe FID sanity
            * 23c05e8 LU-13437 mdc: remote object support getattr from cache
            * ae9fc81 LU-13437 mdt: don't fetch LOOKUP lock for remote object
            * 23fa920 LU-13437 mdt: rename misses remote LOOKUP lock revoke
            * daa9148 LU-13437 uapi: add OBD_CONNECT2_GETATTR_PFID
            
            $ git lg -n 200 --grep LU-13437  2.12.5_10.llnl | sort -k3
            * d06cc90 LU-13437 llite: pack parent FID in getattr
            * 338d34e LU-13437 lmv: check stripe FID sanity
            * 8d212e8 LU-13437 mdt: don't fetch LOOKUP lock for remote object
            * 6dc8f51 LU-13437 mdt: rename misses remote LOOKUP lock revoke
            * ddec375 LU-13437 uapi: add OBD_CONNECT2_GETATTR_PFID
            
            ofaaland Olaf Faaland added a comment - - edited Ah.  Thank you. We are preparing to upgrade to lustre 2.12.6, we just don't have it on the machines yet. A little later in the day, users started reporting bad dirents, e.g. [root@oslic20:test]# ll ls: cannot access scratch-2: No such file or directory total 1798843 -rw-rw---- 1 varley2 varley2          0 Jan 27 12:14 CHG -rw-rw---- 1 varley2 varley2          0 Jan 27 12:14 CHGCAR -rw-rw---- 1 varley2 varley2      27295 Jan 27 17:53 CONTCAR -rw-rw---- 1 varley2 varley2    5621531 Jan 27 17:54 DOSCAR -rw-rw---- 1 varley2 varley2      83839 Jan 27 17:54 EIGENVAL -rw-rw---- 1 varley2 varley2        207 Jan 27 12:14 IBZKPT lrwxrwxrwx 1 varley2 varley2          8 Jan 27 11:05 INCAR -> ../INCAR lrwxrwxrwx 1 varley2 varley2         10 Jan 27 11:05 KPOINTS -> ../KPOINTS -rw-rw---- 1 varley2 varley2  149105972 Jan 27 17:54 LOCPOT -rw-rw---- 1 varley2 varley2       2121 Jan 27 17:53 OSZICAR -rw-rw---- 1 varley2 varley2     612859 Jan 27 17:54 OUTCAR -rw-rw---- 1 varley2 varley2        234 Jan 27 12:14 PCDAT -rw------- 1 varley2 varley2      27295 Jan 27 11:04 POSCAR lrwxrwxrwx 1 varley2 varley2          9 Jan 27 11:05 POTCAR -> ../POTCAR -rw-rw---- 1 varley2 varley2   21995032 Jan 27 17:54 PROCAR -rw-rw---- 1 varley2 varley2          0 Jan 27 12:14 REPORT -rw------- 1 varley2 varley2 1820073216 Jan 27 17:53 WAVECAR -rw-rw---- 1 varley2 varley2      25475 Jan 27 17:53 XDATCAR lrwxrwxrwx 1 varley2 varley2          9 Jan 27 11:05 msub.vasp -> psub.vasp -rw-rw---- 1 varley2 varley2       9560 Jan 27 17:53 psub.out -rw-rw---- 1 varley2 varley2        709 Jan 27 11:05 psub.vasp -????????? ? ?       ?                ?            ? scratch-2 -rw-rw---- 1 varley2 varley2   34802010 Jan 27 17:54 vasprun.xml and [root@oslic20:2MP]# ls -lR .: total 25 drwx------ 2 varley2 varley2 25600 Jan 27 16:31 2ndshell-broken ./2ndshell-broken: ls: cannot access ./2ndshell-broken/scratch-2: No such file or directory total 0 -????????? ? ? ? ?            ? scratch-2 Is this a likely side-affect of LU-13437 ? We have some of those patches but not all of them in our 2.12.5 branch - I don't know if that's better or worse than having none: $ git lg -n 200 --grep LU-13437 2.12.6_3.llnl | sort -k3 * 3314727 LU-13437 llite: pack parent FID in getattr * 1857880 LU-13437 llite: pass name in getattr by FID * f1712b3 LU-13437 lmv: check stripe FID sanity * 23c05e8 LU-13437 mdc: remote object support getattr from cache * ae9fc81 LU-13437 mdt: don't fetch LOOKUP lock for remote object * 23fa920 LU-13437 mdt: rename misses remote LOOKUP lock revoke * daa9148 LU-13437 uapi: add OBD_CONNECT2_GETATTR_PFID $ git lg -n 200 --grep LU-13437 2.12.5_10.llnl | sort -k3 * d06cc90 LU-13437 llite: pack parent FID in getattr * 338d34e LU-13437 lmv: check stripe FID sanity * 8d212e8 LU-13437 mdt: don't fetch LOOKUP lock for remote object * 6dc8f51 LU-13437 mdt: rename misses remote LOOKUP lock revoke * ddec375 LU-13437 uapi: add OBD_CONNECT2_GETATTR_PFID
            laisiyao Lai Siyao added a comment -

            This looks to be duplicate of LU-13437, and it was landed to the latest 2.12, you can cherry-pick from there.

            laisiyao Lai Siyao added a comment - This looks to be duplicate of LU-13437 , and it was landed to the latest 2.12, you can cherry-pick from there.
            ofaaland Olaf Faaland added a comment -

            PID 25241, given in the watchdog warning above, appears in console log messages regarding changelog handling:

            # [Sat Jan 23 12:21:13 2021] Lustre: 25161:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22
            >> [Sat Jan 23 12:21:53 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22
            # [Sat Jan 23 12:24:08 2021] Lustre: 25188:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22 
            

            There are many like that - some of the others are:

            [Sat Jan 23 13:38:06 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22
            [Sat Jan 23 13:38:06 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) Skipped 12 previous similar messages
            [Sat Jan 23 15:25:52 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22
            [Sat Jan 23 15:25:52 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) Skipped 9 previous similar messages
            [Sat Jan 23 19:39:04 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22
            

            The changelog clear messages are from Starfish, a policy manager similar to Robinhood, which frequently clears changelog entries which have already been cleared. 

            ofaaland Olaf Faaland added a comment - PID 25241, given in the watchdog warning above, appears in console log messages regarding changelog handling: # [Sat Jan 23 12:21:13 2021] Lustre: 25161:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22 >> [Sat Jan 23 12:21:53 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22 # [Sat Jan 23 12:24:08 2021] Lustre: 25188:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22 There are many like that - some of the others are: [Sat Jan 23 13:38:06 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22 [Sat Jan 23 13:38:06 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) Skipped 12 previous similar messages [Sat Jan 23 15:25:52 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22 [Sat Jan 23 15:25:52 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) Skipped 9 previous similar messages [Sat Jan 23 19:39:04 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22 The changelog clear messages are from Starfish, a policy manager similar to Robinhood, which frequently clears changelog entries which have already been cleared. 
            ofaaland Olaf Faaland added a comment -

            SysRq  show-backtrace-all-active-cpus(l) dumped no stacks other than the core processing the SysRq.  All other cores were idling:

            # From [Wed Jan 27 13:13:19 2021] 
            SysRq : Show backtrace of all active CPUs
            sending NMI to all CPUs:
            NMI backtrace for cpu 0 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 1 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 2 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 3 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 4 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 5 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 6 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 7 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 8
            CPU: 8 PID: 0 Comm: swapper/8 Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-1160.4.1.1chaos.ch6.x86_64 #1
            Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
            task: ffff98d5ca74e300 ti: ffff98d5cab50000 task.ti: ffff98d5cab50000
            RIP: 0010:[<ffffffffade6102b>]  [<ffffffffade6102b>] default_send_IPI_mask_sequence_phys+0x7b/0x100
            RSP: 0018:ffff9914bf203ce8  EFLAGS: 00000046
            RAX: ffff9914bf240000 RBX: 000000000000e02e RCX: 0000000000000020
            RDX: 0000000000000009 RSI: 0000000000000012 RDI: 0000000000000000
            RBP: ffff9914bf203d20 R08: ffffffffaeb5b5a0 R09: 0000000000008032
            R10: 0000000000000000 R11: ffff9914bf203a2e R12: ffffffffaeb5b5a0
            R13: 0000000000000400 R14: 0000000000000086 R15: 0000000000000002
            FS:  0000000000000000(0000) GS:ffff9914bf200000(0000) knlGS:0000000000000000
            CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            CR2: 00007ffff7ff8000 CR3: 0000007bf1ed8000 CR4: 00000000001607e0
            Call Trace:
             <IRQ> [<ffffffffade66d4e>] physflat_send_IPI_mask+0xe/0x10
             [<ffffffffade6153a>] arch_trigger_all_cpu_backtrace+0x2ea/0x2f0
             [<ffffffffae2932f3>] sysrq_handle_showallcpus+0x13/0x20
             [<ffffffffae2939ed>] __handle_sysrq+0x11d/0x180
             [<ffffffffae293a76>] handle_sysrq+0x26/0x30
             ...
             [<ffffffffae5c6b2d>] do_IRQ+0x4d/0xf0
             [<ffffffffae5b836a>] common_interrupt+0x16a/0x16a
             <EOI> [<ffffffffae3eab67>] ? cpuidle_enter_state+0x57/0xd0
             [<ffffffffae3eacbe>] cpuidle_idle_call+0xde/0x270
             [<ffffffffade3919e>] arch_cpu_idle+0xe/0xc0
             [<ffffffffadf0835a>] cpu_startup_entry+0x14a/0x1e0
             [<ffffffffade5cb87>] start_secondary+0x207/0x280
             [<ffffffffade000d5>] start_cpu+0x5/0x14
            Code: c2 01 4c 89 e7 48 63 d2 e8 93 8f 35 00 89 c2 48 63 05 ae e6 cf 00 48 39 c2 73 53 48 8b 04 d5 a0 15 b5 ae 41 83 ff 02 0f b7 34 03 <74> 5a 8b 04 25 00 93 5a ff f6 c4 10 74 15 0f 1f 80 00 00 00 00 
            NMI backtrace for cpu 9 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 10 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 11 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 12 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 13 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 14 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 15 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 16 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 17 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 18 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 19 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 20 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 21 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 22 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 23 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 24 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 25 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 26 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 27 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 28 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 29 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 30 skipped: idling at pc 0xffffffffae5b6aa1
            NMI backtrace for cpu 31 skipped: idling at pc 0xffffffffae5b6aa1
            ofaaland Olaf Faaland added a comment - SysRq  show-backtrace-all-active-cpus(l) dumped no stacks other than the core processing the SysRq.  All other cores were idling: # From [Wed Jan 27 13:13:19 2021] SysRq : Show backtrace of all active CPUs sending NMI to all CPUs: NMI backtrace for cpu 0 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 1 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 2 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 3 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 4 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 5 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 6 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 7 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 8 CPU: 8 PID: 0 Comm: swapper/8 Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-1160.4.1.1chaos.ch6.x86_64 #1 Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016 task: ffff98d5ca74e300 ti: ffff98d5cab50000 task.ti: ffff98d5cab50000 RIP: 0010:[<ffffffffade6102b>]  [<ffffffffade6102b>] default_send_IPI_mask_sequence_phys+0x7b/0x100 RSP: 0018:ffff9914bf203ce8  EFLAGS: 00000046 RAX: ffff9914bf240000 RBX: 000000000000e02e RCX: 0000000000000020 RDX: 0000000000000009 RSI: 0000000000000012 RDI: 0000000000000000 RBP: ffff9914bf203d20 R08: ffffffffaeb5b5a0 R09: 0000000000008032 R10: 0000000000000000 R11: ffff9914bf203a2e R12: ffffffffaeb5b5a0 R13: 0000000000000400 R14: 0000000000000086 R15: 0000000000000002 FS:  0000000000000000(0000) GS:ffff9914bf200000(0000) knlGS:0000000000000000 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffff7ff8000 CR3: 0000007bf1ed8000 CR4: 00000000001607e0 Call Trace: <IRQ> [<ffffffffade66d4e>] physflat_send_IPI_mask+0xe/0x10 [<ffffffffade6153a>] arch_trigger_all_cpu_backtrace+0x2ea/0x2f0 [<ffffffffae2932f3>] sysrq_handle_showallcpus+0x13/0x20 [<ffffffffae2939ed>] __handle_sysrq+0x11d/0x180 [<ffffffffae293a76>] handle_sysrq+0x26/0x30 ... [<ffffffffae5c6b2d>] do_IRQ+0x4d/0xf0 [<ffffffffae5b836a>] common_interrupt+0x16a/0x16a <EOI> [<ffffffffae3eab67>] ? cpuidle_enter_state+0x57/0xd0 [<ffffffffae3eacbe>] cpuidle_idle_call+0xde/0x270 [<ffffffffade3919e>] arch_cpu_idle+0xe/0xc0 [<ffffffffadf0835a>] cpu_startup_entry+0x14a/0x1e0 [<ffffffffade5cb87>] start_secondary+0x207/0x280 [<ffffffffade000d5>] start_cpu+0x5/0x14 Code: c2 01 4c 89 e7 48 63 d2 e8 93 8f 35 00 89 c2 48 63 05 ae e6 cf 00 48 39 c2 73 53 48 8b 04 d5 a0 15 b5 ae 41 83 ff 02 0f b7 34 03 <74> 5a 8b 04 25 00 93 5a ff f6 c4 10 74 15 0f 1f 80 00 00 00 00 NMI backtrace for cpu 9 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 10 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 11 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 12 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 13 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 14 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 15 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 16 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 17 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 18 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 19 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 20 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 21 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 22 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 23 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 24 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 25 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 26 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 27 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 28 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 29 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 30 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 31 skipped: idling at pc 0xffffffffae5b6aa1
            ofaaland Olaf Faaland added a comment -

            Code listings for the point where the stacks are different:

            (gdb) l *(ldlm_cli_enqueue+0x3d2)
            0x25f32 is in ldlm_cli_enqueue (/usr/src/debug/lustre-2.12.5_10.llnl/lustre/ldlm/ldlm_request.c:1036).
            1031	
            1032		LDLM_DEBUG(lock, "sending request");
            1033	
            1034		rc = ptlrpc_queue_wait(req);
            1035	
            1036		err = ldlm_cli_enqueue_fini(exp, req, einfo->ei_type, policy ? 1 : 0,
            1037					    einfo->ei_mode, flags, lvb, lvb_len,
            1038					    lockh, rc);
            1039	
            1040		/* If ldlm_cli_enqueue_fini did not find the lock, we need to free
            
            
            (gdb) l *(ldlm_cli_enqueue+0x40e)
            0x25f6e is in ldlm_cli_enqueue (/usr/src/debug/lustre-2.12.5_10.llnl/lustre/ldlm/ldlm_request.c:1042).
            1037					    einfo->ei_mode, flags, lvb, lvb_len,
            1038					    lockh, rc);
            1039	
            1040		/* If ldlm_cli_enqueue_fini did not find the lock, we need to free
            1041		 * one reference that we took */
            1042		if (err == -ENOLCK)
            1043			LDLM_LOCK_RELEASE(lock);
            1044		else
            1045			rc = err;
            1046	 
            ofaaland Olaf Faaland added a comment - Code listings for the point where the stacks are different: (gdb) l *(ldlm_cli_enqueue+0x3d2) 0x25f32 is in ldlm_cli_enqueue (/usr/src/debug/lustre-2.12.5_10.llnl/lustre/ldlm/ldlm_request.c:1036). 1031 1032 LDLM_DEBUG(lock, "sending request"); 1033 1034 rc = ptlrpc_queue_wait(req); 1035 1036 err = ldlm_cli_enqueue_fini(exp, req, einfo->ei_type, policy ? 1 : 0, 1037     einfo->ei_mode, flags, lvb, lvb_len, 1038     lockh, rc); 1039 1040 /* If ldlm_cli_enqueue_fini did not find the lock, we need to free (gdb) l *(ldlm_cli_enqueue+0x40e) 0x25f6e is in ldlm_cli_enqueue (/usr/src/debug/lustre-2.12.5_10.llnl/lustre/ldlm/ldlm_request.c:1042). 1037     einfo->ei_mode, flags, lvb, lvb_len, 1038     lockh, rc); 1039 1040 /* If ldlm_cli_enqueue_fini did not find the lock, we need to free 1041 * one reference that we took */ 1042 if (err == -ENOLCK) 1043 LDLM_LOCK_RELEASE(lock); 1044 else 1045 rc = err; 1046
            ofaaland Olaf Faaland added a comment -

            Possibly related to LU-14378

            ofaaland Olaf Faaland added a comment - Possibly related to LU-14378
            ofaaland Olaf Faaland added a comment -

            For my records, my local ticket is TOSS5037

            ofaaland Olaf Faaland added a comment - For my records, my local ticket is TOSS5037

            People

              laisiyao Lai Siyao
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: