[LU-14379] ls hangs on client and console log message LNet: Service thread pid 25241 was inactive performing mdt_reint_rename - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
- llnl
Environment:
zfs-0.7
kernel-3.10.0-1160.4.1.1chaos.ch6.x86_64
lustre-2.12.5_10.llnl-3.ch6.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

At some time the morning of Jan 27th, we got reports of directory listings ("ls") hanging, where the directories were on MDT5 and MDT7.

The console log of MDT5 and MDT7 both reported repeated watchdog dumps, all with very similar stacks. The first one on MDT7 appeared Thu Jan 21 12:01:33 2021

# From zinc8 dmesg log
LNet: Service thread pid 25241 was inactive for 200.13s. The thread might be hung, or it might only be slow and will resume later. D
Pid: 25241, comm: mdt01_049 3.10.0-1160.4.1.1chaos.ch6.x86_64 #1 SMP Fri Oct 9 17:56:20 PDT 2020
Call Trace:
 [<ffffffffc141a460>] ldlm_completion_ast+0x440/0x870 [ptlrpc]
 [<ffffffffc141be2f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
 [<ffffffffc141ef3e>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
 [<ffffffffc1982342>] osp_md_object_lock+0x162/0x2d0 [osp]
 [<ffffffffc1895194>] lod_object_lock+0xf4/0x780 [lod]
 [<ffffffffc1916ace>] mdd_object_lock+0x3e/0xe0 [mdd]
 [<ffffffffc17ae681>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
 [<ffffffffc17aec1a>] mdt_remote_object_lock+0x2a/0x30 [mdt]
 [<ffffffffc17c407e>] mdt_rename_lock+0xbe/0x4b0 [mdt]
 [<ffffffffc17c6400>] mdt_reint_rename+0x2c0/0x2900 [mdt]
 [<ffffffffc17cf113>] mdt_reint_rec+0x83/0x210 [mdt]
 [<ffffffffc17ab303>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
 [<ffffffffc17b6b37>] mdt_reint+0x67/0x140 [mdt] 
 [<ffffffffc14b8b1a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
 [<ffffffffc145d80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
 [<ffffffffc1461bfd>] ptlrpc_main+0xc4d/0x2280 [ptlrpc]
 [<ffffffffadecafc1>] kthread+0xd1/0xe0
 [<ffffffffae5c1ff7>] ret_from_fork_nospec_end+0x0/0x39
 [<ffffffffffffffff>] 0xffffffffffffffff
LustreError: dumping log to /tmp/lustre-log.1611259694.25241

The remaining dumps took a different path within ldlm_cli_enqueue:

ptlrpc_set_wait+0x4d8/0x800 [ptlrpc]
ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
ldlm_cli_enqueue+0x3d2/0x920 [ptlrpc]
osp_md_object_lock+0x162/0x2d0 [osp]
lod_object_lock+0xf4/0x780 [lod]
mdd_object_lock+0x3e/0xe0 [mdd]
mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
mdt_remote_object_lock+0x2a/0x30 [mdt]
mdt_rename_lock+0xbe/0x4b0 [mdt]
mdt_reint_rename+0x2c0/0x2900 [mdt]
mdt_reint_rec+0x83/0x210 [mdt]
mdt_reint_internal+0x6e3/0xaf0 [mdt]
mdt_reint+0x67/0x140 [mdt]
tgt_request_handle+0xada/0x1570 [ptlrpc]
ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
ptlrpc_main+0xc4d/0x2280 [ptlrpc]
kthread+0xd1/0xe0
ret_from_fork_nospec_end+0x0/0x39
0xffffffffffffffff

I can provide mdt7 debug logs and ldlm namespace dump; I have a core dump from mdt5. And dmesg logs for both.

Attachments

Issue Links

is related to

LU-14378 ldlm_resource_complain()) MGC172.19.3.1@o2ib600: namespace resource [0x68736c:0x2:0x0].0x0 (ffff972b9abea0c0) refcount nonzero (1) after lock cleanup; forcing cleanup.

Closed

Activity

[LU-14379] ls hangs on client and console log message LNet: Service thread pid 25241 was inactive performing mdt_reint_rename

Olaf Faaland added a comment - 05/Feb/21 12:22 AM

Closing (left "Fixed", forgot to change to "Duplicate").

Olaf Faaland added a comment - 05/Feb/21 12:22 AM Closing (left "Fixed", forgot to change to "Duplicate").

Lai Siyao added a comment - 29/Jan/21 1:51 AM

Mmm, it might be because commit "1857880 ~~LU-13437~~ llite: pass name in getattr by FID" is missing.

Lai Siyao added a comment - 29/Jan/21 1:51 AM Mmm, it might be because commit "1857880 LU-13437 llite: pass name in getattr by FID" is missing.

Olaf Faaland added a comment - 28/Jan/21 1:40 PM - edited

Ah. Thank you.

We are preparing to upgrade to lustre 2.12.6, we just don't have it on the machines yet.

A little later in the day, users started reporting bad dirents, e.g.

[root@oslic20:test]# ll
ls: cannot access scratch-2: No such file or directory
total 1798843
-rw-rw---- 1 varley2 varley2          0 Jan 27 12:14 CHG
-rw-rw---- 1 varley2 varley2          0 Jan 27 12:14 CHGCAR
-rw-rw---- 1 varley2 varley2      27295 Jan 27 17:53 CONTCAR
-rw-rw---- 1 varley2 varley2    5621531 Jan 27 17:54 DOSCAR
-rw-rw---- 1 varley2 varley2      83839 Jan 27 17:54 EIGENVAL
-rw-rw---- 1 varley2 varley2        207 Jan 27 12:14 IBZKPT
lrwxrwxrwx 1 varley2 varley2          8 Jan 27 11:05 INCAR -> ../INCAR
lrwxrwxrwx 1 varley2 varley2         10 Jan 27 11:05 KPOINTS -> ../KPOINTS
-rw-rw---- 1 varley2 varley2  149105972 Jan 27 17:54 LOCPOT
-rw-rw---- 1 varley2 varley2       2121 Jan 27 17:53 OSZICAR
-rw-rw---- 1 varley2 varley2     612859 Jan 27 17:54 OUTCAR
-rw-rw---- 1 varley2 varley2        234 Jan 27 12:14 PCDAT
-rw------- 1 varley2 varley2      27295 Jan 27 11:04 POSCAR
lrwxrwxrwx 1 varley2 varley2          9 Jan 27 11:05 POTCAR -> ../POTCAR
-rw-rw---- 1 varley2 varley2   21995032 Jan 27 17:54 PROCAR
-rw-rw---- 1 varley2 varley2          0 Jan 27 12:14 REPORT
-rw------- 1 varley2 varley2 1820073216 Jan 27 17:53 WAVECAR
-rw-rw---- 1 varley2 varley2      25475 Jan 27 17:53 XDATCAR
lrwxrwxrwx 1 varley2 varley2          9 Jan 27 11:05 msub.vasp -> psub.vasp
-rw-rw---- 1 varley2 varley2       9560 Jan 27 17:53 psub.out
-rw-rw---- 1 varley2 varley2        709 Jan 27 11:05 psub.vasp
-????????? ? ?       ?                ?            ? scratch-2
-rw-rw---- 1 varley2 varley2   34802010 Jan 27 17:54 vasprun.xml

and

[root@oslic20:2MP]# ls -lR
.:
total 25
drwx------ 2 varley2 varley2 25600 Jan 27 16:31 2ndshell-broken


./2ndshell-broken:
ls: cannot access ./2ndshell-broken/scratch-2: No such file or directory
total 0
-????????? ? ? ? ?            ? scratch-2

Is this a likely side-affect of ~~LU-13437~~? We have some of those patches but not all of them in our 2.12.5 branch - I don't know if that's better or worse than having none:

$ git lg -n 200 --grep LU-13437 2.12.6_3.llnl | sort -k3
* 3314727 LU-13437 llite: pack parent FID in getattr
* 1857880 LU-13437 llite: pass name in getattr by FID
* f1712b3 LU-13437 lmv: check stripe FID sanity
* 23c05e8 LU-13437 mdc: remote object support getattr from cache
* ae9fc81 LU-13437 mdt: don't fetch LOOKUP lock for remote object
* 23fa920 LU-13437 mdt: rename misses remote LOOKUP lock revoke
* daa9148 LU-13437 uapi: add OBD_CONNECT2_GETATTR_PFID

$ git lg -n 200 --grep LU-13437  2.12.5_10.llnl | sort -k3
* d06cc90 LU-13437 llite: pack parent FID in getattr
* 338d34e LU-13437 lmv: check stripe FID sanity
* 8d212e8 LU-13437 mdt: don't fetch LOOKUP lock for remote object
* 6dc8f51 LU-13437 mdt: rename misses remote LOOKUP lock revoke
* ddec375 LU-13437 uapi: add OBD_CONNECT2_GETATTR_PFID

Olaf Faaland added a comment - 28/Jan/21 1:40 PM - edited Ah. Thank you. We are preparing to upgrade to lustre 2.12.6, we just don't have it on the machines yet. A little later in the day, users started reporting bad dirents, e.g. [root@oslic20:test]# ll ls: cannot access scratch-2: No such file or directory total 1798843 -rw-rw---- 1 varley2 varley2 0 Jan 27 12:14 CHG -rw-rw---- 1 varley2 varley2 0 Jan 27 12:14 CHGCAR -rw-rw---- 1 varley2 varley2 27295 Jan 27 17:53 CONTCAR -rw-rw---- 1 varley2 varley2 5621531 Jan 27 17:54 DOSCAR -rw-rw---- 1 varley2 varley2 83839 Jan 27 17:54 EIGENVAL -rw-rw---- 1 varley2 varley2 207 Jan 27 12:14 IBZKPT lrwxrwxrwx 1 varley2 varley2 8 Jan 27 11:05 INCAR -> ../INCAR lrwxrwxrwx 1 varley2 varley2 10 Jan 27 11:05 KPOINTS -> ../KPOINTS -rw-rw---- 1 varley2 varley2 149105972 Jan 27 17:54 LOCPOT -rw-rw---- 1 varley2 varley2 2121 Jan 27 17:53 OSZICAR -rw-rw---- 1 varley2 varley2 612859 Jan 27 17:54 OUTCAR -rw-rw---- 1 varley2 varley2 234 Jan 27 12:14 PCDAT -rw------- 1 varley2 varley2 27295 Jan 27 11:04 POSCAR lrwxrwxrwx 1 varley2 varley2 9 Jan 27 11:05 POTCAR -> ../POTCAR -rw-rw---- 1 varley2 varley2 21995032 Jan 27 17:54 PROCAR -rw-rw---- 1 varley2 varley2 0 Jan 27 12:14 REPORT -rw------- 1 varley2 varley2 1820073216 Jan 27 17:53 WAVECAR -rw-rw---- 1 varley2 varley2 25475 Jan 27 17:53 XDATCAR lrwxrwxrwx 1 varley2 varley2 9 Jan 27 11:05 msub.vasp -> psub.vasp -rw-rw---- 1 varley2 varley2 9560 Jan 27 17:53 psub.out -rw-rw---- 1 varley2 varley2 709 Jan 27 11:05 psub.vasp -????????? ? ? ? ? ? scratch-2 -rw-rw---- 1 varley2 varley2 34802010 Jan 27 17:54 vasprun.xml and [root@oslic20:2MP]# ls -lR .: total 25 drwx------ 2 varley2 varley2 25600 Jan 27 16:31 2ndshell-broken ./2ndshell-broken: ls: cannot access ./2ndshell-broken/scratch-2: No such file or directory total 0 -????????? ? ? ? ? ? scratch-2 Is this a likely side-affect of LU-13437 ? We have some of those patches but not all of them in our 2.12.5 branch - I don't know if that's better or worse than having none: $ git lg -n 200 --grep LU-13437 2.12.6_3.llnl | sort -k3 * 3314727 LU-13437 llite: pack parent FID in getattr * 1857880 LU-13437 llite: pass name in getattr by FID * f1712b3 LU-13437 lmv: check stripe FID sanity * 23c05e8 LU-13437 mdc: remote object support getattr from cache * ae9fc81 LU-13437 mdt: don't fetch LOOKUP lock for remote object * 23fa920 LU-13437 mdt: rename misses remote LOOKUP lock revoke * daa9148 LU-13437 uapi: add OBD_CONNECT2_GETATTR_PFID $ git lg -n 200 --grep LU-13437 2.12.5_10.llnl | sort -k3 * d06cc90 LU-13437 llite: pack parent FID in getattr * 338d34e LU-13437 lmv: check stripe FID sanity * 8d212e8 LU-13437 mdt: don't fetch LOOKUP lock for remote object * 6dc8f51 LU-13437 mdt: rename misses remote LOOKUP lock revoke * ddec375 LU-13437 uapi: add OBD_CONNECT2_GETATTR_PFID

Lai Siyao added a comment - 28/Jan/21 8:51 AM

This looks to be duplicate of ~~LU-13437~~, and it was landed to the latest 2.12, you can cherry-pick from there.

Lai Siyao added a comment - 28/Jan/21 8:51 AM This looks to be duplicate of LU-13437 , and it was landed to the latest 2.12, you can cherry-pick from there.

Olaf Faaland added a comment - 28/Jan/21 7:29 AM

PID 25241, given in the watchdog warning above, appears in console log messages regarding changelog handling:

# [Sat Jan 23 12:21:13 2021] Lustre: 25161:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22
>> [Sat Jan 23 12:21:53 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22
# [Sat Jan 23 12:24:08 2021] Lustre: 25188:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22

There are many like that - some of the others are:

[Sat Jan 23 13:38:06 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22
[Sat Jan 23 13:38:06 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) Skipped 12 previous similar messages
[Sat Jan 23 15:25:52 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22
[Sat Jan 23 15:25:52 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) Skipped 9 previous similar messages
[Sat Jan 23 19:39:04 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22

The changelog clear messages are from Starfish, a policy manager similar to Robinhood, which frequently clears changelog entries which have already been cleared.

Olaf Faaland added a comment - 28/Jan/21 7:29 AM PID 25241, given in the watchdog warning above, appears in console log messages regarding changelog handling: # [Sat Jan 23 12:21:13 2021] Lustre: 25161:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22 >> [Sat Jan 23 12:21:53 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22 # [Sat Jan 23 12:24:08 2021] Lustre: 25188:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22 There are many like that - some of the others are: [Sat Jan 23 13:38:06 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22 [Sat Jan 23 13:38:06 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) Skipped 12 previous similar messages [Sat Jan 23 15:25:52 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22 [Sat Jan 23 15:25:52 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) Skipped 9 previous similar messages [Sat Jan 23 19:39:04 2021] Lustre: 25241:0:(mdd_device.c:1811:mdd_changelog_clear()) lsh-MDD0007: Failure to clear the changelog for user 6: -22 The changelog clear messages are from Starfish, a policy manager similar to Robinhood, which frequently clears changelog entries which have already been cleared.

Olaf Faaland added a comment - 28/Jan/21 7:26 AM

SysRq show-backtrace-all-active-cpus(l) dumped no stacks other than the core processing the SysRq. All other cores were idling:

# From [Wed Jan 27 13:13:19 2021] 
SysRq : Show backtrace of all active CPUs
sending NMI to all CPUs:
NMI backtrace for cpu 0 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 1 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 2 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 3 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 4 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 5 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 6 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 7 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 8
CPU: 8 PID: 0 Comm: swapper/8 Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-1160.4.1.1chaos.ch6.x86_64 #1
Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
task: ffff98d5ca74e300 ti: ffff98d5cab50000 task.ti: ffff98d5cab50000
RIP: 0010:[<ffffffffade6102b>]  [<ffffffffade6102b>] default_send_IPI_mask_sequence_phys+0x7b/0x100
RSP: 0018:ffff9914bf203ce8  EFLAGS: 00000046
RAX: ffff9914bf240000 RBX: 000000000000e02e RCX: 0000000000000020
RDX: 0000000000000009 RSI: 0000000000000012 RDI: 0000000000000000
RBP: ffff9914bf203d20 R08: ffffffffaeb5b5a0 R09: 0000000000008032
R10: 0000000000000000 R11: ffff9914bf203a2e R12: ffffffffaeb5b5a0
R13: 0000000000000400 R14: 0000000000000086 R15: 0000000000000002
FS:  0000000000000000(0000) GS:ffff9914bf200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffff7ff8000 CR3: 0000007bf1ed8000 CR4: 00000000001607e0
Call Trace:
 <IRQ> [<ffffffffade66d4e>] physflat_send_IPI_mask+0xe/0x10
 [<ffffffffade6153a>] arch_trigger_all_cpu_backtrace+0x2ea/0x2f0
 [<ffffffffae2932f3>] sysrq_handle_showallcpus+0x13/0x20
 [<ffffffffae2939ed>] __handle_sysrq+0x11d/0x180
 [<ffffffffae293a76>] handle_sysrq+0x26/0x30
 ...
 [<ffffffffae5c6b2d>] do_IRQ+0x4d/0xf0
 [<ffffffffae5b836a>] common_interrupt+0x16a/0x16a
 <EOI> [<ffffffffae3eab67>] ? cpuidle_enter_state+0x57/0xd0
 [<ffffffffae3eacbe>] cpuidle_idle_call+0xde/0x270
 [<ffffffffade3919e>] arch_cpu_idle+0xe/0xc0
 [<ffffffffadf0835a>] cpu_startup_entry+0x14a/0x1e0
 [<ffffffffade5cb87>] start_secondary+0x207/0x280
 [<ffffffffade000d5>] start_cpu+0x5/0x14
Code: c2 01 4c 89 e7 48 63 d2 e8 93 8f 35 00 89 c2 48 63 05 ae e6 cf 00 48 39 c2 73 53 48 8b 04 d5 a0 15 b5 ae 41 83 ff 02 0f b7 34 03 <74> 5a 8b 04 25 00 93 5a ff f6 c4 10 74 15 0f 1f 80 00 00 00 00 
NMI backtrace for cpu 9 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 10 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 11 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 12 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 13 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 14 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 15 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 16 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 17 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 18 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 19 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 20 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 21 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 22 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 23 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 24 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 25 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 26 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 27 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 28 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 29 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 30 skipped: idling at pc 0xffffffffae5b6aa1
NMI backtrace for cpu 31 skipped: idling at pc 0xffffffffae5b6aa1

Olaf Faaland added a comment - 28/Jan/21 7:26 AM SysRq show-backtrace-all-active-cpus(l) dumped no stacks other than the core processing the SysRq. All other cores were idling: # From [Wed Jan 27 13:13:19 2021] SysRq : Show backtrace of all active CPUs sending NMI to all CPUs: NMI backtrace for cpu 0 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 1 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 2 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 3 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 4 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 5 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 6 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 7 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 8 CPU: 8 PID: 0 Comm: swapper/8 Kdump: loaded Tainted: P OE ------------ T 3.10.0-1160.4.1.1chaos.ch6.x86_64 #1 Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016 task: ffff98d5ca74e300 ti: ffff98d5cab50000 task.ti: ffff98d5cab50000 RIP: 0010:[<ffffffffade6102b>] [<ffffffffade6102b>] default_send_IPI_mask_sequence_phys+0x7b/0x100 RSP: 0018:ffff9914bf203ce8 EFLAGS: 00000046 RAX: ffff9914bf240000 RBX: 000000000000e02e RCX: 0000000000000020 RDX: 0000000000000009 RSI: 0000000000000012 RDI: 0000000000000000 RBP: ffff9914bf203d20 R08: ffffffffaeb5b5a0 R09: 0000000000008032 R10: 0000000000000000 R11: ffff9914bf203a2e R12: ffffffffaeb5b5a0 R13: 0000000000000400 R14: 0000000000000086 R15: 0000000000000002 FS: 0000000000000000(0000) GS:ffff9914bf200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffff7ff8000 CR3: 0000007bf1ed8000 CR4: 00000000001607e0 Call Trace: <IRQ> [<ffffffffade66d4e>] physflat_send_IPI_mask+0xe/0x10 [<ffffffffade6153a>] arch_trigger_all_cpu_backtrace+0x2ea/0x2f0 [<ffffffffae2932f3>] sysrq_handle_showallcpus+0x13/0x20 [<ffffffffae2939ed>] __handle_sysrq+0x11d/0x180 [<ffffffffae293a76>] handle_sysrq+0x26/0x30 ... [<ffffffffae5c6b2d>] do_IRQ+0x4d/0xf0 [<ffffffffae5b836a>] common_interrupt+0x16a/0x16a <EOI> [<ffffffffae3eab67>] ? cpuidle_enter_state+0x57/0xd0 [<ffffffffae3eacbe>] cpuidle_idle_call+0xde/0x270 [<ffffffffade3919e>] arch_cpu_idle+0xe/0xc0 [<ffffffffadf0835a>] cpu_startup_entry+0x14a/0x1e0 [<ffffffffade5cb87>] start_secondary+0x207/0x280 [<ffffffffade000d5>] start_cpu+0x5/0x14 Code: c2 01 4c 89 e7 48 63 d2 e8 93 8f 35 00 89 c2 48 63 05 ae e6 cf 00 48 39 c2 73 53 48 8b 04 d5 a0 15 b5 ae 41 83 ff 02 0f b7 34 03 <74> 5a 8b 04 25 00 93 5a ff f6 c4 10 74 15 0f 1f 80 00 00 00 00 NMI backtrace for cpu 9 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 10 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 11 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 12 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 13 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 14 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 15 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 16 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 17 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 18 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 19 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 20 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 21 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 22 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 23 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 24 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 25 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 26 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 27 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 28 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 29 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 30 skipped: idling at pc 0xffffffffae5b6aa1 NMI backtrace for cpu 31 skipped: idling at pc 0xffffffffae5b6aa1

Olaf Faaland added a comment - 28/Jan/21 7:22 AM

Code listings for the point where the stacks are different:

(gdb) l *(ldlm_cli_enqueue+0x3d2)
0x25f32 is in ldlm_cli_enqueue (/usr/src/debug/lustre-2.12.5_10.llnl/lustre/ldlm/ldlm_request.c:1036).
1031	
1032		LDLM_DEBUG(lock, "sending request");
1033	
1034		rc = ptlrpc_queue_wait(req);
1035	
1036		err = ldlm_cli_enqueue_fini(exp, req, einfo->ei_type, policy ? 1 : 0,
1037					    einfo->ei_mode, flags, lvb, lvb_len,
1038					    lockh, rc);
1039	
1040		/* If ldlm_cli_enqueue_fini did not find the lock, we need to free


(gdb) l *(ldlm_cli_enqueue+0x40e)
0x25f6e is in ldlm_cli_enqueue (/usr/src/debug/lustre-2.12.5_10.llnl/lustre/ldlm/ldlm_request.c:1042).
1037					    einfo->ei_mode, flags, lvb, lvb_len,
1038					    lockh, rc);
1039	
1040		/* If ldlm_cli_enqueue_fini did not find the lock, we need to free
1041		 * one reference that we took */
1042		if (err == -ENOLCK)
1043			LDLM_LOCK_RELEASE(lock);
1044		else
1045			rc = err;
1046

Olaf Faaland added a comment - 28/Jan/21 7:22 AM Code listings for the point where the stacks are different: (gdb) l *(ldlm_cli_enqueue+0x3d2) 0x25f32 is in ldlm_cli_enqueue (/usr/src/debug/lustre-2.12.5_10.llnl/lustre/ldlm/ldlm_request.c:1036). 1031 1032 LDLM_DEBUG(lock, "sending request"); 1033 1034 rc = ptlrpc_queue_wait(req); 1035 1036 err = ldlm_cli_enqueue_fini(exp, req, einfo->ei_type, policy ? 1 : 0, 1037 einfo->ei_mode, flags, lvb, lvb_len, 1038 lockh, rc); 1039 1040 /* If ldlm_cli_enqueue_fini did not find the lock, we need to free (gdb) l *(ldlm_cli_enqueue+0x40e) 0x25f6e is in ldlm_cli_enqueue (/usr/src/debug/lustre-2.12.5_10.llnl/lustre/ldlm/ldlm_request.c:1042). 1037 einfo->ei_mode, flags, lvb, lvb_len, 1038 lockh, rc); 1039 1040 /* If ldlm_cli_enqueue_fini did not find the lock, we need to free 1041 * one reference that we took */ 1042 if (err == -ENOLCK) 1043 LDLM_LOCK_RELEASE(lock); 1044 else 1045 rc = err; 1046

Olaf Faaland added a comment - 28/Jan/21 7:22 AM

Possibly related to ~~LU-14378~~

Olaf Faaland added a comment - 28/Jan/21 7:22 AM Possibly related to LU-14378

Olaf Faaland added a comment - 28/Jan/21 7:20 AM

For my records, my local ticket is TOSS5037

Olaf Faaland added a comment - 28/Jan/21 7:20 AM For my records, my local ticket is TOSS5037

People

Assignee:: Lai Siyao

Reporter:: Olaf Faaland

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 28/Jan/21 7:20 AM

Updated:: 06/Feb/21 4:18 PM

Resolved:: 05/Feb/21 12:21 AM