[LU-15397] LustreError: 4585:0:(llite_mmap.c:61:our_vma()) ASSERTION( !down_write_trylock(&mm->mmap_sem) ) failed Created: 23/Dec/21  Updated: 21/Apr/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.6
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Yang Sheng
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Sles 12-sp5 (4.12.14-122.103.1)
Client 2.12.6


Issue Links:
Related
is related to LU-14713 Process hung with waiting for mmap_sem Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

Clients keep hitting LBUG.
We have hit this with both 2.12.6-ddn55 and whamcloud 2.12.6 clients. Our running kernel has the patch listed in LU-

12508

[557612.110077] LustreError: 4585:0:(llite_mmap.c:61:our_vma()) ASSERTION( !down_write_trylock(&mm->mmap_sem) ) failed:
[557612.121131] LustreError: 4585:0:(llite_mmap.c:61:our_vma()) LBUG
[557612.127620] Pid: 4585, comm: fftpot4omp 4.12.14-122.103.1.20211202-nasa #1 SMP Tue Nov 23 14:22:07 UTC 2021 (d263070)
[557612.127621] Call Trace:
[557612.127644] [<0>] libcfs_call_trace+0x7e/0xd0 [libcfs]
[557612.127650] [<0>] lbug_with_loc+0x41/0x90 [libcfs]
[557612.127671] [<0>] our_vma+0x141/0x150 [lustre]
[557612.127683] [<0>] vvp_io_rw_lock+0x260/0x710 [lustre]
[557612.127707] [<0>] cl_io_lock+0x5f/0x3c0 [obdclass]
[557612.127723] [<0>] cl_io_loop+0x81/0x1d0 [obdclass]
[557612.127732] [<0>] ll_file_io_generic+0x6b3/0xbe0 [lustre]
[557612.127742] [<0>] ll_file_write_iter+0xbc/0x520 [lustre]
[557612.127745] [<0>] __vfs_write+0xdc/0x150
[557612.127746] [<0>] __kernel_write+0x4b/0xe0
[557612.127748] [<0>] dump_emit+0x79/0xa0
[557612.127750] [<0>] elf_core_dump+0x3fc/0xa40
[557612.127751] [<0>] do_coredump+0x7e5/0x1080
[557612.127752] [<0>] get_signal+0x161/0x7d0
[557612.127754] [<0>] do_signal+0x23/0x650
[557612.127756] [<0>] exit_to_usermode_loop+0x57/0x9c
[557612.127758] [<0>] prepare_exit_to_usermode+0x3d/0x50
[557612.127760] [<0>] retint_user+0x8/0x8
[557612.127761] [<0>] 0xffffffffffffffff
[557612.127762] Kernel panic - not syncing: LBUG
[557612.132493] CPU: 7 PID: 4585 Comm: fftpot4omp Tainted: P           OE      4.12.14-122.103.1.20211202-nasa #1 SLE12-SP5 (unreleased)
[557612.132494] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 01/23/2021
[557612.132494] Call Trace:
[557612.132498]  dump_stack+0x64/0x85
[557612.132501]  panic+0xdb/0x23e
[557612.132508]  lbug_with_loc+0x8b/0x90 [libcfs]
[557612.168839]  our_vma+0x141/0x150 [lustre]
[557612.168852]  vvp_io_rw_lock+0x260/0x710 [lustre]
[557612.178415]  cl_io_lock+0x5f/0x3c0 [obdclass]
[557612.178430]  cl_io_loop+0x81/0x1d0 [obdclass]
[557612.178444]  ll_file_io_generic+0x6b3/0xbe0 [lustre]
[557612.193504]  ll_file_write_iter+0xbc/0x520 [lustre]
[557612.193508]  __vfs_write+0xdc/0x150
[557612.202789]  __kernel_write+0x4b/0xe0
[557612.202792]  dump_emit+0x79/0xa0
[557612.210588]  elf_core_dump+0x3fc/0xa40
[557612.210590]  do_coredump+0x7e5/0x1080
[557612.210594]  get_signal+0x161/0x7d0
[557612.222856]  do_signal+0x23/0x650
[557612.222860]  ? force_sig_info_fault+0x89/0xd0
[557612.231441]  ? mm_fault_error+0xa3/0x13f
[557612.231442]  ? __do_page_fault+0x42f/0x4c0
[557612.231443]  exit_to_usermode_loop+0x57/0x9c
[557612.231447]  ? page_fault+0x2f/0x50
[557612.231449]  prepare_exit_to_usermode+0x3d/0x50
[557612.231451]  retint_user+0x8/0x8
[557612.231452] RIP: 0033:0x402b9d
[557612.231453] RSP: 002b:00002aaadf1f5d90 EFLAGS: 00010206
[557612.231454] RAX: 00002aaaae469a60 RBX: 0000000000000158 RCX: 0000000000000000
[557612.231454] RDX: 000000000208ca50 RSI: 00000000ffffffff RDI: 0000000000000000
[557612.231454] RBP: 00002aaadf1f5e50 R08: 0000000000010600 R09: 00002aab000908e0
[557612.231455] R10: 00002aab00000070 R11: 0000000000000005 R12: 0000000000000024
[557612.231455] R13: 00007fffffffa150 R14: 0000000000617c90 R15: 0000000000619510


 Comments   
Comment by Peter Jones [ 23/Dec/21 ]

Yang Sheng

You have been looking into a similar issue I believe

Peter

Comment by Yang Sheng [ 10/Jan/22 ]

This issue should be duplicated with LU-14713. The patch https://review.whamcloud.com/44716/ can fix it. I think sles12 sp5 hasn't included the commit 6b4c9f4469819a0c1a38a0a4541337e0f9bf6c11. Could you please verify it? Since i haven't source code for sles12 in hand.

Comment by Mahmoud Hanafi [ 21/Apr/22 ]

I can't see patch 44716 and this is not listed in LU-14713.

Comment by Peter Jones [ 21/Apr/22 ]

Mahmoud

I think that we should take a step back and look at what you are trying to do and why here. It looks like there is a meeting setup for next week - let's discuss there

Peter

Generated at Sat Feb 10 03:17:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.