[LU-16441] BUG: Bad page state in process socknal_* functions Created: 03/Jan/23  Updated: 09/Jan/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0, Lustre 2.12.9, Lustre 2.15.1
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: James A Simmons Assignee: James A Simmons
Resolution: Unresolved Votes: 0
Labels: ORNL
Environment:

Lustre 2.12.9 vanilla servers running in RHEL7.9 environment. Ethernet hardware is 200GiB cards.


Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The following  crash hit our OSS servers for a production system.

[3845523.531585] Call Trace:
[3845523.535134]  [<ffffffffb7d865c9>] dump_stack+0x19/0x1b
[3845523.541350]  [<ffffffffb7d81905>] bad_page.part.75+0xdc/0xf9
[3845523.548082]  [<ffffffffb77c6fd2>] free_pages_prepare+0x1f2/0x210
[3845523.555139]  [<ffffffffb77c7309>] __free_pages_ok+0x19/0xc0
[3845523.561750]  [<ffffffffb77c73cb>] free_compound_page+0x1b/0x20
[3845523.568605]  [<ffffffffb7d8236f>] __put_compound_page+0x25/0x28
[3845523.575535]  [<ffffffffb7d824d8>] put_compound_page+0x166/0x174
[3845523.582456]  [<ffffffffb77cce06>] put_page+0x56/0x60
[3845523.588404]  [<ffffffffb7c4266f>] skb_release_data+0x8f/0x150
[3845523.595114]  [<ffffffffb7c42754>] skb_release_all+0x24/0x30
[3845523.601633]  [<ffffffffb7c42772>] __kfree_skb+0x12/0x20
[3845523.607795]  [<ffffffffb7cbecad>] tcp_ack+0x60d/0x12f0
[3845523.613856]  [<ffffffffb7cbff66>] tcp_rcv_established+0x1d6/0x7a0
[3845523.620857]  [<ffffffffb7cc83e3>] ? tcp_v4_md5_lookup+0x13/0x20
[3845523.627689]  [<ffffffffb7ccb04a>] tcp_v4_do_rcv+0x10a/0x350
[3845523.634166]  [<ffffffffb7c3e986>] release_sock+0xa6/0x180
[3845523.640454]  [<ffffffffb7cb609d>] tcp_sendpage+0xdd/0x5c0
[3845523.646746]  [<ffffffffc0b47bab>] ksocknal_lib_send_kiov+0xdb/0x2e0 [ksocklnd]
[3845523.654853]  [<ffffffffc0b48662>] ? ksocknal_lib_send_iov+0xd2/0x140 [ksocklnd]
[3845523.663040]  [<ffffffffc0b4112e>] ksocknal_process_transmit+0x39e/0xc10 [ksocklnd]
[3845523.671478]  [<ffffffffc0b45b80>] ksocknal_scheduler+0x320/0xd50 [ksocklnd]
[3845523.679304]  [<ffffffffb76c7080>] ? wake_up_atomic_t+0x30/0x30
[3845523.685989]  [<ffffffffc0b45860>] ? ksocknal_recv+0x2a0/0x2a0 [ksocklnd]
[3845523.693527]  [<ffffffffb76c5f91>] kthread+0xd1/0xe0
[3845523.699238]  [<ffffffffb76c5ec0>] ? insert_kthread_work+0x40/0x40
[3845523.706156]  [<ffffffffb7d99ddd>] ret_from_fork_nospec_begin+0x7/0x21
[3845523.713415]  [<ffffffffb76c5ec0>] ? insert_kthread_work+0x40/0x40
[3845523.720321] BUG: Bad page state in process socknal_sd05_02  pfn:282c3e3
[3845523.727747] page:ffffeb1aa0b0f8c0 count:0 mapcount:-1 mapping:          (null) index:0x0
[3845523.736648] page flags: 0x6fffff00008000(tail)
[3845523.741962] page dumped because: nonzero mapcount



 Comments   
Comment by James A Simmons [ 03/Jan/23 ]

Both NVME and ceph ran into the same issue. Basically memory in the skbuff is being used after it is freed. I have a working prototype patch I'm testing right now.

Comment by Gerrit Updater [ 03/Jan/23 ]

"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49545
Subject: LU-16441 ksocklnd: ensure a page is valid with sendpage_ok()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 444ba4c1625268fce3b7a031f932360554f26983

Generated at Sat Feb 10 03:27:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.