[LU-14306] sanity-sec test_52: BUG: Bad rss-counter state Created: 07/Jan/21  Updated: 25/Jan/21  Resolved: 25/Jan/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Sebastien Buisson
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14045 Fix O_DIRECT and encrypted files Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for S Buisson <sbuisson@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/c31eb44e-b299-487e-99da-794b128f0cc5

Even if it does not fail, test_52 shows the following error in the console:

[  694.000072] BUG: Bad rss-counter state mm:0000000049598c66 idx:1 val:2
[  694.001372] BUG: Bad rss-counter state mm:0000000049598c66 idx:3 val:-2
[  694.009960] BUG: Bad rss-counter state mm:0000000049598c66 idx:1 val:2
[  694.011094] BUG: Bad rss-counter state mm:0000000049598c66 idx:3 val:-2
[  694.092732] BUG: Bad rss-counter state mm:00000000e37c077b idx:1 val:2
[  694.094024] BUG: Bad rss-counter state mm:00000000e37c077b idx:3 val:-2
[  694.124577] BUG: Bad rss-counter state mm:0000000049598c66 idx:1 val:2
[  694.125755] BUG: Bad rss-counter state mm:0000000049598c66 idx:3 val:-2
[  694.367283] BUG: Bad rss-counter state mm:0000000049598c66 idx:1 val:3
[  694.368441] BUG: Bad rss-counter state mm:0000000049598c66 idx:3 val:-3
[  694.376602] BUG: Bad rss-counter state mm:0000000084701c1b idx:1 val:3
[  694.377761] BUG: Bad rss-counter state mm:0000000084701c1b idx:3 val:-3

Also, from time to time, some pages are dumped, because of a nonzero mapcount:

[  700.614727] BUG: Bad page state in process lfs  pfn:3ca01
[  700.615774] page:ffffd87ac0f28040 refcount:0 mapcount:-1 mapping:0000000000000000 index:0x0
[  700.617194] flags: 0xfffffc0000000()
[  700.617850] raw: 000fffffc0000000 0000000000000000 ffffffffc0f20903 0000000000000000
[  700.622001] raw: 0000000000000000 0000000000000000 00000000fffffffe 0000000000000000
[  700.623329] page dumped because: nonzero mapcount
[  700.624155] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp rpcrdma ib_iser ib_ipoib libiscsi scsi_transport_iscsi rdma_ucm ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_uverbs ib_core sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i2c_piix4 virtio_balloon joydev pcspkr ip_tables ext4 mbcache jbd2 mlx4_en ata_generic ata_piix 8139too mlx4_core libata crc32c_intel serio_raw 8139cp virtio_blk mii
[  700.633949] CPU: 1 PID: 45813 Comm: lfs Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-240.1.1.el8_3.x86_64 #1
[  700.635821] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  700.636800] Call Trace:
[  700.637382]  dump_stack+0x5c/0x80
[  700.638015]  bad_page.cold.115+0xba/0xbf
[  700.638726]  __free_pages_ok+0x2c4/0x2d0
[  700.639436]  release_pages+0x305/0x430
[  700.640117]  ? cpumask_any_but+0x20/0x40
[  700.640819]  tlb_flush_mmu_free+0x3d/0x60
[  700.641539]  arch_tlb_finish_mmu+0x89/0x130
[  700.642277]  tlb_finish_mmu+0x1f/0x30
[  700.642931]  unmap_region+0xdd/0x110
[  700.643585]  ? __vma_rb_erase+0x127/0x250
[  700.644294]  do_munmap+0x256/0x440
[  700.644913]  vm_munmap+0x5f/0xa0
[  700.645499]  __x64_sys_munmap+0x22/0x30
[  700.646208]  do_syscall_64+0x5b/0x1a0
[  700.646887]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[  700.647778] RIP: 0033:0x7f4035528a0b
[  700.648418] Code: ff ff 0f 1f 44 00 00 48 8b 15 79 54 2c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff e9 6f ff ff ff f3 0f 1e fa b8 0b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4d 54 2c 00 f7 d8 64 89 01 48
[  700.651441] RSP: 002b:00007ffc550ab5f8 EFLAGS: 00000202 ORIG_RAX: 000000000000000b
[  700.652712] RAX: ffffffffffffffda RBX: 00007ffc550ab7e0 RCX: 00007f4035528a0b
[  700.653907] RDX: 0000000000000000 RSI: 0000000000402000 RDI: 00007f4034bfe000
[  700.655109] RBP: 0000000000000002 R08: 0000000000401010 R09: 00007ffc550ab4c0
[  700.656330] R10: 000000000000000a R11: 0000000000000202 R12: 0000000000000002
[  700.657530] R13: 00007ffc550ab6d0 R14: 00007f4034bff000 R15: 00007ffc550ab7e0
[  700.658764] Disabling lock debugging due to kernel taint

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-sec test_52 -



 Comments   
Comment by Gerrit Updater [ 07/Jan/21 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/41159
Subject: LU-14306 tests: debug patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 91b117e0916969466137e43c0d1bf57fdca6d681

Comment by Andreas Dilger [ 07/Jan/21 ]

It seems possible that this problem is caused by the use of "fake" pages for handling O_DIRECT writes from lfs mirror resync? Is it possible that the pages are not being reset properly after use, or they are not being initialized correctly before use?

Comment by Sebastien Buisson [ 07/Jan/21 ]

Indeed, that is most likely because we twist the Lustre pages used for RPCs so that they have a proper mapping and index,
suitable for encryption/decryption. This is done in order to be able to support O_DIRECT for encrypted files, which is the IO mode used when syncing mirrors.

Comment by John Hammond [ 07/Jan/21 ]

These messages start in test_44 which appears to be the first subtest to invoke direct IO.

Comment by Sebastien Buisson [ 07/Jan/21 ]

You are right, I would better try to get traces from sanity-sec test_44 instead of test_52, because test_44 explicitly does O_DIRECT IOs whereas test_52 indirectly triggers that via calls to lfs mirror.

Comment by Gerrit Updater [ 07/Jan/21 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/41164
Subject: LU-14306 tests: attempt 1
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c10dae7f3c5909ff4a3b3e47d3f0561cf68ea161

Comment by Gerrit Updater [ 07/Jan/21 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/41165
Subject: LU-14306 tests: attempt 2
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 509d599f967095ed7969b037a93f747c7b975c38

Comment by Gerrit Updater [ 11/Jan/21 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/41199
Subject: LU-14306 sec: get rid of bad rss-counter state messages
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9d94fafbadd6fa2e1f0ce554d911113fb91cb3c9

Comment by Gerrit Updater [ 12/Jan/21 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/41205
Subject: LU-14306 sec: debug patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0b249bdf4981e915083c9f116649f392582cdc30

Comment by Gerrit Updater [ 12/Jan/21 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/41208
Subject: LU-14306 sec: debug patch 2
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cb8a25627b68d5cfc2d962e61f9aef815ab8d97b

Comment by Gerrit Updater [ 25/Jan/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41199/
Subject: LU-14306 sec: get rid of bad rss-counter state messages
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a71e0dd7f7aa445dc06a99c0a1ef34383cafa2e2

Comment by Peter Jones [ 25/Jan/21 ]

Landed for 2.14

Generated at Sat Feb 10 03:08:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.