[LU-5279] kernel BUG at arch/x86/mm/physaddr.c:20! Created: 01/Jul/14  Updated: 04/Jun/15  Resolved: 07/Jul/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Blocker
Reporter: John Hammond Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: libcfs

Issue Links:
Related
is related to LU-5053 soft lockup in ptlrpcd on client Resolved
Severity: 3
Rank (Obsolete): 14732

 Description   

On v2_5_60_0-84-gbdc5bb5 I see a kernel BUG when loading libcfs.ko.

[   54.655271] LNet: HW CPU cores: 4, npartitions: 2
[   54.679747] alg: No test for adler32 (adler32-zlib)
[   54.682437] alg: No test for crc32 (crc32-table)
[   54.692688] ------------[ cut here ]------------
[   54.693610] kernel BUG at arch/x86/mm/physaddr.c:20!
[   54.693610] invalid opcode: 0000 [#1] SMP 
[   54.693610] last sysfs file: /sys/devices/system/cpu/online
[   54.693610] CPU 3 
[   54.693610] Modules linked in: libcfs(+)(U) autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 jbd2 mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
[   54.693610] 
[   54.693610] Pid: 4382, comm: insmod Not tainted 2.6.32-431.5.1.el6.lustre.x86_64 #1 Bochs Bochs
[   54.693610] RIP: 0010:[<ffffffff81050447>]  [<ffffffff81050447>] __phys_addr+0x67/0x80
[   54.693610] RSP: 0018:ffff88021621bd88  EFLAGS: 00010206
[   54.693610] RAX: 000041000283c000 RBX: ffff88021621bdd8 RCX: 0000000000000028
[   54.693610] RDX: 0000000000000041 RSI: 0000000000000000 RDI: ffffc9000283c000
[   54.693610] RBP: ffff88021621bd88 R08: 0000000000000000 R09: ffff88021621bdd8
[   54.693610] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[   54.693610] R13: 0000000000100000 R14: ffff88021621bdd8 R15: ffffc9000283c000
[   54.693610] FS:  00007f402faf0700(0000) GS:ffff88002fe00000(0000) knlGS:0000000000000000
[   54.693610] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   54.693610] CR2: 0000003d046e99b0 CR3: 0000000216f71000 CR4: 00000000000006e0
[   54.693610] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   54.693610] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   54.693610] Process insmod (pid: 4382, threadinfo ffff88021621a000, task ffff8802191b4080)
[   54.693610] Stack:
[   54.693610]  ffff88021621bdb8 ffffffff812b0fe6 ffffc9000283c000 0000000000100000
[   54.693610] <d> ffffc9000283c000 ffff88021621bdf8 ffff88021621be38 ffffffffa02c85c4
[   54.693610] <d> ffff88021621be28 ffff88021621be78 0000000000000002 0000000000000000
[   54.693610] Call Trace:
[   54.693610]  [<ffffffff812b0fe6>] sg_init_one+0x36/0x80
[   54.693610]  [<ffffffffa02c85c4>] cfs_crypto_hash_digest+0x94/0xf0 [libcfs]
[   54.693610]  [<ffffffffa02cc6d0>] ? init_libcfs_module+0x0/0x3c0 [libcfs]
[   54.693610]  [<ffffffffa02c8728>] cfs_crypto_register+0x108/0x330 [libcfs]
[   54.693610]  [<ffffffffa02da606>] ? cfs_wi_sched_create+0x396/0x630 [libcfs]
[   54.693610]  [<ffffffffa02cc6d0>] ? init_libcfs_module+0x0/0x3c0 [libcfs]
[   54.693610]  [<ffffffffa02cc6d0>] ? init_libcfs_module+0x0/0x3c0 [libcfs]
[   54.693610]  [<ffffffffa02cc925>] init_libcfs_module+0x255/0x3c0 [libcfs]
[   54.693610]  [<ffffffff8100204c>] do_one_initcall+0x3c/0x1d0
[   54.693610]  [<ffffffff810caa03>] sys_init_module+0xe3/0x260
[   54.693610]  [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[   54.693610] Code: ff ff ff 87 ff ff 48 39 c7 76 24 48 b8 00 00 00 00 00 78 00 00 0f b6 0d 11 0b bc 00 48 8d 04 07 48 89 c2 48 d3 ea 48 85 d2 74 c9 <0f> 0b eb fe 0f 0b eb fe 0f 0b eb fe 90 90 90 90 90 90 90 90 90 
[   54.693610] RIP  [<ffffffff81050447>] __phys_addr+0x67/0x80
[   54.693610]  RSP <ffff88021621bd88>


 Comments   
Comment by Bob Glossman (Inactive) [ 01/Jul/14 ]

I don't know that it's related, but I note that "LU-5053 libcfs: clean up cfs_crypto_hash code" landed to master very recently.

Comment by John Hammond [ 01/Jul/14 ]

Yes this is due to the switch from kmalloc() to vmalloc() in 0723b8be34503b31dbfc6c877d7a855091c625b4 along with having CONFIG_DEBUG_VIRTUAL=y. The scatterlist interface is allergic to virtual memory.

Comment by Andreas Dilger [ 04/Jul/14 ]

http://review.whamcloud.com/10982

Comment by John Hammond [ 04/Jul/14 ]

Here is a possibly related OOM from an insanity run:

https://testing.hpdd.intel.com/test_logs/375f4b94-0314-11e4-b192-5254006e85c2

Comment by Andreas Dilger [ 07/Jul/14 ]

Patch landed to master for 2.6.0.

Generated at Sat Feb 10 01:50:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.