[LU-1491] 2.2.0<->tag2.2.54 conf-sanity 62 BUG: unable to handle kernel NULL pointer dereference at 00000000000003d0 Created: 06/Jun/12  Updated: 22/Sep/12  Resolved: 22/Sep/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.4.0
Fix Version/s: Lustre 2.3.0, Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Sarah Liu Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre client: lustre-master-tag2.2.54RHEL6
Lustre server 2.2.0RHEL6


Severity: 3
Rank (Obsolete): 4427

 Description   

Lustre: DEBUG MARKER: == conf-sanity test 62: start with disabled journal == 13:37:57 (1339015077)
LDISKFS-fs (sdc1): mounted filesystem without journal. Opts:
LDISKFS-fs (sdc1): mounted filesystem without journal. Opts:
Lustre: MGC10.10.4.132@tcp: Reactivating import
BUG: unable to handle kernel NULL pointer dereference at 00000000000003d0
IP: [<ffffffffa0d4d1b1>] osd_trans_start+0x151/0x4e0 [osd_ldiskfs]
PGD 0
Oops: 0000 1 SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:11.0/host2/target2:0:0/2:0:0:0/block/sdc/queue/max_sectors_kb
CPU 8
Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ldiskfs(U) jbd2 nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa igb dca mlx4_ib ib_mad ib_core mlx4_en mlx4_core microcode serio_raw k10temp amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core sg shpchp ext3 jbd mbcache sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 4588, comm: llog_process_th Not tainted 2.6.32-220.4.2.el6_lustre.g0aea052.x86_64 #1 Supermicro H8DGT/H8DGT
RIP: 0010:[<ffffffffa0d4d1b1>] [<ffffffffa0d4d1b1>] osd_trans_start+0x151/0x4e0 [osd_ldiskfs]
RSP: 0018:ffff8802ebd07a40 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff880419c743c0 RCX: 0000000000000000
RDX: ffff88033ec11e48 RSI: 0000000000000010 RDI: ffff8802e6a03000
RBP: ffff8802ebd07a70 R08: 00000000fffffffb R09: 00000000fffffffe
R10: 0000000000000001 R11: 000000000000000f R12: ffff88033ec11e00
R13: 0000000000000000 R14: ffff88033ec16000 R15: ffff88033ec11e00
FS: 00007faaa5392700(0000) GS:ffff880323c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000003d0 CR3: 0000000001a85000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process llog_process_th (pid: 4588, threadinfo ffff8802ebd06000, task ffff880319b54a80)
Stack:
ffff8802ebd07a70 0000000000000000 ffff8802ebd07c20 ffff8803e69b1780
<0> ffff880419c744c0 ffff880419c743c0 ffff8802ebd07a80 ffffffffa08c08cc
<0> ffff8802ebd07ae0 ffffffffa08c0ade ffff8802ebd07ae0 ffffffffa08be9b5
Call Trace:
[<ffffffffa08c08cc>] fld_trans_start+0x2c/0x70 [fld]
[<ffffffffa08c0ade>] fld_index_init+0x1ce/0x3c0 [fld]
[<ffffffffa08be9b5>] ? fld_cache_init+0x185/0x350 [fld]
[<ffffffffa08bb6ae>] fld_server_init+0x22e/0x350 [fld]
[<ffffffffa0cb3d40>] mdt_device_alloc+0xab0/0x1a30 [mdt]
[<ffffffffa057ad9f>] ? keys_fill+0x6f/0x1a0 [obdclass]
[<ffffffffa05658af>] obd_setup+0x19f/0x280 [obdclass]
[<ffffffffa0565ba2>] class_setup+0x212/0x730 [obdclass]
[<ffffffffa056af6c>] class_process_config+0x97c/0x1670 [obdclass]
[<ffffffffa044ad08>] ? libcfs_log_return+0x28/0x40 [libcfs]
[<ffffffffa0567c39>] ? lustre_cfg_new+0x359/0x640 [obdclass]
[<ffffffffa056c95b>] class_config_llog_handler+0x79b/0x1160 [obdclass]
[<ffffffffa0542158>] llog_process_thread+0x6e8/0xab0 [obdclass]
[<ffffffffa0541a70>] ? llog_process_thread+0x0/0xab0 [obdclass]
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffffa0541a70>] ? llog_process_thread+0x0/0xab0 [obdclass]
[<ffffffffa0541a70>] ? llog_process_thread+0x0/0xab0 [obdclass]
[<ffffffff8100c140>] ? child_rip+0x0/0x20
Code: 3a ff ff ff 66 0f 1f 44 00 00 49 8b 47 58 0f b7 b3 88 00 00 00 48 8b 40 10 48 8b 78 28 48 8b 87 90 02 00 00 48 8b 80 e0 01 00 00 <3b> b0 d0 03 00 00 0f 8f b3 00 00 00 e8 ee f8 6a ff 48 3d 00 f0
RIP [<ffffffffa0d4d1b1>] osd_trans_start+0x151/0x4e0 [osd_ldiskfs]
RSP <ffff8802ebd07a40>
CR2: 00000000000003d0



 Comments   
Comment by Sarah Liu [ 11/Sep/12 ]

Hit this problem again in interop testing
server: 2.2.0 RHEL6
client: b2_3 build #16
https://maloo.whamcloud.com/test_sets/0fd869fc-fa97-11e1-887d-52540035b04c

13:46:25:Lustre: DEBUG MARKER: == conf-sanity test 62: start with disabled journal == 13:46:25 (1347137185)
13:46:26:Lustre: DEBUG MARKER: tune2fs -O ^has_journal /dev/lvm-MDS/P1
13:46:27:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1
13:46:27:Lustre: DEBUG MARKER: test -b /dev/lvm-MDS/P1
13:46:27:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl  		                   /dev/lvm-MDS/P1 /mnt/mds1
13:46:27:LDISKFS-fs (dm-0): mounted filesystem without journal. Opts: 
13:46:27:LDISKFS-fs (dm-0): mounted filesystem without journal. Opts: 
13:46:27:Lustre: MGC10.10.4.160@tcp: Reactivating import
13:46:27:Lustre: Skipped 3 previous similar messages
13:46:27:BUG: unable to handle kernel NULL pointer dereference at 00000000000003d0
13:46:27:IP: [<ffffffffa090f1b1>] osd_trans_start+0x151/0x4e0 [osd_ldiskfs]
13:46:27:PGD 640c7067 PUD 376ac067 PMD 0 
13:46:27:Oops: 0000 [#1] SMP 
13:46:27:last sysfs file: /sys/devices/pci0000:00/0000:00:04.0/virtio0/block/vda/queue/max_sectors_kb
13:46:27:CPU 0 
13:46:27:Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfs fscache jbd2 nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
13:46:27:
13:46:27:Pid: 25021, comm: llog_process_th Not tainted 2.6.32-220.4.2.el6_lustre.x86_64 #1 Red Hat KVM
13:46:27:RIP: 0010:[<ffffffffa090f1b1>]  [<ffffffffa090f1b1>] osd_trans_start+0x151/0x4e0 [osd_ldiskfs]
13:46:27:RSP: 0018:ffff88005e3b97f0  EFLAGS: 00010246
13:46:27:RAX: 0000000000000000 RBX: ffff88007bf6dd80 RCX: 0000000000000000
13:46:27:RDX: ffff88007bf6dd80 RSI: 0000000000000057 RDI: ffff88006c2fb800
13:46:27:RBP: ffff88005e3b9820 R08: 20737365636f7250 R09: 0a64657265746e65
13:46:27:R10: 20737365636f7250 R11: 0a64657265746e65 R12: ffff88006c21a600
13:46:27:R13: 0000000000000000 R14: ffff88006ab95000 R15: ffff88006c21a600
13:46:27:FS:  00007f42da509700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
13:46:27:CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
13:46:27:CR2: 00000000000003d0 CR3: 000000005a32c000 CR4: 00000000000006f0
13:46:27:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
13:46:27:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
13:46:27:Process llog_process_th (pid: 25021, threadinfo ffff88005e3b8000, task ffff88007cbb0b40)
13:46:27:Stack:
13:46:27: ffff88005e3b9820 0000000000000000 ffff88007bf477c0 ffff88005e3b9c20
13:46:27:<0> ffff88006c21a200 ffff88007af12000 ffff88005e3b9830 ffffffffa09b7104
13:46:27:<0> ffff88005e3b9970 ffffffffa09ad9c2 ffff88006d7a0c38 ffff88006d7a0c78
13:46:27:Call Trace:
13:46:27: [<ffffffffa09b7104>] mdd_trans_start+0x14/0x20 [mdd]
13:46:27: [<ffffffffa09ad9c2>] mdd_create+0x12f2/0x1c50 [mdd]
13:46:27: [<ffffffffa0b16799>] ? cfs_hash_bd_add_locked+0x29/0x90 [libcfs]
13:46:27: [<ffffffffa03f074d>] ? dt_path_parser+0x3d/0x90 [obdclass]
13:46:27: [<ffffffffa0407d0f>] llo_store_create_index+0x16f/0x220 [obdclass]
13:46:27: [<ffffffffa0916d92>] osd_oi_init+0x2f2/0x4e0 [osd_ldiskfs]
13:46:27: [<ffffffffa0910ab8>] osd_prepare+0x88/0x2c0 [osd_ldiskfs]
13:46:27: [<ffffffffa09b3c5a>] mdd_prepare+0xca/0x5a0 [mdd]
13:46:27: [<ffffffffa0b134f1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
13:46:27: [<ffffffffa0a12d58>] ? mdt_process_config+0xa8/0xa80 [mdt]
13:46:27: [<ffffffffa0aa9446>] cmm_prepare+0x46/0xa0 [cmm]
13:46:27: [<ffffffffa0a14bd1>] mdt_device_alloc+0x941/0x1a30 [mdt]
13:46:27: [<ffffffffa03ebdef>] ? keys_fill+0x6f/0x1a0 [obdclass]
13:46:27: [<ffffffffa03d68ff>] obd_setup+0x19f/0x280 [obdclass]
13:46:27: [<ffffffffa03d6bf2>] class_setup+0x212/0x730 [obdclass]
13:46:27: [<ffffffffa03dbfbc>] class_process_config+0x97c/0x1670 [obdclass]
13:46:27: [<ffffffffa0b0dd08>] ? libcfs_log_return+0x28/0x40 [libcfs]
13:46:27: [<ffffffffa03d8c89>] ? lustre_cfg_new+0x359/0x640 [obdclass]
13:46:27: [<ffffffffa03dd9ab>] class_config_llog_handler+0x79b/0x1160 [obdclass]
13:46:27: [<ffffffffa03b3158>] llog_process_thread+0x6e8/0xab0 [obdclass]
13:46:27: [<ffffffff814f3e0c>] ? kprobe_flush_task+0xbc/0xe0
13:46:27: [<ffffffffa03b2a70>] ? llog_process_thread+0x0/0xab0 [obdclass]
13:46:27: [<ffffffff8100c14a>] child_rip+0xa/0x20
13:46:27: [<ffffffffa03b2a70>] ? llog_process_thread+0x0/0xab0 [obdclass]
13:46:27: [<ffffffffa03b2a70>] ? llog_process_thread+0x0/0xab0 [obdclass]
13:46:27: [<ffffffff8100c140>] ? child_rip+0x0/0x20
13:46:27:Code: 3a ff ff ff 66 0f 1f 44 00 00 49 8b 47 58 0f b7 b3 88 00 00 00 48 8b 40 10 48 8b 78 28 48 8b 87 90 02 00 00 48 8b 80 e0 01 00 00 <3b> b0 d0 03 00 00 0f 8f b3 00 00 00 e8 ee b8 d8 ff 48 3d 00 f0 
13:46:27:RIP  [<ffffffffa090f1b1>] osd_trans_start+0x151/0x4e0 [osd_ldiskfs]
13:46:27: RSP <ffff88005e3b97f0>
13:46:27:CR2: 00000000000003d0
13:46:27:---[ end trace fa7541ee0e91aea2 ]---
13:46:27:Kernel panic - not syncing: Fatal exception
13:46:27:Pid: 25021, comm: llog_process_th Tainted: G      D    ----------------   2.6.32-220.4.2.el6_lustre.x86_64 #1
13:46:27:Call Trace:
13:46:27: [<ffffffff814ec61a>] ? panic+0x78/0x143
13:46:27: [<ffffffff814f07a4>] ? oops_end+0xe4/0x100
13:46:27: [<ffffffff8104234b>] ? no_context+0xfb/0x260
13:46:27: [<ffffffff810425d5>] ? __bad_area_nosemaphore+0x125/0x1e0
13:46:27: [<ffffffff810426a3>] ? bad_area_nosemaphore+0x13/0x20
13:46:27: [<ffffffff81042d5d>] ? __do_page_fault+0x31d/0x480
13:46:27: [<ffffffffa0b12e31>] ? libcfs_debug_vmsg2+0x4e1/0xb60 [libcfs]
13:46:27: [<ffffffff814f275e>] ? do_page_fault+0x3e/0xa0
13:46:27: [<ffffffff814efb15>] ? page_fault+0x25/0x30
13:46:27: [<ffffffffa090f1b1>] ? osd_trans_start+0x151/0x4e0 [osd_ldiskfs]
13:46:27: [<ffffffffa09b7104>] ? mdd_trans_start+0x14/0x20 [mdd]
13:46:27: [<ffffffffa09ad9c2>] ? mdd_create+0x12f2/0x1c50 [mdd]
13:46:27: [<ffffffffa0b16799>] ? cfs_hash_bd_add_locked+0x29/0x90 [libcfs]
13:46:27: [<ffffffffa03f074d>] ? dt_path_parser+0x3d/0x90 [obdclass]
13:46:27: [<ffffffffa0407d0f>] ? llo_store_create_index+0x16f/0x220 [obdclass]
13:46:27: [<ffffffffa0916d92>] ? osd_oi_init+0x2f2/0x4e0 [osd_ldiskfs]
13:46:27: [<ffffffffa0910ab8>] ? osd_prepare+0x88/0x2c0 [osd_ldiskfs]
13:46:27: [<ffffffffa09b3c5a>] ? mdd_prepare+0xca/0x5a0 [mdd]
13:46:27: [<ffffffffa0b134f1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
13:46:27: [<ffffffffa0a12d58>] ? mdt_process_config+0xa8/0xa80 [mdt]
13:46:27: [<ffffffffa0aa9446>] ? cmm_prepare+0x46/0xa0 [cmm]
13:46:27: [<ffffffffa0a14bd1>] ? mdt_device_alloc+0x941/0x1a30 [mdt]
13:46:27: [<ffffffffa03ebdef>] ? keys_fill+0x6f/0x1a0 [obdclass]
13:46:27: [<ffffffffa03d68ff>] ? obd_setup+0x19f/0x280 [obdclass]
13:46:27: [<ffffffffa03d6bf2>] ? class_setup+0x212/0x730 [obdclass]
13:46:27: [<ffffffffa03dbfbc>] ? class_process_config+0x97c/0x1670 [obdclass]
13:46:27: [<ffffffffa0b0dd08>] ? libcfs_log_return+0x28/0x40 [libcfs]
13:46:27: [<ffffffffa03d8c89>] ? lustre_cfg_new+0x359/0x640 [obdclass]
13:46:27: [<ffffffffa03dd9ab>] ? class_config_llog_handler+0x79b/0x1160 [obdclass]
13:46:27: [<ffffffffa03b3158>] ? llog_process_thread+0x6e8/0xab0 [obdclass]
13:46:27: [<ffffffff814f3e0c>] ? kprobe_flush_task+0xbc/0xe0
13:46:27: [<ffffffffa03b2a70>] ? llog_process_thread+0x0/0xab0 [obdclass]
13:46:27: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
13:46:27: [<ffffffffa03b2a70>] ? llog_process_thread+0x0/0xab0 [obdclass]
13:46:27: [<ffffffffa03b2a70>] ? llog_process_thread+0x0/0xab0 [obdclass]
13:46:27: [<ffffffff8100c140>] ? child_rip+0x0/0x20
14:46:49:********** Timeout by autotest system **********14:47:53:
14:47:53:<ConMan> Console [client-27vm7] disconnected from <client-27:6006> at 09-08 14:47.
Comment by Jian Yu [ 17/Sep/12 ]

Lustre client build: http://build.whamcloud.com/job/lustre-b2_3/19
Lustre server build: http://build.whamcloud.com/job/lustre-b2_2/17
Distro/Arch: RHEL6.3/x86_64
https://maloo.whamcloud.com/test_sets/efba2fb0-0069-11e2-9f3c-52540035b04c

We need add Lustre version check code in Lustre b2_3 and master test suites to skip the test.

Comment by Peter Jones [ 17/Sep/12 ]

Yujian

Could you please take care of this one?

Thanks

Peter

Comment by Jian Yu [ 17/Sep/12 ]

Patch for master branch: http://review.whamcloud.com/4018
It also needs to be cherry-picked to b2_3 branch.

Comment by Peter Jones [ 22/Sep/12 ]

Landed for 2.3 and 2.4

Generated at Sat Feb 10 06:06:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.