[LU-2310] Kernel panic when trying to mount MDS after system upgrade from 1.8.8-wc1 to master. Created: 09/Nov/12 Updated: 19/Apr/13 Resolved: 19/Apr/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Sarah Liu | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | NFBlocker | ||
| Environment: |
Before Upgrade: After Upgrade: |
||
| Severity: | 3 |
| Rank (Obsolete): | 5523 |
| Description |
|
After upgrade the system from 1.8.8-wc1 to master, hit kernel panic when trying to mount MDS. Here is the console log from MDS: [root@fat-amd-1 ~]# mount -t lustre -o user_xattr,acl /dev/sdb1 /mnt/mds LDISKFS-fs (sdb1): recovery complete LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=off. Opts: Lustre: MGC10.10.4.132@tcp: Reactivating import Lustre: MGS: Logs for fs lustre were removed by user request. All servers must be restarted in order to regenerate the logs. Lustre: Setting parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000 Lustre: Setting parameter lustre-clilov.lov.stripesize in log lustre-client LustreError: 2535:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2 Lustre: lustre-MDT0000: used disk, loading Lustre: Mounting lustre-MDT0000 at first time on 1.8 FS, remove all clients for interop needs LustreError: 2609:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context Lustre: lustre-MDT0000: Migrate inode quota from old admin quota file(admin_quotafile_v2.usr) to new IAM quota index([0x200000006:0x10000:0x0]). BUG: unable to handle kernel NULL pointer dereference at 0000000000000068 IP: [<ffffffffa09cb2bd>] ldiskfs_quota_read+0x3d/0x140 [ldiskfs] PGD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/module/mdd/initstate CPU 3 Modules linked in: mdt(U) lod(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) ldiskfs(U) lquota(U) jbd2 mdd(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa igb dca mlx4_ib ib_mad ib_core mlx4_en mlx4_core microcode serio_raw k10temp amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core sg shpchp ext3 jbd mbcache sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 2609, comm: llog_process_th Not tainted 2.6.32-279.5.1.el6_lustre.x86_64 #1 Supermicro H8DGT/H8DGT RIP: 0010:[<ffffffffa09cb2bd>] [<ffffffffa09cb2bd>] ldiskfs_quota_read+0x3d/0x140 [ldiskfs] RSP: 0018:ffff88041515b620 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88011c4943c0 RCX: 000000000000000c RDX: ffff88010a285c00 RSI: 0000000000000000 RDI: ffff88041d1adc00 RBP: ffff88041515b680 R08: 0000000000000400 R09: ffff88041d1adc00 R10: 0000000000000004 R11: 0000000000000000 R12: ffff88010a285c00 R13: 0000000000000000 R14: ffff88010a285c00 R15: 0000000000000400 FS: 00007f30c588d700(0000) GS:ffff8800282c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000068 CR3: 0000000001a85000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process llog_process_th (pid: 2609, threadinfo ffff88041515a000, task ffff880415152ae0) Stack: ffff88011fc21600 0000000000000400 0000000000000000 ffff88041d1adc00 <d> 0000000000000000 0000000000000400 0000000000000010 ffff88011c4943c0 <d> ffff88041515bbd0 0000000000000000 ffff88010a285c00 0000000000000400 Call Trace: [<ffffffffa0a424d7>] quota_read_blk+0x217/0x280 [osd_ldiskfs] [<ffffffffa0a425c2>] find_tree_dqentry+0x82/0x750 [osd_ldiskfs] [<ffffffffa0a173ec>] ? osd_trans_stop+0x12c/0x450 [osd_ldiskfs] [<ffffffffa0a22302>] ? osd_index_iam_insert+0x252/0x560 [osd_ldiskfs] [<ffffffffa0a2eb40>] ? iam_lfix_ipd_alloc+0x0/0x20 [osd_ldiskfs] [<ffffffffa0a3cb70>] osd_it_admin_get+0x50/0x230 [osd_ldiskfs] [<ffffffffa0a3eeac>] ? write_quota_rec+0x26c/0xb80 [osd_ldiskfs] [<ffffffffa0a3cd8b>] osd_it_admin_load+0x3b/0xd0 [osd_ldiskfs] [<ffffffffa0a40ca1>] convert_quota_file+0x7b1/0xd80 [osd_ldiskfs] [<ffffffffa0a41ceb>] osd_quota_migration+0xa7b/0x1050 [osd_ldiskfs] [<ffffffffa0a25c3e>] osd_index_try+0x28e/0x610 [osd_ldiskfs] [<ffffffffa0926ad4>] lquota_disk_glb_find_create+0x2c4/0x410 [lquota] [<ffffffffa0948bf1>] qmt_pool_prepare+0x1b1/0x710 [lquota] [<ffffffffa093b4e3>] qmt_device_prepare+0xd3/0x210 [lquota] [<ffffffffa0b7fbf5>] mdt_quota_init+0xdc5/0x1730 [mdt] [<ffffffffa0b83aa7>] mdt_device_alloc+0xb47/0xf60 [mdt] [<ffffffffa04cd007>] obd_setup+0x1d7/0x2f0 [obdclass] [<ffffffffa04cd328>] class_setup+0x208/0x890 [obdclass] [<ffffffffa04d48dc>] class_process_config+0xc0c/0x1ca0 [obdclass] [<ffffffffa036dbe0>] ? cfs_alloc+0x30/0x60 [libcfs] [<ffffffffa04ce533>] ? lustre_cfg_new+0x353/0x7e0 [obdclass] [<ffffffffa04d6a3b>] class_config_llog_handler+0x9db/0x1850 [obdclass] [<ffffffff814ff18e>] ? mutex_lock+0x1e/0x50 [<ffffffffa0491b9b>] llog_process_thread+0x8fb/0xe10 [obdclass] [<ffffffffa04ee58f>] ? keys_fill+0x6f/0x1a0 [obdclass] [<ffffffffa04920b0>] ? llog_process_thread_daemonize+0x0/0x80 [obdclass] [<ffffffffa04920b0>] ? llog_process_thread_daemonize+0x0/0x80 [obdclass] [<ffffffffa04920fc>] llog_process_thread_daemonize+0x4c/0x80 [obdclass] [<ffffffff8100c14a>] child_rip+0xa/0x20 [<ffffffffa04920b0>] ? llog_process_thread_daemonize+0x0/0x80 [obdclass] [<ffffffffa04920b0>] ? llog_process_thread_daemonize+0x0/0x80 [obdclass] [<ffffffff8100c140>] ? child_rip+0x0/0x20 Code: 1f 44 00 00 31 c0 48 63 f6 48 89 7d b8 48 89 4d a8 48 8b b4 f7 a0 01 00 00 49 89 d4 48 89 75 b0 0f b6 4f 20 c7 45 cc 00 00 00 00 <48> 8b 56 68 4c 8b 77 18 49 39 d0 0f 8f c6 00 00 00 48 8b 5d a8 RIP [<ffffffffa09cb2bd>] ldiskfs_quota_read+0x3d/0x140 [ldiskfs] RSP <ffff88041515b620> CR2: 0000000000000068 Initializing cgroup subsys cpuset Initializing cgroup subsys cpu Linux version 2.6.32-279.5.1.el6_lustre.x86_64 (jenkins@builder-2-sdd1-el6-x8664.lab.whamcloud.com) (gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) ) #1 SMP Mon Oct 29 00:58:20 PDT 2012 Command line: ro root=UUID=6e137aae-0326-4846-9256-0aed1e37b7d5 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD console=ttyS0,115200 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off memmap=exactmap memmap=538K@64K memmap=132562K@49690K elfcorehdr=182252K memmap=64K$0K memmap=38K$602K memmap=104K$920K memmap=8K$3668600K memmap=72K#3668608K memmap=184K#3668680K memmap=263296K$3668864K memmap=2048K$4192256K KERNEL supported cpus: Intel GenuineIntel AMD AuthenticAMD |
| Comments |
| Comment by Niu Yawei (Inactive) [ 11/Nov/12 ] |
|
hmm, there is a typo in find_tree_dqentry(): ret = quota_read_blk(env, obj, 0, blk, buf); we should pass 'type' for the quota_read_blk, not sure why I passed '0' here. |
| Comment by Niu Yawei (Inactive) [ 11/Nov/12 ] |
|
http://review.whamcloud.com/4509 Sarah, could you try above patch? Thanks. |
| Comment by Sarah Liu [ 12/Nov/12 ] |
|
I will try it and let you know the status. Mount MDS failded, unknown param quota_type=ug3 LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=off. Opts: LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=off. Opts: Lustre: MGC10.10.4.132@tcp: Reactivating import Lustre: MGS: Logs for fs lustre were removed by user request. All servers must be restarted in order to regenerate the logs. Lustre: Setting parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000 Lustre: Setting parameter lustre-clilov.lov.stripesize in log lustre-client LustreError: 31273:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2 Lustre: lustre-MDT0000: used disk, loading Lustre: Mounting lustre-MDT0000 at first time on 1.8 FS, remove all clients for interop needs LustreError: 31347:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context Lustre: lustre-MDT0000: Migrate inode quota from old admin quota file(admin_quotafile_v2.usr) to new IAM quota index([0x200000006:0x10000:0x0]). Lustre: lustre-MDT0000: Migrate inode quota from old admin quota file(admin_quotafile_v2.grp) to new IAM quota index([0x200000006:0x1010000:0x0]). Lustre: 31347:0:(mdt_handler.c:5192:mdt_process_config()) For 1.8 interoperability, skip this mdt.group_upcall. It is obsolete. LustreError: 31347:0:(obd_config.c:1299:class_process_proc_param()) lustre-MDT0000: unknown param quota_type=ug3 LustreError: 31347:0:(obd_config.c:1546:class_config_llog_handler()) MGC10.10.4.132@tcp: cfg command failed: rc = -38 Lustre: cmd=cf00f 0:lustre-MDT0000 1:mdd.quota_type=ug3 LustreError: 15c-8: MGC10.10.4.132@tcp: The configuration from log 'lustre-MDT0000' failed (-38). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. LustreError: 31273:0:(obd_mount.c:1850:server_start_targets()) failed to start server lustre-MDT0000: -38 LustreError: 31273:0:(obd_mount.c:2399:server_fill_super()) Unable to start targets: -38 LustreError: 31273:0:(obd_mount.c:1350:lustre_disconnect_osp()) Can't end config log lustre LustreError: 31273:0:(obd_mount.c:2112:server_put_super()) lustre-MDT0000: failed to disconnect osp-on-ost (rc=-2)! Lustre: Failing over lustre-MDT0000 LustreError: 31273:0:(obd_mount.c:1418:lustre_stop_osp()) Can not find osp-on-ost lustre-MDT0000-osp-MDT0000 LustreError: 31273:0:(obd_mount.c:2157:server_put_super()) lustre-MDT0000: Fail to stop osp-on-ost! LustreError: 31273:0:(ldlm_request.c:1183:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway LustreError: 31273:0:(ldlm_request.c:1815:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Lustre: 31273:0:(client.c:1912:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1352760135/real 1352760135] req@ffff88011cba1c00 x1418471798734858/t0(0) o251->MGC10.10.4.132@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1352760141 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: server umount lustre-MDT0000 complete LustreError: 31273:0:(obd_mount.c:2987:lustre_fill_super()) Unable to mount (-38) Lustre: DEBUG MARKER: Using TIMEOUT=20 Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: NAME=ncli not mounted LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=off. Opts: Lustre: MGC10.10.4.132@tcp: Reactivating import Lustre: MGS: Logs for fs lustre were removed by user request. All servers must be restarted in order to regenerate the logs. Lustre: Setting parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000 Lustre: Skipped 4 previous similar messages LustreError: 31689:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2 Lustre: lustre-MDT0000: used disk, loading LustreError: 31757:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context Lustre: 31757:0:(mdt_handler.c:5192:mdt_process_config()) For 1.8 interoperability, skip this mdt.group_upcall. It is obsolete. LustreError: 31757:0:(obd_config.c:1299:class_process_proc_param()) lustre-MDT0000: unknown param quota_type=ug3 LustreError: 31757:0:(obd_config.c:1546:class_config_llog_handler()) MGC10.10.4.132@tcp: cfg command failed: rc = -38 Lustre: cmd=cf00f 0:lustre-MDT0000 1:mdd.quota_type=ug3 LustreError: 15c-8: MGC10.10.4.132@tcp: The configuration from log 'lustre-MDT0000' failed (-38). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. LustreError: 31689:0:(obd_mount.c:1850:server_start_targets()) failed to start server lustre-MDT0000: -38 LustreError: 31689:0:(obd_mount.c:2399:server_fill_super()) Unable to start targets: -38 LustreError: 31689:0:(obd_mount.c:1350:lustre_disconnect_osp()) Can't end config log lustre LustreError: 31689:0:(obd_mount.c:2112:server_put_super()) lustre-MDT0000: failed to disconnect osp-on-ost (rc=-2)! Lustre: Failing over lustre-MDT0000 LustreError: 31689:0:(obd_mount.c:1418:lustre_stop_osp()) Can not find osp-on-ost lustre-MDT0000-osp-MDT0000 LustreError: 31689:0:(obd_mount.c:2157:server_put_super()) lustre-MDT0000: Fail to stop osp-on-ost! LustreError: 31689:0:(ldlm_request.c:1183:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway LustreError: 31689:0:(ldlm_request.c:1815:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Lustre: 31689:0:(client.c:1912:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1352760261/real 1352760261] req@ffff88030e13d000 x1418471798734868/t0(0) o251->MGC10.10.4.132@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1352760267 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: server umount lustre-MDT0000 complete LustreError: 31689:0:(obd_mount.c:2987:lustre_fill_super()) Unable to mount (-38) Lustre: DEBUG MARKER: Using TIMEOUT=20 Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: NAME=ncli not mounted |
| Comment by Niu Yawei (Inactive) [ 12/Nov/12 ] |
|
Thank you Sarah. Seems the patch fixed the panic. The unkown parameter is another issue, I think we'd open another ticket and track it there (you can assigne it to me). Thanks. |
| Comment by Sarah Liu [ 20/Nov/12 ] |
|
this has been fixed, close it |