[LU-2310] Kernel panic when trying to mount MDS after system upgrade from 1.8.8-wc1 to master. Created: 09/Nov/12  Updated: 19/Apr/13  Resolved: 19/Apr/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Sarah Liu Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: NFBlocker
Environment:

Before Upgrade:
server: 1.8.8-wc1 RHEL5
client: 1.8.8-wc1 RHEL5/6

After Upgrade:
server: lustre master build #1035 RHEL6
client: lustre master build #1035 RHEL6


Severity: 3
Rank (Obsolete): 5523

 Description   

After upgrade the system from 1.8.8-wc1 to master, hit kernel panic when trying to mount MDS. Here is the console log from MDS:

[root@fat-amd-1 ~]# mount -t lustre -o user_xattr,acl  /dev/sdb1 /mnt/mds
LDISKFS-fs (sdb1): recovery complete
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=off. Opts: 
Lustre: MGC10.10.4.132@tcp: Reactivating import
Lustre: MGS: Logs for fs lustre were removed by user request.  All servers must be restarted in order to regenerate the logs.
Lustre: Setting parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000
Lustre: Setting parameter lustre-clilov.lov.stripesize in log lustre-client
LustreError: 2535:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2
Lustre: lustre-MDT0000: used disk, loading
Lustre: Mounting lustre-MDT0000 at first time on 1.8 FS, remove all clients for interop needs
LustreError: 2609:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context
Lustre: lustre-MDT0000: Migrate inode quota from old admin quota file(admin_quotafile_v2.usr) to new IAM quota index([0x200000006:0x10000:0x0]).
BUG: unable to handle kernel NULL pointer dereference at 0000000000000068
IP: [<ffffffffa09cb2bd>] ldiskfs_quota_read+0x3d/0x140 [ldiskfs]
PGD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/module/mdd/initstate
CPU 3 
Modules linked in: mdt(U) lod(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) ldiskfs(U) lquota(U) jbd2 mdd(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa igb dca mlx4_ib ib_mad ib_core mlx4_en mlx4_core microcode serio_raw k10temp amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core sg shpchp ext3 jbd mbcache sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 2609, comm: llog_process_th Not tainted 2.6.32-279.5.1.el6_lustre.x86_64 #1 Supermicro H8DGT/H8DGT
RIP: 0010:[<ffffffffa09cb2bd>]  [<ffffffffa09cb2bd>] ldiskfs_quota_read+0x3d/0x140 [ldiskfs]
RSP: 0018:ffff88041515b620  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88011c4943c0 RCX: 000000000000000c
RDX: ffff88010a285c00 RSI: 0000000000000000 RDI: ffff88041d1adc00
RBP: ffff88041515b680 R08: 0000000000000400 R09: ffff88041d1adc00
R10: 0000000000000004 R11: 0000000000000000 R12: ffff88010a285c00
R13: 0000000000000000 R14: ffff88010a285c00 R15: 0000000000000400
FS:  00007f30c588d700(0000) GS:ffff8800282c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000068 CR3: 0000000001a85000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process llog_process_th (pid: 2609, threadinfo ffff88041515a000, task ffff880415152ae0)
Stack:
 ffff88011fc21600 0000000000000400 0000000000000000 ffff88041d1adc00
<d> 0000000000000000 0000000000000400 0000000000000010 ffff88011c4943c0
<d> ffff88041515bbd0 0000000000000000 ffff88010a285c00 0000000000000400
Call Trace:
 [<ffffffffa0a424d7>] quota_read_blk+0x217/0x280 [osd_ldiskfs]
 [<ffffffffa0a425c2>] find_tree_dqentry+0x82/0x750 [osd_ldiskfs]
 [<ffffffffa0a173ec>] ? osd_trans_stop+0x12c/0x450 [osd_ldiskfs]
 [<ffffffffa0a22302>] ? osd_index_iam_insert+0x252/0x560 [osd_ldiskfs]
 [<ffffffffa0a2eb40>] ? iam_lfix_ipd_alloc+0x0/0x20 [osd_ldiskfs]
 [<ffffffffa0a3cb70>] osd_it_admin_get+0x50/0x230 [osd_ldiskfs]
 [<ffffffffa0a3eeac>] ? write_quota_rec+0x26c/0xb80 [osd_ldiskfs]
 [<ffffffffa0a3cd8b>] osd_it_admin_load+0x3b/0xd0 [osd_ldiskfs]
 [<ffffffffa0a40ca1>] convert_quota_file+0x7b1/0xd80 [osd_ldiskfs]
 [<ffffffffa0a41ceb>] osd_quota_migration+0xa7b/0x1050 [osd_ldiskfs]
 [<ffffffffa0a25c3e>] osd_index_try+0x28e/0x610 [osd_ldiskfs]
 [<ffffffffa0926ad4>] lquota_disk_glb_find_create+0x2c4/0x410 [lquota]
 [<ffffffffa0948bf1>] qmt_pool_prepare+0x1b1/0x710 [lquota]
 [<ffffffffa093b4e3>] qmt_device_prepare+0xd3/0x210 [lquota]
 [<ffffffffa0b7fbf5>] mdt_quota_init+0xdc5/0x1730 [mdt]
 [<ffffffffa0b83aa7>] mdt_device_alloc+0xb47/0xf60 [mdt]
 [<ffffffffa04cd007>] obd_setup+0x1d7/0x2f0 [obdclass]
 [<ffffffffa04cd328>] class_setup+0x208/0x890 [obdclass]
 [<ffffffffa04d48dc>] class_process_config+0xc0c/0x1ca0 [obdclass]
 [<ffffffffa036dbe0>] ? cfs_alloc+0x30/0x60 [libcfs]
 [<ffffffffa04ce533>] ? lustre_cfg_new+0x353/0x7e0 [obdclass]
 [<ffffffffa04d6a3b>] class_config_llog_handler+0x9db/0x1850 [obdclass]
 [<ffffffff814ff18e>] ? mutex_lock+0x1e/0x50
 [<ffffffffa0491b9b>] llog_process_thread+0x8fb/0xe10 [obdclass]
 [<ffffffffa04ee58f>] ? keys_fill+0x6f/0x1a0 [obdclass]
 [<ffffffffa04920b0>] ? llog_process_thread_daemonize+0x0/0x80 [obdclass]
 [<ffffffffa04920b0>] ? llog_process_thread_daemonize+0x0/0x80 [obdclass]
 [<ffffffffa04920fc>] llog_process_thread_daemonize+0x4c/0x80 [obdclass]
 [<ffffffff8100c14a>] child_rip+0xa/0x20
 [<ffffffffa04920b0>] ? llog_process_thread_daemonize+0x0/0x80 [obdclass]
 [<ffffffffa04920b0>] ? llog_process_thread_daemonize+0x0/0x80 [obdclass]
 [<ffffffff8100c140>] ? child_rip+0x0/0x20
Code: 1f 44 00 00 31 c0 48 63 f6 48 89 7d b8 48 89 4d a8 48 8b b4 f7 a0 01 00 00 49 89 d4 48 89 75 b0 0f b6 4f 20 c7 45 cc 00 00 00 00 <48> 8b 56 68 4c 8b 77 18 49 39 d0 0f 8f c6 00 00 00 48 8b 5d a8 
RIP  [<ffffffffa09cb2bd>] ldiskfs_quota_read+0x3d/0x140 [ldiskfs]
 RSP <ffff88041515b620>
CR2: 0000000000000068
Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Linux version 2.6.32-279.5.1.el6_lustre.x86_64 (jenkins@builder-2-sdd1-el6-x8664.lab.whamcloud.com) (gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) ) #1 SMP Mon Oct 29 00:58:20 PDT 2012
Command line: ro root=UUID=6e137aae-0326-4846-9256-0aed1e37b7d5 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD console=ttyS0,115200 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off  memmap=exactmap memmap=538K@64K memmap=132562K@49690K elfcorehdr=182252K memmap=64K$0K memmap=38K$602K memmap=104K$920K memmap=8K$3668600K memmap=72K#3668608K memmap=184K#3668680K memmap=263296K$3668864K memmap=2048K$4192256K
KERNEL supported cpus:
  Intel GenuineIntel
  AMD AuthenticAMD


 Comments   
Comment by Niu Yawei (Inactive) [ 11/Nov/12 ]

hmm, there is a typo in find_tree_dqentry():

ret = quota_read_blk(env, obj, 0, blk, buf);

we should pass 'type' for the quota_read_blk, not sure why I passed '0' here.

Comment by Niu Yawei (Inactive) [ 11/Nov/12 ]

http://review.whamcloud.com/4509

Sarah, could you try above patch? Thanks.

Comment by Sarah Liu [ 12/Nov/12 ]

I will try it and let you know the status.

Mount MDS failded, unknown param quota_type=ug3

LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=off. Opts: 
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=off. Opts: 
Lustre: MGC10.10.4.132@tcp: Reactivating import
Lustre: MGS: Logs for fs lustre were removed by user request.  All servers must be restarted in order to regenerate the logs.
Lustre: Setting parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000
Lustre: Setting parameter lustre-clilov.lov.stripesize in log lustre-client
LustreError: 31273:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2
Lustre: lustre-MDT0000: used disk, loading
Lustre: Mounting lustre-MDT0000 at first time on 1.8 FS, remove all clients for interop needs
LustreError: 31347:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context
Lustre: lustre-MDT0000: Migrate inode quota from old admin quota file(admin_quotafile_v2.usr) to new IAM quota index([0x200000006:0x10000:0x0]).
Lustre: lustre-MDT0000: Migrate inode quota from old admin quota file(admin_quotafile_v2.grp) to new IAM quota index([0x200000006:0x1010000:0x0]).
Lustre: 31347:0:(mdt_handler.c:5192:mdt_process_config()) For 1.8 interoperability, skip this mdt.group_upcall. It is obsolete.
LustreError: 31347:0:(obd_config.c:1299:class_process_proc_param()) lustre-MDT0000: unknown param quota_type=ug3
LustreError: 31347:0:(obd_config.c:1546:class_config_llog_handler()) MGC10.10.4.132@tcp: cfg command failed: rc = -38
Lustre:    cmd=cf00f 0:lustre-MDT0000  1:mdd.quota_type=ug3  
LustreError: 15c-8: MGC10.10.4.132@tcp: The configuration from log 'lustre-MDT0000' failed (-38). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
LustreError: 31273:0:(obd_mount.c:1850:server_start_targets()) failed to start server lustre-MDT0000: -38
LustreError: 31273:0:(obd_mount.c:2399:server_fill_super()) Unable to start targets: -38
LustreError: 31273:0:(obd_mount.c:1350:lustre_disconnect_osp()) Can't end config log lustre
LustreError: 31273:0:(obd_mount.c:2112:server_put_super()) lustre-MDT0000: failed to disconnect osp-on-ost (rc=-2)!
Lustre: Failing over lustre-MDT0000
LustreError: 31273:0:(obd_mount.c:1418:lustre_stop_osp()) Can not find osp-on-ost lustre-MDT0000-osp-MDT0000
LustreError: 31273:0:(obd_mount.c:2157:server_put_super()) lustre-MDT0000: Fail to stop osp-on-ost!
LustreError: 31273:0:(ldlm_request.c:1183:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
LustreError: 31273:0:(ldlm_request.c:1815:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Lustre: 31273:0:(client.c:1912:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1352760135/real 1352760135]  req@ffff88011cba1c00 x1418471798734858/t0(0) o251->MGC10.10.4.132@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1352760141 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: server umount lustre-MDT0000 complete
LustreError: 31273:0:(obd_mount.c:2987:lustre_fill_super()) Unable to mount  (-38)
Lustre: DEBUG MARKER: Using TIMEOUT=20
Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: NAME=ncli not mounted
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=off. Opts: 
Lustre: MGC10.10.4.132@tcp: Reactivating import
Lustre: MGS: Logs for fs lustre were removed by user request.  All servers must be restarted in order to regenerate the logs.
Lustre: Setting parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000
Lustre: Skipped 4 previous similar messages
LustreError: 31689:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2
Lustre: lustre-MDT0000: used disk, loading
LustreError: 31757:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context
Lustre: 31757:0:(mdt_handler.c:5192:mdt_process_config()) For 1.8 interoperability, skip this mdt.group_upcall. It is obsolete.
LustreError: 31757:0:(obd_config.c:1299:class_process_proc_param()) lustre-MDT0000: unknown param quota_type=ug3
LustreError: 31757:0:(obd_config.c:1546:class_config_llog_handler()) MGC10.10.4.132@tcp: cfg command failed: rc = -38
Lustre:    cmd=cf00f 0:lustre-MDT0000  1:mdd.quota_type=ug3  
LustreError: 15c-8: MGC10.10.4.132@tcp: The configuration from log 'lustre-MDT0000' failed (-38). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
LustreError: 31689:0:(obd_mount.c:1850:server_start_targets()) failed to start server lustre-MDT0000: -38
LustreError: 31689:0:(obd_mount.c:2399:server_fill_super()) Unable to start targets: -38
LustreError: 31689:0:(obd_mount.c:1350:lustre_disconnect_osp()) Can't end config log lustre
LustreError: 31689:0:(obd_mount.c:2112:server_put_super()) lustre-MDT0000: failed to disconnect osp-on-ost (rc=-2)!
Lustre: Failing over lustre-MDT0000
LustreError: 31689:0:(obd_mount.c:1418:lustre_stop_osp()) Can not find osp-on-ost lustre-MDT0000-osp-MDT0000
LustreError: 31689:0:(obd_mount.c:2157:server_put_super()) lustre-MDT0000: Fail to stop osp-on-ost!
LustreError: 31689:0:(ldlm_request.c:1183:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
LustreError: 31689:0:(ldlm_request.c:1815:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Lustre: 31689:0:(client.c:1912:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1352760261/real 1352760261]  req@ffff88030e13d000 x1418471798734868/t0(0) o251->MGC10.10.4.132@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1352760267 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: server umount lustre-MDT0000 complete
LustreError: 31689:0:(obd_mount.c:2987:lustre_fill_super()) Unable to mount  (-38)
Lustre: DEBUG MARKER: Using TIMEOUT=20
Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: NAME=ncli not mounted
Comment by Niu Yawei (Inactive) [ 12/Nov/12 ]

Thank you Sarah. Seems the patch fixed the panic.

The unkown parameter is another issue, I think we'd open another ticket and track it there (you can assigne it to me). Thanks.

Comment by Sarah Liu [ 20/Nov/12 ]

this has been fixed, close it

Generated at Sat Feb 10 01:24:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.