[LU-4390] cfs_trace_get_tage+0x1c6/0x300 [libcfs] Created: 18/Dec/13  Updated: 09/Jan/20  Resolved: 09/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: yueyuling Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

1 Lustre Client Server
2 MDS Server, config failover
2 OSS Server,config failover


Issue Links:
Related
is related to LU-4391 vsnprintf+0x218/0x5e0 Resolved
Severity: 3
Rank (Obsolete): 12054

 Description   

In the process of deleting a large number of small files(close to 800,000), simulate one MDS crash. During the othe MDS takeover the crashed one, the normal MDS print call trace and crashed.
After a while , the client server crashed too.
Call trace info as follow:
BUG: unable to handle kernel paging request at 0000000000001008
IP: [<ffffffffa08394b2>] cfs_tage_to_tail+0x32/0xa0 [libcfs]
PGD 26ca0b067 PUD 26e2b2067 PMD 0
Oops: 0002 1 SMP
last sysfs file: /sys/class/nt_neg/info
CPU 1
Modules linked in: osp(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) ldiskfs(U) lquota(U) mdd(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) device(F)(U) tcmfc(F)(U) libpmcfc(U) lundev(F)(U) iscsi_target_mod(F)(U) target_core_mod(F)(U) replication(F)(U) snapshot(F)(U) hotcache(F)(U) cache(F)(U) mmgt(F)(U) copy(F)(U) raid5(F)(U) raid10(F)(U) raid1(F)(U) raid0(F)(U) data_record(F)(U) raid_common(F)(U) ACpoweroff(F)(U) pm80xx(F)(U) sys_disk(U) disk_noop(F)(U) disk_deadline(F)(U) disk_elevator(F)(U) lib_sas(F)(U) scsi_transport_sas(F)(U) lib_ata(F)(U) odsp_sas(F)(U) disk_adapter(F)(U) disk_vault(F)(U) disk_error(F)(U) odsp_scsi(F)(U) disk_err_stub(U) disk_arch(F)(U) cell(F)(U) ld_worker(F)(U) comm_duplock(F)(U) iodir(F)(U) pthread(F)(U) kmempool(F)(U) common(U) comm(F)(U) dma(F)(U) bugon(F)(U) driver_adapter(F)(U) event(F)(U) debug(U) nt_memcpy(U) drv_dma(F)(U) plxdma(F)(U) async_tx(U) xor(U) nt_neg_resume(F)(U) Plx8000(U) msgpio(U) bonding ipv6 sunrpc i2c_ismt(U) i2c_dev 8021q garp stp llc iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_mod e1000e sg ixgbe hwmon dca mdio iTCO_wdt iTCO_vendor_support i2c_i801 i2c_core ext4 mbcache jbd2 sd_mod crc_t10dif ahci [last unloaded: scsi_wait_scan]

Pid: 23499, comm: mdt00_003 Tainted: GF W --------------- 2.6.32-358.6.2.l2.08 #2 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffffa08394b2>] [<ffffffffa08394b2>] cfs_tage_to_tail+0x32/0xa0 [libcfs]
RSP: 0000:ffff880288c119b0 EFLAGS: 00010282
RAX: ffff880290eecd20 RBX: ffff880290c79720 RCX: 0000000000001000
RDX: ffff880290c79728 RSI: ffff880290eecd20 RDI: ffff880290c79720
RBP: ffff880288c119c0 R08: 000000000000004c R09: 0000000000000000
R10: 0000000000000001 R11: ffff88026d710d60 R12: ffff880290eecd20
R13: 0000000000001010 R14: ffff880288c11ae0 R15: 00000000000000a1
FS: 00007fe20c76a700(0000) GS:ffff880028300000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000001008 CR3: 000000026d0bd000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mdt00_003 (pid: 23499, threadinfo ffff880288c10000, task ffff880293bc0aa0)
Stack:
ffff880290eecd00 0000000000000400 ffff880288c11a30 ffffffffa083aab6
<d> 0000000052b01764 ffff880290c79720 0000000000000000 ffff880288c11ab0
<d> ffff880288c11a30 ffffffffa082927b 0000000052b01764 0000000000009880
Call Trace:
[<ffffffffa083aab6>] cfs_trace_get_tage+0x1c6/0x300 [libcfs]
[<ffffffffa082927b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
[<ffffffffa083ada2>] libcfs_debug_vmsg2+0x1b2/0x890 [libcfs]
[<ffffffffa083b4c1>] libcfs_debug_msg+0x41/0x50 [libcfs]
[<ffffffffa093fea7>] ? class_handle2object+0x97/0x170 [obdclass]
[<ffffffffa1029b01>] mdt_save_lock+0x201/0x300 [mdt]
[<ffffffffa1029c5c>] mdt_object_unlock+0x5c/0x160 [mdt]
[<ffffffffa104a52b>] mdt_reint_unlink+0xa0b/0x14e0 [mdt]
[<ffffffffa10457e1>] mdt_reint_rec+0x41/0xe0 [mdt]
[<ffffffffa1027063>] mdt_reint_internal+0x4d3/0x7a0 [mdt]
[<ffffffffa1027374>] mdt_reint+0x44/0xe0 [mdt]
[<ffffffffa102ca08>] mdt_handle_common+0x648/0x1660 [mdt]
[<ffffffffa106a7e5>] mds_regular_handle+0x15/0x20 [mdt]
[<ffffffffa0abfa59>] ptlrpc_server_handle_request+0x3c9/0xc70 [ptlrpc]
[<ffffffffa082a64e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[<ffffffffa083c0ff>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
[<ffffffffa0abc94f>] ? ptlrpc_wait_event+0xaf/0x2d0 [ptlrpc]
[<ffffffff8103fec3>] ? __wake_up+0x53/0x70
[<ffffffffa0ac341e>] ptlrpc_main+0xafe/0x1940 [ptlrpc]
[<ffffffffa0ac2920>] ? ptlrpc_main+0x0/0x1940 [ptlrpc]
[<ffffffff81003f1a>] child_rip+0xa/0x20
[<ffffffffa0ac2920>] ? ptlrpc_main+0x0/0x1940 [ptlrpc]
[<ffffffffa0ac2920>] ? ptlrpc_main+0x0/0x1940 [ptlrpc]
[<ffffffff81003f10>] ? child_rip+0x0/0x20
Code: 10 48 89 1c 24 4c 89 64 24 08 0f 1f 44 00 00 48 85 ff 48 89 fb 49 89 f4 74 38 4d 85 e4 74 52 48 8b 43 10 48 8b 4b 08 48 8d 53 08 <48> 89 41 08 48 89 08 49 8b 44 24 08 49 89 54 24 08 4c 89 63 08
RIP [<ffffffffa08394b2>] cfs_tage_to_tail+0x32/0xa0 [libcfs]
RSP <ffff880288c119b0>
CR2: 0000000000001008
--[ end trace 026f77cb820e9e09 ]--
Kernel panic - not syncing: Fatal exception
Pid: 23499, comm: mdt00_003 Tainted: GF D W --------------- 2.6.32-358.6.2.l2.08 #2
Call Trace:
[<ffffffff81476fa7>] ? panic+0xa1/0x163
[<ffffffff8147b2fc>] ? oops_end+0xdc/0xf0
[<ffffffff81031b3b>] ? no_context+0xfb/0x260
[<ffffffff81075ec7>] ? bit_waitqueue+0x17/0xd0
[<ffffffff81031db5>] ? __bad_area_nosemaphore+0x115/0x1d0
[<ffffffffa00248e2>] ? do_get_write_access+0x3d2/0x550 [jbd2]
[<ffffffff81031e83>] ? bad_area_nosemaphore+0x13/0x20
[<ffffffff8147cdd1>] ? do_page_fault+0x351/0x540
[<ffffffffa0ece15b>] ? __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]
[<ffffffffa0edee18>] ? ldiskfs_bread+0x18/0x90 [ldiskfs]
[<ffffffff8147a68f>] ? page_fault+0x1f/0x30
[<ffffffffa08394b2>] ? cfs_tage_to_tail+0x32/0xa0 [libcfs]
[<ffffffffa083aab6>] ? cfs_trace_get_tage+0x1c6/0x300 [libcfs]
[<ffffffffa082927b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
[<ffffffffa083ada2>] ? libcfs_debug_vmsg2+0x1b2/0x890 [libcfs]
[<ffffffffa083b4c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[<ffffffffa093fea7>] ? class_handle2object+0x97/0x170 [obdclass]
[<ffffffffa1029b01>] ? mdt_save_lock+0x201/0x300 [mdt]
[<ffffffffa1029c5c>] ? mdt_object_unlock+0x5c/0x160 [mdt]
[<ffffffffa104a52b>] ? mdt_reint_unlink+0xa0b/0x14e0 [mdt]
[<ffffffffa10457e1>] ? mdt_reint_rec+0x41/0xe0 [mdt]
[<ffffffffa1027063>] ? mdt_reint_internal+0x4d3/0x7a0 [mdt]
[<ffffffffa1027374>] ? mdt_reint+0x44/0xe0 [mdt]
[<ffffffffa102ca08>] ? mdt_handle_common+0x648/0x1660 [mdt]
[<ffffffffa106a7e5>] ? mds_regular_handle+0x15/0x20 [mdt]
[<ffffffffa0abfa59>] ? ptlrpc_server_handle_request+0x3c9/0xc70 [ptlrpc]
[<ffffffffa082a64e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[<ffffffffa083c0ff>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
[<ffffffffa0abc94f>] ? ptlrpc_wait_event+0xaf/0x2d0 [ptlrpc]
[<ffffffff8103fec3>] ? __wake_up+0x53/0x70
[<ffffffffa0ac341e>] ? ptlrpc_main+0xafe/0x1940 [ptlrpc]
[<ffffffffa0ac2920>] ? ptlrpc_main+0x0/0x1940 [ptlrpc]
[<ffffffff81003f1a>] ? child_rip+0xa/0x20
[<ffffffffa0ac2920>] ? ptlrpc_main+0x0/0x1940 [ptlrpc]
[<ffffffffa0ac2920>] ? ptlrpc_main+0x0/0x1940 [ptlrpc]
[<ffffffff81003f10>] ? child_rip+0x0/0x20



 Comments   
Comment by Andreas Dilger [ 18/Dec/13 ]

Has this problem happened more than once?

It looks to be a bug in the debug message handling (maybe too long a debug message, or some race between determining the message length and writing it to the debug buffer?), and not really related to DNE or recovery at all.

Comment by yueyuling [ 19/Dec/13 ]

I tried the same operations 3 times after this problem appearing, this problem didn't happen.
In addition, could you explain what DNE means?
This problem led to MDS crashed, would you offer some help to avoid crash?

Comment by Andreas Dilger [ 09/Jan/20 ]

Close old bug

Generated at Sat Feb 10 01:42:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.