[LU-11519] sanity-hsm: test_90 MDS crash in mdt_cdt_waiting_cb() Created: 15/Oct/18  Updated: 19/Dec/18  Resolved: 27/Nov/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-11630 sanity-hsm test_13: Crash in mdt_cdt_... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run review-dne-zfs-part-2:
https://testing.whamcloud.com/test_sets/864ae762-d020-11e8-82f2-52540065bddc

BUG: unable to handle kernel paging request at 00000000240cd204
 IP: mdt_cdt_waiting_cb.isra.25+0x357/0xc10 [mdt]
 CPU: 0 PID: 25301 Comm: hsm_cdtr Kdump: loaded Tainted: P
 RIP: 0010:[<ffffffffc1230617>]  [<ffffffffc1230617>] mdt_cdt_waiting_cb.isra.25+0x357/0xc10 [mdt]
Call Trace:
mdt_coordinator_cb+0x162/0x290 [mdt]
llog_process_thread+0x852/0x1550 [obdclass]
llog_process_or_fork+0xbc/0x450 [obdclass]
llog_cat_process_cb+0x239/0x250 [obdclass]
llog_process_thread+0x852/0x1550 [obdclass]
llog_process_or_fork+0xbc/0x450 [obdclass]
llog_cat_process_or_fork+0x199/0x2a0 [obdclass]
llog_cat_process+0x2e/0x30 [obdclass]
cdt_llog_process+0xc6/0x3a0 [mdt]
mdt_coordinator+0x541/0x19f0 [mdt]
kthread+0xd1/0xe0
ret_from_fork_nospec_begin+0x21/0x21


 Comments   
Comment by Andreas Dilger [ 19/Oct/18 ]

+1 in review-dne-part-2 test:
https://testing.whamcloud.com/test_sets/32eb66fe-d3a5-11e8-ad90-52540065bddc

Comment by James Nunez (Inactive) [ 19/Oct/18 ]

We are seeing sanity-hsm test 13 crash at about a 5% failure rate with this call trace. All are for DNE with ZFS configurations.

Some example are at
https://testing.whamcloud.com/test_sets/b61b466e-d34d-11e8-9238-52540065bddc
https://testing.whamcloud.com/test_sets/d47f7fc4-d313-11e8-82f2-52540065bddc

Comment by Andreas Dilger [ 23/Oct/18 ]

+1 on master: https://testing.whamcloud.com/test_sets/23d994e6-d644-11e8-ad90-52540065bddc

Comment by Yang Sheng [ 25/Oct/18 ]

The crash dump cannot be loaded to analysis.

Comment by Andreas Dilger [ 29/Oct/18 ]

+1 on master https://testing.whamcloud.com/test_sets/30e877d2-d6d1-11e8-9238-52540065bddc

There are actually many failures like this.

Comment by Andreas Dilger [ 01/Nov/18 ]

John, could you please take a look at this, it looks like a coordinator bug and has been crashing the MDS fairly commonly lately.

Comment by John Hammond [ 02/Nov/18 ]

I grabbed the most recent failure from maloo https://testing.whamcloud.com/test_sets/27781c7e-dde4-11e8-975a-52540065bddc which uses build 59629:

[ 5043.117404] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
[ 5043.119825] IP: [<ffffffffc1327747>] mdt_cdt_waiting_cb.isra.25+0x357/0xc10 [mdt]
[ 5043.122276] PGD 0

$ mkdir /tmp/LU-11519
$ cd /tmp/LU-11519
$ wget 'https://build.whamcloud.com/job/lustre-reviews/59629/arch=x86_64,build_type=server,distro=el7,ib_stack=inkernel/artifact/artifacts/RPMS/x86_64/lustre-debuginfo-2.11.56_56_g098be7d-1.el7.x86_64.rpm'
...
$ rpm2cpio lustre-debuginfo-2.11.56_56_g098be7d-1.el7.x86_64.rpm | cpio -id
$ nm ./usr/lib/debug/lib/modules/3.10.0-862.14.4.el7_lustre.x86_64/extra/lustre/fs/mdt.ko.debug | awk '$3 == "mdt_cdt_waiting_cb.isra.25" { print $1; }'
000000000005f3f0
$ printf '%#lx\n' $(( 0x357 + 0x5f3f0 ))
0x5f747
$ addr2line -e ./usr/lib/debug/lib/modules/3.10.0-862.14.4.el7_lustre.x86_64/extra/lustre/fs/mdt.ko.debug --functions --inlines 0x5f747
mdt_cdt_waiting_cb
/usr/src/debug/lustre-2.11.56_56_g098be7d/lustre/mdt/mdt_coordinator.c:228
$ cat -n ./usr/src/debug/lustre-2.11.56_56_g098be7d/lustre/mdt/mdt_coordinator.c | sed -n '220,230p'
   220				} while (request->hal_used_sz + hai_size >
   221					 LDLM_MAXREQSIZE);
   222			} else if (hsd->hsd_housekeeping) {
   223				struct hsm_scan_request *tmp;
   224	
   225				/* Discard the (whole) last hal */
   226				hsd->hsd_request_count--;
   227				tmp = &hsd->hsd_request[hsd->hsd_request_count];
   228				hsd->hsd_action_count -= tmp->hal->hal_count;
   229				OBD_FREE(tmp->hal, tmp->hal_sz);
   230			} else {

So it looks like tmp->hal is NULL.

Comment by Andreas Dilger [ 02/Nov/18 ]
$ nm ...
$ printf ..
$ addr2line -e ...
$ cat -n ...

I typically use the following, which is considerably easier:

$ gdb .../mdt.ko.debug -ex "list *(mdt_cdt_waiting_cb+0x357)" -ex quit
(gdb) list *(mdt_cdt_waiting_cb+0x357)
0x5f777 is in mdt_cdt_waiting_cb (/usr/src/lustre-head/lustre/mdt/mdt_coordinator.c:228).
223                             struct hsm_scan_request *tmp;
224     
225                             /* Discard the (whole) last hal */
226                             hsd->hsd_request_count--;
227                             tmp = &hsd->hsd_request[hsd->hsd_request_count];
228                             hsd->hsd_action_count -= tmp->hal->hal_count;
229                             OBD_FREE(tmp->hal, tmp->hal_sz);
230                     } else {
231                             /* Bailing out, this code path is too hot */
232                             RETURN(LLOG_PROC_BREAK);

Note the "list *()" argument is typically the same format as how addresses are printed from stack dumps/oops messages, but in this case the ".isra.25" part needs to be dropped.

Comment by Jian Yu [ 03/Nov/18 ]

sanity-hsm test 13 also hit the same failure in review-dne-zfs-part-2 test session on master branch:
https://testing.whamcloud.com/test_sets/ee81ca36-defc-11e8-a251-52540065bddc

Comment by Jian Yu [ 04/Nov/18 ]

sanity-hsm test 13 hit this failure again:
https://testing.whamcloud.com/test_sets/b6a66416-dfe5-11e8-89f8-52540065bddc

Comment by John Hammond [ 05/Nov/18 ]

> I typically use the following, which is considerably easier:

(I like my way because it makes inlining much easier to understand. Which is not an issue here. Maybe gcc has a better way to do this in those cases.)

Comment by Gerrit Updater [ 05/Nov/18 ]

John L. Hammond (jhammond@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33580
Subject: LU-11519 hsm: handle hsd_request_count == 0 properly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5021ced380ad820a5f92ac1b41363f5919d79e16

Comment by Gerrit Updater [ 06/Nov/18 ]

Quentin Bouget (quentin.bouget@cea.fr) uploaded a new patch: https://review.whamcloud.com/33590
Subject: LU-11519 hsm: improve the testing of hsm.max_requests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f56c3963b1314e641044464b50e8a549762c751e

Comment by Gerrit Updater [ 21/Nov/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33580/
Subject: LU-11519 hsm: handle hsd_request_count == 0 properly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: af05afdfb5781806ab3bc059c86c289b01713ade

Comment by Gerrit Updater [ 27/Nov/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33590/
Subject: LU-11519 hsm: improve the testing of hsm.max_requests
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c1df28b34acdb56acff444400fc9a05d8adf493a

Comment by Peter Jones [ 27/Nov/18 ]

Landed for 2.12

Generated at Sat Feb 10 02:44:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.