Replies: 13 comments 26 replies
-
The distro is probably the least important part. What matters is having access to the kernel-dev package involved in using your local kernel, and (very preferably) a kernel that works on all versions involved. If it weren't out of support I might suggest centos 7. In particular I would make sure no ZFS software whatsoever is installed before beginning, just to make sure there's no conflict at all. Don't forget to delete ZFS modules from Checkout the git repo, go into it with a shell and run:
Versions can be tags ( Git will checkout a middle version for you. Run When it's done, Git will name the commit that causes your regression. Run If a version fails to build for some reason and you think it's specific to this particular commit, |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Thanks both for your replies
If I can build and test 0.8.6 and 2.0.0 I'm pretty sure that's covered.
Absolutely true with Debian.
I'll be sure to do that.
I've been using the commit IDs. I didn't know that releases could also work.
I may try that if I cannot find a solution to the issue I mentioned.
I'll stick with that unless someone can provide a strong argument for something else. I wouldn't object to Centos 7 if I could find install media. I don't care if it is out of support as this host is a throw away and the only Internet exposure is pulling OpenZFS from Github. (@DeHackEd Centos 7 suggestion seemed to be a little lukewarm.)
Ah... 0.8.6 is actually more recent than 2.0.0 so bisect could pull in older releases not compatible with 5.9. I think I still have 4.19 installed and if not, a nuke & pave is 15 minutes away.
Issues where the build cannot be completed as above, but I think your suggestion to use 4.19 may be the key there. Edit.0: corruption with 2.0.0 and no corruption with 0.8.6 has been confirmed. First bisect build is processing and |
Beta Was this translation helpful? Give feedback.
-
I just realized that @IvanVolosyuk suggested the 4.18 kernel and I've been blithely testing with the 4.19 kernel. Looking at tagged releases I see that no mention of compatibility is listed with the 0.8.0 and the 0.8.0-RC1 is listed as compatible through 4.18. However 0.8.0 -RC5 claims compatibility through 5.1. 2.0.0-RC1 claims compatibility through 5.8. 2.0.0 on 4.19 produced corruption in bit more than two hours. Testing with 0.8.6 on 4.19 started about ten minutes ago and will run until I'm confident no corruption will be produced. |
Beta Was this translation helpful? Give feedback.
-
We should know soon.
I was thinking of just posting
But that would be mean. ;) Currently 3 1/2 hours into the (hopefully) last test with no corruption. Usually they're revealed before then. |
Beta Was this translation helpful? Give feedback.
-
Thank you for checking my work.
@HankB <https://github.com/HankB> Maybe the bisect for
…On Tue, Apr 8, 2025 at 2:26 PM TheJulianJES ***@***.***> wrote:
Hmm, after running git bisect between zfs-0.8.6 and zfs-2.0.0 myself with
the test results from Hank, I get different results on what to test next.
b1b4ac2
<b1b4ac2>
was tested bad (in the second test run), so
d3230d7
<d3230d7>
should be next. The full Git CLI output is at the bottom of the comment.
@HankB <https://github.com/HankB> Maybe the bisect for
b1b4ac2
<b1b4ac2>
was incorrectly marked as "good" and not "bad"?
Or am I missing something...? (Thanks for all the work on this issue btw!)
Git bisect output (click to open)
$ git bisect startstatus: waiting for both good and bad commits
$ git bisect bad zfs-2.0.0status: waiting for good commit(s), bad commit known
$ git bisect good zfs-0.8.6Bisecting: a merge base must be tested
[78fac8d] Fix kstat state update during pool transition
$ git bisect goodBisecting: 629 revisions left to test after this (roughly 9 steps)
[327000c] Remove zfs_getattr and convoff dead code
$ git bisect badBisecting: 314 revisions left to test after this (roughly 8 steps)
[eedb3a6] Make `zil_async_to_sync` visible to platform code
$ git bisect badBisecting: 157 revisions left to test after this (roughly 7 steps)
[1e620c9] Revert "Develop tests for issues #5866 and #8858"
$ git bisect badBisecting: 78 revisions left to test after this (roughly 6 steps)
[a64f827] Update vdev_ops_t from illumos
$ git bisect badBisecting: 38 revisions left to test after this (roughly 5 steps)
[8e91c5b] hkdf_test binary should only have one icp instance
$ git bisect goodBisecting: 19 revisions left to test after this (roughly 4 steps)
[d9cd66e] Target ARC size can get reduced to arc_c_min
$ git bisect good # this failed in second test run of the same commit according to Hank's websiteBisecting: 9 revisions left to test after this (roughly 3 steps)
[b1b4ac2] Python config cleanup
$ git bisect bad # this should be tested next, but the bisect on Hank's site is a different "known bad" one alreadyBisecting: 4 revisions left to test after this (roughly 2 steps)
[d3230d7] looping in metaslab_block_picker impacts performance on fragmented pools
Screenshot of Git history
Some commit in between these should be tested next (e.g.
d3230d7
<d3230d7>
):
Git.ZFS.bisect.png (view on web)
<https://github.com/user-attachments/assets/263ceafb-5844-4ccb-8fa6-4f7319cfee2a>
According to that screenshot, what's being tested now (
c1b5801
<c1b5801>)
should fail..? It was committed after the last known bad commit.
—
Reply to this email directly, view it on GitHub
<#17203 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZM47UQIU6ZYS4V4RBXUF32YQPGFAVCNFSM6AAAAAB2HFJVYOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTENZWHEZTEOA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Beautiful Sunny Winfield
|
Beta Was this translation helpful? Give feedback.
-
I'm confused. @TheJulianJES had a comment questioning if I had gotten something wrong. I take these very seriously as I am very concerned that a mistake on my part could lead a kernel dev down the wrong path and that is absolutely the last thing I want to do. The "last" bisect is still running and w/out any corruption detected and I'm going to turn my attention to my notes to see if I can find something wrong. I'll report back and must complete that before reporting any further results. |
Beta Was this translation helpful? Give feedback.
-
Where are we now? If someone has results that don't match mine, I need to figure out why. BTW, I do really appreciate folks taking a close look at my work and trying to find errors in it. My efforts are wasted if the work is not done right. And I'll repeat it if necessary. Thanks! |
Beta Was this translation helpful? Give feedback.
-
No worries - I need a little excitement now and then, particularly where everyone can see it. (I was a big fan of code walk throughs and the only thing I didn't like was when everyone said "it all looks good.") I've reviewed the steps and I believe that I've done the good/bad correctly for the various tests. I have saved all of my log files so I can reconcile them with what I concluded to make sure I didn't misinterpret something. (later) I'll repeat the tests suggested by @AndrewJDR but can only do one at a time. I'll post up the "in progress" test in a few minutes. best, |
Beta Was this translation helpful? Give feedback.
-
The bisect is complete. I will shortly be repeating the test for the previous commit as suggested. When that is complete I will add a comment about this in the original issue. Thanks all for your help and support with this! |
Beta Was this translation helpful? Give feedback.
-
Edit: Unfortunately at completion of the test I realized that I had checked out the commit for bisect #9. Repeating now for the correct commit. The repeat of bisect I can repeat the test with the three options listed above disabled. Is there anything else I should change for this test? thanks, |
Beta Was this translation helpful? Give feedback.
-
10th bisect repeat is complete with corruption at about 3 hours. Is there anything else to do before reporting to #12014? I will repeat this test with the other settings tomorrow. best, |
Beta Was this translation helpful? Give feedback.
-
testing with
has just completed with corruption detected. I'll be posting the results in a few minutes. |
Beta Was this translation helpful? Give feedback.
-
Good morning,
I have developed some scripts and methodology to provoke the encryption related corruption that has been present since 2.0.0. #12014. This work is at https://github.com/HankB/provoke_ZFS_corruption.
Early on it was suggested that Use these tests to
git bisect
between 0.8.6 and 2.0.0 and I began that work. I settled on the 5.9.16 kernel because the 5.9 kernel was the most recent supported by the ZFS versions in question. On the first step in the bisect (after confirming that 0.8.6 and 2.0.0 built w/out issue) I encountered build errors and could not progress past the./configure
process. After not finding an answer to the error, I set this aside and there has been no further progress. My notes on this are at https://github.com/HankB/provoke_ZFS_corruption/blob/main/docs/tests/2025-03-03_Linux_Buster_5.9.16_bisect_0.8.6_2.0.0/Setup.mdI can continue with this effort but will need some help getting the various modules built. I have to believe that each commit is buildable with the correct conditions but I do not know what conditions are required. Questions I have are:
strace
/dtrace
orprintk()
could be used to capture information that would be helpful to determine the sequence of operations that result in corruption. (Here I am throwing out the names of things I have heard of and have no experience with.)This leaves me with two asks:
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions