-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Memory issues when using the solver for multiple registration attempts #30
Comments
Oh thanks for reporting! Hmm...that looks off, but occasionally, I also saw that error. Could you run the code like this?
|
In the case of Python, at which iteration does the error occur? |
Hmmm...bro I tested python by running KISS-Matcher multiple time; but there are no memory issues. Could you check your data carefully? Because you acquired very noisy data, right? Or if it is possible, you can share your python script and data with me. |
My bad, apparently it's happening also with the "pure" cpp code, at least there is one less layer (pybinding) to worry about. Just in case here is a video proving I'm not going mad: https://drive.google.com/file/d/1V_3Un0wdeE0gZWq2ECEtALP-G2xTMjde/view?usp=sharing |
Due to ICCV deadline, currently, I have no bandwidh—could you double-check which function causes the memory issue (using GDB or just putting |
No problem, thanks for the code and help! It is crashing due something in this parellel for loop in the FasterPFH.cpp according to my debugging: tbb::parallel_for(tbb::blocked_range<size_t>(0, N), [&](const tbb::blocked_range<size_t> &r) {
// tbb::parallel_for(0, N, [&](const int& j) {
for (size_t j = r.begin(); j != r.end(); ++j) {
const int p_idx = fpfh_indices_[j];
std::vector<uint32_t> nn_indices;
std::vector<double> nn_dists;
nn_indices.reserve(corrs_fpfh_[p_idx].neighboring_indices.size());
nn_dists.reserve(corrs_fpfh_[p_idx].neighboring_dists.size());
// nn_cardinalities.resize(p_voxel.valid_neighboring_voxels.size());
const auto &indices = corrs_fpfh_[p_idx].neighboring_indices;
const auto &dists = corrs_fpfh_[p_idx].neighboring_dists;
for (size_t i = 0; i < indices.size(); ++i) {
if (is_valid_[indices[i]]) {
nn_indices.emplace_back(spfh_hist_lookup_[indices[i]]);
nn_dists.emplace_back(dists[i]);
}
}
points[j] = points_[p_idx];
WeightPointSPFHSignature(hist_f1_, hist_f2_, hist_f3_, nn_indices, nn_dists, descriptors[j]);
}
}); I think there is some unlucky race condition that randomly comes up if you have a lot of CPU cores: I am having issues when using a workstation with a 24-cores i9-13900K, but could not reproduce the issue on my laptop with a 10-cores i7-1355U. After removing the parallel_for and using a simple for loop it looks like the memory issues are fixed. I'm not really an expert on fixing race conditions, and removing the parallelism, even if not ideal, is a good enough solution for me, so I don't think I will be investigating further this issue. |
Thanks for your help! My desktop is also an AMD Ryzen™ Threadripper™ 7960X 24-Core, 48-Thread Processor, but the error has not occurred. Let me test it in the future (please just wait for one week!). |
@fdila Maybe you could give my PR a try if you like. Assuming you have a ROS2 installation you could try some of what is below.
|
Hi, I'll try it tomorrow thanks! |
The execution with valgrind is taking quite a while, in the meantime I'm attaching the output of running with address sanitizer, which to me seems to confirm it could be a concurrency issue. It seems like a really niche edge case, I was only able to reproduce with resolution 0.3 and no added yaw and roll angles. |
Can you show the exact command you used? |
Nevermind I just reproduced it(Address sanitzer failure) on iteration 38, 159, 162 with |
Thread sanitizer shows plenty of problems Here is the first
|
|
Haven't dug into this at all, but I have seen similar failures when a dependency is compiled statically with one version of Eigen, and the runtime code is using another version of Eigen. |
Ok I fixed the build up and made it use its own eigen and a newer tbb and changed the build to use shared libraries. It now requires CMake 3.24+ which I hope is ok. I think the actual problem is related to growth and rehashing of the robin_map spfh_hist_lookup_. Reserving space before populating it seems to help.
|
Has reached stress iteration: 2000 and is still going. |
Thanks for the update and the improvements! Upgrading to CMake 3.24+ should be fine. It's good to hear that reserving space before populating spfh_hist_lookup_ helps mitigate the issue. I'll take a closer look at the changes and test them further. Let me know if you notice any unexpected behavior as the iterations increase. Appreciate your effort! Does your PR #33 address this issue, right? I'll take a look within two business days. |
Yeah PR #33 is the one. I was just giving it a thrash building release mode, with the sanitizers off and tried using the same point cloud for src and tgt.
It seems to run slower and was killed with signal 9 after 129 iterations.
|
Have you tried building without ROS2? If I build the python bindings directly via
Looking into the asan output above it looks like there is a copy of |
Hi, as you can see here (in https://github.com/MIT-SPARK/KISS-Matcher/blob/main/cpp/kiss_matcher/3rdparty/find_dependencies.cmake):
Now, the local ones are used. Originally, I wanted to use the tsl from https://github.com/Tessil/robin-map; however, as far as I remember, it leads to an installation issue (I forgot the detail, though). |
It is using the robin map included with the code. It appears to be the same as the latest release apart from code formatting and some backward compatibility for pre c++17 compilers and the following change.
|
The operator[] on spfh_hist_lookup_ does a try_emplace which may cause a rehash which invalidates any indexes. Changing it to .at() as below, it immediately throws. I do not know the intention of spfh_hist_lookup_. Should it be getting modified in the loop?
|
I can confirm my example fails if I use the input Vel16 clouds with a resolution of 0.3m. Apologies for being slow to the party here. |
I have a patch that fixes it. Just needs a tidy up, maybe tonight (UTC+11 here) Essentially use find to see if the indices[i] is in spfh_hist_lookup_ before using it. IOW skip if not there. I am also working on reserving the correct amount of space for spfh_hist_lookup_. essentially data_size / max_load_factor rounded up to next power of two, |
Hi!
first of all: I don't know if this is a bug or if I'm not doing something properly.
I need to use the KISS matcher to align multiple pointclouds, like in a loop, but after a couple hundreds of alignments I get the error
This is a snippet to reproduce my error, I took the
examples/run_kiss_matcher.py
script and instead of solving the registration one time I do it in a loop for a thousand times. After a non-deterministic amout of iterations I get the error copied above.Snippet:
As show above, I'm clearing and resetting the matcher object, but maybe it is not enough to completely clear the internal state?
I tried to do the same thing in cpp and apparently there is no memory bug there, so I guess it has something to do with the pybiding. I tried to search in the pybinding repo issues and there are other people complaining about similar problems, but I haven't been able to pinpoint what exactly is making the python script crash.
The text was updated successfully, but these errors were encountered: