Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixAuthors and merging #128

Open
ggrittz opened this issue Feb 12, 2025 · 0 comments
Open

fixAuthors and merging #128

ggrittz opened this issue Feb 12, 2025 · 0 comments

Comments

@ggrittz
Copy link

ggrittz commented Feb 12, 2025

For large vectors of scientific names (500k+), plantR::fixAuthors is not able to merge due to the size of each object being merged

The issue occurs here:

res0 <- data.frame(orig.name = taxa, tax.name = NA, tax.author = NA, 
                     ids = 1:length(taxa))
  res <- res[, -which(names(res) %in% "fix.author")]
  res1 <- merge(res0, res, by = "orig.name", all = TRUE, suffixes = c(".x", 
                                                                      ""))
**Error in merge.data.frame(res0, res, by = "orig.name", all = TRUE, suffixes = c(".x",  : 
  vetores de comprimento negativo não são permitidos**

Changing merge to dplyr::left_join gives us a bit more information about the problem:

res0 <- data.frame(orig.name = taxa, tax.name = NA, tax.author = NA, 
                     ids = 1:length(taxa))
  res <- res[, -which(names(res) %in% "fix.author")]
  res1 <- dplyr::left_join(res0, res, by = "orig.name", suffix = c(".x", ""))
**Error in `dplyr::left_join()`:
! This join would result in more rows than dplyr can handle.
ℹ 9758327359 rows would be returned. 2147483647 rows is the maximum number allowed.
ℹ Double check your join keys. This error commonly occurs due to a missing join key, or an improperly specified join condition.**

However, there is a fix to be added right before merging:

##### ADD THIS STEP #####
  res <- res[!duplicated(res$orig.name), ]
  
  res0 <- data.frame(orig.name = taxa, tax.name = NA, tax.author = NA, 
                     ids = 1:length(taxa))
  res <- res[, -which(names(res) %in% "fix.author")]

##### WHICH ALSO ALLOWS LEFT_JOIN (MUCH FASTER) TO BE USED #####
  res1 <- dplyr::left_join(res0, res, by = "orig.name", suffix = c(".x", ""))
  
##### Maybe the steps or removing duplicated and ordering can be removed too? I didn't test tho #####
  res1 <- res1[!duplicated(res1$ids), ]
  res1 <- res1[order(res1$ids), ]
  res1$tax.name[!rep_ids0] <- taxa[!rep_ids0]
  res2 <- res1[, c("orig.name", "tax.name", "tax.author")]
  return(res2)
This was referenced Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant