Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Install subworkflows with modules from different remotes #3083

Open
wants to merge 90 commits into
base: dev
Choose a base branch
from

Conversation

jvfe
Copy link
Member

@jvfe jvfe commented Jul 29, 2024

PR checklist

  • This comment contains a description of changes (with reason)
  • CHANGELOG.md is updated
  • If you've fixed a bug or added code that should be tested, add tests!
  • Documentation in docs is updated

Addresses #1927 and #2497

Issue information

Currently, get_components_to_install is the function used to download a subworkflow's modules. It relies on simple text parsing to define where to get a subworkflow/module from and install it.

As outlined in the above issues (and in slack conversations), it is currently not possible to install subworkflows which depend on modules that originate from different remotes. The solution is to copy over required modules to a same remote in order to use them.

Implementation

At this moment the implementation is mostly a work-around/hack. I aimed to make the least possible modifications to the source code in order to make the intended behavior functional. Here's how it works:

  • It still uses the same text parsing for all modules and subworkflows that don't have a meta.yml file defined.

  • If a subworkflow defines a meta.yml, they can optionally define the git_remote and org_path keys for a component. This is based on a suggestion given on the original issue. E.g.:

components:
  - name: gatk4/mutect2
  - name: gatk4/genomicsdbimport
    org_path: custom_remote
    git_remote: https://github.com/custom_remote/modules.git

This info is used to grab the respective modules from the specified remote.
I've setup a dummy repository with a subworkflow that showcases this. If no remote is specified, it assumes the module is from the same remote as the subworkflow (as it is currently).

Trying to run nf-core subworkflows --git-remote https://github.com/jvfe/test-subworkflow-remote install get_genome_annotation with this branch of nf-core/tools should work. There's a basic test I've setup to test for this.

Known issues

All in all, I'm eager to hear what everyone thinks. I'm tagging @muffato who has been supervising and helping me in this implementation.

@jvfe jvfe changed the base branch from master to dev July 29, 2024 11:33
jvfe added 3 commits August 5, 2024 10:13
fix: Avoid duplicate downloads in nested subworkflows
* dev: (299 commits)
  Update nf_core/pipelines/create/utils.py
  [automated] Update CHANGELOG.md
  allow numbers in custom pipeline name
  [automated] Update CHANGELOG.md
  Update pre-commit hook pre-commit/mirrors-mypy to v1.11.1
  Fixed linting
  Updated changelog
  Added process_high_memory to create/lint list
  update chagelog
  use pathlib instead of os.path
  Apply suggestions from code review
  add option to skip code linters
  Update gitpod/workspace-base Docker digest to f189a41
  Update pre-commit hook pre-commit/mirrors-mypy to v1.11.0
  [automated] Update CHANGELOG.md
  Update python:3.12-slim Docker digest to 740d94a
  Update .pre-commit-config.yaml
  Update nf_core/__main__.py
  Add default version
  Automatically add a default version in pipelines_create
  ...
@nf-core nf-core deleted a comment from github-actions bot Aug 5, 2024
@awgymer
Copy link
Contributor

awgymer commented Aug 7, 2024

I'm very wary of increasing the idiosyncrasies in the code and I feel that making it such that the code returns variably either strings or dicts and thus necessitates isinstance checks is liable to be difficult to follow/trace through the code in future.

I understand the desire to simply get something working but I think that if we are going to implement this and support it we need to do it right.

I think this also indicates support only for cross-organisational modules and not cross-organisational subworkflows?

@muffato
Copy link
Member

muffato commented Aug 8, 2024

I think this also indicates support only for cross-organisational modules and not cross-organisational subworkflows?

Currently, yes

I'm very wary of increasing the idiosyncrasies in the code and I feel that making it such that the code returns variably either strings or dicts and thus necessitates isinstance checks is liable to be difficult to follow/trace through the code in future.

Once get_components_to_install is updated to return dictionaries for sub-workflows too, all the isinstance(..., dict) will evaluate to true, and all their if statements will be simplified / collapse.

I understand the desire to simply get something working but I think that if we are going to implement this and support it we need to do it right.

It's a question of time management (@jvfe is working part time on this project). We're not sure about about the handling of the default org vs the module org in nf_core/components/install.py (cf the creation of ModulesRepo instances). We didn't want to invest deeper without clearing that first. Making the change in get_components_to_install is easy, but finding the other files that need an update and making sure that all the tests still pass takes a bit longer.

Copy link
Member

@mirpedrol mirpedrol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution @jvfe!
I think this is a good start 😄 I agree with @awgymer that allowing component to be both a string or a dict is confusing and it will cause problems. But I like the idea of using the information in the meta.yml to obtain the remote form which the component should be installed.
My suggestion would be to convert component to a dictionary in all cases since the beginning, with the default being the current remote.

jvfe added 3 commits August 10, 2024 14:35
* dev: (35 commits)
  don't remove creating a pipeline to test it 😶
  update textual snapshots
  update textual snapshots
  fix indentation
  Update nf_core/pipelines/create/create.py
  [automated] Update CHANGELOG.md
  [automated] Update CHANGELOG.md
  add option to exclude changelog from custom pipeline template
  add option to exclude codespaces from pipeline template
  remove multiqc images from the template docs
  and the same for subworkflows
  run tests also after pytest-migrate
  update textual snapshots
  add multiqc potion to test data
  update textual snapshots
  fix tests
  [automated] Update CHANGELOG.md
  add option to exclude multiqc from pipeline template
  rename regenerae-snapshots to update-textual-snapshots
  Apply suggestions from code review
  ...
@mirpedrol mirpedrol mentioned this pull request Feb 19, 2025
@mirpedrol
Copy link
Member

I had to open a new PR after updating the branch because I wasn't able to push to this branch, you can see it here #3456
There is still one last thing missing. When removing a mixed subworkflow, only the components installed from that same repo are removed, the components from cross-repos are not removed.

@mirpedrol
Copy link
Member

There is still one last thing missing. When removing a mixed subworkflow, only the components installed from that same repo are removed, the components from cross-repos are not removed.

should be fixed with 04b26bb

@jvfe
Copy link
Member Author

jvfe commented Feb 20, 2025

I had to open a new PR after updating the branch because I wasn't able to push to this branch, you can see it here #3456 There is still one last thing missing. When removing a mixed subworkflow, only the components installed from that same repo are removed, the components from cross-repos are not removed.

Hi! Just saw your comment, sorry - though you fixed it way quicker than I would have 😄 . Also, just merged your changes to this branch. We can also give you access to this fork if necessary (@muffato I think you'd have to do that, since I don't have the permission).

* dev:
  only add entry to files
  remove remaining print
  [automated] Update CHANGELOG.md
  fix edam BAM link on modules template
  add ontologies to meta.yml if not present
  [automated] Update CHANGELOG.md
  [automated] Update CHANGELOG.md
  docs: fix contributing link in the main README
  fix: linting with comments after the input directive
@muffato
Copy link
Member

muffato commented Feb 20, 2025

I've given you, @mirpedrol , write access to repo. I'd prefer we continue the review here rather than #3456 so that all the review history is in one place

@mashehu mashehu modified the milestones: 3.4.0, 3.3.0 Feb 24, 2025
Copy link
Contributor

@mashehu mashehu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Does it also work with nf-core subworkflows patch?

Co-authored-by: Matthias Hörtenhuber <mashehu@users.noreply.github.com>
@mirpedrol
Copy link
Member

Does it also work with nf-core subworkflows patch?

I have tested that it works 👍 patch is independent of the included components so it should not be affected by these changes

@awgymer
Copy link
Contributor

awgymer commented Mar 7, 2025

As I can see this PR is looking good and probably heading towards a merge I just want to gently raise my previous point

I'm unclear on what level of commitment to maintain and work around issues this introduces there is from an nf-core standpoint.

This PR adds functionality to make cross-repo subworkflows work but we are still lacking a serious CI testing solution and I'm worried about other footguns the functionality potentially opens up. I'm conscious that greater private repo support is something many (including myself!) want but I also want to ensure we achieve it with stability and long-term viability.

@muffato
Copy link
Member

muffato commented Mar 7, 2025

@awgymer What CI testing is missing ? This includes does include tests for the new cross-repo functionality. That's in fact how it was developed: @jvfe first wrote (failing) test cases and then implemented the code so that they pass.

@awgymer
Copy link
Contributor

awgymer commented Mar 7, 2025

@awgymer What CI testing is missing ?

It's a fundamental limitation of this style of subworkflows that you cannot test them easily in their remotes on CI right now and this tools functionality does nothing to address that. Once again my issue is not so much with the actual code of this PR but rather the wider implications this PR has.

@muffato
Copy link
Member

muffato commented Mar 7, 2025

Ah right. Well, it sounds like it is a good idea for another internship/mentorship ;) a nf-core command to install missing cross-org modules so that tests can then run.

@mashehu
Copy link
Contributor

mashehu commented Mar 10, 2025

how about a CI test, where we create a test pipeline and build a mixed subworkflow there?

@muffato
Copy link
Member

muffato commented Mar 10, 2025

It's there already: https://github.com/nf-core/tools/pull/3083/files#diff-60cea8a0cf3a284ef26219e6dc9b9b2b55ebb25214cb520b0ae38ef6179436d7R87

It uses modules/subworkflows across nf-core/modules and https://github.com/nf-core-test/modules

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
7 participants