Skip to content
This repository was archived by the owner on Aug 25, 2024. It is now read-only.

operation: archive: Directory structure not preserved while creating a zip/tar archive of a directory #1198

Closed
programmer290399 opened this issue Aug 20, 2021 · 1 comment · Fixed by #1199
Labels
bug Something isn't working

Comments

@programmer290399
Copy link
Contributor

programmer290399 commented Aug 20, 2021

Describe the bug

While creating an archive the directory structure is not preserved.

Steps To Reproduce

Try archiving any directory or use following example.

Example by @pdxjohnny

import os
import sys
import pathlib
import tempfile
import itertools
from typing import Union, Callable

from dffml.operation.archive import (
    make_zip_archive,
    extract_zip_archive,
    make_tar_archive,
    extract_tar_archive,
)
from dffml.operation.compression import (
    gz_compress,
    gz_decompress,
    bz2_compress,
    bz2_decompress,
    xz_compress,
    xz_decompress,
)

from dffml.df.types import DataFlow, Input, Operation
from dffml.df.base import OperationImplementation, op
from dffml.noasync import run

# Must include output operation for results (from run()) to contain data!
from dffml.operation.output import GetSingle


def make_dataflow(
    *args: Union[Operation, OperationImplementation, Callable], **kwargs
) -> DataFlow:
    return DataFlow(
        # Add an output operation. Results will be empty without one.
        GetSingle,
        # Add the operations to the dataflow.
        *args,
        seed=[
            # Ensure the output operation returns a single value for each
            # operation executed that has an output. If operations will be run
            # multiple times and produce multiple outputs which you want to
            # capture, you'll want to add GetMulti rather than GetSingle.
            Input(
                value=list(
                    itertools.chain(
                        *[
                            [
                                definition.name
                                for definition in operation.op.outputs.values()
                            ]
                            for operation in args
                        ]
                    )
                ),
                definition=GetSingle.op.inputs["spec"],
            )
        ]
        + [
            # Add each input from kwargs.
            Input(
                value=value,
                # Look up the definition which should be used this value. The
                # name of the keyword argument will be used to look up the
                # definition. This is just a quick and dirty approach to this,
                # obviously if there are two inputs with the same operation
                # local name, only one of their definitions will be chosen.
                definition=dict(
                    itertools.chain(
                        *[
                            [
                                (input_name, definition)
                                for input_name, definition in operation.op.inputs.items()
                            ]
                            for operation in args
                        ]
                    )
                )[key],
            )
            for key, value in kwargs.items()
        ],
    )


@op
def debug_print_tempdir(input_directory_path: str):
    """
    Debug function to print contents of tempdir
    """
    os.system(f"tree {input_directory_path}")


def main():
    # Create a temporary directory which we'll make an archive of
    with tempfile.TemporaryDirectory() as tempdir:
        # Create a file in the directory to make an archive of
        pathlib.Path(tempdir, "hello").write_text("world")
        # Make a dataflow to run each operation (just for example purposes, we
        # will connect operations in another example).
        for dataflow in (
            [
                # Make the tar archive
                make_dataflow(
                    make_tar_archive,
                    input_directory_path=tempdir,
                    output_file_path=str(pathlib.Path(tempdir, "test.tar")),
                )
            ]
            + list(
                itertools.chain(
                    *[
                        [
                            # Compress the tar archive
                            make_dataflow(
                                # Grab operation from the globals of this file by name
                                getattr(
                                    sys.modules[__name__],
                                    f"{compression_algorithm}_compress",
                                ),
                                input_file_path=str(
                                    pathlib.Path(tempdir, "test.tar")
                                ),
                                output_file_path=str(
                                    pathlib.Path(
                                        tempdir,
                                        f"test.tar.{compression_algorithm}",
                                    )
                                ),
                            ),
                            # Decompress the tar archive
                            make_dataflow(
                                # Grab operation from the globals of this file by name
                                getattr(
                                    sys.modules[__name__],
                                    f"{compression_algorithm}_decompress",
                                ),
                                input_file_path=str(
                                    pathlib.Path(
                                        tempdir,
                                        f"test.tar.{compression_algorithm}",
                                    )
                                ),
                                output_file_path=str(
                                    pathlib.Path(
                                        tempdir,
                                        f"test.decompressed_{compression_algorithm}.tar",
                                    )
                                ),
                            ),
                            # Extract the tar archive. Ensure that compression and
                            # decompression didn't mess up the archive
                            make_dataflow(
                                extract_tar_archive,
                                input_file_path=str(
                                    pathlib.Path(
                                        tempdir,
                                        f"test.decompressed_{compression_algorithm}.tar",
                                    )
                                ),
                                output_directory_path=str(
                                    pathlib.Path(
                                        tempdir,
                                        f"tar_inflated_{compression_algorithm}",
                                    )
                                ),
                            ),
                        ]
                        for compression_algorithm in ["gz", "bz2", "xz"]
                    ]
                )
            )
            + [
                # Make the zip archive (which will contain the tar archive)
                make_dataflow(
                    debug_print_tempdir, input_directory_path=tempdir,
                ),
                # Make the zip archive (which will contain the tar archive)
                make_dataflow(
                    make_zip_archive,
                    input_directory_path=tempdir,
                    output_file_path=str(pathlib.Path(tempdir, "test.zip")),
                ),
                # Extract the zip archive
                make_dataflow(
                    extract_tar_archive,
                    input_file_path=str(pathlib.Path(tempdir, "test.zip")),
                    output_directory_path=str(
                        pathlib.Path(tempdir, "zip_inflated")
                    ),
                ),
            ]
        ):
            # Run the dataflow. Keep in mind the dataflows in the above list
            # will be run in order. Each dataflow is just one processing
            # operation and one output operation (GetSingle).
            print(list(dataflow.operations.keys())[1:])
            print(dataflow.seed[1:])
            for ctx, results in run(dataflow):
                print("results", results, "\n")


if __name__ == "__main__":
    main()

Expected behavior

For ex if we try to archive a directory with the following structure:

top_level_dir
├── child_dir_1 
│   └── file1
├── child_dir_2
│   ├── file2
│   └── file3
└── child_dir_3
    └── child_child_dir1
        └── file4    

We expect to have the same structure inside the archive as well, but currently we get the following structure in the archive:

top_level_dir
├── child_dir_1 
├── child_dir_2
├── child_dir_3
├── child_child_dir1
├── file1
├── file2
├── file3
└── file4    

Related:

@programmer290399 programmer290399 added the bug Something isn't working label Aug 20, 2021
@programmer290399
Copy link
Contributor Author

Working on a fix, would make a PR asap.

johnandersen777 pushed a commit that referenced this issue Aug 22, 2021
Fixes: #1198
Signed-off-by: John Andersen <johnandersenpdx@gmail.com>
johnandersen777 pushed a commit to johnandersen777/dffml that referenced this issue Mar 11, 2022
Fixes: intel#1198
Signed-off-by: John Andersen <johnandersenpdx@gmail.com>
johnandersen777 pushed a commit that referenced this issue Mar 12, 2022
Fixes: #1198
Signed-off-by: John Andersen <johnandersenpdx@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant