Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include Offsets & Fringe Case Fix for outerSize > size && lda = {1, 1, ...} #33

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

njh80
Copy link

@njh80 njh80 commented Feb 26, 2025

Headline
FEAT: Introduced increased flexibility for handling subtensors by including offset inputs for tensors. This feature includes backwards compatibility for calls to hptt::create_plan without offsets and is therefore not a breaking change.

Performance
Passes all testFramework.cpp tests.
Benchmark Output: hptt_benchmark.txt

Detail:
Makefiles (Makefile, benchmark/Makefile, testframework/Makefile) FIX: In the case of libomp not being discovered in LD_LIBRARY_PATH (MacOS M2 issue), user can specify a path for build.

benchmark/benchmark.cpp
FEAT: transpose_ref is for internal use and therefore changes do not include backwards compatibility for the function and hence the function call is amended to pass new nullptr arguments.

benchmark/maxFromFiles.py
FIX: Print statement of Error is given parentheses.

benchmark/reference.cpp
FEAT: Firstly, function receives new outerSize (A/B) and offset (A/B) arrays which are initialised to mimic the size array in the supplication of nullptrs. Next, the stepping through B is amended to ensure that the outerSize is traversed where the row of size is exceeded. Further offsets are inserted into the traversal. Behaviour can be verified via DEBUG.

Pseudo-Code is:
for each dimension not the innermost loop of B:
divide the current position by the size of the next innermost loop of B that we want to traverse
move across the offset distance as many times as we have exceeded it plus the initial offset
further move over any space that remains after the end of the block required by size as many times as we exceed it

benchmark/reference.h
FEAT: Amended template to reflect new inputs of transpose_ref(), namely offsetA, offsetB, outerSizeA and outerSizeB.

include/compute_node.h
FEAT: Included three new members of a ComputeNode without exceeding the cache size of 64 bytes (unaligned memory in caches exceeding this programmers be warned!).

First the offset difference (A - B) which reduces the number of calculations required in adjusting for the offset in the execution of hptt. The plan is created with start and end positions inclusive of the offset of B and the difference is added to access the start and end values of A.

FIX: Secondly, the booleans of indexA and indexB indicate true when the leading dimension of A/B is 1 and the index is 0. The original code faultered when A or B's innermost dimensions were 1 causing the transpose_int functions to identify incorrect innermost indexes - especially problematic with non-zero outerSizes.

include/hptt.h
FEAT: New template functions provided for provision of offsets in various floatType contexts.

include/transpose.h
FEAT: Amended skipIndices and verifyParameter to include offset inputs as these functions are effected by the inclusion of these. Also included offsets as properties of the transpose class.

include/utils.h
FEAT: Amended the template of accountForRowMajor as this needs to change the orders of the offsets similarly to the other parameters.

src/hptt.cpp
FEAT: Implemented new offset templates and amended original templates to point to plan() with nullptrs or offsets where appropriate.

src/transpose.cpp
FEAT: Amended plan assignment section to include assignments for the new computeNode members. FEAT: Included offsets in fuseIndices, skipIndices and verifyParameters functions where amendments effect offsets too and verification proves offset + size <= outerSize for all dimensions. FEAT: axpy functions require offset differences as well and so these are calculated and the integer/array passed to the respective functions for proper calculation. Similarly, the axpy functions themselves are amended. FEAT: in transpose_ functions offDiffAB is always added to i to get the correct start/end. Also where lda/ldb == 1 is checked, plan->indexA/B is also asserted to ensure correct blocking is passed. As result of the increased robustness, the blockingA/B can always be confidently passed and loops can be included for cases where scalar is reached and lda/ldb is not 1. FEAT: Included a plethora of DEBUG statements (coding this was very fun).

src/utils.cpp
FEAT: Implemented accountForRowMajor changes for offsets mirroring the behaviour for outerSizes.

testframework/testframework.cpp
FEAT: Improved testing to include triggerable outerSize != size and offsets with strings printed for DEBUG cases. FEAT: Error messages modified for clarity.

…luding offset inputs for tensors. This feature includes backwards compatibility for calls to hptt::create_plan without offsets and is therefore not a breaking change.

**Detail:**
*Makefiles* (Makefile, benchmark/Makefile, testframework/Makefile)
FIX: In the case of libomp not being discovered in LD_LIBRARY_PATH (MacOS M2 issue), user can specify a path for build.

*benchmark/benchmark.cpp*
FEAT: `transpose_ref` is for internal use and therefore changes do not include backwards compatibility for the function and hence the function call is amended to pass new nullptr arguments.

*benchmark/maxFromFiles.py*
FIX: Print statement of Error is given parentheses.

*benchmark/reference.cpp*
FEAT: Firstly, function receives new outerSize (A/B) and offset (A/B) arrays which are initialised to mimic the size array in the supplication of nullptrs. Next, the stepping through B is amended to ensure that the outerSize is traversed where the row of size is exceeded. Further offsets are inserted into the traversal. Behaviour can be verified via DEBUG.

Pseudo-Code is:
for each dimension not the innermost loop of B:
    divide the current position by the size of the next innermost loop of B that we want to traverse
    move across the offset distance as many times as we have exceeded it plus the initial offset
    further move over any space that remains after the end of the block required by size as many times as we exceed it

*benchmark/reference.h*
FEAT: Amended template to reflect new inputs of transpose_ref(), namely offsetA, offsetB, outerSizeA and outerSizeB.

*include/compute_node.h*
FEAT: Included three new members of a ComputeNode without exceeding the cache size of 64 bytes (unaligned memory in caches exceeding this programmers be warned!).

First the offset difference (A - B) which reduces the number of calculations required in adjusting for the offset in the execution of hptt. The plan is created with start and end positions inclusive of the offset of B and the difference is added to access the start and end values of A.

FIX: Secondly, the booleans of indexA and indexB indicate true when the leading dimension of A/B is 1 and the index is 0. The original code faultered when A or B's innermost dimensions were 1 causing the transpose_int functions to identify incorrect innermost indexes - especially problematic with non-zero outerSizes.

*include/hptt.h*
FEAT: New template functions provided for provision of offsets in various floatType contexts.

*include/transpose.h*
FEAT: Amended skipIndices and verifyParameter to include offset inputs as these functions are effected by the inclusion of these. Also included offsets as properties of the transpose class.

*include/utils.h*
FEAT: Amended the template of accountForRowMajor as this needs to change the orders of the offsets similarly to the other parameters.

*src/hptt.cpp*
FEAT: Implemented new offset templates and amended original templates to point to plan() with nullptrs or offsets where appropriate.

*src/transpose.cpp*
FEAT: Amended plan assignment section to include assignments for the new computeNode members.
FEAT: Included offsets in fuseIndices, skipIndices and verifyParameters functions where amendments effect offsets too and verification proves offset + size <= outerSize for all dimensions.
FEAT: axpy functions require offset differences as well and so these are calculated and the integer/array passed to the respective functions for proper calculation. Similarly, the axpy functions themselves are amended.
FEAT: in transpose_ functions offDiffAB is always added to i to get the correct start/end. Also where lda/ldb == 1 is checked, plan->indexA/B is also asserted to ensure correct blocking is passed. As result of the increased robustness, the blockingA/B can always be confidently passed and loops can be included for cases where scalar is reached and lda/ldb is not 1.
FEAT: Included a plethora of DEBUG statements (coding this was very fun).

*src/utils.cpp*
FEAT: Implemented accountForRowMajor changes for offsets mirroring the behaviour for outerSizes.

*testframework/testframework.cpp*
FEAT: Improved testing to include triggerable outerSize != size and offsets with strings printed for DEBUG cases.
FEAT: Error messages modified for clarity.
njh80 added 2 commits March 28, 2025 15:35
Sub-Tensors often omit their inner-most dimension meaning that they access their source data without an inner stride of one. This commit adds a basic level of support for this in a similar way to the support for offsets. Inner Strides are optional arguments and are supplied as integers.

*benchmark/benchmark.cpp*
Amends reference to `transpose_ref` to include nullptrs to inner strides.

*benchmark/reference.cpp*
`transpose_ref` can now receive non-integer inner strides - used for evaluating tests.

*benchmark/reference.h*
Amends template function.

*include/hptt.h*
Creates overloads including innerStrides (size_t) for create_plan calls

*include/transpose.h*
Amends functions to receive innerStrides as inputs.

*src/hptt.cpp*
Includes the new overloads and amends existing to pass nullptr objects in the cases where inner strides are not supplied.

*src/transpose.cpp*
Amends behaviour of execution to include innerStrides. As `transpose_int` functions are not part of the `Transpose` class, the innerStrides must be passed as new arguments, unchanged throughout, to the `macro_kernel_scalar` and `micro_kernel`. These then use the new strides.

An attempt to write support for Arch ARM and Arch AVX has been written but the execution of these is unchecked as the author is not working with access to these operating systems. Further, no support has been included for the B buffer case in the Macro-Kernel which in theory could be included.

Further, as a comment to the offset version as well, there has not been any changes made to the plan generation stages - be the effectiveness of these will likely be altered by these commits.

*testframework/testframework.cpp*
Tests have been added for innerStrides of 1 or 2 to test behaviour (in theory larger strides are fine but exceed the memory capabilities of my device).
…en one and the number of dimenions (a small random number) and number of dimension any value between 1 and MAX_DIM again.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant