Text Augmentation additions #9

efhosci · 2024-10-21T01:37:30Z

Add two new modules for modifying text in captions. This is for the modules in mgds with the actual scripts, a separate pull request will be done for the integration in OneTrainer:

CapitalizeTags: Will randomly capitalize tags in the caption with a set probability. Can specify comma-separated list of capitalization modes - 'capslock' (ALL CAPS), 'title' (First Letter Of Every Word), 'first' (First word only), or 'random' (rAnDOm lETteRS) and it will iterate through each tag in the caption and apply a random mode to each. Also has an option to set the entire caption to lowercase first. Uses the same delimiter as defined for the existing "shuffle tags" option.

DropTags: Will randomly remove tags from the caption with a specified probability. Also currently uses the same existing "keep tags" and "delimiter" options. Has three different modes of dropping tags:

FULL - all-or-nothing, will drop everything after the 'keep tags' amount
RANDOM - will iterate through the caption and randomly drop individual tags
RANDOM WEIGHTED - same as random, but the probability to drop tags is reduced at the start of the caption and only reaches the full value at the end, making it more likely to preserve tags at the start

Also supports a "special tags" list, which can be either a delimiter-separated string in the input field or a file path to a txt or csv file with entries separated by new lines. Can act as a "whitelist" (the special tags will never be dropped, but all others past the "keep tags" amount can be) or "blacklist" (the special tags may be dropped but all others will always be kept). There's also an option to enable this list to be interpreted with regex matching, so that syntax like ".*s" will match any tags ending with the letter s. Includes an exception for the "/(" and "/)" syntax used in many booru/e6 tags, but there may be some other special characters used by regex "[].^$*+?{}|()" which could be interpreted in unintentional ways.

Add two new modules for modifying text in captions: CapitalizeTags: Will randomly capitalize tags in the caption with a set probability. Can specify comma-separated list of capitalization modes - 'capslock' (ALL CAPS), 'title' (First Letter Of Every Word), 'first' (First word only), or 'random' (rAnDOm lETteRS) and it will iterate through each tag in the caption and apply a random mode to each. Also has an option to set the entire caption to lowercase first. Uses the same delimiter as defined for the existing "shuffle tags" option. DropTags: Will randomly remove tags from the caption with a specified probability. Also currently uses the same existing "keep tags" and "delimiter" options. Has three different modes of dropping tags: - FULL - all-or-nothing, will drop everything after the 'keep tags' amount - RANDOM - will iterate through the caption and randomly drop individual tags - RANDOM WEIGHTED - same as random, but the probability to drop tags is reduced at the start of the caption and only reaches the full value at the end, making it more likely to preserve tags at the start Also supports a "special tags" list, which can be either a delimiter-separated string in the input field or a file path to a txt or csv file with entries separated by new lines. Can act as a "whitelist" (the special tags will never be dropped, but all others past the "keep tags" amount can be) or "blacklist" (the special tags may be dropped but all others will always be kept). There's also an option to enable this list to be interpreted with regex matching, so that syntax like ".*s" will match any tags ending with the letter s. Includes an exception for the "/(" and "/)" syntax used in many booru/e6 tags, but there may be some other special characters used by regex "[]\.^$*+?{}|()" which could be interpreted in unintentional ways.

efhosci · 2024-10-21T01:53:37Z

Relevant issue related to this PR Nerogar/OneTrainer#289

Nerogar · 2024-10-21T20:58:56Z

src/mgds/pipelineModules/DropTags.py

+
+            #convert special_tags to list depending on whether it's a newline-separated csv/txt file or a delimiter-separated string
+            if (special_tags.endswith(".txt") and os.path.isfile(special_tags)):
+                with open(special_tags) as special_tags_file:


Might be a good idea to add dome kind of in-memory cache to this. Reading and parsing a txt/csv file on every training step will add a bit too much overhead.

Thanks for the suggestion, I'll have to test this with some larger datasets and see how much of a slowdown it causes. I mostly included that part to make it easier to reuse large and complex white/blacklists between multiple concepts, but if you want to exclude that for now and just use the text field only, I can look into it more and submit it in the future.

Did some research and some time comparisons between loading special tags from file vs giving it a string directly, looping the main block of the script through a list of ~1200 captions. Loading special tags from a file took around 0.12 s to complete, compared to 0.02 s for string. Changing the file loading section into a function with /@functools.lru_cache closed the difference between them, however I'm not sure if the cache will persist each time the real function is called and I don't know how to do that without writing to disk. I can add it to the PR anyway since it didn't seem to hurt anything at least.

Enabling regex also made it slower but there may not be a way to improve that, caching the regex patterns didn't help. In any case it seems like a difference of tenths of a millisecond for a single iteration so I don't think it's going to affect training speed. I did also try running some short trainings comparing string/txt file, as well as regex on/off, and the completion time was not noticeably different. My system is not the strongest so it may be slightly more noticeable with faster GPUs.

I'm not even sure if a LRU cache is needed. I would have just added a dict that to map a filename to the parsed content. unless someone adds millions of different files, that create a memory issue, and it also ensures that every file only creates a single cache miss

Alright, maybe I misunderstood what you were asking for. There would only ever be one "special tag" file per concept right now, so memory issues would be unlikely. I really don't have much experience with Python beyond simple single-purpose scripts so I don't know what best practices would be for caching a handful of text files between multiple calls to the same script.

I think the main thing that makes this "slow" is the fact that it's only doing one caption->one variation each time it runs, and loading/interpreting the special list from scratch each time it loads a new caption or generates a new variation. That's the way the "shuffle" module works and it's what I used as a reference for the structure of both of these. It wouldn't be difficult to change it to return a list of n variations of a single caption, but it seems like that would require significant changes to the way all modules are handled in the rest of the project.

Change section which loads txt/csv files with special tags into a separate function with lru_cache, may help reduce time taken to read files

Reorganized several sections of the DropTags code into functions for better readability and possibly performance, a few other changes to reduce the number of unnecessary loops in some sections

Fix some missing "self" arguments and remove lru cache from regex function

Nerogar

I finally had some time to really read through the code. Apart from the other comments, there are a few general things that should be changed

Some of the lines are way too long. They should be limited to 120 characters
If statements don't need parenthesis
type hints are missing for some functions

.vscode/launch.json

src/mgds/pipelineModules/CapitalizeTags.py

src/mgds/pipelineModules/DropTags.py

Fixed some long lines, incorrect type hints, and unnecessary parentheses. removed extra file

efhosci · 2024-11-10T02:51:20Z

Thanks for your help with this, will work on adding information about the new features to the wiki

efhosci marked this pull request as ready for review October 21, 2024 01:50

Nerogar reviewed Oct 21, 2024

View reviewed changes

cache files loaded in DropTags

c1c15ba

Change section which loads txt/csv files with special tags into a separate function with lru_cache, may help reduce time taken to read files

efhosci marked this pull request as draft October 23, 2024 13:43

Reorganized DropTags

17266b9

Reorganized several sections of the DropTags code into functions for better readability and possibly performance, a few other changes to reduce the number of unnecessary loops in some sections

efhosci marked this pull request as ready for review October 31, 2024 03:17

Fix functions/cache

a14f13f

Fix some missing "self" arguments and remove lru cache from regex function

Nerogar requested changes Nov 2, 2024

View reviewed changes

efhosci and others added 2 commits November 4, 2024 19:08

Formatting fixes

3e68fc0

Fixed some long lines, incorrect type hints, and unnecessary parentheses. removed extra file

fix formatting issues

bc2c94c

Nerogar merged commit f9edb99 into Nerogar:master Nov 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text Augmentation additions #9

Text Augmentation additions #9

efhosci commented Oct 21, 2024 •

edited

Loading

efhosci commented Oct 21, 2024

Nerogar Oct 21, 2024

efhosci Oct 22, 2024

efhosci Oct 23, 2024

Nerogar Oct 23, 2024

efhosci Oct 24, 2024

Nerogar left a comment

efhosci commented Nov 10, 2024

Text Augmentation additions #9

Text Augmentation additions #9

Conversation

efhosci commented Oct 21, 2024 • edited Loading

efhosci commented Oct 21, 2024

Nerogar Oct 21, 2024

Choose a reason for hiding this comment

efhosci Oct 22, 2024

Choose a reason for hiding this comment

efhosci Oct 23, 2024

Choose a reason for hiding this comment

Nerogar Oct 23, 2024

Choose a reason for hiding this comment

efhosci Oct 24, 2024

Choose a reason for hiding this comment

Nerogar left a comment

Choose a reason for hiding this comment

efhosci commented Nov 10, 2024

efhosci commented Oct 21, 2024 •

edited

Loading