Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text Augmentation additions #9

Merged
merged 6 commits into from
Nov 9, 2024
Merged

Text Augmentation additions #9

merged 6 commits into from
Nov 9, 2024

Conversation

efhosci
Copy link
Contributor

@efhosci efhosci commented Oct 21, 2024

Add two new modules for modifying text in captions. This is for the modules in mgds with the actual scripts, a separate pull request will be done for the integration in OneTrainer:

CapitalizeTags: Will randomly capitalize tags in the caption with a set probability. Can specify comma-separated list of capitalization modes - 'capslock' (ALL CAPS), 'title' (First Letter Of Every Word), 'first' (First word only), or 'random' (rAnDOm lETteRS) and it will iterate through each tag in the caption and apply a random mode to each. Also has an option to set the entire caption to lowercase first. Uses the same delimiter as defined for the existing "shuffle tags" option.

DropTags: Will randomly remove tags from the caption with a specified probability. Also currently uses the same existing "keep tags" and "delimiter" options. Has three different modes of dropping tags:

  • FULL - all-or-nothing, will drop everything after the 'keep tags' amount
  • RANDOM - will iterate through the caption and randomly drop individual tags
  • RANDOM WEIGHTED - same as random, but the probability to drop tags is reduced at the start of the caption and only reaches the full value at the end, making it more likely to preserve tags at the start

Also supports a "special tags" list, which can be either a delimiter-separated string in the input field or a file path to a txt or csv file with entries separated by new lines. Can act as a "whitelist" (the special tags will never be dropped, but all others past the "keep tags" amount can be) or "blacklist" (the special tags may be dropped but all others will always be kept). There's also an option to enable this list to be interpreted with regex matching, so that syntax like ".*s" will match any tags ending with the letter s. Includes an exception for the "/(" and "/)" syntax used in many booru/e6 tags, but there may be some other special characters used by regex "[].^$*+?{}|()" which could be interpreted in unintentional ways.

Add two new modules for modifying text in captions:

CapitalizeTags: Will randomly capitalize tags in the caption with a set probability. Can specify comma-separated list of capitalization modes - 'capslock' (ALL CAPS), 'title' (First Letter Of Every Word), 'first' (First word only), or 'random' (rAnDOm lETteRS) and it will iterate through each tag in the caption and apply a random mode to each. Also has an option to set the entire caption to lowercase first. Uses the same delimiter as defined for the existing "shuffle tags" option.

DropTags: Will randomly remove tags from the caption with a specified probability. Also currently uses the same existing "keep tags" and "delimiter" options. Has three different modes of dropping tags:
   - FULL - all-or-nothing, will drop everything after the 'keep tags' amount
   - RANDOM - will iterate through the caption and randomly drop individual tags
   - RANDOM WEIGHTED - same as random, but the probability to drop tags is reduced at the start of the caption and only reaches the full value at the end, making it more likely to preserve tags at the start

Also supports a "special tags" list, which can be either a delimiter-separated string in the input field or a file path to a txt or csv file with entries separated by new lines. Can act as a "whitelist" (the special tags will never be dropped, but all others past the "keep tags" amount can be) or "blacklist" (the special tags may be dropped but all others will always be kept). There's also an option to enable this list to be interpreted with regex matching, so that syntax like ".*s" will match any tags ending with the letter s. Includes an exception for the "/(" and "/)" syntax used in many booru/e6 tags, but there may be some other special characters used by regex "[]\.^$*+?{}|()" which could be interpreted in unintentional ways.
@efhosci efhosci marked this pull request as ready for review October 21, 2024 01:50
@efhosci
Copy link
Contributor Author

efhosci commented Oct 21, 2024

Relevant issue related to this PR Nerogar/OneTrainer#289


#convert special_tags to list depending on whether it's a newline-separated csv/txt file or a delimiter-separated string
if (special_tags.endswith(".txt") and os.path.isfile(special_tags)):
with open(special_tags) as special_tags_file:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a good idea to add dome kind of in-memory cache to this. Reading and parsing a txt/csv file on every training step will add a bit too much overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, I'll have to test this with some larger datasets and see how much of a slowdown it causes. I mostly included that part to make it easier to reuse large and complex white/blacklists between multiple concepts, but if you want to exclude that for now and just use the text field only, I can look into it more and submit it in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some research and some time comparisons between loading special tags from file vs giving it a string directly, looping the main block of the script through a list of ~1200 captions. Loading special tags from a file took around 0.12 s to complete, compared to 0.02 s for string. Changing the file loading section into a function with /@functools.lru_cache closed the difference between them, however I'm not sure if the cache will persist each time the real function is called and I don't know how to do that without writing to disk. I can add it to the PR anyway since it didn't seem to hurt anything at least.

Enabling regex also made it slower but there may not be a way to improve that, caching the regex patterns didn't help. In any case it seems like a difference of tenths of a millisecond for a single iteration so I don't think it's going to affect training speed. I did also try running some short trainings comparing string/txt file, as well as regex on/off, and the completion time was not noticeably different. My system is not the strongest so it may be slightly more noticeable with faster GPUs.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not even sure if a LRU cache is needed. I would have just added a dict that to map a filename to the parsed content. unless someone adds millions of different files, that create a memory issue, and it also ensures that every file only creates a single cache miss

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, maybe I misunderstood what you were asking for. There would only ever be one "special tag" file per concept right now, so memory issues would be unlikely. I really don't have much experience with Python beyond simple single-purpose scripts so I don't know what best practices would be for caching a handful of text files between multiple calls to the same script.

I think the main thing that makes this "slow" is the fact that it's only doing one caption->one variation each time it runs, and loading/interpreting the special list from scratch each time it loads a new caption or generates a new variation. That's the way the "shuffle" module works and it's what I used as a reference for the structure of both of these. It wouldn't be difficult to change it to return a list of n variations of a single caption, but it seems like that would require significant changes to the way all modules are handled in the rest of the project.

Change section which loads txt/csv files with special tags into a separate function with lru_cache, may help reduce time taken to read files
@efhosci efhosci marked this pull request as draft October 23, 2024 13:43
Reorganized several sections of the DropTags code into functions for better readability and possibly performance, a few other changes to reduce the number of unnecessary loops in some sections
@efhosci efhosci marked this pull request as ready for review October 31, 2024 03:17
Fix some missing "self" arguments and remove lru cache from regex function
Copy link
Owner

@Nerogar Nerogar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally had some time to really read through the code. Apart from the other comments, there are a few general things that should be changed

  1. Some of the lines are way too long. They should be limited to 120 characters
  2. If statements don't need parenthesis
  3. type hints are missing for some functions

.vscode/launch.json Outdated Show resolved Hide resolved
src/mgds/pipelineModules/CapitalizeTags.py Outdated Show resolved Hide resolved
src/mgds/pipelineModules/DropTags.py Outdated Show resolved Hide resolved
src/mgds/pipelineModules/DropTags.py Outdated Show resolved Hide resolved
src/mgds/pipelineModules/DropTags.py Outdated Show resolved Hide resolved
src/mgds/pipelineModules/DropTags.py Outdated Show resolved Hide resolved
src/mgds/pipelineModules/DropTags.py Outdated Show resolved Hide resolved
efhosci and others added 2 commits November 4, 2024 19:08
Fixed some long lines, incorrect type hints, and unnecessary parentheses. removed extra file
@Nerogar Nerogar merged commit f9edb99 into Nerogar:master Nov 9, 2024
@efhosci
Copy link
Contributor Author

efhosci commented Nov 10, 2024

Thanks for your help with this, will work on adding information about the new features to the wiki

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants