-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text Augmentation additions #9
Conversation
Add two new modules for modifying text in captions: CapitalizeTags: Will randomly capitalize tags in the caption with a set probability. Can specify comma-separated list of capitalization modes - 'capslock' (ALL CAPS), 'title' (First Letter Of Every Word), 'first' (First word only), or 'random' (rAnDOm lETteRS) and it will iterate through each tag in the caption and apply a random mode to each. Also has an option to set the entire caption to lowercase first. Uses the same delimiter as defined for the existing "shuffle tags" option. DropTags: Will randomly remove tags from the caption with a specified probability. Also currently uses the same existing "keep tags" and "delimiter" options. Has three different modes of dropping tags: - FULL - all-or-nothing, will drop everything after the 'keep tags' amount - RANDOM - will iterate through the caption and randomly drop individual tags - RANDOM WEIGHTED - same as random, but the probability to drop tags is reduced at the start of the caption and only reaches the full value at the end, making it more likely to preserve tags at the start Also supports a "special tags" list, which can be either a delimiter-separated string in the input field or a file path to a txt or csv file with entries separated by new lines. Can act as a "whitelist" (the special tags will never be dropped, but all others past the "keep tags" amount can be) or "blacklist" (the special tags may be dropped but all others will always be kept). There's also an option to enable this list to be interpreted with regex matching, so that syntax like ".*s" will match any tags ending with the letter s. Includes an exception for the "/(" and "/)" syntax used in many booru/e6 tags, but there may be some other special characters used by regex "[]\.^$*+?{}|()" which could be interpreted in unintentional ways.
Relevant issue related to this PR Nerogar/OneTrainer#289 |
src/mgds/pipelineModules/DropTags.py
Outdated
|
||
#convert special_tags to list depending on whether it's a newline-separated csv/txt file or a delimiter-separated string | ||
if (special_tags.endswith(".txt") and os.path.isfile(special_tags)): | ||
with open(special_tags) as special_tags_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a good idea to add dome kind of in-memory cache to this. Reading and parsing a txt/csv file on every training step will add a bit too much overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, I'll have to test this with some larger datasets and see how much of a slowdown it causes. I mostly included that part to make it easier to reuse large and complex white/blacklists between multiple concepts, but if you want to exclude that for now and just use the text field only, I can look into it more and submit it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some research and some time comparisons between loading special tags from file vs giving it a string directly, looping the main block of the script through a list of ~1200 captions. Loading special tags from a file took around 0.12 s to complete, compared to 0.02 s for string. Changing the file loading section into a function with /@functools.lru_cache closed the difference between them, however I'm not sure if the cache will persist each time the real function is called and I don't know how to do that without writing to disk. I can add it to the PR anyway since it didn't seem to hurt anything at least.
Enabling regex also made it slower but there may not be a way to improve that, caching the regex patterns didn't help. In any case it seems like a difference of tenths of a millisecond for a single iteration so I don't think it's going to affect training speed. I did also try running some short trainings comparing string/txt file, as well as regex on/off, and the completion time was not noticeably different. My system is not the strongest so it may be slightly more noticeable with faster GPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not even sure if a LRU cache is needed. I would have just added a dict that to map a filename to the parsed content. unless someone adds millions of different files, that create a memory issue, and it also ensures that every file only creates a single cache miss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, maybe I misunderstood what you were asking for. There would only ever be one "special tag" file per concept right now, so memory issues would be unlikely. I really don't have much experience with Python beyond simple single-purpose scripts so I don't know what best practices would be for caching a handful of text files between multiple calls to the same script.
I think the main thing that makes this "slow" is the fact that it's only doing one caption->one variation each time it runs, and loading/interpreting the special list from scratch each time it loads a new caption or generates a new variation. That's the way the "shuffle" module works and it's what I used as a reference for the structure of both of these. It wouldn't be difficult to change it to return a list of n variations of a single caption, but it seems like that would require significant changes to the way all modules are handled in the rest of the project.
Change section which loads txt/csv files with special tags into a separate function with lru_cache, may help reduce time taken to read files
Reorganized several sections of the DropTags code into functions for better readability and possibly performance, a few other changes to reduce the number of unnecessary loops in some sections
Fix some missing "self" arguments and remove lru cache from regex function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally had some time to really read through the code. Apart from the other comments, there are a few general things that should be changed
- Some of the lines are way too long. They should be limited to 120 characters
- If statements don't need parenthesis
- type hints are missing for some functions
Fixed some long lines, incorrect type hints, and unnecessary parentheses. removed extra file
Thanks for your help with this, will work on adding information about the new features to the wiki |
Add two new modules for modifying text in captions. This is for the modules in mgds with the actual scripts, a separate pull request will be done for the integration in OneTrainer:
CapitalizeTags: Will randomly capitalize tags in the caption with a set probability. Can specify comma-separated list of capitalization modes - 'capslock' (ALL CAPS), 'title' (First Letter Of Every Word), 'first' (First word only), or 'random' (rAnDOm lETteRS) and it will iterate through each tag in the caption and apply a random mode to each. Also has an option to set the entire caption to lowercase first. Uses the same delimiter as defined for the existing "shuffle tags" option.
DropTags: Will randomly remove tags from the caption with a specified probability. Also currently uses the same existing "keep tags" and "delimiter" options. Has three different modes of dropping tags:
Also supports a "special tags" list, which can be either a delimiter-separated string in the input field or a file path to a txt or csv file with entries separated by new lines. Can act as a "whitelist" (the special tags will never be dropped, but all others past the "keep tags" amount can be) or "blacklist" (the special tags may be dropped but all others will always be kept). There's also an option to enable this list to be interpreted with regex matching, so that syntax like ".*s" will match any tags ending with the letter s. Includes an exception for the "/(" and "/)" syntax used in many booru/e6 tags, but there may be some other special characters used by regex "[].^$*+?{}|()" which could be interpreted in unintentional ways.