Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

German Regexp + Wordcount #33

Closed
wants to merge 2 commits into from

Conversation

NiklasHoltmeyer
Copy link

i added Regexp for the German Language and trained
German Word Counts based on https://dumps.wikimedia.org/dewiki/latest/ [Wikipedia Dump (07.05.2021)]
https://github.com/NiklasHoltmeyer/autocorrect/releases/tag/German

@NiklasHoltmeyer NiklasHoltmeyer changed the title German Regexp German Regexp + Wordcount May 11, 2021
@filyp
Copy link
Owner

filyp commented May 12, 2021

Great, can you upload the de.tar.gz file somewhere? Also, please add some unit test, similar to the existing ones in test_all.py

@NiklasHoltmeyer
Copy link
Author

@NiklasHoltmeyer
Copy link
Author

i added tests.

@filyp
Copy link
Owner

filyp commented May 15, 2021

I see that de.tar.gz has 150MB which is a lot. Did you do this "find_threshold" step from https://github.com/fsondej/autocorrect#adding-new-languages to reduce dictionary size? Or is it just because of the long compound words in German so there must be a lot of words in the dictionary?

@NiklasHoltmeyer
Copy link
Author

sorry i forghott to uploaded the clean version

https://github.com/NiklasHoltmeyer/autocorrect/releases/tag/DE-Threshhold

here is the threshhold version

@filyp
Copy link
Owner

filyp commented May 16, 2021

Hmm, in assets I only see the source code and no de.tar.gz like before

@NiklasHoltmeyer
Copy link
Author

Hm Strange, i couldnt see the Files either, but they are there if i edit the Release.. I just reuploaded it and now i can see them!

https://github.com/NiklasHoltmeyer/autocorrect/releases/tag/DE-Threshhold

@filyp filyp mentioned this pull request May 26, 2021
@filyp
Copy link
Owner

filyp commented May 26, 2021

Great, I made a new PR #34 with this dictionary url added and fixed the black errors, so I''ll close this one.

@filyp filyp closed this May 26, 2021
@filyp
Copy link
Owner

filyp commented May 26, 2021

also note, that this de.tar.gz file has bad directory structure inside and it fails, I changed it and uploaded to dropbox

@filyp
Copy link
Owner

filyp commented May 26, 2021

hmm, for some reason the tests still fail with the

json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 337958 column 1 (char 7633222)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants