Data preprocessing for Korean Wordle

Description

The purpose of this project is to extract datasets to be used in Korean Wordle.
The characteristics of the dataset required for Korean Wordle are as follows.

Only nouns are used in the datasets.
It is returned in a decomposed writing style.
- 하늘 → ㅎㅏㄴㅡㄹ
- 너울 → ㄴㅓㅇㅜㄹ
In the decomposed writing, double consonants use two spaces.
- 꼬마 → ㄱㄱㅗㅁㅏ
- 과거 → ㄱㅗㅏㄱㅓ
All output data must consist of five characters.

Project Structure

root/  
│  
├── datasets/  
│   ├── dataset1/                   # 국립국어연구원 한국어 학습용 Dataset files  
│   ├── dataset2/                   # 국립국어원 한국어 기초사전 Dataset files  
│   └── dataset3/                   # 우리말샘 Dataset files  
│  
├── output/  
│  
├── preprocess/  
│   ├── __init__.py  
│   ├── common_preprocessing.py     # Common preprocessing functions  
│   ├── preprocess_easy_dataset.py  # Script for preprocessing dataset1
│   ├── preprocess_imdt_dataset.py  # Script for preprocessing dataset2
│   ├── preprocess_hard_dataset.py  # Script for preprocessing dataset3
│   └── preprocess_all_dataset.py   # Script for preprocessing dataset3 (dictionary)
│  
├── main.py  
└── config.py

Output

Each of the preprocessing python scripts generate a JSON file which is saved in /output/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data preprocessing for Korean Wordle

Description

Project Structure

Output

Dataset

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
output		output
preprocess		preprocess
README.md		README.md
config.py		config.py
main.py		main.py

hwahyeon/wordle-kor-dataset

Folders and files

Latest commit

History

Repository files navigation

Data preprocessing for Korean Wordle

Description

Project Structure

Output

Dataset

About

Topics

Resources

Stars

Watchers

Forks

Languages