ko-nlp/Korpora

Korean corpus repository. Contribute to ko-nlp/Korpora development by creating an account on GitHub.

github.com

Korpora: Korean Corpora Archives

Due to the growing interest in natural language processing, governments, businesses, and individuals are disclosing their data for free. However, even for a high-quality corpus, its existence is often unknown as datasets are scattered in different locations. Furthermore, each of their file or saved format is often different, making it even more difficult to use them. Therefore, individuals need to painstakingly create download or preprocessing codes for every instance.

Korpora is an open-source Python package that aims to minimize such inconvenience. The name Korpora comes from the word corpora, a plural form of the word corpus. Korpora is an acronym that stands for Korean Corpora. We hope that Korpora will serve as a starting point that encourages more Korean datasets to be released and improve the state of Korean natural language processing to the next level.

List of corpora

Korpora provides following corpora.

corpus_namedescriptionlink

korean_chatbot_data	Question and answer pairs for training a chatbot	https://github.com/songys/Chatbot_data
kcbert	Comment data used for training KcBERT model	https://github.com/Beomi/KcBERT
korean_hate_speech	Korean hate speech dataset	https://github.com/kocohub/korean-hate-speech
korean_petitions	Petitions to Blue House	https://github.com/lovit/petitions_archive
kornli	Korean NLI	https://github.com/kakaobrain/KorNLUDatasets
korsts	Korean STS	https://github.com/kakaobrain/KorNLUDatasets
kowikitext	Korean Wikipedia text	https://github.com/lovit/kowikitext/
namuwikitext	Namuwiki text	https://github.com/lovit/namuwikitext
naver_changwon_ner	NAVER x Changwon National University NER dataset	https://github.com/naver/nlp-challenge/tree/master/missions/ner
nsmc	NAVER Sentiment Movie Corpus	https://github.com/e9t/nsmc
question_pair	Korean question and answer pair dataset	https://github.com/songys/Question_pair
modu_news	Modu Corpus: Newspaper	https://corpus.korean.go.kr
modu_messenger	Modu Corpus: Messenger	https://corpus.korean.go.kr
modu_mp	Modu Corpus: Morphemes	https://corpus.korean.go.kr
modu_ne	Modu Corpus: Named Entity	https://corpus.korean.go.kr
modu_spoken	Modu Corpus: Spoken	https://corpus.korean.go.kr
modu_web	Modu Corpus: Web	https://corpus.korean.go.kr
modu_written	Modu Corpus: Written	https://corpus.korean.go.kr
aihub_translation	Korean-English translation corpus	https://aihub.or.kr/aidata/87
open_subtitles	Korean-English parallel corpus from movie subtitles	http://opus.nlpl.eu/OpenSubtitles-v2018.php
korean_parallel_koen_news	Korean-English parallel corpus	https://github.com/jungyeul/korean-parallel-corpora

Information page

Detailed information on Korpora is available from the link below. The information page is written in both Korean and English. We like to thank Han Kyul Kim (@hank110) and Won Ik Cho (@warnikchow) (Alphabet order) for the English translation.

https://ko-nlp.github.io/Korpora

For those who would like to quickly go through the core functions, please refer to the Quick overview part below. For more information about notes on execution or option modifications, please refer to the information page linked above.

Quick overview

Installation

From source

git clone https://github.com/ko-nlp/Korpora python setup.py install

Using pip

pip install Korpora

Using in Python

Korpora is an open-source Python package. By default, it can be executed in a Python console. You can check the list of the available corpus with the following Python codes.

from Korpora import Korpora Korpora.corpus_list()

{ 'kcbert': 'beomi@github 님이 만드신 KcBERT 학습데이터', 'korean_chatbot_data': 'songys@github 님이 만드신 챗봇 문답 데이터', 'korean_hate_speech': '{inmoonlight,warnikchow,beomi}@github 님이 만드신 혐오댓글데이터', 'korean_petitions': 'lovit@github 님이 만드신 2017.08 ~ 2019.03 청와대 청원데이터', 'kornli': 'KakaoBrain 에서 제공하는 Natural Language Inference (NLI) 데이터', 'korsts': 'KakaoBrain 에서 제공하는 Semantic Textual Similarity (STS) 데이터', 'kowikitext': "lovit@github 님이 만드신 wikitext 형식의 한국어 위키피디아 데이터", 'namuwikitext': 'lovit@github 님이 만드신 wikitext 형식의 나무위키 데이터', 'naver_changwon_ner': '네이버 + 창원대 NER shared task data', 'nsmc': 'e9t@github 님이 만드신 Naver sentiment movie corpus v1.0', 'question_pair': 'songys@github 님이 만드신 질문쌍(Paired Question v.2)', 'modu_news': '국립국어원에서 만든 모두의 말뭉치: 뉴스 말뭉치', 'modu_messenger': '국립국어원에서 만든 모두의 말뭉치: 메신저 말뭉치', 'modu_mp': '국립국어원에서 만든 모두의 말뭉치: 형태 분석 말뭉치', 'modu_ne': '국립국어원에서 만든 모두의 말뭉치: 개체명 분석 말뭉치', 'modu_spoken': '국립국어원에서 만든 모두의 말뭉치: 구어 말뭉치', 'modu_web': '국립국어원에서 만든 모두의 말뭉치: 웹 말뭉치', 'modu_written': '국립국어원에서 만든 모두의 말뭉치: 문어 말뭉치', 'aihub_translation': "AI Hub 에서 제공하는 번역용 병렬 말뭉치 (구어 + 대화 + 뉴스 + 한국문화 + 조례 + 지자체웹사이트)", 'aihub_spoken_translation': "AI Hub 에서 제공하는 번역용 병렬 말뭉치 (구어)", 'aihub_conversation_translation': "AI Hub 에서 제공하는 번역용 병렬 말뭉치 (대화)", 'aihub_news_translation': "AI Hub 에서 제공하는 번역용 병렬 말뭉치 (뉴스)", 'aihub_korean_culture_translation': "AI Hub 에서 제공하는 번역용 병렬 말뭉치 (한국문화)", 'aihub_decree_translation': "AI Hub 에서 제공하는 번역용 병렬 말뭉치 (조례)", 'aihub_government_website_translation': "AI Hub 에서 제공하는 번역용 병렬 말뭉치 (지자체웹사이트)", 'open_subtitles': 'Open parallel corpus (OPUS) 에서 제공하는 영화 자막 번역 병렬 말뭉치', }

From the Python console, you can download KcBERT training data with the following Python codes. The corpus is downloaded to the Korpora directory within the user's root directory (~/Korpora). If you want to download a different dataset, please change the name of the corpus in the argument by the name of the dataset as expressed in the list above.

from Korpora import Korpora Korpora.fetch("kcbert")

If you want to download all corpora provided by Korpora, use the following Python codes. All datasets are downloaded to ~/Korpora.

from Korpora import Korpora Korpora.fetch('all')

Using the following codes, you can load the KcBERT training dataset from your Python console. If the corpus does not exist in the local directory, it is downloaded to ~/Korpora as well. Then, the corpus data is stored in a Python variable corpus. To load a different dataset, please change the name of the corpus in the argument by the name of the dataset as expressed in the list above.

from Korpora import Korpora corpus = Korpora.load("kcbert")

Using in a terminal

You can execute Korpora through your terminal as well (Command Line Interface, CLI). Korpora can be used without executing your Python console. You can download the KcBERT training dataset from your terminal with the following command. The dataset is downloaded to ~/Korpora.

korpora fetch --corpus kcbert

With the following command, you can simultaneously download the KcBERT training dataset and the chatbot Q&A pair dataset. With this command, you can also simultaneously download three or more datasets. Datasets are downloaded to ~/Korpora.

korpora fetch --corpus kcbert korean_chatbot_data

You can download all corpora provided by Korpora from your terminal with the following command. Datasets are downloaded to ~/Korpora.

korpora fetch --corpus all

From your terminal, you can also create a dataset for training a language model. Creating this training dataset for a language model refers to a process of extracting only the sentences from all corpora provided by Korpora and saving them in a text file. A sample command is as follows. It simultaneously processes all corpora provided by Korpora and creates a single training dataset for a language model. Downloading the corpus and preprocessing its text occur simultaneously as well. If the corpus does not exist in the local directory, it is downloaded to ~/Korpora. A single output file named all.train will be created. It is created within output_dir.

korpora lmdata \ --corpus all \ --output_dir ~/works/lmdata

License

Korpora is licensed under the Creative Commons License(CCL) 4.0 CC-BY. This license covers the Korpora package and all of its components.
Its users have the following rights.
- Share : They are free to reproduce, distribute, exhibit, perform and transmit via air (including changes in the format).
- Adapt : They can remix, transform, and build upon the material for any purpose, even commercially.
Its users have the following obligations. As long as these obligations are fulfilled, the user rights listed above are valid.
- Attribution : They must indicate that they have used Korpora.
- No additional restrictions : For all derivative works of Korpora, they cannot impose stricter license than CC-BY permits.
- For example, if you have downloaded and used Korpora, you need to fulfill only the 'attribution' obligation. However, if you are creating and distributing models, documents or any other derivative works of Korpora, you must fulfill both the 'attribution' and 'no additional restrictions' obligations.
Each corpus adheres to its own license policy. Please check the license of the corpus before using it!

'인공지능' 카테고리의 다른 글

kaggle-Jane Street Market Prediction (0)	2021.05.04
scikit-learn CV(Cross Validation) (0)	2021.05.04
COMPUTING MACHINERY AND INTELLIGENCE - A. Turing 1950 paper (0)	2021.02.08
main() for Micro Controller (0)	2020.11.12
동경하는 기계학습 (0)	2020.10.21

일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

AI 3D Printing

한국어 말뭉치 목록

Korpora: Korean Corpora Archives

List of corpora

Information page

Quick overview

Installation

Using in Python

Using in a terminal

License

'인공지능' 카테고리의 다른 글

댓글

티스토리툴바

한국어 말뭉치 목록

Korpora: Korean Corpora Archives

List of corpora

Information page

Quick overview

Installation

Using in Python

Using in a terminal

License

'인공지능' 카테고리의 다른 글

관련글

댓글

티스토리툴바