Open2021/09/01にコメント追加3

機械学習、深層学習の前処理tips

schnell 2021/08/31に更新

データの分割

利用するモジュール

pandas
sklearn.model_selection

サンプルコード

jsonlines形式のデータをtrain, test, valid = 8 : 1 : 1 で分割。正解データは'label'カラムに指定している。
それぞれのファイルの正解ラベルの割合が等しくなるように分割している。

import pandas as pd
from sklearn.model_selection import train_test_split

filename = 'raw_data.json'
df = pd.read_json(filename, orient='records', lines=True)

train, valid_test = train_test_split(df, test_size=0.2, shuffle=True, random_state=0, stratify=df['label'])
valid, test = train_test_split(valid_test, test_size=0.5, shuffle=True, random_state=0, stratify=valid_test['label'])

train.to_json('train.json', orient='records', lines=True, force_ascii=False)
test.to_json('test.json', orient='records', lines=True, force_ascii=False)
valid.to_json('valid.json', orient='records', lines=True, force_ascii=False)

schnell 2021/09/01に更新

データの変換

TSV -> JSONLINES

import json
import csv

colums = ('sentence', 'label')
filename = 'data.tsv'
outfile = 'data.json'

with open(filename, newline='', mode='r') as f, open(outfile, newline='', mode='w') as fw:
    for row in csv.DictReader(f, colums, delimiter='\t'):
        text = json.dumps(row, ensure_ascii=False, separators=(',', ':'))
        fw.write(text + '\n')

schnell 2021/09/01

テキスト処理

連続する半角スペースを1つにする

space = re.compile('[\u3000 ]+[\u3000 ]')
re.sub(space, '', text)