言語処理100本ノック

一章　準備運動
https://qiita.com/yamaru/items/6d66445dbd5e7cef8640 を参考にする
02 二つの文字列を互い違いに追加する

"".join([i+j for i,j in zip(str1,str2)])

zipで二つのイテラブル(今回は文字列）を結合させる

正規表現で英文から各単語の文字数を数える

re.sub(`[,\.]`,``,str) # sub(a,b,str) でstrからaをbに置き換える

辞書の活用とenumerate

import re

str = 'Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.'
str = re.sub("[,\.]", "", str)
lis = str.split()
one_ch = [1, 5, 6, 7, 8, 9, 15, 16, 19]
ans = {}
idx = 0
for i, word in enumerate(lis):#enumerate でindexとenumerate型の要素を順番に取得する
    if idx < len(one_ch) and i+1 == one_ch[idx]:
        idx += 1
        ans[word[:1]] = i+1
    else:
        ans[word[:2]] = i+1
print(ans)

ngram
splitでiterableなオブジェクトを関数に投げて、wordsもcharも同じ関数で処理できるようにするのがポイント
愚直に書いたのが下

def ngram(n, iter):
    res = []
    for i in range(len(iter)-n+1):
        tmp = []
        for j in range(n):
            tmp.append(iter[j+i])
        res.append(tuple(tmp))
    return res


str = 'I am an NLPer'
chars_bi_gram = ngram(2, str)
words_bi_gram = ngram(2, str.split())
print(chars_bi_gram)
print(words_bi_gram)

Qiita記事のone-linerでの解法
ポイント

zipはiterableの長さが違っても短い方に合わせられる(itertools モジュールのzip_longestを使うと足りない要素を任意の値（fillvalueを設定しないとNone)で埋めることができる）
参考　https://note.nkmk.me/python-zip-usage-for/
zip(*[]) で入れ子になっているiterableの子要素をzipに渡す

def ngram_(n, lst):
    # ex.
    # [str[i:] for i in range(2)] -> ['I am an NLPer', ' am an NLPer']
    # zip(*[str[i:] for i in range(2)]) -> zip('I am an NLPer', ' am an NLPer')
    return list(zip(*[lst[i:] for i in range(n)]))

knk

第1章: 準備運動　後半6~9
6 集合
ポイント

setによる集合の演算 |,&,-
setでの集合の要素判定 {(a,b)} <= x

str1 = 'paraparaparadise'
str2 = 'paragraph'


def ngram(n, iter):
    res = []
    for i in range(len(iter)-n+1):
        tmp = []
        for j in range(n):
            tmp.append(iter[j+i])
        res.append(tuple(tmp))
    return res


x = set(ngram(2, str1))
y = set(ngram(2, str2))
union = x | y
intersection = x & y
difference = x - y

print("x:", x)
print("y:", y)
print("和集合：", union)
print("積集合：", intersection)
print("差集合：", difference)
print("xにseが含まれるか：", {("s", "e")} <= x)
print("yにseが含まれるか：", {("s", "e")} <= y)

7 テンプレートによる文生成Permalink
f文字列を使うと便利

def generate_template(a, b, c):
    print(f'{a}時のとき{b}は{c}')


generate_template(12, "気温", 22.4)

8 暗号文
ポイント

小文字判定 islower()でできる

愚直にかくと

def cipher(s):
    res = ""
    for c in s:
        if ord(c) < 123 and ord(c) > 96:
            res += chr(219-ord(c))
        else:
            res += c
    return res


message = 'the quick brown fox jumps over the lazy dog'
message = cipher(message)
print('暗号化:', message)
message = cipher(message)
print('復号化:', message)

関数部分をone-linerで書くと
"".join([chr(219-ord(x) if x.islower() else x for x in str
(Qiita記事より）

9 Typoglycemia
ポイント

文字列のシャッフル random.sample(str,x)でstrから重複なくx個の要素を返す
ドキュメント　https://docs.python.org/ja/3/library/random.html

from random import sample


def word_shuffle(str_):
    word_list = str_.split()
    res = ""
    for i in word_list:
        if len(i) <= 4:
            res += i + " "
        else:
            res += i[0] + "".join(sample(i[1:-1], len(i[1:-1]))) + i[-1] + " "
    return res[:-1]


words = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
ans = word_shuffle(words)
print(ans)

knk

第2章　UNIXコマンド
10 行数を数える wc -l file_name
11 タブをスペースに置換sed -e "s/\t/ /g" file_name
12 特定カラムをファイルに保存(cutは1 index) cut -f 1 file_name
13 行単位でのファイル連結　paste file1 file2
pandas pd.concat([df1,df2],axis=1)
14 先頭からN行を出力 head -n N file_name
pandas pd.head(N)
15 末尾のN行を出力 tail -n N file_name
pandas pd.tail(N)
16 ファイルをN分割する split -l 200 file_name 200行ずつに分割する(N分割の方法はわからず。。）

tmp = df.reset_indez(drop=False)
df_cut = pd.qcut(tmp.index,N,labels=[i for i in range(N)])
df_cut = pd.concat([df,pd.Series(df_cut,name="sep")],axis=1)

17 文字列の種類　cut -f 1 file_name | sort | unique | wc -l
注意

unique はsort済みであることが前提になるのでuniqueとセットで使う
pandas len(df.drop_duplicates(subset=col_name))　カラム名はsubsetで指定（リストで複数も可能）

18 行を特定カラムの値でソートする cut file_name | sort -rmk 3
参考　sort のコマンド(r:逆順、n:文字列を数値として扱う、k:場所と並び替え種別の指定）
複数指定する場合はsort -k 3n -k 1のように複数並べる

pandas df.sort_values(by="number",ascending=False,inplace=True) byでカラム名を指定する
19 重複行をカウントする sort | unique -c | sort -rn (unique -c で重複行を削除して何回出現したかを数える)

pandas df[col_name].value_counts()

knk

第3章　正規表現
データ　wikipediaの国名データを https://nlp100.github.io/data/jawiki-country.json.gzから取得してそれに対して操作を行う
それぞれの国名に対して１行ずつjson形式で情報が格納されている

20 JSONデータの読み込み
ポイント

json.pyにしたらAttributeError: partially initialized module 'json' has no attribute 'loads' (most likely due to a circular import) で循環importでエラーになった。ファイル名モジュールと被らないように気をつける
- json.loads(file_name)で文字列(str,bytearray,bytes)を直列化して、辞書型として読み込む
- json.load(file_name) streamに対して直列化して辞書型として読み込む
- json.dumps(dict) dictを脱直列化してjson形式にする
  ドキュメントURL： https://community.nanoporetech.com/#search&tab-overlay-right=search

21 正規表現

import re

pattern = r'^(.*\[\[Category:.*\]\].*)$'
result = '\n'.join(re.findall(pattern, text_uk, re.MULTILINE))
print(result)

ポイント

re.findall()でマッチする文字列を全て取り出せる
re.findall(pattern,text,re.MULTILINE)で行ごとに正規表現を行うようにする
Tips
バックスラッシュを一回でかけるようにraw文字列を使う
Pythonでの正規表現で使える特殊文字
- ^:文字列の先頭
- $ : 文字列の末尾
- . : 改行以外の任意の1文字
- + : 一回以上の繰り返し
- * : 0回以上の繰り返し
  https://note.nkmk.me/python-re-match-search-findall-etc/

22 カテゴリ名の抽出

pattern = r'^.*\[\[Category:(.*?)(?:\|.*)?\]\].*$'
result = '\n'.join(re.findall(pattern, text_uk, re.MULTILINE))
print(result)

ポイント

マッチする文字列から~~を含まないものを返す => ? を使って最長文字列を検索する