<h1 id="%E3%83%A2%E3%83%81%E3%83%99%E3%83%BC%E3%82%B7%E3%83%A7%E3%83%B3">
<a class="header-anchor-link" href="#%E3%83%A2%E3%83%81%E3%83%99%E3%83%BC%E3%82%B7%E3%83%A7%E3%83%B3" aria-hidden="true"></a> モチベーション</h1>
Pysparkのsize関数について、なんのサイズを出す関数かすぐに忘れるため、実際のサンプルを記載しすぐに思い出せるようにする。 
<iframe id="zenn-embedded__4aa09944bf0fe" src="https://embed.zenn.studio/card#zenn-embedded__4aa09944bf0fe" data-content="https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fapi%2Fpython%2Freference%2Fapi%2Fpyspark.sql.functions.size.html" frameborder="0" scrolling="no" loading="lazy"></iframe><a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.size.html" style="display:none" target="_blank" rel="nofollow noopener noreferrer">https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.size.html</a>
<h1 id="%E7%B5%90%E6%9E%9C">
<a class="header-anchor-link" href="#%E7%B5%90%E6%9E%9C" aria-hidden="true"></a> 結果</h1>
<div class="code-block-container"><pre class="language-python"><code class="language-python">from pyspark.sql.functions import size

data = [(['Yamada','Taro'], 13),(['Ito','kenta'], 25)]
df = spark.createDataFrame(data,['name', 'age'])
df = df.withColumn('sizecolumn',size('name'))
df.show()

+--------------+---+----------+
| name|age|sizecolumn|
+--------------+---+----------+
|[Yamada, Taro]| 13| 2|
| [Ito, kenta]| 25| 2|
+--------------+---+----------+
</code></pre></div>なお、size関数の引数にarrayかmapを指定しなかった場合には、次のExceptionとなる。 
AnalysisException: cannot resolve 'size(name)' due to data type mismatch: argument 1 requires (array or map) type

PySparkのsize関数について

モチベーション

結果

Discussion