🎉

PySpark読み込み時のinferSchema optionについて

2022/06/02に公開

はじめに

PySparkのread時に指定するinferSchema optionの挙動について確認する。
inferSchemaのdefaultはfalseである。trueを設定した場合には、入力データより自動でスキーマを推論する。
https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option

結果

  • inferSchema optionがfalseの場合、すべてのカラムはstringとなる。
  • date型はサポートしていない。
  • 日付もtimestampとなる。3系以降はtimestamp型を検知する場合には、timestampFormat optionも必要である。[1]
  • 数値のなかに1行でも文字があるとstringになる。
読み込みデータ
"id","c_datetime","c_date","c_time","price"
r1,2016-03-06 13:09:42,2016-03-26,10:00:00,97200
r2,2016-07-16 23:39:55,2016-07-20,11:30:00,20600
r3,2016-09-24 10:03:17,2016-10-19,09:00:00,33600
inferSchema指定なし
df_sample = spark.read.option("header", True).csv('../../../data/sampledata.csv')

df_sample.show()
+---+-------------------+----------+--------+-----+
| id|         c_datetime|    c_date|  c_time|price|
+---+-------------------+----------+--------+-----+
| r1|2016-03-06 13:09:42|2016-03-26|10:00:00|97200|
| r2|2016-07-16 23:39:55|2016-07-20|11:30:00|20600|
| r3|2016-09-24 10:03:17|2016-10-19|09:00:00|33600|
+---+-------------------+----------+--------+-----+

df_sample.printSchema()
root
 |-- id: string (nullable = true)
 |-- c_datetime: string (nullable = true)
 |-- c_date: string (nullable = true)
 |-- c_time: string (nullable = true)
 |-- price: string (nullable = true)
inferSchema指定あり、timestampFormat指定なし
df_sample = spark.read.option("header", True).option("inferSchema",True).csv('../../../data/sampledata.csv')

df_sample.show()
+---+-------------------+----------+--------+-----+
| id|         c_datetime|    c_date|  c_time|price|
+---+-------------------+----------+--------+-----+
| r1|2016-03-06 13:09:42|2016-03-26|10:00:00|97200|
| r2|2016-07-16 23:39:55|2016-07-20|11:30:00|20600|
| r3|2016-09-24 10:03:17|2016-10-19|09:00:00|33600|
+---+-------------------+----------+--------+-----+

df_sample.printSchema()
root
 |-- id: string (nullable = true)
 |-- c_datetime: string (nullable = true)
 |-- c_date: string (nullable = true)
 |-- c_time: string (nullable = true)
 |-- price: integer (nullable = true)
inferSchema、timestampFormat指定あり
df_sample = spark.read.option("header", True).option("inferSchema",True).option("timestampFormat","yyyy-MM-dd HH:mm:ss").csv('../../../data/sampledata.csv')

df_sample.show()
+---+-------------------+----------+--------+-----+
| id|         c_datetime|    c_date|  c_time|price|
+---+-------------------+----------+--------+-----+
| r1|2016-03-06 13:09:42|2016-03-26|10:00:00|97200|
| r2|2016-07-16 23:39:55|2016-07-20|11:30:00|20600|
| r3|2016-09-24 10:03:17|2016-10-19|09:00:00|33600|
+---+-------------------+----------+--------+-----+

df_sample.printSchema()
root
 |-- id: string (nullable = true)
 |-- c_datetime: timestamp (nullable = true)
 |-- c_date: string (nullable = true)
 |-- c_time: string (nullable = true)
 |-- price: integer (nullable = true)
inferSchema、timestampFormat指定あり
df_sample = spark.read.option("header", True).option("inferSchema",True).option("timestampFormat","yyyy-MM-dd").csv('../../../data/sampledata.csv')

df_sample.show()
+---+-------------------+-------------------+--------+-----+
| id|         c_datetime|             c_date|  c_time|price|
+---+-------------------+-------------------+--------+-----+
| r1|2016-03-06 13:09:42|2016-03-26 00:00:00|10:00:00|97200|
| r2|2016-07-16 23:39:55|2016-07-20 00:00:00|11:30:00|20600|
| r3|2016-09-24 10:03:17|2016-10-19 00:00:00|09:00:00|33600|
+---+-------------------+-------------------+--------+-----+

df_sample.printSchema()
root
 |-- id: string (nullable = true)
 |-- c_datetime: string (nullable = true)
 |-- c_date: timestamp (nullable = true)
 |-- c_time: string (nullable = true)
 |-- price: integer (nullable = true)
inferSchema、timestampFormat指定あり
df_sample = spark.read.option("header", True).option("inferSchema",True).option("timestampFormat","HH:mm:ss").csv('../../../data/sampledata.csv')

df_sample.show()
+---+-------------------+----------+-------------------+-----+
| id|         c_datetime|    c_date|             c_time|price|
+---+-------------------+----------+-------------------+-----+
| r1|2016-03-06 13:09:42|2016-03-26|1970-01-01 10:00:00|97200|
| r2|2016-07-16 23:39:55|2016-07-20|1970-01-01 11:30:00|20600|
| r3|2016-09-24 10:03:17|2016-10-19|1970-01-01 09:00:00|33600|
+---+-------------------+----------+-------------------+-----+

df_sample.printSchema()
root
 |-- id: string (nullable = true)
 |-- c_datetime: string (nullable = true)
 |-- c_date: string (nullable = true)
 |-- c_time: timestamp (nullable = true)
 |-- price: integer (nullable = true)
脚注
  1. https://stackoverflow.com/questions/70448035/issue-with-date-and-inferschema-option-in-spark-3-1 ↩︎

Discussion