🎉
PySpark読み込み時のinferSchema optionについて
はじめに
PySparkのread時に指定するinferSchema optionの挙動について確認する。
inferSchemaのdefaultはfalseである。trueを設定した場合には、入力データより自動でスキーマを推論する。
結果
- inferSchema optionがfalseの場合、すべてのカラムはstringとなる。
- date型はサポートしていない。
- 日付もtimestampとなる。3系以降はtimestamp型を検知する場合には、timestampFormat optionも必要である。[1]
- 数値のなかに1行でも文字があるとstringになる。
読み込みデータ
"id","c_datetime","c_date","c_time","price"
r1,2016-03-06 13:09:42,2016-03-26,10:00:00,97200
r2,2016-07-16 23:39:55,2016-07-20,11:30:00,20600
r3,2016-09-24 10:03:17,2016-10-19,09:00:00,33600
inferSchema指定なし
df_sample = spark.read.option("header", True).csv('../../../data/sampledata.csv')
df_sample.show()
+---+-------------------+----------+--------+-----+
| id| c_datetime| c_date| c_time|price|
+---+-------------------+----------+--------+-----+
| r1|2016-03-06 13:09:42|2016-03-26|10:00:00|97200|
| r2|2016-07-16 23:39:55|2016-07-20|11:30:00|20600|
| r3|2016-09-24 10:03:17|2016-10-19|09:00:00|33600|
+---+-------------------+----------+--------+-----+
df_sample.printSchema()
root
|-- id: string (nullable = true)
|-- c_datetime: string (nullable = true)
|-- c_date: string (nullable = true)
|-- c_time: string (nullable = true)
|-- price: string (nullable = true)
inferSchema指定あり、timestampFormat指定なし
df_sample = spark.read.option("header", True).option("inferSchema",True).csv('../../../data/sampledata.csv')
df_sample.show()
+---+-------------------+----------+--------+-----+
| id| c_datetime| c_date| c_time|price|
+---+-------------------+----------+--------+-----+
| r1|2016-03-06 13:09:42|2016-03-26|10:00:00|97200|
| r2|2016-07-16 23:39:55|2016-07-20|11:30:00|20600|
| r3|2016-09-24 10:03:17|2016-10-19|09:00:00|33600|
+---+-------------------+----------+--------+-----+
df_sample.printSchema()
root
|-- id: string (nullable = true)
|-- c_datetime: string (nullable = true)
|-- c_date: string (nullable = true)
|-- c_time: string (nullable = true)
|-- price: integer (nullable = true)
inferSchema、timestampFormat指定あり
df_sample = spark.read.option("header", True).option("inferSchema",True).option("timestampFormat","yyyy-MM-dd HH:mm:ss").csv('../../../data/sampledata.csv')
df_sample.show()
+---+-------------------+----------+--------+-----+
| id| c_datetime| c_date| c_time|price|
+---+-------------------+----------+--------+-----+
| r1|2016-03-06 13:09:42|2016-03-26|10:00:00|97200|
| r2|2016-07-16 23:39:55|2016-07-20|11:30:00|20600|
| r3|2016-09-24 10:03:17|2016-10-19|09:00:00|33600|
+---+-------------------+----------+--------+-----+
df_sample.printSchema()
root
|-- id: string (nullable = true)
|-- c_datetime: timestamp (nullable = true)
|-- c_date: string (nullable = true)
|-- c_time: string (nullable = true)
|-- price: integer (nullable = true)
inferSchema、timestampFormat指定あり
df_sample = spark.read.option("header", True).option("inferSchema",True).option("timestampFormat","yyyy-MM-dd").csv('../../../data/sampledata.csv')
df_sample.show()
+---+-------------------+-------------------+--------+-----+
| id| c_datetime| c_date| c_time|price|
+---+-------------------+-------------------+--------+-----+
| r1|2016-03-06 13:09:42|2016-03-26 00:00:00|10:00:00|97200|
| r2|2016-07-16 23:39:55|2016-07-20 00:00:00|11:30:00|20600|
| r3|2016-09-24 10:03:17|2016-10-19 00:00:00|09:00:00|33600|
+---+-------------------+-------------------+--------+-----+
df_sample.printSchema()
root
|-- id: string (nullable = true)
|-- c_datetime: string (nullable = true)
|-- c_date: timestamp (nullable = true)
|-- c_time: string (nullable = true)
|-- price: integer (nullable = true)
inferSchema、timestampFormat指定あり
df_sample = spark.read.option("header", True).option("inferSchema",True).option("timestampFormat","HH:mm:ss").csv('../../../data/sampledata.csv')
df_sample.show()
+---+-------------------+----------+-------------------+-----+
| id| c_datetime| c_date| c_time|price|
+---+-------------------+----------+-------------------+-----+
| r1|2016-03-06 13:09:42|2016-03-26|1970-01-01 10:00:00|97200|
| r2|2016-07-16 23:39:55|2016-07-20|1970-01-01 11:30:00|20600|
| r3|2016-09-24 10:03:17|2016-10-19|1970-01-01 09:00:00|33600|
+---+-------------------+----------+-------------------+-----+
df_sample.printSchema()
root
|-- id: string (nullable = true)
|-- c_datetime: string (nullable = true)
|-- c_date: string (nullable = true)
|-- c_time: timestamp (nullable = true)
|-- price: integer (nullable = true)
Discussion