Rubyでデータ分析
Purpose of this artile
•使用するのはこのデータセット:https://www.kaggle.com/datasets/sauravmishraa/waterneeded
•Rubyを用いて分析
about the dataset
differece of Soil Moisture and Soil Humidity
• Soil Moisture
は土の中に含まれている水分量。水分も蒸気も含まれている。%表記
•Souil Humidity
は土の湿気を示している。%表記。土壌が湿っているか、湿ってい るか、乾燥しているか、あるいはその中間であるかを示す、より定性的な、あるいは説明的な意味で用いられる。
参考:https://www.kaggle.com/datasets/sauravmishraa/waterneeded/discussion/433172
Asked Chat GPT & process of errors
require 'csv'
require 'daru'
# CSVファイルの読み込み
file_path = 'plants.csv'
# Daruを使ってデータフレームを作成
# CSVファイルをDaru::DataFrameに変換
df = Daru::DataFrame.from_csv(file_path)
# 基本統計の表示
puts "基本統計:"
puts df.describe
# 植物タイプごとの平均水量の計算
puts "\n植物タイプごとの平均水量:"
average_water = df.group_by('Plant Type') do |group|
group['Water Needed (Liters)'].to_a.map(&:to_f).sum / group.size
end
puts average_water
# 日光時間と必要水量の関係を出力
puts "\n日光時間と必要水量の関係:"
df['Sunlight (Hours)'].to_a.each_with_index do |sunlight, index|
water = df['Water Needed (Liters)'][index]
puts "日光: #{sunlight}, 水量: #{water}"
end
I got this:
No such file or directory @ rb_sysopen - plants.csv (Errno::ENOENT)
Since the file of mine is not plants.csv
but TARP.csv
, so I changed.
And run again, but got this:
`group_by': Vector(s) missing: Plant Type (ArgumentError)
This indicates, the column name Plant Type
does not exist.
Since there is a column called Soil Moisture
, I run the code written above rewrite Plant Type
to Soil Moisture
.
And then, I got this:
`rescue in access_vector_single_index': Specified vector Sunlight (Hours) does not exist (IndexError)
Since the codes provided by ChatGPT are not based on the columns that I have, so I asked GPT to write codes based on the columns that I have.
In this dataset, there are columns like these:
Soil Moisture: It is having details regarding different soil moisture.
Temperature: Temperature of the soil.
Soil Humidity: It is having the details of soil humidity.
Air Temperature: It is having the details of the air temp.
Wind Speed: It is having the details of speed of wind.
Air Humidity: It is having the details of air humidity.
Wind Gust: It is having the details of wind gust.
Pressure: It is having the details of air pressure (in KPa).
ph: It is having details of the ph of soil.
How can I rewrite the codes based on these columns?
And then, I got this:
require 'csv'
require 'daru'
# CSVファイルの読み込み
file_path = 'TARP.csv'
# Daruを使ってデータフレームを作成
df = Daru::DataFrame.from_csv(file_path)
# 基本統計の表示
puts "基本統計:"
puts df.describe
# 土壌湿度(Soil Moisture)の平均値を計算
puts "\n土壌湿度(Soil Moisture)の平均値:"
average_soil_moisture = df['Soil Moisture'].to_a.map(&:to_f).sum / df.size
puts average_soil_moisture
# 気温(Air Temperature)と湿度(Air Humidity)の関係
puts "\n気温(Air Temperature)と湿度(Air Humidity)の関係:"
relation = df['Air Temperature'].zip(df['Air Humidity'])
relation.each do |temp, humidity|
puts "気温: #{temp}, 湿度: #{humidity}"
end
# 気圧(Pressure)の統計を計算
puts "\n気圧(Pressure)の基本統計:"
pressure_values = df['Pressure'].to_a.map(&:to_f)
pressure_stats = {
mean: pressure_values.sum / pressure_values.size,
min: pressure_values.min,
max: pressure_values.max
}
puts pressure_stats
# 土壌pH(ph)の分布を表示
puts "\n土壌pH(ph)の分布:"
ph_values = df['ph'].to_a.map(&:to_f)
ph_distribution = ph_values.group_by { |ph| ph.round }.transform_values(&:size)
ph_distribution.each do |ph, count|
puts "pH: #{ph}, Count: #{count}"
end
And when I run, I got this:
`numeric_pos': Specified index "Air Temperature" does not exist (IndexError)
plus, this:
基本統計:
#<Daru::DataFrame(5x14)>
土壌湿度(Soil Moisture)の平均値:
45.48407
気温(Air Temperature)と湿度(Air Humidity)の関係:
I'm gonna do code reading.
code reading
require 'csv'
require 'daru'
# CSVファイルの読み込み
file_path = 'TARP.csv'
# Daruを使ってデータフレームを作成
df = Daru::DataFrame.from_csv(file_path)
# 基本統計の表示
puts "基本統計:"
puts df.describe
# 土壌湿度(Soil Moisture)の平均値を計算
puts "\n土壌湿度(Soil Moisture)の平均値:"
average_soil_moisture = df['Soil Moisture'].to_a.map(&:to_f).sum / df.size
puts average_soil_moisture
•require 'csv'
:ルビーの標準搭載されているライブラリであるCSVライブラリをロードすることで、CSVファイルの読み込みと書き込みが可能になる
•require 'daru':DaruとはData Analysis in Rubyのことで、PythonでのPandasのようなデータ分析に必要な機能を提供しており、require 'daru'
でdaru
を読み込んでいる
•基本統計
displays descriptive statistics.
•descriptive statistics provide a quick overview of data characteristics using such as mean, median, mode.
•output is this: 基本統計#<Daru::DataFrame(5x14)>
•Soil Moisture
calculates the average soil moisture.
puts "\n土壌湿度(Soil Moisture)の平均値:"
average_soil_moisture = df['Soil Moisture'].to_a.map(&:to_f).sum / df.size
puts average_soil_moisture
- `.to_a` でarrayに変換
- `.map(&:to_f)`で値をフロート型(小数も含まれる)に変換
- `.sum / df.size`:合計の値/列の数
I'm gonna do this script's code reading.
# 気温(Air Temperature)と湿度(Air Humidity)の関係
puts "\n気温(Air Temperature)と湿度(Air Humidity)の関係:"
# Extracting and cleaning the data
air_temperature = df['Air temperature (C)'].to_a.map(&:to_s).reject(&:empty?)
air_humidity = df['Air humidity (%)'].to_a.map(&:to_s).reject(&:empty?)
# Ensure the arrays are the same size
min_size = [air_temperature.size, air_humidity.size].min
air_temperature = air_temperature.first(min_size)
air_humidity = air_humidity.first(min_size)
# Combine and print the relation
relation = air_temperature.zip(air_humidity)
relation.each do |temp, humidity|
puts "気温: #{temp}, 湿度: #{humidity}"
end
- When I run this, I got this: ``<main>': undefined local variable or method
み' for main:Object (NameError)
- 未定義のローカル変数またはメソッド
- ローカル変数:メソッド、ブロック、スクリプト内で定義されたメソッド。
-
み
が定義されていないらしい...(コードには何も書かれていないのに)- 原因
- コピペミス
- エンコーディング:文字列をバイトに変換する方法
-->なぜか、時間が経った後にもう一度実行してみたら、エラー解消されてた。
- 未定義のローカル変数またはメソッド
実行すると、こういうのだった。
基本統計:
#<Daru::DataFrame(5x14)>
土壌湿度(Soil Moisture)の平均値:
45.48407
気温(Air Temperature)と湿度(Air Humidity)の関係:
気温: , 温度:
.
.
.
日光時間と必要水量の関係:
気温: , 温度:
.
.
.
気圧(Pressure)の基本統計:
「気温(Air Temperature)と湿度(Air Humidity)の関係:」が出力されているときと、「日光時間と必要水量の関係:」が出力されている時の二つのタイプがある。なんで両方がいっぺんに出力されないのだろうか、、、。
11/20
when I run code, I got this:
`rescue in access_vector_single_index': Specified vector Soil Moisture does not exist (IndexError)
`numeric_pos': Specified index "Soil Moisture" does not exist (IndexError)
•first error is about the missing error. It says, there is no Soil Moisture
column. however, it actually exists.
•so, I run this to check available columns:
puts df.vectors.to_a
•I realized that I don't have to write down pressure, air temperature, air humidity or sth. I first need to complete the codes about avg of soil moisture.
--> this worked well😎
code. 一つ目の出力完成
require 'csv'
require 'daru'
# Load the dataset
file_path = 'TARP.csv'
df = Daru::DataFrame.from_csv(file_path)
# Debug: List all column names
puts "Available column names: #{df.vectors.to_a}"
# Calculate average Soil Moisture
if df.vectors.include?('Soil Moisture_1')
valid_values = df['Soil Moisture_1'].to_a.map(&:to_f).reject(&:nan?)
average_soil_moisture = valid_values.sum / valid_values.size
puts "Average Soil Moisture: #{average_soil_moisture}"
else
puts "Column 'Soil Moisture' not found in the dataset."
end
result:
Available column names: ["Soil Moisture_1", "Temperature", " Soil Humidity", "Time", "Air temperature (C)", "Wind speed (Km/h)", "Air humidity (%)", "Wind gust (Km/h)", "Pressure (KPa)", "ph", "rainfall", "N", "P", "K", "Status", "_1", "_2", "Soil Moisture_2"]
Average Soil Moisture: 45.48407
explanations:
require 'csv'
require 'daru'
•require 'csv':ルビーの標準搭載されているライブラリであるCSVライブラリをロードすることで、CSVファイルの読み込みと書き込みが可能になる
•require 'daru':DaruとはData Analysis in Rubyのことで、PythonでのPandasのようなデータ分析に必要な機能を提供しており、require 'daru'でdaruを読み込んでいる
file_path = 'TARP.csv'
df = Daru::DataFrame.from_csv(file_path)
•Daru::DataFrame.from_csv(file_path)
で、CSVファイルをDaru::DataFrame
オブジェクトに変換して、データ分析をしやすいようにする。
puts "Available column names: #{df.vectors.to_a}"
•ここで、カラムの名前を全て出力する
if df.vectors.include?('Soil Moisture_1')
•もし、dfにSoil Moisture_1
というカラムがあるかを確認
二つ目の出力
When I run this code, I got nothing. I wanna fix this.
# 土壌pH(ph)の分布を表示
puts "\n土壌pH(ph)の分布:"
ph_values = df['ph'].to_a.map(&:to_f)
ph_distribution = ph_values.group_by { |ph| ph.round }.transform_values(&:size)
ph_distribution.each do |ph, count|
puts "pH: #{ph}, Count: #{count}"
Reason
•ph
is not accessible in the Daru::DataFrame
•ph is like this
code reading
ph_values = df['ph'].to_a.map(&:to_f)
-
df['ph']
:access theph
column -
to_a
:converts its value to an array- array is like this:
[1,2,3]
- array is like this:
-
map(&:to_f)
:ensure they are all numeric.- converts strings like
"3.5"
to3.5
- a sequence of characters enclosed in quotes, used to represent text in programming. For example, "Hello, world!" is a string.
- converts strings like
ph_distribution = ph_values.group_by { |ph| ph.round }.transform_values(&:size)
-
groups the
ph
values by their rounded value-
6.3
,6.7
とかは6
にいく。by doing this, it gets simpler. Also, it's easier to identify which values are common.
-
-
transform_values(&:size)
:counts the number of occureences for each group
problems
-
ph
はカラムに存在しているのに、このコードは動かない。なぜだ?
possible reasons
- there is no column called
ph
.
Code is fixed because I asked copilot! When I compare the wrong coedes and right codes, I learned this.
- index is important
- include
end
at the end of the code or process(likeif
andelse
構文)
Discussion