Rubyでデータ分析
Purpose of this artile
•使用するのはこのデータセット:https://www.kaggle.com/datasets/sauravmishraa/waterneeded
•Rubyを用いて分析
about the dataset
differece of Soil Moisture and Soil Humidity
• Soil Moisture
は土の中に含まれている水分量。水分も蒸気も含まれている。%表記
•Souil Humidity
は土の湿気を示している。%表記。土壌が湿っているか、湿ってい るか、乾燥しているか、あるいはその中間であるかを示す、より定性的な、あるいは説明的な意味で用いられる。
参考:https://www.kaggle.com/datasets/sauravmishraa/waterneeded/discussion/433172
Asked Chat GPT & process of errors
require 'csv'
require 'daru'
# CSVファイルの読み込み
file_path = 'plants.csv'
# Daruを使ってデータフレームを作成
# CSVファイルをDaru::DataFrameに変換
df = Daru::DataFrame.from_csv(file_path)
# 基本統計の表示
puts "基本統計:"
puts df.describe
# 植物タイプごとの平均水量の計算
puts "\n植物タイプごとの平均水量:"
average_water = df.group_by('Plant Type') do |group|
group['Water Needed (Liters)'].to_a.map(&:to_f).sum / group.size
end
puts average_water
# 日光時間と必要水量の関係を出力
puts "\n日光時間と必要水量の関係:"
df['Sunlight (Hours)'].to_a.each_with_index do |sunlight, index|
water = df['Water Needed (Liters)'][index]
puts "日光: #{sunlight}, 水量: #{water}"
end
I got this:
No such file or directory @ rb_sysopen - plants.csv (Errno::ENOENT)
Since the file of mine is not plants.csv
but TARP.csv
, so I changed.
And run again, but got this:
`group_by': Vector(s) missing: Plant Type (ArgumentError)
This indicates, the column name Plant Type
does not exist.
Since there is a column called Soil Moisture
, I run the code written above rewrite Plant Type
to Soil Moisture
.
And then, I got this:
`rescue in access_vector_single_index': Specified vector Sunlight (Hours) does not exist (IndexError)
Since the codes provided by ChatGPT are not based on the columns that I have, so I asked GPT to write codes based on the columns that I have.
In this dataset, there are columns like these:
Soil Moisture: It is having details regarding different soil moisture.
Temperature: Temperature of the soil.
Soil Humidity: It is having the details of soil humidity.
Air Temperature: It is having the details of the air temp.
Wind Speed: It is having the details of speed of wind.
Air Humidity: It is having the details of air humidity.
Wind Gust: It is having the details of wind gust.
Pressure: It is having the details of air pressure (in KPa).
ph: It is having details of the ph of soil.
How can I rewrite the codes based on these columns?
And then, I got this:
require 'csv'
require 'daru'
# CSVファイルの読み込み
file_path = 'TARP.csv'
# Daruを使ってデータフレームを作成
df = Daru::DataFrame.from_csv(file_path)
# 基本統計の表示
puts "基本統計:"
puts df.describe
# 土壌湿度(Soil Moisture)の平均値を計算
puts "\n土壌湿度(Soil Moisture)の平均値:"
average_soil_moisture = df['Soil Moisture'].to_a.map(&:to_f).sum / df.size
puts average_soil_moisture
# 気温(Air Temperature)と湿度(Air Humidity)の関係
puts "\n気温(Air Temperature)と湿度(Air Humidity)の関係:"
relation = df['Air Temperature'].zip(df['Air Humidity'])
relation.each do |temp, humidity|
puts "気温: #{temp}, 湿度: #{humidity}"
end
# 気圧(Pressure)の統計を計算
puts "\n気圧(Pressure)の基本統計:"
pressure_values = df['Pressure'].to_a.map(&:to_f)
pressure_stats = {
mean: pressure_values.sum / pressure_values.size,
min: pressure_values.min,
max: pressure_values.max
}
puts pressure_stats
# 土壌pH(ph)の分布を表示
puts "\n土壌pH(ph)の分布:"
ph_values = df['ph'].to_a.map(&:to_f)
ph_distribution = ph_values.group_by { |ph| ph.round }.transform_values(&:size)
ph_distribution.each do |ph, count|
puts "pH: #{ph}, Count: #{count}"
end
And when I run, I got this:
`numeric_pos': Specified index "Air Temperature" does not exist (IndexError)
plus, this:
基本統計:
#<Daru::DataFrame(5x14)>
土壌湿度(Soil Moisture)の平均値:
45.48407
気温(Air Temperature)と湿度(Air Humidity)の関係:
I'm gonna do code reading.
code reading
require 'csv'
require 'daru'
# CSVファイルの読み込み
file_path = 'TARP.csv'
# Daruを使ってデータフレームを作成
df = Daru::DataFrame.from_csv(file_path)
# 基本統計の表示
puts "基本統計:"
puts df.describe
# 土壌湿度(Soil Moisture)の平均値を計算
puts "\n土壌湿度(Soil Moisture)の平均値:"
average_soil_moisture = df['Soil Moisture'].to_a.map(&:to_f).sum / df.size
puts average_soil_moisture
•require 'csv'
:ルビーの標準搭載されているライブラリであるCSVライブラリをロードすることで、CSVファイルの読み込みと書き込みが可能になる
•require 'daru':DaruとはData Analysis in Rubyのことで、PythonでのPandasのようなデータ分析に必要な機能を提供しており、require 'daru'
でdaru
を読み込んでいる
•基本統計
displays descriptive statistics.
•descriptive statistics provide a quick overview of data characteristics using such as mean, median, mode.
•output is this: 基本統計#<Daru::DataFrame(5x14)>
•Soil Moisture
calculates the average soil moisture.
puts "\n土壌湿度(Soil Moisture)の平均値:"
average_soil_moisture = df['Soil Moisture'].to_a.map(&:to_f).sum / df.size
puts average_soil_moisture
- `.to_a` でarrayに変換
- `.map(&:to_f)`で値をフロート型(小数も含まれる)に変換
- `.sum / df.size`:合計の値/列の数
I'm gonna do this script's code reading.
# 気温(Air Temperature)と湿度(Air Humidity)の関係
puts "\n気温(Air Temperature)と湿度(Air Humidity)の関係:"
# Extracting and cleaning the data
air_temperature = df['Air temperature (C)'].to_a.map(&:to_s).reject(&:empty?)
air_humidity = df['Air humidity (%)'].to_a.map(&:to_s).reject(&:empty?)
# Ensure the arrays are the same size
min_size = [air_temperature.size, air_humidity.size].min
air_temperature = air_temperature.first(min_size)
air_humidity = air_humidity.first(min_size)
# Combine and print the relation
relation = air_temperature.zip(air_humidity)
relation.each do |temp, humidity|
puts "気温: #{temp}, 湿度: #{humidity}"
end
- When I run this, I got this: ``<main>': undefined local variable or method
み' for main:Object (NameError)
- 未定義のローカル変数またはメソッド
- ローカル変数:メソッド、ブロック、スクリプト内で定義されたメソッド。
-
み
が定義されていないらしい...(コードには何も書かれていないのに)- 原因
- コピペミス
- エンコーディング:文字列をバイトに変換する方法
-->なぜか、時間が経った後にもう一度実行してみたら、エラー解消されてた。
- 未定義のローカル変数またはメソッド
実行すると、こういうのだった。
基本統計:
#<Daru::DataFrame(5x14)>
土壌湿度(Soil Moisture)の平均値:
45.48407
気温(Air Temperature)と湿度(Air Humidity)の関係:
気温: , 温度:
.
.
.
日光時間と必要水量の関係:
気温: , 温度:
.
.
.
気圧(Pressure)の基本統計:
「気温(Air Temperature)と湿度(Air Humidity)の関係:」が出力されているときと、「日光時間と必要水量の関係:」が出力されている時の二つのタイプがある。なんで両方がいっぺんに出力されないのだろうか、、、。
11/20
when I run code, I got this:
`rescue in access_vector_single_index': Specified vector Soil Moisture does not exist (IndexError)
`numeric_pos': Specified index "Soil Moisture" does not exist (IndexError)
•first error is about the missing error. It says, there is no Soil Moisture
column. however, it actually exists.
•so, I run this to check available columns:
puts df.vectors.to_a
•I realized that I don't have to write down pressure, air temperature, air humidity or sth. I first need to complete the codes about avg of soil moisture.
--> this worked well😎
code. 一つ目の出力完成
require 'csv'
require 'daru'
# Load the dataset
file_path = 'TARP.csv'
df = Daru::DataFrame.from_csv(file_path)
# Debug: List all column names
puts "Available column names: #{df.vectors.to_a}"
# Calculate average Soil Moisture
if df.vectors.include?('Soil Moisture_1')
valid_values = df['Soil Moisture_1'].to_a.map(&:to_f).reject(&:nan?)
average_soil_moisture = valid_values.sum / valid_values.size
puts "Average Soil Moisture: #{average_soil_moisture}"
else
puts "Column 'Soil Moisture' not found in the dataset."
end
result:
Available column names: ["Soil Moisture_1", "Temperature", " Soil Humidity", "Time", "Air temperature (C)", "Wind speed (Km/h)", "Air humidity (%)", "Wind gust (Km/h)", "Pressure (KPa)", "ph", "rainfall", "N", "P", "K", "Status", "_1", "_2", "Soil Moisture_2"]
Average Soil Moisture: 45.48407
explanations:
require 'csv'
require 'daru'
•require 'csv':ルビーの標準搭載されているライブラリであるCSVライブラリをロードすることで、CSVファイルの読み込みと書き込みが可能になる
•require 'daru':DaruとはData Analysis in Rubyのことで、PythonでのPandasのようなデータ分析に必要な機能を提供しており、require 'daru'でdaruを読み込んでいる
file_path = 'TARP.csv'
df = Daru::DataFrame.from_csv(file_path)
•Daru::DataFrame.from_csv(file_path)
で、CSVファイルをDaru::DataFrame
オブジェクトに変換して、データ分析をしやすいようにする。
puts "Available column names: #{df.vectors.to_a}"
•ここで、カラムの名前を全て出力する
if df.vectors.include?('Soil Moisture_1')
•もし、dfにSoil Moisture_1
というカラムがあるかを確認
二つ目の出力
When I run this code, I got nothing. I wanna fix this.
# 土壌pH(ph)の分布を表示
puts "\n土壌pH(ph)の分布:"
ph_values = df['ph'].to_a.map(&:to_f)
ph_distribution = ph_values.group_by { |ph| ph.round }.transform_values(&:size)
ph_distribution.each do |ph, count|
puts "pH: #{ph}, Count: #{count}"
Reason
•ph
is not accessible in the Daru::DataFrame
•ph is like this
code reading
ph_values = df['ph'].to_a.map(&:to_f)
-
df['ph']
:access theph
column -
to_a
:converts its value to an array- array is like this:
[1,2,3]
- array is like this:
-
map(&:to_f)
:ensure they are all numeric.- converts strings like
"3.5"
to3.5
- a sequence of characters enclosed in quotes, used to represent text in programming. For example, "Hello, world!" is a string.
- converts strings like
ph_distribution = ph_values.group_by { |ph| ph.round }.transform_values(&:size)
-
groups the
ph
values by their rounded value-
6.3
,6.7
とかは6
にいく。by doing this, it gets simpler. Also, it's easier to identify which values are common.
-
-
transform_values(&:size)
:counts the number of occureences for each group
problems
-
ph
はカラムに存在しているのに、このコードは動かない。なぜだ?
possible reasons
- there is no column called
ph
.
Code is fixed because I asked copilot! When I compare the wrong coedes and right codes, I learned this.
- index is important
- include
end
at the end of the code or process(likeif
andelse
構文)
reference:https://www.diffchecker.com/text-compare/
get the max, min, median
max_soil_moisture = valid_values.max
min_soil_moisture = valid_values.min
median_soil_moisture = valid_values.sort[valid_values.size / 2]
puts "Max Soil Moisture: #{max_soil_moisture}"
puts "Min Soil Moisture: #{min_soil_moisture}"
puts "Median Soil Moisture: #{median_soil_moisture}"
Correlation coefficient between Air humidity and Rainfall
if df.vectors.include?('Air humidity (%)') && df.vectors.include?('rainfall')
# Convert data to arrays of floats
air_humidity = df['Air humidity (%)'].to_a.map(&:to_f)
rainfall = df['rainfall'].to_a.map(&:to_f)
# Calculate correlation coefficient
n = air_humidity.size
sum_x = air_humidity.sum
sum_y = rainfall.sum
sum_xy = air_humidity.zip(rainfall).map { |x, y| x * y }.sum
sum_x2 = air_humidity.map { |x| x**2 }.sum
sum_y2 = rainfall.map { |y| y**2 }.sum
numerator = n * sum_xy - sum_x * sum_y
denominator = Math.sqrt((n * sum_x2 - sum_x**2) * (n * sum_y2 - sum_y**2))
correlation = numerator / denominator
puts "Correlation coefficient between Air humidity and Rainfall: #{correlation}"
else
puts "Required columns ('Air humidity' and 'Rainfall') are not found in the dataset."
end
explanations
sum_xy = air_humidity.zip(rainfall).map { |x, y| x * y }.sum
-
zip
method is to take two arrays and combine them into one array.- もし
air_humidty
の配列が[a,b,c]
で、rainfall
の配列が[x,y,z]
とすると、zip
の結果は[[a, x], [b, y], [c, z]]
となる -
.map { |x, y| x * y }
:[a * x, b * y, c * z] -
.sum
:[a * x, b * y, c * z], the result of sum would be a * x + b * y + c * z.
- もし
-
sum_x2 = air_humidity.map { |x| x**2 }.sum
:x^2の合計 -
sum_y2 = rainfall.map { |y| y**2 }.sum
:y^2の合計
改良
先ほどの相関係数のコードは理解できなかったのでこれに書き換えました。
if df.vectors.include?('Soil Moisture_1')
valid_values = df['Soil Moisture_1'].to_a.map(&:to_f).reject(&:nan?)
average_soil_moisture = valid_values.sum / valid_values.size
max_soil_moisture = valid_values.max
min_soil_moisture = valid_values.min
median_soil_moisture = valid_values.sort[valid_values.size / 2]
puts "Average Soil Moisture: #{average_soil_moisture}"
puts "Max Soil Moisture: #{max_soil_moisture}"
puts "Min Soil Moisture: #{min_soil_moisture}"
puts "Median Soil Moisture: #{median_soil_moisture}"
else
puts "Column 'Soil Moisture' not found in the dataset."
end
Discussion