📑

Rubyでデータ分析

2024/11/20に公開

Purpose of this artile

•使用するのはこのデータセット:https://www.kaggle.com/datasets/sauravmishraa/waterneeded
•Rubyを用いて分析

about the dataset

differece of Soil Moisture and Soil Humidity

Soil Moistureは土の中に含まれている水分量。水分も蒸気も含まれている。%表記
Souil Humidityは土の湿気を示している。%表記。土壌が湿っているか、湿ってい るか、乾燥しているか、あるいはその中間であるかを示す、より定性的な、あるいは説明的な意味で用いられる。
参考:https://www.kaggle.com/datasets/sauravmishraa/waterneeded/discussion/433172

Asked Chat GPT & process of errors

require 'csv'
require 'daru'

# CSVファイルの読み込み
file_path = 'plants.csv'

# Daruを使ってデータフレームを作成
# CSVファイルをDaru::DataFrameに変換
df = Daru::DataFrame.from_csv(file_path)

# 基本統計の表示
puts "基本統計:"
puts df.describe

# 植物タイプごとの平均水量の計算
puts "\n植物タイプごとの平均水量:"
average_water = df.group_by('Plant Type') do |group|
  group['Water Needed (Liters)'].to_a.map(&:to_f).sum / group.size
end
puts average_water

# 日光時間と必要水量の関係を出力
puts "\n日光時間と必要水量の関係:"
df['Sunlight (Hours)'].to_a.each_with_index do |sunlight, index|
  water = df['Water Needed (Liters)'][index]
  puts "日光: #{sunlight}, 水量: #{water}"
end

I got this:

No such file or directory @ rb_sysopen - plants.csv (Errno::ENOENT)

Since the file of mine is not plants.csv but TARP.csv, so I changed.

And run again, but got this:

`group_by': Vector(s) missing: Plant Type (ArgumentError)

This indicates, the column name Plant Type does not exist.

Since there is a column called Soil Moisture, I run the code written above rewrite Plant Type to Soil Moisture.

And then, I got this:

`rescue in access_vector_single_index': Specified vector Sunlight (Hours) does not exist (IndexError)

Since the codes provided by ChatGPT are not based on the columns that I have, so I asked GPT to write codes based on the columns that I have.

In this dataset, there are columns like these:
Soil Moisture: It is having details regarding different soil moisture.

Temperature: Temperature of the soil.

Soil Humidity: It is having the details of soil humidity.

Air Temperature: It is having the details of the air temp.

Wind Speed: It is having the details of speed of wind.

Air Humidity: It is having the details of air humidity.

Wind Gust: It is having the details of wind gust.

Pressure: It is having the details of air pressure (in KPa).

ph: It is having details of the ph of soil.

How can I rewrite the codes based on these columns?

And then, I got this:

require 'csv'
require 'daru'

# CSVファイルの読み込み
file_path = 'TARP.csv'

# Daruを使ってデータフレームを作成
df = Daru::DataFrame.from_csv(file_path)

# 基本統計の表示
puts "基本統計:"
puts df.describe

# 土壌湿度(Soil Moisture)の平均値を計算
puts "\n土壌湿度(Soil Moisture)の平均値:"
average_soil_moisture = df['Soil Moisture'].to_a.map(&:to_f).sum / df.size
puts average_soil_moisture

# 気温(Air Temperature)と湿度(Air Humidity)の関係
puts "\n気温(Air Temperature)と湿度(Air Humidity)の関係:"
relation = df['Air Temperature'].zip(df['Air Humidity'])
relation.each do |temp, humidity|
  puts "気温: #{temp}, 湿度: #{humidity}"
end

# 気圧(Pressure)の統計を計算
puts "\n気圧(Pressure)の基本統計:"
pressure_values = df['Pressure'].to_a.map(&:to_f)
pressure_stats = {
  mean: pressure_values.sum / pressure_values.size,
  min: pressure_values.min,
  max: pressure_values.max
}
puts pressure_stats

# 土壌pH(ph)の分布を表示
puts "\n土壌pH(ph)の分布:"
ph_values = df['ph'].to_a.map(&:to_f)
ph_distribution = ph_values.group_by { |ph| ph.round }.transform_values(&:size)
ph_distribution.each do |ph, count|
  puts "pH: #{ph}, Count: #{count}"
end

And when I run, I got this:

`numeric_pos': Specified index "Air Temperature" does not exist (IndexError)

plus, this:

基本統計:
#<Daru::DataFrame(5x14)>

土壌湿度(Soil Moisture)の平均値:
45.48407

気温(Air Temperature)と湿度(Air Humidity)の関係:

I'm gonna do code reading.

code reading

require 'csv'
require 'daru'

# CSVファイルの読み込み
file_path = 'TARP.csv'

# Daruを使ってデータフレームを作成
df = Daru::DataFrame.from_csv(file_path)

# 基本統計の表示
puts "基本統計:"
puts df.describe

# 土壌湿度(Soil Moisture)の平均値を計算
puts "\n土壌湿度(Soil Moisture)の平均値:"
average_soil_moisture = df['Soil Moisture'].to_a.map(&:to_f).sum / df.size
puts average_soil_moisture

require 'csv':ルビーの標準搭載されているライブラリであるCSVライブラリをロードすることで、CSVファイルの読み込みと書き込みが可能になる
•require 'daru':DaruとはData Analysis in Rubyのことで、PythonでのPandasのようなデータ分析に必要な機能を提供しており、require 'daru'daruを読み込んでいる
基本統計displays descriptive statistics.
•descriptive statistics provide a quick overview of data characteristics using such as mean, median, mode.
•output is this: 基本統計#<Daru::DataFrame(5x14)>
Soil Moisture calculates the average soil moisture.

puts "\n土壌湿度(Soil Moisture)の平均値:"
average_soil_moisture = df['Soil Moisture'].to_a.map(&:to_f).sum / df.size
puts average_soil_moisture
- `.to_a` でarrayに変換
- `.map(&:to_f)`で値をフロート型(小数も含まれる)に変換
- `.sum / df.size`:合計の値/列の数

I'm gonna do this script's code reading.

# 気温(Air Temperature)と湿度(Air Humidity)の関係
puts "\n気温(Air Temperature)と湿度(Air Humidity)の関係:"

# Extracting and cleaning the data
air_temperature = df['Air temperature (C)'].to_a.map(&:to_s).reject(&:empty?) 
air_humidity = df['Air humidity (%)'].to_a.map(&:to_s).reject(&:empty?)

# Ensure the arrays are the same size
min_size = [air_temperature.size, air_humidity.size].min
air_temperature = air_temperature.first(min_size)
air_humidity = air_humidity.first(min_size)

# Combine and print the relation
relation = air_temperature.zip(air_humidity)
relation.each do |temp, humidity|
  puts "気温: #{temp}, 湿度: #{humidity}"
end

  • When I run this, I got this: ``<main>': undefined local variable or method み' for main:Object (NameError)
    • 未定義のローカル変数またはメソッド
      • ローカル変数:メソッド、ブロック、スクリプト内で定義されたメソッド。
    • が定義されていないらしい...(コードには何も書かれていないのに)
      • 原因
      • コピペミス
      • エンコーディング:文字列をバイトに変換する方法
        -->なぜか、時間が経った後にもう一度実行してみたら、エラー解消されてた。

実行すると、こういうのだった。

基本統計:
#<Daru::DataFrame(5x14)>

土壌湿度(Soil Moisture)の平均値:
45.48407

気温(Air Temperature)と湿度(Air Humidity)の関係:
気温: , 温度:
.
.
.

日光時間と必要水量の関係:
気温: , 温度:
.
.
.
気圧(Pressure)の基本統計:

「気温(Air Temperature)と湿度(Air Humidity)の関係:」が出力されているときと、「日光時間と必要水量の関係:」が出力されている時の二つのタイプがある。なんで両方がいっぺんに出力されないのだろうか、、、。

11/20

when I run code, I got this:

`rescue in access_vector_single_index': Specified vector Soil Moisture does not exist (IndexError)

 `numeric_pos': Specified index "Soil Moisture" does not exist (IndexError)

•first error is about the missing error. It says, there is no Soil Moisture column. however, it actually exists.
•so, I run this to check available columns:

puts df.vectors.to_a

•I realized that I don't have to write down pressure, air temperature, air humidity or sth. I first need to complete the codes about avg of soil moisture.

--> this worked well😎

code. 一つ目の出力完成

require 'csv'
require 'daru'

# Load the dataset
file_path = 'TARP.csv'
df = Daru::DataFrame.from_csv(file_path)

# Debug: List all column names
puts "Available column names: #{df.vectors.to_a}"

# Calculate average Soil Moisture
if df.vectors.include?('Soil Moisture_1')
  valid_values = df['Soil Moisture_1'].to_a.map(&:to_f).reject(&:nan?)
  average_soil_moisture = valid_values.sum / valid_values.size
  puts "Average Soil Moisture: #{average_soil_moisture}"
else
  puts "Column 'Soil Moisture' not found in the dataset."
end

result:

Available column names: ["Soil Moisture_1", "Temperature", " Soil Humidity", "Time", "Air temperature (C)", "Wind speed (Km/h)", "Air humidity (%)", "Wind gust (Km/h)", "Pressure (KPa)", "ph", "rainfall", "N", "P", "K", "Status", "_1", "_2", "Soil Moisture_2"]
Average Soil Moisture: 45.48407

explanations:

require 'csv'
require 'daru'

•require 'csv':ルビーの標準搭載されているライブラリであるCSVライブラリをロードすることで、CSVファイルの読み込みと書き込みが可能になる
•require 'daru':DaruとはData Analysis in Rubyのことで、PythonでのPandasのようなデータ分析に必要な機能を提供しており、require 'daru'でdaruを読み込んでいる

file_path = 'TARP.csv'
df = Daru::DataFrame.from_csv(file_path)

Daru::DataFrame.from_csv(file_path)で、CSVファイルをDaru::DataFrameオブジェクトに変換して、データ分析をしやすいようにする。

puts "Available column names: #{df.vectors.to_a}"

•ここで、カラムの名前を全て出力する

if df.vectors.include?('Soil Moisture_1')

•もし、dfにSoil Moisture_1というカラムがあるかを確認

二つ目の出力

When I run this code, I got nothing. I wanna fix this.

# 土壌pH(ph)の分布を表示
puts "\n土壌pH(ph)の分布:"
ph_values = df['ph'].to_a.map(&:to_f)
ph_distribution = ph_values.group_by { |ph| ph.round }.transform_values(&:size)
ph_distribution.each do |ph, count|
  puts "pH: #{ph}, Count: #{count}"

Reason
ph is not accessible in the Daru::DataFrame

•ph is like this

code reading

ph_values = df['ph'].to_a.map(&:to_f)
  • df['ph']:access the ph column
  • to_a:converts its value to an array
    • array is like this: [1,2,3]
  • map(&:to_f):ensure they are all numeric.
    • converts strings like "3.5" to 3.5
      • a sequence of characters enclosed in quotes, used to represent text in programming. For example, "Hello, world!" is a string.
ph_distribution = ph_values.group_by { |ph| ph.round }.transform_values(&:size)
  • groups the ph values by their rounded value

    • 6.3, 6.7とかは6にいく。by doing this, it gets simpler. Also, it's easier to identify which values are common.
  • transform_values(&:size):counts the number of occureences for each group

problems

  • phはカラムに存在しているのに、このコードは動かない。なぜだ?

possible reasons

  1. there is no column called ph.

Code is fixed because I asked copilot! When I compare the wrong coedes and right codes, I learned this.

  • index is important
  • include end at the end of the code or process(like if and else 構文)


reference:https://www.diffchecker.com/text-compare/

get the max, min, median

    max_soil_moisture = valid_values.max
    min_soil_moisture = valid_values.min
    median_soil_moisture = valid_values.sort[valid_values.size / 2]
    puts "Max Soil Moisture: #{max_soil_moisture}"
    puts "Min Soil Moisture: #{min_soil_moisture}"
    puts "Median Soil Moisture: #{median_soil_moisture}"

Correlation coefficient between Air humidity and Rainfall

if df.vectors.include?('Air humidity (%)') && df.vectors.include?('rainfall')
    # Convert data to arrays of floats
    air_humidity = df['Air humidity (%)'].to_a.map(&:to_f)
    rainfall = df['rainfall'].to_a.map(&:to_f)
    
    # Calculate correlation coefficient
    n = air_humidity.size
    sum_x = air_humidity.sum
    sum_y = rainfall.sum
    sum_xy = air_humidity.zip(rainfall).map { |x, y| x * y }.sum
    sum_x2 = air_humidity.map { |x| x**2 }.sum
    sum_y2 = rainfall.map { |y| y**2 }.sum
  
    numerator = n * sum_xy - sum_x * sum_y
    denominator = Math.sqrt((n * sum_x2 - sum_x**2) * (n * sum_y2 - sum_y**2))
    correlation = numerator / denominator
  
    puts "Correlation coefficient between Air humidity and Rainfall: #{correlation}"
  else
    puts "Required columns ('Air humidity' and 'Rainfall') are not found in the dataset."
  end

explanations

  • sum_xy = air_humidity.zip(rainfall).map { |x, y| x * y }.sum
  • zip method is to take two arrays and combine them into one array.
    • もしair_humidtyの配列が[a,b,c]で、rainfallの配列が[x,y,z]とすると、zipの結果は[[a, x], [b, y], [c, z]]となる
    • .map { |x, y| x * y }:[a * x, b * y, c * z]
    • .sum:[a * x, b * y, c * z], the result of sum would be a * x + b * y + c * z.
  • sum_x2 = air_humidity.map { |x| x**2 }.sum:x^2の合計
  • sum_y2 = rainfall.map { |y| y**2 }.sum:y^2の合計

改良

先ほどの相関係数のコードは理解できなかったのでこれに書き換えました。

if df.vectors.include?('Soil Moisture_1')
    valid_values = df['Soil Moisture_1'].to_a.map(&:to_f).reject(&:nan?)
    average_soil_moisture = valid_values.sum / valid_values.size
    max_soil_moisture = valid_values.max
    min_soil_moisture = valid_values.min
    median_soil_moisture = valid_values.sort[valid_values.size / 2]
    puts "Average Soil Moisture: #{average_soil_moisture}"
    puts "Max Soil Moisture: #{max_soil_moisture}"
    puts "Min Soil Moisture: #{min_soil_moisture}"
    puts "Median Soil Moisture: #{median_soil_moisture}"
else
    puts "Column 'Soil Moisture' not found in the dataset."
end

Discussion