📑

Rubyでデータ分析

2024/11/20に公開

Purpose of this artile

•使用するのはこのデータセット:https://www.kaggle.com/datasets/sauravmishraa/waterneeded
•Rubyを用いて分析

about the dataset

differece of Soil Moisture and Soil Humidity

Soil Moistureは土の中に含まれている水分量。水分も蒸気も含まれている。%表記
Souil Humidityは土の湿気を示している。%表記。土壌が湿っているか、湿ってい るか、乾燥しているか、あるいはその中間であるかを示す、より定性的な、あるいは説明的な意味で用いられる。
参考:https://www.kaggle.com/datasets/sauravmishraa/waterneeded/discussion/433172

Asked Chat GPT & process of errors

require 'csv'
require 'daru'

# CSVファイルの読み込み
file_path = 'plants.csv'

# Daruを使ってデータフレームを作成
# CSVファイルをDaru::DataFrameに変換
df = Daru::DataFrame.from_csv(file_path)

# 基本統計の表示
puts "基本統計:"
puts df.describe

# 植物タイプごとの平均水量の計算
puts "\n植物タイプごとの平均水量:"
average_water = df.group_by('Plant Type') do |group|
  group['Water Needed (Liters)'].to_a.map(&:to_f).sum / group.size
end
puts average_water

# 日光時間と必要水量の関係を出力
puts "\n日光時間と必要水量の関係:"
df['Sunlight (Hours)'].to_a.each_with_index do |sunlight, index|
  water = df['Water Needed (Liters)'][index]
  puts "日光: #{sunlight}, 水量: #{water}"
end

I got this:

No such file or directory @ rb_sysopen - plants.csv (Errno::ENOENT)

Since the file of mine is not plants.csv but TARP.csv, so I changed.

And run again, but got this:

`group_by': Vector(s) missing: Plant Type (ArgumentError)

This indicates, the column name Plant Type does not exist.

Since there is a column called Soil Moisture, I run the code written above rewrite Plant Type to Soil Moisture.

And then, I got this:

`rescue in access_vector_single_index': Specified vector Sunlight (Hours) does not exist (IndexError)

Since the codes provided by ChatGPT are not based on the columns that I have, so I asked GPT to write codes based on the columns that I have.

In this dataset, there are columns like these:
Soil Moisture: It is having details regarding different soil moisture.

Temperature: Temperature of the soil.

Soil Humidity: It is having the details of soil humidity.

Air Temperature: It is having the details of the air temp.

Wind Speed: It is having the details of speed of wind.

Air Humidity: It is having the details of air humidity.

Wind Gust: It is having the details of wind gust.

Pressure: It is having the details of air pressure (in KPa).

ph: It is having details of the ph of soil.

How can I rewrite the codes based on these columns?

And then, I got this:

require 'csv'
require 'daru'

# CSVファイルの読み込み
file_path = 'TARP.csv'

# Daruを使ってデータフレームを作成
df = Daru::DataFrame.from_csv(file_path)

# 基本統計の表示
puts "基本統計:"
puts df.describe

# 土壌湿度(Soil Moisture)の平均値を計算
puts "\n土壌湿度(Soil Moisture)の平均値:"
average_soil_moisture = df['Soil Moisture'].to_a.map(&:to_f).sum / df.size
puts average_soil_moisture

# 気温(Air Temperature)と湿度(Air Humidity)の関係
puts "\n気温(Air Temperature)と湿度(Air Humidity)の関係:"
relation = df['Air Temperature'].zip(df['Air Humidity'])
relation.each do |temp, humidity|
  puts "気温: #{temp}, 湿度: #{humidity}"
end

# 気圧(Pressure)の統計を計算
puts "\n気圧(Pressure)の基本統計:"
pressure_values = df['Pressure'].to_a.map(&:to_f)
pressure_stats = {
  mean: pressure_values.sum / pressure_values.size,
  min: pressure_values.min,
  max: pressure_values.max
}
puts pressure_stats

# 土壌pH(ph)の分布を表示
puts "\n土壌pH(ph)の分布:"
ph_values = df['ph'].to_a.map(&:to_f)
ph_distribution = ph_values.group_by { |ph| ph.round }.transform_values(&:size)
ph_distribution.each do |ph, count|
  puts "pH: #{ph}, Count: #{count}"
end

And when I run, I got this:

`numeric_pos': Specified index "Air Temperature" does not exist (IndexError)

plus, this:

基本統計:
#<Daru::DataFrame(5x14)>

土壌湿度(Soil Moisture)の平均値:
45.48407

気温(Air Temperature)と湿度(Air Humidity)の関係:

I'm gonna do code reading.

code reading

require 'csv'
require 'daru'

# CSVファイルの読み込み
file_path = 'TARP.csv'

# Daruを使ってデータフレームを作成
df = Daru::DataFrame.from_csv(file_path)

# 基本統計の表示
puts "基本統計:"
puts df.describe

# 土壌湿度(Soil Moisture)の平均値を計算
puts "\n土壌湿度(Soil Moisture)の平均値:"
average_soil_moisture = df['Soil Moisture'].to_a.map(&:to_f).sum / df.size
puts average_soil_moisture

require 'csv':ルビーの標準搭載されているライブラリであるCSVライブラリをロードすることで、CSVファイルの読み込みと書き込みが可能になる
•require 'daru':DaruとはData Analysis in Rubyのことで、PythonでのPandasのようなデータ分析に必要な機能を提供しており、require 'daru'daruを読み込んでいる
基本統計displays descriptive statistics.
•descriptive statistics provide a quick overview of data characteristics using such as mean, median, mode.
•output is this: 基本統計#<Daru::DataFrame(5x14)>
Soil Moisture calculates the average soil moisture.

puts "\n土壌湿度(Soil Moisture)の平均値:"
average_soil_moisture = df['Soil Moisture'].to_a.map(&:to_f).sum / df.size
puts average_soil_moisture
- `.to_a` でarrayに変換
- `.map(&:to_f)`で値をフロート型(小数も含まれる)に変換
- `.sum / df.size`:合計の値/列の数

I'm gonna do this script's code reading.

# 気温(Air Temperature)と湿度(Air Humidity)の関係
puts "\n気温(Air Temperature)と湿度(Air Humidity)の関係:"

# Extracting and cleaning the data
air_temperature = df['Air temperature (C)'].to_a.map(&:to_s).reject(&:empty?) 
air_humidity = df['Air humidity (%)'].to_a.map(&:to_s).reject(&:empty?)

# Ensure the arrays are the same size
min_size = [air_temperature.size, air_humidity.size].min
air_temperature = air_temperature.first(min_size)
air_humidity = air_humidity.first(min_size)

# Combine and print the relation
relation = air_temperature.zip(air_humidity)
relation.each do |temp, humidity|
  puts "気温: #{temp}, 湿度: #{humidity}"
end

  • When I run this, I got this: ``<main>': undefined local variable or method み' for main:Object (NameError)
    • 未定義のローカル変数またはメソッド
      • ローカル変数:メソッド、ブロック、スクリプト内で定義されたメソッド。
    • が定義されていないらしい...(コードには何も書かれていないのに)
      • 原因
      • コピペミス
      • エンコーディング:文字列をバイトに変換する方法
        -->なぜか、時間が経った後にもう一度実行してみたら、エラー解消されてた。

実行すると、こういうのだった。

基本統計:
#<Daru::DataFrame(5x14)>

土壌湿度(Soil Moisture)の平均値:
45.48407

気温(Air Temperature)と湿度(Air Humidity)の関係:
気温: , 温度:
.
.
.

日光時間と必要水量の関係:
気温: , 温度:
.
.
.
気圧(Pressure)の基本統計:

「気温(Air Temperature)と湿度(Air Humidity)の関係:」が出力されているときと、「日光時間と必要水量の関係:」が出力されている時の二つのタイプがある。なんで両方がいっぺんに出力されないのだろうか、、、。

11/20

when I run code, I got this:

`rescue in access_vector_single_index': Specified vector Soil Moisture does not exist (IndexError)

 `numeric_pos': Specified index "Soil Moisture" does not exist (IndexError)

•first error is about the missing error. It says, there is no Soil Moisture column. however, it actually exists.
•so, I run this to check available columns:

puts df.vectors.to_a

•I realized that I don't have to write down pressure, air temperature, air humidity or sth. I first need to complete the codes about avg of soil moisture.

--> this worked well😎

code. 一つ目の出力完成

require 'csv'
require 'daru'

# Load the dataset
file_path = 'TARP.csv'
df = Daru::DataFrame.from_csv(file_path)

# Debug: List all column names
puts "Available column names: #{df.vectors.to_a}"

# Calculate average Soil Moisture
if df.vectors.include?('Soil Moisture_1')
  valid_values = df['Soil Moisture_1'].to_a.map(&:to_f).reject(&:nan?)
  average_soil_moisture = valid_values.sum / valid_values.size
  puts "Average Soil Moisture: #{average_soil_moisture}"
else
  puts "Column 'Soil Moisture' not found in the dataset."
end

result:

Available column names: ["Soil Moisture_1", "Temperature", " Soil Humidity", "Time", "Air temperature (C)", "Wind speed (Km/h)", "Air humidity (%)", "Wind gust (Km/h)", "Pressure (KPa)", "ph", "rainfall", "N", "P", "K", "Status", "_1", "_2", "Soil Moisture_2"]
Average Soil Moisture: 45.48407

explanations:

require 'csv'
require 'daru'

•require 'csv':ルビーの標準搭載されているライブラリであるCSVライブラリをロードすることで、CSVファイルの読み込みと書き込みが可能になる
•require 'daru':DaruとはData Analysis in Rubyのことで、PythonでのPandasのようなデータ分析に必要な機能を提供しており、require 'daru'でdaruを読み込んでいる

file_path = 'TARP.csv'
df = Daru::DataFrame.from_csv(file_path)

Daru::DataFrame.from_csv(file_path)で、CSVファイルをDaru::DataFrameオブジェクトに変換して、データ分析をしやすいようにする。

puts "Available column names: #{df.vectors.to_a}"

•ここで、カラムの名前を全て出力する

if df.vectors.include?('Soil Moisture_1')

•もし、dfにSoil Moisture_1というカラムがあるかを確認

二つ目の出力

When I run this code, I got nothing. I wanna fix this.

# 土壌pH(ph)の分布を表示
puts "\n土壌pH(ph)の分布:"
ph_values = df['ph'].to_a.map(&:to_f)
ph_distribution = ph_values.group_by { |ph| ph.round }.transform_values(&:size)
ph_distribution.each do |ph, count|
  puts "pH: #{ph}, Count: #{count}"

Reason
ph is not accessible in the Daru::DataFrame

•ph is like this

code reading

ph_values = df['ph'].to_a.map(&:to_f)
  • df['ph']:access the ph column
  • to_a:converts its value to an array
    • array is like this: [1,2,3]
  • map(&:to_f):ensure they are all numeric.
    • converts strings like "3.5" to 3.5
      • a sequence of characters enclosed in quotes, used to represent text in programming. For example, "Hello, world!" is a string.
ph_distribution = ph_values.group_by { |ph| ph.round }.transform_values(&:size)
  • groups the ph values by their rounded value

    • 6.3, 6.7とかは6にいく。by doing this, it gets simpler. Also, it's easier to identify which values are common.
  • transform_values(&:size):counts the number of occureences for each group

problems

  • phはカラムに存在しているのに、このコードは動かない。なぜだ?

possible reasons

  1. there is no column called ph.

Code is fixed because I asked copilot! When I compare the wrong coedes and right codes, I learned this.

  • index is important
  • include end at the end of the code or process(like if and else 構文)


reference:https://www.diffchecker.com/text-compare/

Discussion