Zenn

Performance Comparison of Python Regression Analysis Methods

に公開

Hello, I'm Dang, an AI and machine learning engineer at Knowledgelabo, Inc. We provide a service called "Manageboard," which helps aggregate, analyze, and manage corporate data scattered throughout our organization. Manageboard is set to enhance its AI capabilities in the future. In my articles, I will share the challenges we encountered during our research and development.

Regression Analysis Methods to Compare

Python offers various libraries for performing regression analysis, with methods provided by sklearn, SciPy, and NumPy being particularly popular. In this article, we will compare the performance of these methods by applying them to 1-dimensional linear regression analysis and evaluating their performance with respect to different data sizes.

  1. scipy.stats.linregress
    This is a SciPy function for performing 1-dimensional linear regression analysis. It calculates the regression line using the least squares method based on two input arrays.
  2. numpy.polyfit
    A NumPy function used for polynomial fitting. It can also be used for 1-dimensional linear regression by specifying a degree of 1 for the polynomial.
  3. sklearn.linear_model.LinearRegression
    A class from the scikit-learn library that provides a more general linear regression model. It can handle 1-dimensional data as well, using the least squares method to estimate the parameters.

We will use these three methods to compare their performance (execution time) with different amounts of data.

Experiment Methodology

For the experiment, we will use an Amazon SageMaker ml.t3.medium instance and compare the average execution times of each method over 10 runs. The following Python code performs the regression analysis:

import numpy
from scipy import stats
from sklearn import linear_model
from random import randint
from datetime import datetime

num = 10  # Number of data points
result_list = [[], [], []]
regr = linear_model.LinearRegression()

# Run the experiment 10 times
for run in range(10):
    # Generate data
    X = numpy.array(range(num))
    Y = []
    for idx in range(num):
        Y.append(idx + randint(-num/10, num/10))
    Y = numpy.array(Y)

    # linregress
    time = datetime.now()
    stats.linregress(X, Y)
    result_list[0].append(datetime.now() - time)

    # polyfit
    time = datetime.now()
    numpy.polyfit(X, Y, 1)
    result_list[1].append(datetime.now() - time)

    # LinearRegression
    X = X.reshape((-1,1))  # Reshape 1D data to 2D
    time = datetime.now()
    regr.fit(X, Y)
    result_list[2].append(datetime.now() - time)

# Calculate the average result
numpy.mean(result_list, axis=1)

In this code, we start with num set to 10, and then run the regression analysis using linregress, polyfit, and LinearRegression for 10 iterations. The execution times for each method are recorded, and the average time is computed.

Result

The following table shows the average execution time (in seconds) for each method with varying data sizes:

Data Points linregress(seconds) polyfit(seconds) LinearRegression(seconds)
10 0.001425 0.000173 0.000877
100 0.001439 0.000325 0.001099
1000 0.002475 0.000485 0.001729
10000 0.002747 0.001062 0.001500
100000 0.004208 0.007952 0.003567
1000000 0.032403 0.085891 0.039857
10000000 0.252130 0.908959 0.334037
  1. For Small Data Sizes (10 to 10,000 points)
    When the data size is small (up to 10,000 points), polyfit performs the best in terms of speed. While linregress and LinearRegression also perform well, they are slower compared to polyfit.
  2. For Large Data Sizes (100,000 points and above)
    Once the data size exceeds 100,000 points, linregress outperforms the other two methods, showing significantly shorter execution times. On the other hand, polyfit becomes increasingly slower as the data size increases, and its performance drops sharply when data exceeds around 100,000 points.

Conclusion

When performing regression analysis, it's important to choose the appropriate method based on the size of the data. LinearRegression is a more general-purpose method that works well for both 1-dimensional and multi-dimensional data. However, for 1-dimensional data, it is slower than linregress or polyfit. In the case of a small amount of data (up to 10,000 points in this test), the result showed that polyfit is the most efficient and fastest method for one-dimensional linear regression. In the case of large amounts of data (over 100,000 points in this test), the result showed that linregress exhibited the best performance, making it a potentially optimal choice for large-scale datasets.

Discussion

ログインするとコメントできます