iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🦿

Variable and Function Selection for Nonlinear Multiple Regression Analysis using Genetic Algorithms

に公開

What is Regression Analysis?

Regression analysis is a method of modeling that aims to derive an objective variable from other parameters (explanatory variables). For instance, a person's height is thought to be correlated with their father's height due to genetics. In this case, we consider estimating the son's height as the "objective variable" based on the father's height.
First, we collect data by surveying the heights of many fathers (x(i)) and their sons (y(i)) ( (x(1),y(1)), (x(2),y(2)), (x(3),y(3)), ... ). If we plot this, it might look like the figure below.
axis+data.png

At this point, we can infer that there is a relationship similar to the straight line shown below. Expressed as a formula, it is y=ax+b.
axis+data+line.png

So, which a and b should we choose? The best values are those that minimize the error. Specifically, we want the sum of the squared differences ( residual sum of squares ) between the y values calculated using y=ax+b (theoretical values) and the actual y values to be minimal. Please refer to reference [1] for detailed calculation methods.

What is Multiple Regression Analysis?

In reality, a person's height is influenced by various factors, not just their father's height, but also their mother's height, what they ate during childhood, exercise habits, and so on. Thinking about a model that derives an objective variable (in this case, height) from such a large number of variables is called multiple regression analysis.
The way of thinking remains the same even if the number of variables increases; we are still looking for y=ax1+bx2+cx3+.... that minimizes the residual sum of squares. Fortunately, Java has libraries available, so if you just want to perform the analysis, you can achieve it by simply inputting data into them.

OLSMultipleLinearRegression regression = new OLSMultipleLinearRegression();
//Weight
double[] y = new double[] { 50, 60, 65, 65, 70, 75, 80, 85, 90, 95 };
//Height, Waist
double[][] x = new double[10][];
x[0] = new double[] { 165, 65 };
x[1] = new double[] { 170, 68 };
x[2] = new double[] { 172, 70 };
x[3] = new double[] { 175, 65 };
x[4] = new double[] { 170, 80 };
x[5] = new double[] { 172, 85 };
x[6] = new double[] { 183, 78 };
x[7] = new double[] { 187, 79 };
x[8] = new double[] { 180, 95 };
x[9] = new double[] { 185, 97 };
regression.newSampleData(y, x);

Quoted from Multiple Regression Analysis in Java: Reading Think Stats

Problems with Regression Analysis

Cases of Non-linear Relationships

Until now, we have implicitly assumed that the relationship between variables is linear (y=ax+b). However, in reality, there might be cases where the relationship is y=ax^2. What should be done in such cases? One method is to create a new variable x^2=:z, which results in a linear relationship y=az, allowing for the use of previous regression analysis techniques. However, this method cannot be used unless the functional form can be predicted to some extent.

The Problem of Overfitting

Consider the initial thought of fitting a straight line to the following graph.
axis+data+line.png

However, it is also possible to fit a complex function as shown below.
axis+data+kyokusen.png

This graph overlaps completely with the data points, resulting in zero error. Is this a better model? No, while it certainly has a small error for existing data, its predictive performance for unknown data might decrease.
axis+data+line+newData.png
axis+data+kyokusen+newData.png

This is true not only for regression analysis but for machine learning in general: there is a problem called overfitting (over-learning), where generally, as a model becomes more complex, it fits existing data better but fails to fit unknown data.

In multiple regression analysis, as variables are added, the error with existing data decreases, but the predictive performance for new data decreases accordingly. Conversely, reducing the variables too much, which increases the error ( residual sum of squares ), is also a problem. We must consider which variables to use while balancing both. Fortunately, there is an index called AIC (Akaike Information Criterion), and the smaller it is, the better.

AIC=nlog(SS/n)+2(q+1)(n is the number of data, SS is the residual sum of squares, q is the number of variables)

Summary of Regression Analysis Problems

In the end, we can see that non-linear multiple regression analysis has two problems:

  1. Which variables should be used?
  2. Assuming they are used, what functional forms (x^2, log(x), exp(x)... etc.) should be used?
    In the following, we consider solving this using a genetic algorithm.

What is a Genetic Algorithm?

Genetic algorithms are a method of exploring solutions by representing the desired solution in the form of "genes" and simulating biological survival competition. For detailed schools of thought and explanations, please see Wikipedia. In the following, I will explain my program while providing concrete figures.

How to Represent Solutions

First, consider a way to represent the solution in the form of genes. For example, if you don't use a variable, express it as Unused; if you use it as it is, as Linear; if you use it in squared form, as Squared; if log, as Log. When the formula shape you want is y=ax1+cx3^2+dlog(x4) (x2 is unused), the solution will be represented by the "gene" "Linear, Unused, Squared, Log".

Population Generation and "Survival Competition"

First, randomly generate 500 gene sets (individuals). Calculate the AIC mentioned earlier for each, and order them from smallest to largest. The top 5 (1%) are allowed to survive unconditionally, and for the rest, the higher the rank, the higher the probability of survival. This determines the "survivors".

Crossover

Once the "survivors" are determined, randomly select two individuals from them and perform a "crossover". Specifically, select two points from the gene sequence and exchange the genes between them.
2点交叉.png
(Image reprinted from Wikipedia)
This process creates the "next generation", and we repeat this until the population size reaches 500 again.
And excluding the top 5 individuals, we cause a "mutation" (changing one of the genes randomly to another) with a 10% probability. A large random factor allows for exploration in a wide range with fewer loops; however, it risks "mutating" the "good solutions" we have worked hard to find, so we ensure the "top individuals" are not mutated.

Generation Change

Calculate the AIC again, sort them from smallest, and return to the first "survival competition". Repeat this until the best AIC value stops changing.

Scale Factor

Now, while omitted above to simplify the explanation, problems different from the usual occur with functions like exp(x). For example, when the objective variable's value changes significantly due to a variable's change, we can expect the function to be in the form of exp(x). However, here, if the value range of x is 0.001 to 0.004, exp(x) becomes nearly 1, which is no different from a constant function. Thus, in practice, if we put it in the form of exp(ax) and set a=1000, etc., the objective variable will fluctuate properly according to the change in x. We will call such an 'a' a scale factor hereafter.
The scale factor is determined by a genetic algorithm as follows:
0. First, find the median of the absolute values of variable x, and consider 'a' such that ax=1 when x is that value (i.e., the reciprocal of the median).

  1. Next, determine a value randomly within 1/10 to 10 times that value.
  2. In the case of "crossover", if the counterpart's gene is the same function type (Exp(a) and Exp(b)), create the next generation having the average ((a+b)/2) and sum (a+b) of those scale factors.

With the above method, complex functions can also be explored using genetic algorithms.

Source Code

The program that performs the above is uploaded to GitHub, so if you are interested, please try using it.
https://github.com/lamrongol/genetic_regression

Points to Note

Although I wrote "non-linear" in the title, it actually assumes a linear combination like y=ax_1+cx_3^2+dlog(x_4). It cannot estimate equations including multiplication, such as y=ax_1*bx_2/dlog(x_3*x_4). However, for example, if it contained division, as that increases, the value decreases, so it might be possible to approximate it by making the coefficient a large negative number.

脚注
  1. Hideki Toyoda, "Introduction to Regression Analysis" (Tokyo Shoseki) ↩︎

Discussion