Consider a case where y is determined when x is determined, such as "y=f(x)".

It is considered separately from "y=f(x,ε)" which includes chance.

Inductively infers y at a certain x when a pair of x and y is given as data.

As the simplest example, consider the case where no data is given.

Since there is no data available to compare x, the value of x is the same as not given.

Exactly the same result as in the case of "Perfect quality inductive inference. y=f(ε)".

i.e. "unknown" = 100%.

Consider the case where y=f(x) and only one point of data is given.

Example: data = (x1,y1). Prediction target = (x0, y0).

If x0=x1, then y0=y1 since it matches the given data.

If x0≠x1, then y0=“unknown” for now.

However, as in the case of "y=f(ε)", it can be said that the probability of "unknown" is higher when it is closer to y1.

Also, if it is a common function, the limit of x0→x1 is y0→y1.

Therefore, it is considered that the closer x0 is to x1, the higher the probability that y0 is closer to y1.

When x0a is closer to x1 than x0b, the probability Pa that f(x0a) is closer to f(x1) is greater than the probability Pb that f(x0b) is closer to f(x1).

|x0a-x1| < |x0b-x1| → Pa{ |f(x0a)-f(x1)| < |f(x0b)-f(x1)| } > Pb{ |f(x0a)-f(x1 )| > |f(x0b)-f(x1)| }

I can only say the size relationship, but I can't say what percentage it is.

Consider the case where y=f(x) and two data points are given.

Example: Data = (x1,y1),(x2,y2). Prediction target = (x0, y0).

There are two ways of thinking.

The first way of thinking is the method of guessing by drawing a straight line.

However, even here, only the magnitude relationship can be said, and the probability cannot be said.

If x0=(x1+x2)/2, y0=(y1+y2)/2 is most likely, but I don't know what percentage.

The second method is the weighted average of y by the reciprocal of the x distance.

The two methods yield the same results when interpolating, but different results when extrapolating.

The first method cannot be used if the slope cannot be calculated.

The second method can be applied to non-numeric scales as it only mixes y with how similar the x's are.

You can use the second method alone, as it is the more widely applicable method.

The second method also yields the same estimation results when extrapolated.

Rotate the x,y coordinates to x',y' coordinates.

In x versus y', rotate the line so that the slope is 0.

If the slope is 0, extrapolation or interpolation, it is 0 if it is on the line, so it can be inferred in the same way.

y' indicates how much difference there is from a certain straight line.

Since the existing two points have the common property of y'=0, we guess that the third point will also be y'=0.

Even if you change the way you take coordinates or change the unit, the inference result should be the same because the essence does not change.

Consider the case where y=f(x) and three points of data are given.

Example: Data = (x1,y1),(x2,y2),(x3,y3). Prediction target = (x0, y0).

A line must always be drawn through the three points.

The first thing that comes to mind is that if you divide it into x1-x2 and x2-x3, you can draw a straight line in each of the two sections.

However, when the guess results are represented by lines, it is strange that the points are corners.

This is because even if the line of truth had a corner, it would be difficult to pinpoint it as an observation point as intended.

Equal slopes to the left and right of the point are required.

It can be said that the data of x3 also influences the prediction between x1 and x2. The same is true for the other side.

Furthermore, if there is x4 outside x3, it will also affect it.

Because, if x3 and x4 are very close, but there is a big difference between y3 and y4, it is strange to adopt 100% only the closer one.

A slight difference between x3 and x4 causes a large change in the result, which is greatly affected by noise and has poor robustness.

The closer x is, the closer y is inferred, so if x is equally close, y should be equally affected.

Even at a distant point, the effect is small but not 0.

Even when making predictions between x1 and x2, all points can be affected.

For example, if the weighted average of y is the reciprocal of the distance between x0 and all points, all points are affected.

This is not good inference.

If the plot is skewed to the left or right of x0, y0 will also be skewed.

Now, let's set a weight for each point and weight y with that weight.

At this time, if the weighted average of x is equal to x0, it can be said that there is no lateral bias.

To make the weighted average of x = x0, we have to adjust the weights at points to the left and right of x0.

The conditions that the weights should satisfy are summarized below.

1. The closer the position to x0, the greater the weight. However, the side larger than x0 and the side smaller than x0 are distinguished. [Must]

2. Points that are close to each other have close weights. However, the side larger than x0 and the side smaller than x0 are distinguished. [Must]

3. The weighted average of x is close to x0. [Minimize bias]

4. The sum of weights is large. [Minimize variance]

[Minimization of bias] and [Minimization of variance] are in a trade-off relationship.

The weighted average of x can be adjusted by dividing into two groups, one on the side larger than x0 and the other on the smaller side, and adjusting the weight for the entire group.

Weighted average of x = 0 is not always possible. For example, if there is only data on the side larger than x0, bias is unavoidable.

For example, when predicting the future, using older data reduces variance, but inevitably increases bias.

How to balance variance and bias is discussed separately.

As a remaining argument, we need to consider how the weight is determined by the proximity of x.

In the case of two points, the distance d1=|x1-x0| and the weight w1=(1/d1) / Σ(1/d).

It implicitly said that "Euclidean distance" = "nearness of x".

However, you should be free to decide what you consider to be "close".

Any value that satisfies the distance axiom is fine.

Consider an example of two-point inference where the weight decays not at d(=x-x0) but at the square of d.

If you adjust the weighted average of x = x0, you will get the same result as when calculating with d.

However, with 3 or more points, the degree of influence of distant points changes.

So far we have only discussed the most probable values, but we can also infer probability distributions.

First, let y0 to be guessed be the value "unknown".

Hypothesize a fuzzy set and join the guess object.

All data are made to participate in the fuzzy set with their respective weights.

The distribution of the fuzzy set becomes the probability distribution of the inference result.

However, not only the data but also "unknown" participates in the fuzzy set with a weight of 0-1.

If you have data x1=x0, y0=y1. "unknown" should be 0%.

The "unknown" weight represents the proportion of no information about the inference target.

The weight of "unknown" depends on what information is given.

If you don't have information about how close x is, how close y is, you can give weight 0 if they match, and weight 1 if they don't match.
If you don't have the information "if x matches, then y matches", you can always use a weight of 1.