Regarding y=f(x, ε), the basic idea is the same as for y=f(x).

A fuzzy set in which all data and "unknown" participate with weights (0 to 1) is the inference result.

y=f(ε): when the weight of all data is assumed to be 1

y=f(x): Given the information 'Δy=0 when Δx=0', the weight of 'unknown' is reduced

y=f(x,ε): Weight of "unknown"=1

It is y=f(x, ε) that is inductively reasoned by oneself, not given by information whether "Δy=0 when Δx=0".

"unknown" has two origins.

The first is "unknown" = 100% for the prediction target.

The other is the bias ratio for each data.

Good or bad inference can be considered as follows.

1. "unknown" = 100% is the worst guess.

2. A small "unknown" ratio is a better inference than a large one.

3. It is a good inference if a small number of data are narrowed down to have weights.

4. We define that the smaller "(1-unknown) * variance after inference + unknown * variance before inference" is better inference.
Other than the "unknown" ratio, it is possible to evaluate by methods such as average entropy.

Entropy is a method of calculation that groups things that are considered "same".

For example, the entropy changes depending on whether or not "dogs" are classified by breed.

In the case of discrete values, "same" and "different" are affected by slight errors.

Therefore, all data should be different.

1. Decide which explanatory variables to focus on. [Draft a hypothesis]

Multiple explanatory variables may be used.

2. Decide how much weight all data including the prediction target will participate in the fuzzy set. [Initial value of weight]

The initial value of the weight may be appropriately determined by meta-inference or the like with reference to other examples.

3. Calculate the probability (bias) of being a valid participant for all data that participates in the fuzzy set.

[Continuity bias] [Symmetry bias] [Variance bias] [Chance bias] etc.

4. Calculate the inferred result considering all biases.

5. Return to 2. and determine the weight to minimize the "unknown" in the guess result [weight optimization].

As an example, predict whether a strawberry is sweet or sour based on its size.

Data; (20mm, sour), (30mm, sour), (31mm, sweet), (32mm, prediction target "unknown"), (40mm, sweet)

Here, let's divide the size into two groups, under 31mm and over 31mm.

Less than 31mm: 0/2 is sweet

31 mm or more: (2 to 3)/3 is sweet

At this time, it is suspected that the border was arbitrarily set to avoid 30mm sour strawberries.

This is called “continuity bias”.

It may be a coincidence, so it is difficult to judge whether it was arbitrary.

However, the 31mm and 32mm are only slightly different in size, so it's more natural for them to belong to the same group.

It is better to decide the weight for participating in the fuzzy set without clearly dividing the group.

If the explanatory variables are close to each other, the weights should be close to avoid bias.

It should be satisfied even if not near the perimeter.

The weight of the prediction target is 1 and the boundary is 0.

The weight of a point can be determined from the ratio of "distance to prediction target" and "distance to boundary".

It is not always necessary to determine the weight linearly.

A position corresponding to the variance may be determined, and the weight may be determined by a normal distribution.

Poorly weighted data must be given a continuity bias of “unknown” as a penalty.

For the unbiased optimal weights, "unknown" is given for excessive weights.

Reducing the weights is better to reduce the bias unless it is difficult to compute.

Consider the example of predicting the sweetness of strawberries.

Assume that the size of the strawberry to be predicted is 32 mm.

The relationship between size and sweetness was investigated by dividing the size into 0-30 mm for small size and 30-60 mm for large size.

However, the size of the strawberry to be predicted is a fairly small example even among the "large sizes."

A difference between the average value of the explanatory variables of the set and the explanatory variable to be predicted is defined as "symmetry bias".

Simply making the upper and lower ranges the same width does not eliminate the bias.

For example, suppose we want to predict a phenomenon that will occur at a certain time.

Suppose we make an inference from the data from 1 hour before to 1 hour after that time.

Although the target time was decided symmetrically, it is biased because it has only past data.

As is the case with time, the data cannot necessarily be taken in such a way that "symmetry bias" does not occur.

"unknown" is given according to the difference between the average value of the explanatory variables of the set and the explanatory variables to be predicted.

First, split the set in two at half the "mean difference".

L = Wasserstein distance between the undivided set and the whole

L1 = Wasserstein distance between the nearest set of partitions and the whole

L2 = Wasserstein distance between the far set and the entire partition

"average unknown" =|L2-L1|/L

Allocate "mean unknown" to members proportional to the distance of each explanatory variable.

The calculation method is not limited to this method.

Instead of splitting into two, it may be finely split.

Even without any information on the objective variable, it can be inferred that the greater the difference in explanatory variables, the greater the bias.

Also, if a large "unknown" is given, it is better to re-form the hypothesis so that the set becomes smaller.

For one piece of data, the farther the explanatory variable is from the target of prediction, the farther the objective variable will be.

As for the distribution, if the average value of the explanatory variables is the same as that of the prediction target, it can be considered that the deviations are canceled out.

However, this holds true when the relationship between the explanatory variable and the objective variable is linear.

That is, even if the average values of the explanatory variables are equal, the larger the variance×nonlinearity, the larger the deviation.

We call it the “Variance bias”.

"unknown" is given according to the variance of explanatory variables.

"deviation" and "mean difference" have the same units.

Assuming various nonlinear lines, an approximate relationship between the two can be estimated.

The bias due to "deviation" is about half of the bias due to "mean difference".

The set may be divided into two parts, one with a closer "deviation" and one with a farther one, and the same calculation as for the "difference in mean value" may be performed.

Also, if a large "unknown" is given, it is better to re-form the hypothesis so that the set becomes smaller.

Chance bias is the appearance of a relationship due to the random bias in the target variable.

Compute the Wasserstein distance of the objective variable of the fuzzy set and the universal set.

Find the probability that a random sampling of the same number of samples will be equal to or greater than the above distance by chance.

"unknown" is given only for the probability that it occurs by chance.

However, this is when testing for chance by comparing with a single random sample.

If you are concerned that the rarest case in 100 random samples may have been cited, you must compare that case.