For example, consider the case where the graph of "y(=x) VS x" is converted to "y'(=y-x=0) VS x".

After the coordinate transformation, find the average value.

Restore the residuals to their original coordinates.

This is a valid inference since the original coordinates also reduce the residuals.

There was no need to narrow down to those with similar explanatory variables.

It is considered that the inference becomes more effective as the residual error decreases.

When narrowing down, it was not always the case that the smaller the residual, the better the inference.

In narrowing down, the sample is narrowed down.

On the other hand, in the coordinate transformation, objective variables are narrowed down.

The coordinate transformation itself only changes the expression and does not make an inference.

There is no bias of "right or wrong" in the method of coordinate transformation.

If the question is not "Which sample is it close to?" but just the objective variable, narrowing down the objective variable is effective.

It is necessary to consider the balance between "narrowing down the sample" and "narrowing down the objective variable".

Coordinate transformation does not change the sample weights.

Both can be achieved by performing coordinate transformation and then narrowing down the sample.

However, in the example of y' (=y-x=0), the coordinate conversion results in a constant value, leaving no room for narrowing down.

Coordinate transformations are valid even if y=x for a part, not the whole.

Therefore, the quality of the inference should be evaluated in the state of the sample narrowed down after the coordinate transformation.

Consider whether it is possible to evaluate with the variance of samples narrowed down after coordinate transformation.

First, narrowing down is performed so that "unknown" is minimized separately for those that have undergone coordinate transformation and those that have not.

As a result, "unknown" and "variance" are obtained.

If there is an inference with zero variance, it should have the highest priority.

100% of "unknown" has the lowest priority, and 0% has the lowest priority.

Therefore, it seems that the higher the "(1-unknown)/variance", the higher the priority.

Think sample by sample.

Each sample has an "unknown" and a "residual".

However, it cannot be said that the larger the "(1-unknown)/residual" is, the higher the priority is.

In that case, samples near the mean value will be adopted unconditionally.

Each sample should have the same "variance".

It is likely that the larger the "(1-unknown) / variance" of the sample, the higher the priority.

The above formula is the same as saying that "(1-unknown) * variance + unknown * ∞" is as small as possible.

Basically, the purpose is to reduce the variance, but the variance of the "unknown" part can be interpreted as ∞.

It is more accurate to use the variance of the entire set before coordinate transformation instead of ∞.

It seems that the aim is to minimize "(1-unknown) * variance after inference + unknown * variance before inference".

For each sample, three values are obtained for each inference: "unknown", "weight", "variance of narrowed sample at original coordinates", and "objective variable to be predicted at original coordinates".

They only claimed that without coordinate transformation, the predicted object was the same as the sample.

In the presence of a coordinate transformation, we assert "the target variable to be predicted in the original coordinates".

The 'variance of narrowed sample in original coordinates' is the same value for all samples.

The index is "(1-unknown) * variance of narrowed sample at original coordinates + unknown * variance of universal set before inference".

For multiple inferences, keep only the one with the smallest index.

First, put only the predictive object cause "unknown" into the fuzzy set.

Samples with smaller indexes are added to the fuzzy set.

Stop adding when the probability distribution of the set "unknown" does not decrease.

The probability distribution of the "target variable to be predicted at the original coordinates" of the set is the inference result.

It should be noted that if only the best inference result is adopted, the inference result that is not adopted must be worthless.

Therefore, an inference that contains useful information contained in the inference that was not adopted must be adopted.

However, if you make inferences on all assumptions, you don't have to worry about that.

Example: I want to predict a color so that the Euclidean distance difference between the R, G, and B values of the color difference is minimized.

It is possible to predict each individually, but is there a better way?

The inference described so far can be calculated as long as the "distance" is known.

This "distance" is the distance from the best answer to be aimed at.

Although not always explicit, one "distance" is indicated for one problem.

It is better to optimize the "distance" rather than optimizing each of the "distance" elements (R, G, B).

If you convert from "RGB color" to "HSV color" and think about it, you may be able to make a good prediction.

In such cases, element-by-element optimization will not work.

Regularity at regular intervals and the like are “repeating units”.

Example: predict y=? when x=10

Data: 101010101?

Inference 1: Infer that y=1:5/10, y=0:4/10, y=unknown:1/10

Inference 2: Infer that y=0:4/5, y=unknown:1/5

Inference 3: Infer that y=0:4/5, y=unknown:1/5

Inference 2 infers that y=0 when x is even.

Inference 2 ignores data where x is odd.

Inference 3 infers that the sequence "10" is repeating.

Inference 3 does not strictly predict y when x=10.

I predict that the number sequence "10" will appear.

From the result, it is deduced that y=0.

By deduction, if the prediction target is the same, the inference can be integrated.

In reasoning 2, the no-touch odd data should be set to "unknown" = 100%.