I have already described the method of optimizing the inference results after deciding the explanatory variables to be focused on as [hypothesis planning].

Multiple hypotheses lead to multiple inference results, so some form of comprehensive judgment is required.

There are two ways of thinking about integration of inference results.

1. Adopt the best inference result

2. Weighted average of inference results

The first method has drawbacks.

If the judgment is close, the inference result of the losing side is completely ignored.

The second method also has drawbacks.

The one with the most similar inference results is emphasized.

Examples of strawberries: "It's sour because it's small" "It's sweet because it's RGB colors" "It's sweet because it's CMYK colors" "It's sweet because it's HSV colors"

In this example, essentially the same inference overemphasizes color.

One solution to the shortcomings of the second method is to hypothesize evenly.

Another solution to the shortcomings of the second method is weighted averaging according to the degree of overlap.

However, it is difficult to decide what should be considered as "equal" or "duplicate".

A solution to the shortcomings of the first method is to make it possible to ignore hypotheses that are not accepted.

The accepted hypothesis only needs to contain useful information about the rejected hypothesis.

In other words, all inferences should be integrated.

There are two possible ways to integrate inference.

1. Integrate all inference results

2. Propose a hypothesis that combines all hypotheses

First, there is the issue of whether it is possible to integrate results by looking only at the results.

Second, there is the issue of whether to recalculate each time a new hypothesis appears.

Since there are an infinite number of explanatory variables, it is not possible to formulate a hypothesis focusing on all explanatory variables from the beginning.

Either way, the combined hypothesis must prevail.

The inference result of a certain hypothesis is "information".

Any information is better than nothing.

A program that turns a blind eye to the "information" for better results is wrong.

As an example, consider the case of estimating the sweetness of strawberries.

It is likely that a better inference can be made by narrowing down to similar items such as "same breed," "same production area," "similar color," and "similar size."

However, if the result of trying to select only those that are close to the prediction target is 0, the narrowing down is too strict.

If it is narrowed down to one, it is possible to infer that the prediction target is the same as that one, but the variance is uneasy.

The less you narrow down, the less bias you get, but the more variance you get.

Therefore, consider the result of narrowing down to 1 and the combination when narrowing down to 2.

If the data is narrowed down to 1: "unknown" = 1/(1+1) = 1/2. "Data 1" = 1/(1+1) = 1/2

If you narrow down the data to 2: "unknown" = 1/(2+1) = 1/3. "Data 1" = 1/(2+1) = 1/3. "Data 2" = 1/(2+1) = 1/3

If you narrow down the data to m: "unknown" = 1/(m+1). "data m" = 1/(m+1)

Let's assign the result of 2 data to "unknown" for 1 data.

However, it should not be substituted as is.

Substituting them as they are, "data 1" = 2/3, "data 2" = 1/6, "unknown" = 1/6.

This means that "data 1" is twice as many as "unknown", that is, there are two of them.

Also, since "unknown"=1/6, the total number of data is 5.

Constraint: When arbitrary m data are selected from n data, the total ratio of m data must be within m/(m+1).

If "data 1" = 1/2, "data 2" = 1/6, and "unknown" = 1/3, then the constraints are met.

The difference between "unknown" in the case of 1 and in the case of 2 becomes "data 2".

The difference between "unknown" for m-1 and m is "data m".

"data m" = 1/(m+1)-1/{(m-1)+1}=1/(m^2+m)

This is the case where data 1 is of better quality than data 2.

If the quality of data 1 and data 2 are the same, "data 1" = 1/3, "data 2" = 1/3, "unknown" = 1/3.

Narrowing down can be said to rank all the data, not to truncate them.

Groups are divided based on a certain feature value, and ranks are given among the groups.

Then, group by another feature and do the same.

With this idea, all the data will be added to the inference result.

No matter how much data you have, the more data you have, the more certainty your inferences will have.

However, "unknown" up to this point is due to the guess target.

It is also necessary to consider "unknown" due to bias.

If the data has a large “unknown” ratio due to bias, not adopting it will reduce the overall “unknown”.

Therefore, it is better to adopt the ones with the smallest "unknown" due to the bias.

If you adopt more than this, the number of “unknown” will increase, and you can stop it.

The ratio should be determined so that the bias-induced "unknown" is as small as possible within the range that satisfies the above constraints.

However, data with the same "unknown" due to bias should have the same ratio.

For each inference, the result is the "unknown" and "weight" of all the samples.

The fuzzy set of the weight becomes the probability distribution of the inference result, and the "unknown" of the set is obtained.

Think about how to integrate the above in a state where there are multiple.

If we try to take the average value, we will focus on explanatory variables that are copied, as in the example of the color of strawberries.

Consolidation is required in the form of taking maximum or minimum values.

For each exemplar, "unknown" holds the minimum value.

First, put only the predictive object cause "unknown" into the fuzzy set.

Add to the fuzzy set in descending order of "unknown".

Stop adding when the probability distribution of the set "unknown" does not decrease.