Consider the balance between bias and variance.

If you imagine a target, the arrows are concentrated in a place that is off center.

Here, three patterns of purpose are conceivable.

Objective example 1: 1 point if hit in the center of the target, 0 points if not in the center

Example Objective 2: If you hit anywhere on the target, get 1 point anywhere

Objective example 3: The closer you are to the center of the target, the higher the score you get.

In objective example 1, if the bias is even slight, the probability of hitting the center decreases as the variance decreases.

A smaller bias is better, but a smaller variance is not necessarily better.

Purpose examples 1 to 3 are the same if you think that you are aiming for a high score, only the score settings are different.

Simply thinking, it is considered to be a good way to aim to increase the expected value of the score.

But that is if you know the true probability of where the arrow will hit the center and with what variance.

If the shooter guesses that it will be slightly left biased, the shooter corrects the aim slightly to the right.

The archer shoots the arrow in a state where the bias would have disappeared.

However, in reality, there is a bias.

It can be said that the speculation that the shooter's "bias has disappeared" was wrong.

"Bias" can be corrected by shifting the aim if the direction is known.

In other words, the "bias" that remains uncorrected has no direction.

However, although the direction of deviation is unknown, the shooter can recognize that there are ways of shooting that are easy to shift and ways of shooting that are difficult to shift.

You can use "selection entropy" as an indicator of whether you got what you were aiming for.

The selection entropy is greater when the probability of occurrence of an event is lower than expected, based on observations of the results that actually occurred.

If the future can be predicted 100%, the selection entropy is 0.

Reducing selection entropy as much as possible is the direction of inference that should be aimed at.

However, we cannot know the selection entropy until we see the future consequences.

In general, when we just say "entropy", we mean "average entropy".

The average entropy is the expected value of the selection entropy when the results of the predicted probability distribution occur.

Reducing the average entropy as much as possible is not the direction of inference that should be aimed.

This is true only when the actual probability distribution is the expected probability distribution.

A biased probability distribution should not be taken as true.

Consider how to express the degree of bias.

The shift type due to bias has no directionality.

At maximum bias, you could say "unknown" = 100%.

The magnitude of the bias can be expressed by the ratio of "unknown".

Then, the guess result with a large bias should have a large ratio of "unknown".

"unknown" is a value assigned to a prediction target, and the ratio decreases relatively as the amount of other data increases.

It is 'unknown' related to the number of data, ie 'variance'.

Apart from that, 'unknown' related to 'bias' must be added in some way.

The probability distribution of inference results is expressed as the distribution of the set of data + inference target “unknown”.

Therefore, it is thought that each member of the set also has "unknown" that represents the bias.

It can be said that data other than the guess target "unknown" also includes "unknown".

As an example, consider the case where the sun is inductively inferred from the east tomorrow.

Let's say that "unknwon" due to a certain data bias is 50%.

It can be interpreted as uncertain data with a 50% chance of mistaking something other than the Sun.

Here, let's compare with what I thought in "Quality, quantity and certainty in inductive reasoning".

It turns out that "unknown" by bias is the same as "unknown" by quality.

Bias has the same meaning as data quality, and expresses whether they have the same probability distribution.

Conversely, it can be said that it is a bias and a reduction in quality to regard the probability distributions as being the same even though they are different.

In order to calculate inductive reasoning, information on the "bias" of the data used is necessary.

When predicting the future, it is not possible to know how much "bias" actually exists unless the "result" is actually observed.

“Results” cannot be predicted with certainty because they are in the future.

Similarly, "bias" cannot be inferred with certainty.

Therefore, "bias" must also be calculated by inductive reasoning.

The "bias" for inductively inferring the "bias" must also be inductively inferred.

Recursive inductive reasoning is required, but there is no valid information somewhere and it becomes "unknown", so infinite calculations are not required.

For example, consider the case of inductively inferring whether the sun will rise in the east tomorrow.

Whether the "something like the sun" that I saw yesterday really rose in the east is another question.

The degree of "bias" can be rephrased as the degree of difference in "probability distribution".

A smaller bias-induced "unknown" corresponds to better data quality.

In the example of strawberries, the data can be narrowed down gradually to "things of the same variety" and "things of the same production area".

The more you narrow down, the better the quality.

This is because, for a given feature amount, the same feature is considered to be of a better variety than a different feature.

Compared to before the feature amount is taken into account, the quality of the same feature amount increases, and the quality of the different feature amount decreases.

However, we can only say the size relationship.

Suppose that there is no strawberry that satisfies both "the same variety" and "the same production area".

We have to think about which quality is better, "same variety" or "same production area".

It has to be extrapolated from existing data.

For example, let's compare the sweetness of strawberries "from the same production area" and "from different production areas".

If there is no statistical difference in sweetness, it can be said that the production area has nothing to do with sweetness.

However, even if there is a statistical difference, it may be a coincidence.

On the other hand, as I mentioned earlier, even a non-statistical method can tell only the size relationship.

Also, whether or not there is a statistical difference is irrelevant to whether or not the difference in probability distribution is large.

If the probability distribution changes little, narrowing it down has little effect.

The Wasserstein distance is used to measure how much the probability distributions differ.

It is a method called Wasserstein metric in the optimal transport theory.

Matches the elements of two probability distributions so that the sum of their distances is minimized.

This method can be used with any scale as long as the distance between elements can be determined.

In the case of a nominal scale, it is possible to set the same distance across the board, or to set a close distance for similar objects.

For example, a "white cat" and a "black cat" can be closer than a "white cat" and a "dog".

It is not necessary to align all names in a coordinate system, it is sufficient to know the distance between two names.

By using distance, you don't have to worry about matching or not.

If you try to measure the difference in distribution based on whether or not they match, the result will change depending on the degree of subdivision.

That is, methods such as the Kullback-Leibler distance cannot be used.

Also, "unknown" has a distance of 0 no matter what opponent it matches.

This is because we are trying to find the minimum distance.

As an example, consider testing the effect of a drug.

Suppose a drug is judged to be 99% more effective than a placebo drug.

Except for the 1% chance, it would be normal to judge it as effective.

But what if the drug is the better one out of 100 different reagents tried?

If you try 100 different drugs that are ineffective, chances are that one will be found to be effective with a 99% advantage.

In such cases, additional testing with the drug could reduce the error rate.

How should we interpret this result if no additional studies are available?

In this case, just because it's not 99% effective doesn't mean it's worthless as information.

We can only say that the drug has a higher probability of being effective than the other 99 drugs.

As another example, consider the case of predicting changes in stock prices.

Suppose we want to know which stocks will go up, so we explored the conditions for stock prices to rise in various ways.

On the other hand, it is assumed that few conditions were searched for because the stock price would fall.

You would then infer that most stocks would meet more conditions to go up than conditions to go down.

Bias cannot be eliminated unless the possibility of coincidence is taken into account in the number of trials to find regularity.

Another example of a similar bias is an optimization problem in which we search only around the current candidate solution, resulting in local solutions.

Although it seems to give an objectively wrong answer, it may be valid as an inference.

As an example, consider the case of predicting whether strawberries are sweet or sour.

Suppose we predict whether the strawberry at the bottom of basket A is sweet or not.

Suppose that all the strawberries in basket A were sweet and those in basket B were sour after all but one prediction target was tasted.

Naturally, you would expect the strawberries on the bottom of A's basket to be sweet as well.

However, even if similar predictions are made many times, they may be wrong with a high probability.

This is the case where only the bottommost strawberry in A's basket is made sour by the malice of a third party.

If you don't know that, you can predict that it's sweet, but it's the correct answer.

You can make an educated guess based on the information given.

However, such examples also allow you to deceive yourself.

In my brain, I can group the strawberries for convenience and make inferences.

Even if you deceive yourself without realizing it, it is a bias that should be avoided.

You should try to avoid bias as much as possible in the information you are given.