Induction

■Quality, quantity and certainty in inductive inference

Consider the general "induction" example.
1: Both yesterday and the day before yesterday, the sun rose from the east [prerequisites]
2: On all days, the sun will rise in the east [generalization by induction]
3: Therefore, tomorrow the sun will rise in the east [Deduction]
2 → 3: It is a deduction because it always holds.
1 → 2: It is inductive because the information about whether the sun will rise tomorrow has increased.
There is also a way of thinking that "induction" can be divided into "enumerative induction" and "analogy".
"Enumeration induction" is the idea that the more you observe the sun yesterday and the day before, the more certainty you have.
Now, what if what we saw yesterday was "something like the sun" and we weren"t necessarily sure it was the sun?
It is believed that the closer to the "sun" what we saw yesterday, the greater the certainty. That is the "analogy".
Enumerative induction is where the more data you have to make an inference, the more certainty you get.
It is an "inference" that the better the quality of the data used to make an inference, the greater the certainty.
“Quality” and “quantity” cannot be separated.
We have to think about which is more reliable: "poor quality but high quantity" or "good quality but low quantity".
The higher the “quantity”, the higher the certainty.
The higher the quality, the more certainty. Assume that “quality” is represented by 0 to 100%.
"Quality" = 0% is considered to be a sample completely unrelated to the inference target.
"Quality" = 100% considers a sample drawn from the same population as the inference target.
"Quality" = 100% samples can be collected infinitely, and the certainty will asymptotically approach 100%.
If we collect an infinite number of samples with "quality" = 50%, the certainty will asymptotically approach 50%.
This means that no matter how many times we observe the sun or something suspicious rising in the east, there is a limit to the fact that the sun rises in the east.
From the "quality" and "quantity" of n samples, the certainty is obtained by the following formula.
Certainty = (sum of n "qualities")/(n+1)
ratio of "unknown" = (1-certainty)
As a way of thinking, first, suppose that there is one piece of data called "unknown" as an inference target.
Add the data "The sun rises in the east" = 100% one by one.
When only one is added, "the sun rises in the east" = 50% and "unknown" = 50%.
If you add two, "the sun rises in the east" = 2/3 and "unknown" = 1/3.
Add to infinity and you get closer to "the sun rises in the east" = 100%.
For data with a quality of 50%, add the data "The sun rises in the east" = 50% and "unknown" = 50% one by one.
When only one is added, "sun rises in the east" = 25%, "unknown" = 25+50%.
Adding 2 gives "Sun rises in the east" = 0.5*2/3 and "unknown" = (0.5*2+1)/3.
Adding to infinity approaches "sun rises in the east" = 50%.
This formula suggests that relatively poor quality data should not be used.
If you have enough "sun" data, you should ignore "sun-like" data.
Add data in descending order of quality, and discontinue when certainty does not increase.

■ Minimization of guess target range

General "inductive inference" often aims at "generalization".
In the sun example, we aim to show that the sun rises in the east on all days.
All days include 5 trillion years from now.
But if we want to know if the sun will rise in the east tomorrow, why do we need to show that the sun will rise in the east in 5 trillion years?
The more distant in time, the more likely geographic change is, and the less “quality” the data will be.
It would be difficult to show that the sun will rise in the east in 5 trillion years.
If possible, I would like to indicate that the sun rises in the east, just "tomorrow".
Example 1: "Tomorrow the sun will rise in the east"
Example 2: "Tomorrow, the sun will rise in the east" AND "In 5 trillion years, the sun will rise in the east"
Example 2 makes more of an assertion than Example 1, so the assertion of Example 1 is easier.
"Generalization" can be interpreted as inferring the properties of the population as a whole from the properties of the selected sample.
Instead of aiming for generalization, think of a method for estimating the properties of the sample to be extracted next time.
To do that, you should define the minimum necessary group instead of the population.
The minimum necessary population should include only samples used for prediction and unknown samples to be predicted.
The minimal population will be the sample population plus one "unknown".
In general, the probability distribution of a sample drawn at random matches the distribution of the population.
The case of the minimum population is also considered in the same way.
The minimum population distribution becomes the probability distribution to be predicted.
We want to find the distribution of the minimum population, but we don"t have to do anything special.
The distribution of the minimal population includes "unknown", but should be kept as is, without converting to another value.
If no sample is available, "unknown" = 100%, which is the most accurate answer.
You should not arbitrarily choose a uniform distribution or maximum entropy distribution.
In this calculation method, the number of data need not be an integer.
For example, the temperature for the next 1.2 days may be predicted from the temperature for the past 0.3 days.
The prediction target range may be infinitesimal.
The number of data must be "repeating unit".
It is also possible to have only 50% participation in this minimal group.
Each data can be weighted between 0 and 1.
So far, we have not considered the quality of the data.
The quality of the data can be expressed as the weight of each data.

■ Reduction of inductive problem to deductive problem

In general, inference that increases information is classified as "inductive" and information that does not increase is distinguished as "deduction".
Here, an operation to increase information is called "inductive inference", and an operation that does not increase or decrease information is called "deductive inference".
Also, the problem of asking about the absence of information is called an "inductive problem".
For example, guessing whether the sun will rise in the east tomorrow is an inductive problem.
At first glance, it seems that the answer to the "inductive problem" can only be found by "inductive inference," but that is not the case.
However, I take the position that it is correct to answer "I don"t know" about something I don"t know.
Even if you are asked about what will happen in the future, if you do not assert anything, information will not increase.
However, there is a difference between silently not answering a question and answering "I don"t know."
If you don"t understand something, you have to say "I don"t know".
For that purpose, we introduce the state "unknown".
If you use "unknown", you can answer "inductive problem" without increasing information by "inductive inference".
However, simply preventing the increase of information does not constitute "deductive inference", and it is also necessary to prevent the decrease of information.
Answering "unknown" = 100% when information is given that can partially guess the future means that the information has been reduced.
At this time, you can decide how to answer correctly as a rule as "axiom of induction".
Then all inductive problems can be reduced to deductive problems.
Since ancient times, computers have been best at solving deductive problems using symbolic processing.

■Repeating unit

In enumerative induction, the greater the number of observations, the greater the certainty.
How to count the "number" is decided as "repeating unit".
As an example, consider the prediction of a sequence.
The prediction target is written as "?".
Example 1: 111111111?
There are 9 "1"s and 1 "?", so it is guessed that "1" = 90% and "unknwon" = 10%.
Example 2: 123451234?
Nine data follows the rule that "12345" is repeated.
However, it is incorrect to assume that "5" = 90% and "unknown" = 10%.
Since the repetition of "12345" has only appeared once in the past, it is correct to consider it separately as "12345" and "1234?".
It is assumed that "5" = 50%, "unknown" = 50%.
The part "1234?" only partially satisfies "12345", so the number is regarded as 0.
In the case of example 1, it can be said that they were considered separately as "1", "1", "1", "1", "1", "1", "1", "1", "1", and "?".
Looking at only the last five in Example 2, they are "1", "2", "3", "4", and "?", increasing by one.
"1→2" "2→3" "3→4" "4→?"
We can guess that "5" = 75% and "unknown" = 25%.
As another example, predict whether strawberries are sweet.
Example: "Red strawberry: sweet" "Red strawberry: sweet" "Red strawberry: sweet" "Big strawberry: sweet" "Big strawberry: ?"
Here we can make the following hypothesis:
Hypothesis: "Red" or "large" strawberries are sweet.
4 out of 5 follow this hypothesis, so "sweet" = 80%, "unknown" = 20%.
This guess is wrong.
Since there is only one data in the past that large strawberries are sweet, "sweet" = 50% and "unknown" = 50% are correct.
Observing the "red" data does not add to the confirmation of the "large" part of the hypothesis.
This is similar to "Glue"s Paradox" in philosophy, and is easily misunderstood even by humans.
As with sequences, data that only partially satisfy the hypothesis are considered zero.
If the hypothesis contains "or", each correctness must be evaluated separately.

■Formal general solution of induction problem

1. As a problem, we receive information about what values we want to guess.
Values that you want to guess are assigned the value "unknown".
2. Receive all the information you may or may not use for your problem.
3. Assume one fuzzy set.
1. participates in the fuzzy set with a weight of 1 and 2. with a weight of 0-1.
4. Must participate in the fuzzy set in the same repeating unit form.
The repetition unit is the number of data.
5. Adjust weights.
6. The distribution of fuzzy sets is the probability distribution of the inference results.
~End
Here, in the weight adjustment in 5, we decide what kind of explanatory variables to consider and what kind of data to adopt.
Assuming there is an optimal solution, the weights will be the optimal values.
There are many ways to guess, so you"ll have to decide which one is right for you.
For example, suppose there are results of inference by method A and results of inference by method B.
Here, it is not necessary to say which is the best, but it is necessary to be able to judge which is better, A or B.
The best solution is the one that outperforms any inference.
Just because A is better than B doesn"t mean A is the best solution.
You also have to make an inference that C is a combination of A and B, and decide which is better between A and C.
If there is a time limit for answering questions, the best result at that time should be answered.
It would be nice to be able to decide which inference is better, but there are two major things to decide.
・Determine the axiom of induction. If this is not followed, the inference result is inappropriate without comparison.
・According to the "Axiom of Induction", determine the criteria for judging the good or bad results of trying to make as good an inference as possible.
In the future, we will decide on these by considering as simple an example as possible in the case where there are explanatory variables.
If there are no explanatory variables, the weight is assumed to be 1, so the above method can be used for calculation.

■ Entropy, Bias and Variance

Consider whether entropy can be used to judge whether the guess result is good or bad.
Entropy decreases as the candidates for the correct answers are narrowed down from the possible answers.
If there is only one choice, the average entropy is 0.
Because we know the correct answer, we receive 0 selection entropy even if we are informed of the correct answer.
Also, "unknown"=100% is the worst guess and has the highest entropy.
The ratio of "unknown" has a large effect on entropy, but other parts also affect it.
As an example, consider an example in which three parties infer stock price fluctuations.
Mr. A: "Up" = 100%
Mr. B: "Down" = 100%
Mr. C: "Up" = 50%, "Down" = 50%
The answer given by Mr. C is the same answer as if nothing can be guessed.
The results of Mr. A and Mr. B have lower entropy than Mr. C, but can we say that they are better results?
Mr. A and Mr. B make strong claims with confidence, but their claims are contradictory.
At least one of them is biased and is making claims that deviate from the true answer.
In other words, the variance is small due to the claim of 100%, but the bias is large.
In fact, either A or B may be correct.
Stock prices move completely randomly, and Mr. C may be the correct answer.
It is a mistake to judge good inference just because the % is large.
Similarly, entropy only considers variance and does not consider bias.
It is a mistake to judge the goodness or badness of an inference only by the magnitude of entropy.
If it is wrong to judge good or bad by probability, it is also wrong to judge by expected value.
The idea that AI should set a reward and let it learn to maximize the expected value of the reward is insufficient.
A mechanism that optimizes the balance between bias and variance is necessary.