Due to quarantine, I got a bunch of time in my hands, I remember that since I first used this mod, back in MC 1.7 we had this phrase on Geolyzer description
"It is theoretically possible to eliminate this noise by scanning repeatedly and finding an average. (Unconfirmed, needs further testing.)"
Now, many years later, i do know quite some about statistics and got plenty of time
So, lets begin
We begin our dive in Statistics with an hypothesis, Does this number measured relate to an ore or stone, putting it in hardness terms, does the true number of hardness equal to 3(ores) or 1.5(stone), now we estabilish our null hypothesis. The null hypothesis is which one of them I want to minimize the most(it will make more sense in a bit), I prefer to avoid losing ores, therefore my null hypothesis is that the true hardness of the number measured=3 unless proven with enough evidence the contrary(again, more sense later)
In statistic we have two kind of erros given in the table below, we want to minimize error type 1 even if type 2 grows, not so much as for example, mark every where as ore, 0 type 1 error and 100% of type 2. Or be it random where both errors are 50%, so we have to have a small type 1 error without gigantic type 2 error ,let's try. I will from now on ommit "error" in type1 error.
Let's first analyze on what I believe is the most traditional way, checking if a number is bigger than a threshold , I made a code to help me with that(DataCollector), given a line of 32 blocks, it will go through each one of them, measure 1000 times, and separate in intervals of 0,25. The result is given as every number smaller than the first column
It's quite hard to see it all, comparing it with a stone would be disastrous, but we can clearly see a pattern of every 4th increasing in one more row, so make it goes from 0 to 4, to 8 to... to 56, when possible I tried to use an exact distance, for example 40=sqrt(12^2+32^2) when not possible a very close number 56=sqrt(32^2+32^2+32^2), compare stones and ores side by side and we get this:
based on this we can see the strengths and weakness of a single or multiple threshold, in my code I tried to give a general approach, so if you want to detect any other pair it's possible.
In my code, I used 2 variables, one to know the value when they meet, the distance where they meet, given by (Hhard-Lhard)*8, the second one is Havg, (Hhard+Lhard)/2, it gives the number when they meet. If the distance is smaller than "meet", the value will be bigger or lower than Havg but I will be sure after this check
If we use a single threshold of >2.25 we would be fine up to 12, after our errors would be:
16: type1=10.9% ; type2=11.4% | 20: type1=21.7% type2=19.2% | 24:type1=24.7% type2=23.9% | 28: type1=29.8% type2=26.7% | 32:type1=29.6% type2=32.7%
As we can see, not good, not terrible
if we want to be sure that its a stone or an ore, we must know what is the distance, find this new threshold, it could be done for each one of them but I dont see a particular distribution, but every 4 distance we have a good approximation of a uniform distribution( where every value is as likely to happen), but between them it wont affect that much the chance of being wrong, type1 and type2 errors are equal to:
16: 24% | 20: 42% | 24: 50.4% | 28: 58.1% | 32: 61.8% | 36: 65.8% | 40: 70.5% | 44:73.9% | 48: 75.4% | 56: 78.5%
it seems bad to do it that way, we said earlier that our objective is to minimize type1 error, but we can use multiple measurements to decrease the error, we were already going to do that anyway for our second kind of analysis, I chose to do 6 but you can pick another number, I tried to make it easy to change in the code, butt keep in mind that every measurement takes a lot of time and energy, using the geolyzer is time consuming, for each time you add 55 seconds to scan, default(6) is about 5min 30s. The new errors after six times are, (error)^6:
16: 0.1% | 20: 0.5% | 24: 1.6% | 28: 3.8% | 32: 5.6% | 36: 8,1% | 40: 12.2% | 44: 16% | 48: 18% | 56: 23.4%
This is already much better, lower error compared to previus test, but this is still very high, we always have 32 distance, the height, with each command and with this test we would have a high chance missing anything, the robot would have to go around or increase the number of tests that adds up for a lot of time and energy, for such a poor result
Now we begin using statistical theory, for that we need to know how the variance behaves given a distance, using Variance.lua code changing x and y and Variance1.lua to Z axis. This code gives Standard Deviation this represents how much from the mean it can be.
plotting it in a graph we get that the standard deviation is given by approximately 0.35*distance. https://www.khanacademy.org/math/statistics-probability/sampling-distributions-library#what-is-a-sampling-distribution as this free class explains, the distribution of samples is a normal distribution, doesn't matter the original distribution. to visualize it, using DataCollector.lua code add ore nSamples to your code, default is 1, therefore it shows the original distribution now I am doing it with 6 samples
the threshold for the average to given the Z value on the Ztable below is given by -Z*StandardDeviation(0.035*distance)/sqrt(nSamples(6)) + oreHardness(3)=OreT, Z is my type 1 error, my type2 error is given by StoneHardness*sqrt(nSamples)/(Z{positive now}*SD)=OreT. This is all a mess I know, lets give it some numbers, let's get a Z value equivalent of 5% wich is about 1.64 or 1.65, I will go with 1.64, distance of 56, we get -1.64*1.96/2.44+3=1,68, to find our type2 error we have Z=1.5*2.44/(1.96*1.68)=1.116 which is equal to 13.1%
16: type1=5% type2=0.6% | 20: type2=1.9% | 24: 3.6% | 28: 5.3% | 32: 7.2% | 36: 8.8% | 40: 10% | 44: 11% | 48: 12% | 52: 13% | 56: 13,5%
We can see that this is the best analysis until now overall, mainly on bigger distances, also even though its hard to understand, its super easy to implement it on code, but this analisys and the last one can be thought as independent events if we do both analysis we significantly lower the chance of any error for 56 distance we have type1=1.1% type2= 3.5%, wich is super low, but for my code I chose to do it until 40 of distance or 32^2+32^2+16^2=40^2, for me 1.1% is huge yet and 16*32*32 is enough volume but you can tune it to whatever you like
I want to bring to attention that the code of both analysis was done with the ability of taking any pair in mind, be it dirt and stone or stone and ore, or ore and diamond block, it must have a decent enough interval tho. also it checks for anything lower than stone or bigger than ore, must do another analysis to remove the ones you want from it.
https://github.com/gabiiel/GeolyzerOreFinder
PS: I am not a professional coder or anything like that, I just wanted to burn some quarantine time, any questions or improvements or features, I am willing to help. I hope everyone could understand. Also I am leaving up to you guys how to get to the ores, this is a very low end program that can run on any machine, the ores are stored in a string there is a function to help write and read from it
DataCollector.lua variance.lua variance1.lua