LHS Question
Apr. 19th, 2006 08:09 pmAre there any LHS gurus on my friendslist? No, not Lexington High School, I know about you people. I'm wondering about Latin Hypercube Sampling.
See, Monte Carlo is this method of sampling from probability distribution functions. Basically, you pick points from your pdf. The more points you pick, the more your distribution of results looks like the "true" distribution. The problem with Monte Carlo is it is kind of inefficient, which doesn't matter if sampling is cheap and you can pick 10,000 points. But when a point takes 36 hours of processor time... well, efficiency counts.
So, there is this method called Latin Hypercube Sampling. Basically, you divide your pdf into N equal probability bins, where N is your number of samples. Then you pick a value from a bin _without_ replacement. This ensures that you are sampling from the entire variable space. eg, if you take 100 samples, you are picking a point from the upper 1% of your distribution every time. That's not true in Monte Carlo. Various papers show that LHS is a "good" method - it doesn't have bias, and it will on average always be as good or better than Monte Carlo for the same N.
Now, if you have 2 variables, you just divide each variable up into bins. You pick from one variable, then the other variable. If the variables are independent, it is just that simple, for either MC or LHS. If you want to be fancy, you can apply various routines to your LHS sampling to ensure that samples are "representative" to approach "true" even faster.
Now, the problem occurs when the 2 variables aren't independent. For example, if you have a population with random height and weight. Now, if you sample from height and get the 1% highest bin, you don't want to just sample randomly from weight bins or you could get the lowest 1%, resulting in a physically impossible 7 foot tall 80 lb beanpole. In Monte Carlo, this isn't a problem. You just sample once from variable 1 based on its pdf, and then take a sample from variable 2 based on the conditional probability function. Now, in LHS, you are supposed to pick from a bin "without replacement". Well - do you bin the conditional probability function, and if you take the 27th percentile bin, then you never pick a 27th bin again? The problem is that if you had the right 2D function, then if when you pick the 1% height bin and the 99% weight bin, you get a physically possible 7 foot 150 lber... and then your next point is the 99% height bin and the 1% weight bin, for a physically possible 5 foot 150 lber... and suddenly you've picked 100 points and they are _all_ 150 lbs. The whole point of LHS is to ensure that really skewed samples like this can't happen, unlike in Monte Carlo. Now, intuitively I feel that approaching LHS this way is still valid, but I worry that it might not have good statistical properties. eg, your sample may no longer have the right mode or variability or skew or whatever.
Our default approach has been to not bother with conditional pdfs and just stick correlation parameters into our sampling routine for the original marginals, so if you pick the 1% height bin, you are more likely to pick a large weight bin, but that is kind of a blunt stick approach, and as N goes to infinity does not actually converge on the true PDF (except for very simple 2D functions), unlike a Monte Carlo or an LHS in the case with independent variables. And no one in our group is really the right kind of statistics expert. I do plan to ask around through standard academic channels, but I figured that lj was a perfectly fine starting point.
Thank you for your patience. For people who use Monte Carlo, I hope maybe you learned something useful, and if you happen to be an expert in this area and know what is "legal" to do, please tell me!
See, Monte Carlo is this method of sampling from probability distribution functions. Basically, you pick points from your pdf. The more points you pick, the more your distribution of results looks like the "true" distribution. The problem with Monte Carlo is it is kind of inefficient, which doesn't matter if sampling is cheap and you can pick 10,000 points. But when a point takes 36 hours of processor time... well, efficiency counts.
So, there is this method called Latin Hypercube Sampling. Basically, you divide your pdf into N equal probability bins, where N is your number of samples. Then you pick a value from a bin _without_ replacement. This ensures that you are sampling from the entire variable space. eg, if you take 100 samples, you are picking a point from the upper 1% of your distribution every time. That's not true in Monte Carlo. Various papers show that LHS is a "good" method - it doesn't have bias, and it will on average always be as good or better than Monte Carlo for the same N.
Now, if you have 2 variables, you just divide each variable up into bins. You pick from one variable, then the other variable. If the variables are independent, it is just that simple, for either MC or LHS. If you want to be fancy, you can apply various routines to your LHS sampling to ensure that samples are "representative" to approach "true" even faster.
Now, the problem occurs when the 2 variables aren't independent. For example, if you have a population with random height and weight. Now, if you sample from height and get the 1% highest bin, you don't want to just sample randomly from weight bins or you could get the lowest 1%, resulting in a physically impossible 7 foot tall 80 lb beanpole. In Monte Carlo, this isn't a problem. You just sample once from variable 1 based on its pdf, and then take a sample from variable 2 based on the conditional probability function. Now, in LHS, you are supposed to pick from a bin "without replacement". Well - do you bin the conditional probability function, and if you take the 27th percentile bin, then you never pick a 27th bin again? The problem is that if you had the right 2D function, then if when you pick the 1% height bin and the 99% weight bin, you get a physically possible 7 foot 150 lber... and then your next point is the 99% height bin and the 1% weight bin, for a physically possible 5 foot 150 lber... and suddenly you've picked 100 points and they are _all_ 150 lbs. The whole point of LHS is to ensure that really skewed samples like this can't happen, unlike in Monte Carlo. Now, intuitively I feel that approaching LHS this way is still valid, but I worry that it might not have good statistical properties. eg, your sample may no longer have the right mode or variability or skew or whatever.
Our default approach has been to not bother with conditional pdfs and just stick correlation parameters into our sampling routine for the original marginals, so if you pick the 1% height bin, you are more likely to pick a large weight bin, but that is kind of a blunt stick approach, and as N goes to infinity does not actually converge on the true PDF (except for very simple 2D functions), unlike a Monte Carlo or an LHS in the case with independent variables. And no one in our group is really the right kind of statistics expert. I do plan to ask around through standard academic channels, but I figured that lj was a perfectly fine starting point.
Thank you for your patience. For people who use Monte Carlo, I hope maybe you learned something useful, and if you happen to be an expert in this area and know what is "legal" to do, please tell me!
no subject
Date: 2006-04-19 07:44 pm (UTC)In general, if it's only two variables and you can characterize the pdf, you ought to be able to do something, at least computationally. Actually, even if it's some big mess, you might be able to create your bins by monte carlo sampling points from the pdf (don't actually compute the value at the points, just get a bunch of points on the control variables) and then empirically making bins from that.
I'm going on vacation the next few days and won't be online, so I probably can't comment further.
no subject
Date: 2006-04-20 05:02 am (UTC)Thanks!
no subject
Date: 2006-04-20 05:07 am (UTC)It may take a little thought about how to actually implement it, but it doesn't seem like it should be difficult. And intuitively, this feels like it is the right way to do this problem. I'll suggest it to my group.
(the variables are non-discrete, but I don't think that makes a difference. If anything, I actualy think it should make it easier)
no subject
Date: 2006-04-20 06:52 am (UTC)no subject
Date: 2006-04-20 06:18 am (UTC)no subject
Date: 2006-04-20 06:30 am (UTC)no subject
Date: 2006-04-20 06:57 am (UTC)Heck, even making your pdf for the outcome of rolling 3d6, you are taking 3 known input pdfs (uniform probability across 1d6) and running them through a function (addition) to come out with a new pdf.
My input pdfs and functions are just more complicated than yours. =)
no subject
Date: 2006-04-20 07:02 am (UTC)no subject
Date: 2006-04-20 07:12 am (UTC)We use somewhere between 100 and 1000 samples for our work, typically.