marcusmarcusrc: (Default)
[personal profile] marcusmarcusrc
Are there any LHS gurus on my friendslist? No, not Lexington High School, I know about you people. I'm wondering about Latin Hypercube Sampling.


See, Monte Carlo is this method of sampling from probability distribution functions. Basically, you pick points from your pdf. The more points you pick, the more your distribution of results looks like the "true" distribution. The problem with Monte Carlo is it is kind of inefficient, which doesn't matter if sampling is cheap and you can pick 10,000 points. But when a point takes 36 hours of processor time... well, efficiency counts.

So, there is this method called Latin Hypercube Sampling. Basically, you divide your pdf into N equal probability bins, where N is your number of samples. Then you pick a value from a bin _without_ replacement. This ensures that you are sampling from the entire variable space. eg, if you take 100 samples, you are picking a point from the upper 1% of your distribution every time. That's not true in Monte Carlo. Various papers show that LHS is a "good" method - it doesn't have bias, and it will on average always be as good or better than Monte Carlo for the same N.

Now, if you have 2 variables, you just divide each variable up into bins. You pick from one variable, then the other variable. If the variables are independent, it is just that simple, for either MC or LHS. If you want to be fancy, you can apply various routines to your LHS sampling to ensure that samples are "representative" to approach "true" even faster.

Now, the problem occurs when the 2 variables aren't independent. For example, if you have a population with random height and weight. Now, if you sample from height and get the 1% highest bin, you don't want to just sample randomly from weight bins or you could get the lowest 1%, resulting in a physically impossible 7 foot tall 80 lb beanpole. In Monte Carlo, this isn't a problem. You just sample once from variable 1 based on its pdf, and then take a sample from variable 2 based on the conditional probability function. Now, in LHS, you are supposed to pick from a bin "without replacement". Well - do you bin the conditional probability function, and if you take the 27th percentile bin, then you never pick a 27th bin again? The problem is that if you had the right 2D function, then if when you pick the 1% height bin and the 99% weight bin, you get a physically possible 7 foot 150 lber... and then your next point is the 99% height bin and the 1% weight bin, for a physically possible 5 foot 150 lber... and suddenly you've picked 100 points and they are _all_ 150 lbs. The whole point of LHS is to ensure that really skewed samples like this can't happen, unlike in Monte Carlo. Now, intuitively I feel that approaching LHS this way is still valid, but I worry that it might not have good statistical properties. eg, your sample may no longer have the right mode or variability or skew or whatever.

Our default approach has been to not bother with conditional pdfs and just stick correlation parameters into our sampling routine for the original marginals, so if you pick the 1% height bin, you are more likely to pick a large weight bin, but that is kind of a blunt stick approach, and as N goes to infinity does not actually converge on the true PDF (except for very simple 2D functions), unlike a Monte Carlo or an LHS in the case with independent variables. And no one in our group is really the right kind of statistics expert. I do plan to ask around through standard academic channels, but I figured that lj was a perfectly fine starting point.

Thank you for your patience. For people who use Monte Carlo, I hope maybe you learned something useful, and if you happen to be an expert in this area and know what is "legal" to do, please tell me!

Date: 2006-04-19 07:44 pm (UTC)
From: [identity profile] rifmeister.livejournal.com
Do you really only have two variables, or do you actually have some horrible mess of variables? My intuition is that the key to the whole thing working is that you have to initially create N bins which contain equal amounts of the probability mass, and be able to take one sample uniformly at random from each bin. If you just bin the conditionals (imaging for a moment that your variables are discrete or something), you have to be careful to satisfy this.

In general, if it's only two variables and you can characterize the pdf, you ought to be able to do something, at least computationally. Actually, even if it's some big mess, you might be able to create your bins by monte carlo sampling points from the pdf (don't actually compute the value at the points, just get a bunch of points on the control variables) and then empirically making bins from that.

I'm going on vacation the next few days and won't be online, so I probably can't comment further.

Date: 2006-04-20 05:02 am (UTC)
From: [identity profile] marcusmarcusrc.livejournal.com
Oh. Hmm. A 2D (or whatever-D) bin. That could work. (It is a 3D pdf, actually, but yes, we have a nice 3D density function, so, hmm, why not?)

Thanks!

Date: 2006-04-20 05:07 am (UTC)
From: [identity profile] marcusmarcusrc.livejournal.com
You would think one of the 3 PhDs we had standing around the water cooler yesterday discussing this would have thought of this option.

It may take a little thought about how to actually implement it, but it doesn't seem like it should be difficult. And intuitively, this feels like it is the right way to do this problem. I'll suggest it to my group.

(the variables are non-discrete, but I don't think that makes a difference. If anything, I actualy think it should make it easier)


Date: 2006-04-20 06:52 am (UTC)
From: [identity profile] marcusmarcusrc.livejournal.com
Darn. On 2nd thought, it has been pointed out to me that for Latin Hypercube Sampling in the 2D independent pdf case, if you have 100 bins for each pdf, you are taking 100 samples across a 2D space that is composed of 100x100 bins. Which is very different from dividing your 2D space into 100 equal probability bins.

Date: 2006-04-20 06:18 am (UTC)
From: [identity profile] firstfrost.livejournal.com
I'm confused. When I run Monte Carlo simulations (and okay, I'm not a math theoretician, I only do it for things like role-playing games), it's because it's a quick and dirty way to find out what the pdf looks like. LHS starts by assuming I know what the pdf looks like, sufficient to divide it into equal pieces. But if I know what the pdf looks like, why am I sampling it again?

Date: 2006-04-20 06:30 am (UTC)
From: [identity profile] marcusmarcusrc.livejournal.com
In this case, we have a pdf for, say, Climate Sensitivity. This pdf is an input to a function called "the planet". We want to know what our pdf looks like for temperature at the end. But we can't operate our function on a pdf, we can only operate it on a point. So we can sample randomly from climate sensitivity, run it through our planet function, and get temperatures and create a temperature pdf.

Date: 2006-04-20 06:57 am (UTC)
From: [identity profile] marcusmarcusrc.livejournal.com
I imagine that when you do a Monte Carlo for a roleplaying game, you are actually starting with a pdf you know (a 20 sided die, for example) and running it through some function to create a pdf you didn't previously know.

Heck, even making your pdf for the outcome of rolling 3d6, you are taking 3 known input pdfs (uniform probability across 1d6) and running them through a function (addition) to come out with a new pdf.

My input pdfs and functions are just more complicated than yours. =)

Date: 2006-04-20 07:02 am (UTC)
From: [identity profile] firstfrost.livejournal.com
Okay, so if I were to do LHS on 3d6, and we assume that I understand the pdf of 1d6, then would I be dividing each 1d6 into 6 uniform bins and playing with those? Rather than dividing the 3d6 bell curve into some number of uniform bins?

Date: 2006-04-20 07:12 am (UTC)
From: [identity profile] marcusmarcusrc.livejournal.com
Yes. And since you are sampling without replacement, you would end up with only 6 points at the end. And using only 6 points sucks regardless of LHS or MC, but it would be slightly more likely to be well distributed in your final distribution using LHS. And you take the 6 points at the end and try to make a bell curve out of them.

We use somewhere between 100 and 1000 samples for our work, typically.

Profile

marcusmarcusrc: (Default)
marcusmarcusrc

September 2014

S M T W T F S
 123 456
78910111213
14151617181920
21222324252627
282930    

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Mar. 20th, 2026 12:25 pm
Powered by Dreamwidth Studios