What interests me is seeking out and noticing patterns in culture, politics, the economy, religion, and so on, categorizing these things and abstracting from them into various layers of abstractions. I am not as interested in the every day practical notions that relate to these matters and that ask pointed questions of them. So, in many ways I am trying to take these notions and rip them out of their context so that I may analyze them in isolation and abstract new meta-notions from them.

However, to successfully get this across, there is nonetheless a need to anchor this discussion into a practical issue with particular phenomenon that are happening in the context of today. Since I work for an education company I find that I can most readily speak about issues surrounding education and since I am a relatively young adult, I find I can most readily speak about other young adults.

Education today is on an uneasy precipice. For the first time in our modern era (post World War II) taking the age old adage to heart about going to post-secondary education institution and graduating with skills necessary for quick employment in your chosen field is becoming less and less relevant. New graduates today are facing challenges unlike at any other period. Especially people graduating from the arts and humanities find themselves most vulnerable. To digress, should we feel bad for these people? After all, they must have heard the jokes about not being able to find employment with an English major. Nontheless, they have graduated doing what they love and what they are accustomed to and instead of, for the first time in their lives, having the freedom to explore the world doing the things that they have the skills for, find themselves with less freedom, less mobility, in their parents’ basement. Instead of a multitude of choices related to what workplace environemnt, what institution they should feel is in the greatest need of their contribution, they essentially have two choices. Go back to school and do a ‘real’ engineering major, accummulating more debt and frustration, or taking low paying jobs for the foreseeable future.

I do not care about any particular instance and what the best choice wil prove to be for a particulat individual, I care mostly about this scenario, mostly from the perspective of the swelling of the ranks of these people.

Simply, a system the takes in young, idealistic individuals, full of promise, only to output more highly skilled but unemployable bodies that are forced to join the ranks of the unemployed is simply not sustainable and is due to crash eventually. It is like pouring water into a balloon.

Such phenomena can be seen in various economic situations such as boom and bust cycles such as the housing bubble and the tech bubble of recent memory. Subsequently, I would posit that the situation described resembles those economic phenomena in terms of abstract patterns.

It turns out that these behaviors can indeed be abstracted since we have seen cases like this before in many different shapes. 

If the price for Coca Cola suddenly doubled overnight, it is fair to say that the number of purchases of Coca Cola would probably more than halve. We would expect this from an economically elastic product such as this. Simply, there are any alternatives out there and we do not rely on Coca Cola for more than it is used for (a beverage). On the other hand, if the price for gas suddenly doubled overnight, there would be  no noticeable difference, initially. After all, we rely on gas to get us to where we need to go. We would indeed still have to go to the gas station, face the daunting realization that we have to pay such outlandish prices and still have to pay it. No amount of anger and abuse hurled at the poor attendants would remedy the situation.

This is a classic abstraction with a context all its own that can nonetheless be applied to the patterns discovered in our education context. The young people going in to an education from institutions and graduating with a useless (to society) degree, thus swelling the ranks of the unemployed can be easily related to the economic principle of the hypothetical situation in which the price of gas is doubled overnight (even though the two scenarios might not readily be related). The swelling will continue indefinitely until is the very fabric of the context changes. This is the only way to stop the swelling. For the gas price scenario, people will switch (eventually) to other solutions such as alternative fuels, public transportation, carpooling, biking, etc. There are no shortages of ideas. But the important point is that the whole way of doin things will have to change. The context itself changes. What we see in education will invariably be the same. Ideas of how to alter education to its core will start to surface, if not already, and a paradigm shift will emerge. This context will also change.

The key point is that even though there is a context change, the pattern of the balloon swelling until it bursts causing a paradigm shift, this pattern can be readily abstracted to coincide with the key features found in various implementation of this pattern.

This is what in the end interests me.

Posted on by aduric | Leave a comment

Lecture 2 – Intro to Supervised Learning

The gist of supervised learning it to learn to predict a certain outcome from a set of training examples. We’ve already seen a form of that approach from a probabilistic perspective in the last lecture. In this lecture we focus on some of the basic techniques and algorithms.

Let’s say we have a dataset containing house prices according to square feet:

Living Area (sq feet) Price (1000$)

If we plot the data we get something like this:

So, we can already see that there’s a positive relationship between price and sq. feet. Therefore, it shouldn’t be too difficult to make predictions given this data. In fact, good ol’ regression will do the trick.

In the frame of machine learning, I’ll use the following notation:

x^{(i)} input variables (features)

y^{(i)} output / target variable

(x^{(i)}, y^{(i)}) – training example

{(x^{(i)}, y^{(i)}); i = 1 ... m} – training set of m examples

The task becomes to learn a function h : X \rightarrow Y such that h(x) is a good predictor.

First, a little introduction to regression.

There are many forms of regression, but we’ll mostly be concerned with the linear and polynomial varieties.


The most often-used procedure for fitting a line or any polynomial to the data is te Least Squares method (LS). This means that we are summing the squares of the little distances between the line and the data points and minimizing them.

We’ll start with linear fitting using LS but this procedure can be easily generalized to polynomial fitting.

For linear fitting, we’ll start with the linear function that most of you should remember:

f(a,b) = ax + b (1)

Vertical least squares fitting proceeds by finding the sum of the of n data points:

R^{2} = \sum [y_i = f(x_i, a_1, a_2, ..., a_n)]^{2} (2)

Substituting (1) into (2) we get:

R^{2}(a, b) = \sum_{i=1}^{n} [y_i - f(a + bx_i)]^{2} (3)

We know that R^{2} has a minimum where:

\frac{\partial R^{2}}{\partial a_i} = 0 where i = 1 … n

Now if you take the partial derivative of x and y w.r.t. to a and b, we get these Normal equations:

na + b\sum_{i=1}^{n} x_i = \sum_{i=1}^{n} y_i

a \sum_{i=1}^{n} x_i + b \sum_{i=1}^{n} x_i^{2} = \sum_{i=1}^{n} x_i y_i

Then with a little algebra (which I won’t do here), you’ll get these terms for a and b:

b = (\sum_{i=1}^{n} x_i y_i) - nx_i y_i

a = (\sum_{i=1}^{n} x_i^{2}) - n \bar{x}^{2}

where \bar{x} = \sum_{i=1}^{n} x_i and \bar{y} = \sum_{i=1}^{n} y_i .

You can also get r^{2} in the same way but we’re not too concerned with that term.

Polynomial fitting is done the same way, so I’ll leave that as an exercise because I want you to learn something and I’m kinda tired of latex right now 🙂

To get you started I’ll give you the k^{th} degree polynomial:

y = a_0 + a_1 x + ... + a_k x^{k} (4)

And substituting into (2), we get the residual:

R^{2} = \sum [y_i - (a_0 + a_1 x + ... + a_k x^{k})]^{2} (5)

The end result should be:

a = (X^{T} X)^{-1} X^{T} y (6)

where X is the matrix of x’s, a is the vector of polynomial coefficients and y is the vector of outcomes.

Judging by equation (6), we know that we can fit a polynomial to data directly using the  closed form that we (hopefully) derived.

If you don’t feel like working with Hessians and taking derivatives of complex polynomials (who doesn’t?), you can use the gradient descent algorithm.

The gradient descent algorithm is an optimization algorithm that takes incremental steps in a direction defined by a gradient. A gradient of a scalar function f(x_1, x_2, x_3, ... x_n) is:

\nabla f = (\frac{\partial f}{\partial x_1}, ..., \frac{\partial f}{\partial x_n})

Of course this means that the function f has to be defined and differentiable in neighborhood of a point p_1 then f(x) decreases fastest if you go from p_1 in the direction of the negative gradient of f at p_1. This negative gradient is \nabla f(a).

So, if you move from p_1 to lets say p_2 in the direction of the gradient, you can express this as:

p_2 = p_1 - \alpha \nabla f(p_1)

where \alpha is a small number known as learning parameter or learning rate.

Now, suppose we have a vector of points P = (p_1, p_2, p_3, ..., p_n) then our update rule is:

p_n+1 = p_n - \alpha_n \nabla f(p_n), n \geq 0.

We also have an artifact of this rule where:

f(p_0) \geq f(p_1) \geq f(p_2) \geq ...

This says that we are guaranteed to converge to the global minimum. However, the above update rule has to look at the entire training set to figure out where to go next. If we have a large vector of parameters to optimize (like most ML problems), this becomes intractable.

The faster cousin of the gradient descent is called the stochastic gradient descent. This one is an on-line algorithm that looks at the training set one example at a time. However, it is not guaranteed to converge at all, much less at a global minimum. I’m reminded of a particular ‘eating cake’ metaphor.

So, let’s try to fit a straight line through our examples (x_1, y_1) ... (x_n, y_n). The straight line formula again is:

y = ax + b

The objective function to be minimized is (taking the derivative of) and repeating until convergence:

\sum_{i=1}^{n} (b + a x_i - y_i)^{2}

for i = 1 ... n:
\left(\begin{array}{c} b \\ a \end{array} \right) = \left(\begin{array}{c} b \\ a \end{array} \right) - 2(b + a x_i - y_i) \left(\begin{array}{c} 1 \\ x_i \end{array} \right) 
until convergence

Testing against a convergence condition is a science in and of itself, but one check we can use is to compare the norms of the parameter vectors after each iteration and if the difference falls below a certain value \epsilon , then we stop and say we have converged.

Posted in Uncategorized | Leave a comment

Lecture 2 – Regression

In this lecture I will be going through the basics of Regression. This is meant to be purely introductory material for later lectures, so I have not included anything about Bayesian regression and so on.

The link to the lecture video can be found here.

*NOTE: I made a mistake in the video. At about 11:15 I say that I’m taking the derivative w.r.t. to the hypothesis function y(w,x). It should be w.r.t to the parameter vector w.

Posted in Uncategorized | Leave a comment

Seminar 1: Bayesian Learning in R

R is the language of statistics and increasingly the ML community is using it for anything and everything. Without making myself appear too much of an R shill, I would like to say that the reason I use it is that it’s open-source, it has a plethora of packages, it’s easy-to-use, intuitive, blah blah blah.

In this course, I will be using R for some of the seminars so that you can actually see some of the theory in action.

From this point, I’m assuming that you can install it and learn some of the basics on your own. If you don’t understand something please ask the community.

Seminar 1: Learning about the proportion of heavy sleepers

What proportion of college students get at least 8 hours of sleep? Here we think of a population consisting of all American college students and let p represent the proportion of this population who sleep (on a typical night during the week) at least eight hours. We are interested in learning about the location of p.

The value of the proportion p is unknown. In the Bayesian viewpoint a person’s beliefs about the uncertainty in this proportion are represented by a probability distribution placed on this parameter. This distribution reflects the person’s subjective prior opinion about plausible values of p.

A random sample of students from a particular university will be taken to learn about this proportion. But first the person does some initial research to learn about the sleeping habits of college students. This research will help her in constructing a prior distribution.
In the Internet article “College Students Don’t Get Enough Sleep” in The Gamecock, the student newspaper of the University of South Carolina (April 20, 2004), the person reads that a sample survey reports that most students spend only six hours sleeping. She reads a second article “Sleep on It: Imple-menting a Relaxation Program into the College Curriculum” in Fresh Writing, a 2003 publication of the University of Notre Dame. Based on a sample of 100 students, “approximately 70% reported receiving only five to six hours of
sleep on the weekdays, 28% receiving seven to eight, and only 2% receiving the healthy nine hours for teenagers.”

Based on this information, this person believes that college students generally get less than eight hours of sleep and so p (the proportion that sleep at least eight hours) is likely smaller than .5. After some reflection, her best guess at the value of p is .3. But it is very plausible that this proportion could be any value in the interval from 0 to .5.

A sample of 27 students is taken – in this group, 11 record that they had at least eight hours of sleep the previous night. Based on the prior information and this observed data, the person is interested in estimating the proportion p. In addition, she is interested in predicting the number of students that get at least eight hours of sleep if a new sample of 20 students is taken.

Suppose that our prior density for p is denoted by g(p). If we regard a “success” as sleeping at least eight hours and we take a random sample with s successes and f failures, then the likelihood function is given by the binomial distribution:

L(p) = p^s(1 - p)^f, 0 < p < 1

The posterior density for p, by Bayes’ rule, is obtained, up to a proportionality constant, by multiplying the prior density by the likelihood.


So, we have a prior and a likelihood, and now we need to calculate the posterior!

There are a few ways to do this, but since the proportion is a continuous parameter, we should construct a Beta density that represents our prior (our initial knowledge of the proportion). The reason that we use the Beta distribution instead of any other one has to do with the likelihood being represented as a binomial distribution. Beta is conjugate to the binomial which (essentially) means that the posterior will have the same distribution as the prior, which is much easier to calculate as you will see shortly.

The Beta distribution is a family of continuous probability distributions defined on the interval (0, 1) parameterized by two positive shape parameters, typically denoted by α and β. In Bayesian statistics, it can be seen as the posterior distribution of the parameter p of a binomial distribution after observing α − 1 independent events with probability p and β − 1 with probability 1 − p, if there is no other information regarding the distribution of p.

The pdf of the Beta distribution is:

f(p; α, β) = \frac{1}{B(\alpha, \beta)}p^{1-\alpha}(1-p)^{1-\beta}

The B(α, β) is a normalizing constant and we can ignore it for now. And this is what it looks like according to some choice of parameters:


Suppose she believes that the proportion is equally likely to be smaller or larger than p = .3. Moreover, she is 90% confident that p is less than .5. So, these can be our parameters for the Beta distribution right? Actually, p = .3 seems to be the mean of a normal  distribution and this doesn’t make any sense when using the Beta. So, we can try out some values for α and β and take a note of the shape. The shape that matches the closest seems to be when α = 3.4 and β = 7.4. try it out!

So why are conjugate distributions easier to work with? Well…

Remember how to calculate the posterior:

g(p|data) \propto p(p)L(p)

In plain English, this is akin to:

posterior ∝ prior * likelihood

Therefore since our likelihood is a binomial and our prior is a Beta, conjugacy says that our posterior will also be a Beta. This means that as we get more information, we can replace our prior with the posterior and do the same calculation again:

g(p|data) \propto (p^{\alpha-1}(1-p)^{\beta-1})(p^{s}(1-p)^{f})

g(p|data) \propto p^{\alpha+s-1}(1-p)^{\beta+f-1}

where 0 < p < 1.

Now we just substitute α + s = 3.4 + 11 and β + f = 7. 4 + 16. Now we’re ready for R, so let’s do it. In your R-shell, input the following:

> p = seq(0, 1, length = 500)
> a = 3.4
> b = 7.4
> s = 11
> f = 16
> prior=dbeta(p,a,b)
> like=dbeta(p,s+1,f+1)
> post=dbeta(p,a+s,b+f)
> plot(p,post,type="l",ylab="Density",lty=2,lwd=3)
> lines(p,like,lty=1,lwd=3)
> lines(p,prior,lty=3,lwd=3)
> legend(.7,4,c("Prior","Likelihood","Posterior"),
+     lty=c(3,1,2),lwd=c(3,3,3))

And we get this distribution:

Notice how the posterior is “between” the prior and the likelihood? This is not a coincidence and it actually makes a lot of sense. You can think of it as using the prior for “smoothing” the likelihood. You can also see how this approach to inference can be deemed controversial because after all, you picked the prior yourself. You would get a slightly different result if you pick another prior. This point is the fundamental argument of frequentists against Bayesians. But I’ll leave that alone.

Now, what is the proportion of heavy sleepers p? And how confident are we? There a two different ways of making an inference into that:

The first way is compute the posterior probability, P(p < 0.5|data), which gives us the probability of p being less than 0.5:

> pbeta(0.5, a + s, b + f)
[1] 0.9315743

This probability is large, so there is a good chance that the true proportion is less than 0.5. Using the 90% confidence interval, where does the true proportion lie? We’ll use the 5th and 95th percentile to get the desired range:

> qbeta(c(0.05, 0.95), a + s, b + f)
[1] 0.2562364 0.5129274

We can also use simulation to estimate the proportion. In R, this is easy to do:

To simulate 1000 random proportions using our Beta parameters:

> ps = rbeta(1000, a + s, b + f)

To calculate the probability of the proportion being smaller than .5:

> sum(ps >= 0.5)/1000

And at our 90% confidence interval:

> quantile(ps, c(0.05, 0.95))
5%       95%
0.2599039 0.5172406

Notice the similarity between the values in the first case with the values in this one!

Example taken from Bayesian Data Analysis in R.

Exercise 1: Estimating a proportion

Bob claims to have ESP. To test this claim, you propose the following experiment. You will select one from four large cards with different geometric figures and Bob will try to identify it. Let p denote the probability that Bob is correct in identifying the figure for a single card. You believe that Bob has no ESP ability (p = .25), but there is a small chance that p is either larger or smaller than .25. After some thought, you place the following discrete prior distribution on p:

p:       0 .125, .250, .375, .500, .625, .750, .875, 1
g(p): .001, .001, .950, .008, .008, .008, .008, .008, .008,

Suppose that the experiment is repeated ten times and Bob is correct six times and incorrect four times. Find the posterior probabilities of these values of p. What is your posterior probability that Bob has no ability?

Posted in Uncategorized | Leave a comment

Lecture 1: Intro to Bayesian Learning

Greetings all and welcome to the first lecture. In this lecture I’ll be explaining the Bayesian world view which lends itself quite nicely for learning and inference tasks.

*An aside:* These lectures will be short and to the point to encourage as much discussion as possible from the community. If you would like to add something that might make the material easier for beginners, please comment. Feedback is always welcome as well.

Ok, so I’m assuming most of you remember Bayes rule from high school. This states that you can obtain the probability of an event A given an event B in terms of the event B given event A:

P(A|B) = \frac{P(B|A) P(A)}{P(B)}

I won’t be deriving this result but suffice it to say that it can be derived quite easily by using the chain rule (for random variables X and their values x):

P(X_1 = x_1, ..., X_n) = \prod_{i=1}^{n} P(X_i = x_i | X_i-1 = x_i-1 , ..., X_1 = x_i)

Now at this point I’m assuming that you know enough probability theory to know about conditional probability, joint probability, marginal probability, how to marginalize random variables, common distributions, etc. If not, there are many introductory statistical texts you can find.

Going back to the Bayes rule, we can (sort of) begin to see how it could be useful for statistical inference. We are updating probabilities to take into account new information. This is called conditioning. More explicitly, given some initial information called the prior we update using the likelihood and as a result we get the posterior probability, where:

P(A) is the prior
P(A|B) is the posterior
P(B|A) is the likelihood
P(B) is the marginal probability of even B. It is used as the normalizing constant (more on that later).

Simply, the prior gets updated to the posterior through the likelihood. But what is the likelihood?

People often make the mistake of using ‘likelihood’ and ‘probability’ interchangeably, but there is a distinction.

Probability gives an outcome based on parameters that are known. Likelihood estimates those parameters based on the outcomes. For example:

A coin is tossed with a probability pH of landing heads up (‘H’), the probability of getting two heads in two trials (‘HH’) is pH2. If pH = 0.5, then the probability of seeing two heads is 0.25.

In symbols, we can say the above as

p(HH|p_H = 0.5) = 0.25 .

Another way of saying this is to reverse it and say that “the likelihood of pH = 0.5, given the observation ‘HH’, is 0.25″, i.e.,

p(p_H = 0.5|HH) = p(HH|p_H = 0.5) = 0.25 .

But this is not the same as saying that the probability of pH = 0.5, given the observation, is 0.25.

The point to take home with this example is that even though probability and likelihood are not the same thing, there is a lot of notation abuse where people will use them interchangeably, especially in the ML community!

So now that we know a little about Bayes’ rule and its constituents, we can get into how we will be using it to perform some elementary calculations. An important form of the Bayes’ theorem that comes up in ML is derived:

First of all, we look at the Law of Total Probability or Rule of Elimination.

Given n mutually exclusive events A_1A_n whose probabilities sum to 1, then if B is another event and P(B|A_i) is the conditional probability of B given an event A_i ,

P(B) = P(B|A_1)P(A_1) + ... + P(B|A_n)P(A_n)

P(B) = \sum_{i=1}^{n} P(B|A_i)P(A_i)

P(B) = E_i[P(B|A_i)]

So, according to the Law of Total Probability:

P(B) = P(A \cap B) + P(\neg A \cap B) = P(B|A)P(A) + P(B|\neg A)P(\neg A)

This results in the analogous form:

P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|\neg A)P(\neg A)}

Because the Law of Total Probability essentially states that:

P(B) = \sum_{i} P(B \cap A_i) = \sum_{i} P(B|A_i)(A_i)

For any $latex A_i $ in the partition of the event A, Bayes’ theorem states that

P(A_i|B) = \frac{P(B|A_i)P(A_i)}{P(B)} = \frac{P(B|A_i)P(A_i)}{\sum_{j} P(B|A_j)P(A_j)}

Where the denominator is the normalizing constant, which is necessary to give the whole expression a probability.

Example 1 – The Monty Hall Problem:

To make the theory more concrete, we will take a look at the (in)famous Monty Hall problem from a Bayesian perspective.
The Monty Hall problem in based on a famous game show in the 70s I think where the host Monty Hall had the contestants choose between three doors, one of which contained a prize. The contestant would choose a door and then Monty Hall would open another door and ask the contestant to make a decision between the door they originally chose and the door that nobody chose. The question then becomes: Which decision will improve the contestants chances?

Hint: It’s not 50 : 50!


Our events are A_1 , A_2 and A_3 corresponding to opening doors 1, 2 or 3. Therefore we can think of that as our prior information. So, without any further knowledge, we know that P(A_1) = P(A_2) = P(A_3) = \frac{1}{3} .

So, lets say we choose door 1 and Monty has chosen door 2. Let this event be called B. Monty can then choose a door with 3 possibilities:

  1. If the prize is behind door 1, then Monty can open either 2 or 3 with P(B|A_2) = P(B|A_3) = 0.5 probability.
  2. If the prize is behind door 2, Monty must open door 3. Thus P(B|A_2) = 0
  3. If the prize is behind door 3, Monty must open door 2. Thus P(B|A_3) = 1

To summarise, in possibility 1, he has a choice of door, and assuming he chooses uniformly, it’s 0.5. In possibility 2, there is no chance that he will open door 2 because the prize is in it. In possibility 3, he is certain to open door 2 if the prize is behind door 3. Remember, event B states that he opens door 2!

Therefore, we can now calculate the posterior:

P(A_1|B) = \frac{P(B|A_1)P(A_1)}{P(B)} = \frac{1}{3}

P(A_2|B) = \frac{P(B|A_2)P(A_2)}{P(B)} = 0

P(A_3|B) = \frac{P(B|A_3)P(A_3)}{P(B)} = \frac{2}{3}

So, the best outcome that the contestant can hope for would be to switch from door 1 to door 3 if Monty chooses door 2.

Example taken from: http://en.wikipedia.org/wiki/Bayes%27_theorem

Example 2:

Marie is getting married tomorrow, at an outdoor ceremony in the desert. In recent years, it has rained only 5 days each year. Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn’t rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on the day of Marie’s wedding?


The sample space is defined by two mutually-exclusive events – it rains or it does not rain. Additionally, a third event occurs when the weatherman predicts rain. Notation for these events appears below.

  • Event A1. It rains on Marie’s wedding.
  • Event A2. It does not rain on Marie’s wedding
  • Event B. The weatherman predicts rain.

In terms of probabilities, we know the following:

  • P(A1) = 5/365 =0.0136985 [It rains 5 days out of the year.]
  • P(A2) = 360/365 = 0.9863014 [It does not rain 360 days out of the year.]
  • P(B|A1) = 0.9 [When it rains, the weatherman predicts rain 90% of the time.]
  • P(B|A2) = 0.1 [When it does not rain, the weatherman predicts rain 10% of the time.]

We want to know P(A1|B), the probability it will rain on the day of Marie’s wedding, given a forecast for rain by the weatherman. The answer can be determined from Bayes’ theorem, as shown below.

P(A_1|B) = \frac{P(A_1) P(B|A_1)}{P(A_1)P(B|A_1) + P(A_2)P(B|A_2)}

P(A_1|B) = \frac{(0.014)(0.9)}{(0.014)(0.9) + (0.986)(0.1)}

P(A_1|B) = 0.111

Note the somewhat unintuitive result. Even when the weatherman predicts rain, it only rains only about 11% of the time. Despite the weatherman’s gloomy prediction, there is a good chance that Marie will not get rained on at her wedding.

Example taken from: http://stattrek.com/Lesson1/Bayes.aspx.


At the end of each lecture I’ll usually give one or two exercises. I encourage discussion but no spoilers please! I’ll give about a week for completion at which point I’ll put up a solution. I strongly encourage everyone to try them out. If you want to use R or any other kind of statistical software to solve exercises, please do so. I will not be marking anything simply because there’s so gosh darn many of you.

Exercise 1:

Probability calculations for genetics:

Suppose that in each individual of a large population there is a pair of genes that determine eye colour. Each of these genes can be x or X and they follow these rules:

  • xx – blue eyes
  • Xx, xX (heterozygote) – brown eyes
  • XX – brown eyes

The proportion of blue-eyed individuals is p^2 and of heterozygotes it’s 2p(1-p), where 0<p<1.

Each parent transmits one if its own genes to the child. If a parent is a heterozygote, the probability that it transmits the genes of type X is 0.5.

  1. Assuming random mating, show that among brown-eyed children of brown-eyed parents, the expected proportion of heterozygotes is \frac{2p}{1+2p}.
  2. Supposed Judy, a brown-eyed child of brown-eyed parents marries a heterozygote , and they have n children, all of which are brown-eyed. Find the posterior probability that Judy is a heterozygote and the probability that her first grandchild has blue eyes.

Taken from Bayesian Data Analysis [Gelman] – pg. 56.


To use the Bayesian theorem to solve this problem, let’s encode it into a conditional probability:

P(Judy is Xx | child has brown eyes & all previous information)

= \frac{0 * (1-p)^4 + \frac{1}{2} * 4p(1-p)^3 + \frac{1}{2} * 4p^2(1-p)^2}{1 * (1-p)^4 + 1 * 4p(1-p)^3 + \frac{3}{4} * 4p^2(1-p)^2}

= \frac{2p(1-p) + 2p^2}{(1-p)^2 + 4p(1-p) + 3p^2}

= \frac{2p}{1+2p}

To figure out the probability that Judy is a heterozygote, use the above posterior probability as a prior for any new calculation. So, when we get the new information that her grandchildren are brown-eyed and there is n of them (with the father being a heterozygote):

P(Judy is Xx | n children all have brown eyes & all previous information)

\frac{\frac{2p}{1+2p}*(\frac{3}{4})^n}{\frac{2p}{1+2p}(\frac{3}{4})^n + \frac{1}{1+2p}}

Given that Judy’s children are all brown-eyed, her grandchild has blue eyes only if Judy’s child is Xx. We can compute this probability, knowing that the child is brown-eyed and knowing that Judy’s husband is a heterozygote.

P(Judy’s child is Xx | all other information) = P((Judy is Xx & Judy’s child is Xx) or (Judy is XX & Judy’s child is Xx) | all other information)

= (\frac{2}{3})\frac{\frac{2p}{1+2p}*((\frac{3}{4})^n}{\frac{2p}{1+2p}\frac{3}{4})^n + \frac{1}{1+2p}} + (\frac{1}{2})\frac{\frac{2p}{1+2p}*(\frac{3}{4})^n}{\frac{2p}{1+2p}(\frac{3}{4})^n + \frac{1}{1+2p}}

Given that Judy’s child is Xx, the probability of the grandchild having blue eyes is 0, 1/4, or 1/2 id Judy’s husband is XX, Xx or xx, respectively. Given random mating, these events have probability (1-p)^2, 2p(1-p) and p^2 respectively.


P(grandchild is xx | all other information)

= \frac{\frac{2}{3}\frac{2p}{1+2p}(\frac{3}{4})^n+\frac{1}{2}\frac{1}{1+2p}}{\frac{2p}{1+2p}(\frac{3}{4})^n + \frac{1}{1+2p}}(\frac{1}{4}2p(1-p) + \frac{1}{2}p^2)

= \frac{\frac{2}{3}\frac{2p}{1+2p}(\frac{3}{4})^n+\frac{1}{2}\frac{1}{1+2p}}{\frac{2p}{1+2p}(\frac{3}{4})^n + \frac{1}{1+2p}}(\frac{1}{2}p)

Posted in Uncategorized | 2 Comments

Hello world!

Welcome to WordPress.com. This is your first post. Edit or delete it and start blogging!

Posted in Uncategorized | 1 Comment