In my experience, many students start coding near the end of a research project. For example, they might start after they already have a candidate algorithm and performance guarantee. The goal of the code is just to benchmark their algirthm against some other methods.
Compare this to the scientific method
For those of us in Data Science, which steps should involve coding?
The purpose of this session is to see how to use software, plots, simulation, and Mathemaica in conjunction with heuristic thinking to first develop conjectures, and then to guide a proof. We do this with a couple of toy examples that have come up in my research.
Suppose $N \sim \text{Pois}(\lambda)$.
Developing Intuition The first expectation is pretty straightforward to evaluate directly and is at most $1/\lambda$. You should do this carefully before moving forward.
For the second part, we could use standard machinery (like the Chernoff bound), but the resulting bound is a bit ugly and difficult to manipulate. I needed this result as part of a much larger proof, so a clean simple bound (even if it was looser) was important. (In general, you don't always need the tightest bound on a quantity, just a tight enough bound to get the job done.) So let's think a bit more broadly.
First, the random variable is always bounded by $1$. This bound is loose however when $\lambda \rightarrow \infty$. So our effort should be on improving the bound for large $\lambda$. Intuition suggests that $N = \lambda + O_p(\sqrt{\lambda})$ as $\lambda \rightarrow \infty$, and thus, \begin{align} \frac{1}{N+1} \approx \frac{1}{1 + \lambda + O_p(\sqrt{\lambda})} \approx \frac{1}{\lambda + 1} \cdot \frac{1}{1 + O_p\left( \frac{\sqrt{\lambda}}{\lambda + 1} \right) } \approx \frac{1}{\lambda + 1} \left( 1 + O_p\left( \frac{\sqrt{\lambda}}{\lambda + 1} \right) \right) \end{align}
So we might conjecture that $\frac{\lambda}{N+1} \rightarrow_p 1$ as $\lambda \rightarrow \infty$. How can we turn this conjecture about the asymptotics into a bound? Let's try calculating the $1-\epsilon$ quantile:
function eps_quantile(lambda, epsilon)
dist = Distributions.Poisson(lambda)
q = quantile(dist, epsilon)
return lambda/(q + 1)
end
lam = 3
eps_grid = 2 .^collect(-2:-.05:-8)
ts = map(eps -> eps_quantile(lam, eps), eps_grid)
plot(1 ./eps_grid, ts, ".-")
plot(1 ./eps_grid, log.(1 ./eps_grid), "red")
1-element Array{PyCall.PyObject,1}: PyObject <matplotlib.lines.Line2D object at 0x7f8a18294130>
Play with the above plot for different values of $\lambda$. You should be able to convince yourself that the quantile is (much) less than $O(\log(1/\epsilon))$. This then seems like a good conjecture to try and prove. Given we have tails that look like $\log(1/\epsilon)$, one strategy might be to try use Markov's inequality with an exponential, i.e., \begin{align} \mathbb P\left( \frac{\theta \lambda}{N+1} > \log(1/\epsilon) \right) & \leq \epsilon \mathbb E\left[ \exp(\frac{\theta \lambda}{N+1})\right] \end{align}
Now let's try to compute this expectation to figure out a good value of $\theta$. First, let's look at some simulations.
function e_expon(lambda, theta)
dist = Distributions.Poisson(lambda)
#Let's only compute terms within 3 stdevs of the mean... should cover most of the masss
lb = floor(Int, lambda - 3 * sqrt(lambda))
lb = max(0, lb)
ub = floor(Int, lambda + 3 * sqrt(lambda))
out = 0.
for i in lb:ub
out += exp(theta * lambda/ (i + 1)) * pdf(dist, i)
end
return out
end
lam_grid = range(1, 20, length=50)
out = map(lam -> e_expon(lam, 0.5), lam_grid)
plot(lam_grid, out, ".-")
1-element Array{PyCall.PyObject,1}: PyObject <matplotlib.lines.Line2D object at 0x7f8a2966d850>
Note, I did some "quick and dirty" things to get the code above to run reasonably fast. Since we're just tryign to generate a good conjecture, these sorts of tricks are usually a good idea, just make sure you don't confuse yourself later.
So it seems when $\theta = 0.5$, this expectation is at most $2$. Now we have a concrete conjectuer to prove:
I leave the details to you, but it can be attacked by simply writing out the expectation carefully and using a trick similar to the one you did for $\mathbb E[1/(N+1)]$.
Putting it all together, if we bound the above expecation, we will have proven that with probability at least $1-\epsilon$, \begin{align} \frac{1}{N+1} \leq \frac{2 \log(1/\epsilon)}{\lambda} \end{align}
Again, this bound is definitely weaker than what you would get with the Chernoff technique. But it does have a very simple dependence, and roughly the right scaling in $\lambda$ for a fixed $\epsilon$. The tail dependence on $\epsilon$ could be improved, for what I needed in my proof, this was good enough.
Your turn. Apply a similar logic to the one above to develop a nice high probability upper bound on $\frac{\log(N+1)}{N+1}$. To get you started, the following graph might be helpful
n_grid = range(0, 20, length=50)
plot(n_grid, log.(n_grid .+ 1) ./ (n_grid .+ 1), ".-")
1-element Array{PyCall.PyObject,1}: PyObject <matplotlib.lines.Line2D object at 0x7f89fb7c3820>
Probably worth noting that the graph has a finite maximum, and it's monotonically decreasing after that maximum. How can you leverage this fact to develop the requisite upper bound?
This one is a bit more interesting.
Let $\Phi(\lambda) \in \mathbb R^p$ be a vector-valued function for $\lambda \in [-1, 1]$ and let $\Phi_i(\lambda)$ be the $i^\text{th}$ Legendre Polynomial (w.r.t. $[-1, 1]$).
With my coauthors, we developed an algorithm whose complexity scales like $\frac{C}{c}$ where \begin{align} C = \sup_{\lambda \in [-1, 1]} \sigma_{\max} \left( \Phi(\lambda) \Phi(\lambda)^T \right) \\ c = \sigma_{\min}\left( \mathbb E_{\tilde \lambda \sim \text{Unif}[-1, 1]}\left[\Phi(\lambda) \Phi(\lambda)^T\right] \right) \end{align}
How does the fraction $C/c$ scale as $p \rightarrow \infty$?
This problem seems a lot harder than it actually is. Let's start thinking about little c. The Legendre polynomials are an orthonormal basis. So the matrix $\mathbb E_{\tilde \lambda \sim \text{Unif}[-1, 1]}\left[\Phi(\lambda) \Phi(\lambda)^T\right]$ is actually the identity! So it $c = 1$.
The real work is in evaluating $C$. You should write some code to evaluate this for different values of $p$. Some things to ask yourself:
What's the corresponding eigenvector for the maximal eigenvalue?
Once you investigate those questions numerically, it should be pretty clear how to craft a proof that $C/c = O(p)$ (in fact, I believe, $C/c = p$). Once you do that, you can look back at your proof and ask:
What was actually special about the Legendre Polynomials? Can I generalize this result to other basis functions?
I leave the details to you!