1) In section 9, we have a toy example where we’re updating the parameters for each datapoint, so the sum is not necessary. The code in section 11 is written so that we could be working with batches, therefore summing over all the datapoints in the batch becomes necessary.

2) We don’t actually compute argmin, but use gradient descent. We define a differentiable cost function, then calculate gradients for each parameter and update all the values so that they slightly move towards predicting the correct answer each time. If you google, I’m sure you’ll find more in-depth explanations of gradient descent.

This tutorial is great! I started studying neural networks a couple of weeks ago and I have some questions to you about your code.

1) In section 9 and 11 you have this row theano.tensor.sqr(predicted_value – target_value).sum() . Difference between section 9 and 11 is that you have there this term “sum()”. Can’t understand why do u have it there? I know that we have to compute sum of residuals^{2} but if so, why don’t you have this term in the section 9 as well?

2) Next question is about training. As far as I know, we have to compute argmin of the cost function with respect to weights. Thus, we have to take all the training data and the target values and minimize the sum of residuals^{2}. However, I can’t understand how training in your codes is provided.

I think that my questions can explained with the fact that I understand syntax of the code in a wrong way but I would be really pleased if you answer me.

]]>One question on the selection of venues: If the conferences of both the American (NAACL) and European (EACL) chapters of ACL are included, why not Asia Pacific (IJCNLP)? Leaving out that conference seeems to bias the analysis towards certain regions, especially the author statistics. ]]>

I do agree that there are a lot of unnecessary papers in the field, although I’m not sure if their ratio has actually increased or it’s just the result of more people working in that area. ]]>