Understanding gradient manipulation #23

prpfialho · 2018-08-23T16:36:25Z

Hello,

I'm trying to implement your code in Keras, and achieve the same results as you. I've mimicked LSTM initialization, and checked their math and constants to make it fit yours, although I'm still 20% away from your results (using your input data). My guess is the problem lies in the optimizer (it seems different than in Keras Adadelta), and I don't understand what's happenning in:

https://github.com/aditya1503/Siamese-LSTM/blob/master/lstm.py#L289

            gradi = tensor.grad(cost, wrt=tnewp.values())#/bts
            grads=[]
            l=len(gradi)
            for i in range(0,l/2):
                gravg=(gradi[i]+gradi[i+l/2])/(4.0)
            #print i,i+9
                grads.append(gravg)
            for i in range(0,len(tnewp.keys())/2):
                    grads.append(grads[i])
        
            self.f_grad_shared, self.f_update = adadelta(lr, tnewp, grads,emb11,mask11,emb21,mask21,y, cost)

I don't know where to implement this, in Keras logic (I presume this code only runs once, upon network definition), but I've tried:

def get_updates(self, loss, params):        
    gradi = self.get_gradients(loss, params)

    grads = []      
    l = len(gradi)   # for 2 LSTMs, l = 6, 3 'weights' per each
    half_l = int(l / 2)
    print(half_l)
    for i in range(0, half_l):
        gravg = (gradi[i] + gradi[i + half_l]) / (4.0)
        grads.append(gravg)

    alt_half_l = int(len(params) / 2)
    print(alt_half_l)
    for i in range(0, alt_half_l):
        grads.append(grads[i])

    shapes = [K.int_shape(p) for p in params]
    ...

in my own optimizer (based on Keras original Adadelta, again mimicking your constants).

However, the loss/cost per batch went from 0.08 (from a single LSTM, applied to 2 inputs, as suggested by Keras for the Siamese logic) to 0.4, thus it is a logic error.

I guess that gradient manipulation is being applied frequently in my code, such as per batch (as by Keras logic), while in your code it's applied once, on Adadelta initialization/definition.

Can someone help me understand what's happening in the above code? what is it for, is it run per batch, and/or why not a single LSTM shared, as Keras suggests in:
https://keras.io/getting-started/functional-api-guide/#shared-layers

Best,
Pedro

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding gradient manipulation #23

Understanding gradient manipulation #23

prpfialho commented Aug 23, 2018 •

edited

Loading

Understanding gradient manipulation #23

Understanding gradient manipulation #23

Comments

prpfialho commented Aug 23, 2018 • edited Loading

prpfialho commented Aug 23, 2018 •

edited

Loading