Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding gradient manipulation #23

Open
prpfialho opened this issue Aug 23, 2018 · 0 comments
Open

Understanding gradient manipulation #23

prpfialho opened this issue Aug 23, 2018 · 0 comments

Comments

@prpfialho
Copy link

prpfialho commented Aug 23, 2018

Hello,

I'm trying to implement your code in Keras, and achieve the same results as you. I've mimicked LSTM initialization, and checked their math and constants to make it fit yours, although I'm still 20% away from your results (using your input data). My guess is the problem lies in the optimizer (it seems different than in Keras Adadelta), and I don't understand what's happenning in:

https://github.com/aditya1503/Siamese-LSTM/blob/master/lstm.py#L289

            gradi = tensor.grad(cost, wrt=tnewp.values())#/bts
            grads=[]
            l=len(gradi)
            for i in range(0,l/2):
                gravg=(gradi[i]+gradi[i+l/2])/(4.0)
            #print i,i+9
                grads.append(gravg)
            for i in range(0,len(tnewp.keys())/2):
                    grads.append(grads[i])
        
            self.f_grad_shared, self.f_update = adadelta(lr, tnewp, grads,emb11,mask11,emb21,mask21,y, cost)

I don't know where to implement this, in Keras logic (I presume this code only runs once, upon network definition), but I've tried:

def get_updates(self, loss, params):        
    gradi = self.get_gradients(loss, params)

    grads = []      
    l = len(gradi)   # for 2 LSTMs, l = 6, 3 'weights' per each
    half_l = int(l / 2)
    print(half_l)
    for i in range(0, half_l):
        gravg = (gradi[i] + gradi[i + half_l]) / (4.0)
        grads.append(gravg)

    alt_half_l = int(len(params) / 2)
    print(alt_half_l)
    for i in range(0, alt_half_l):
        grads.append(grads[i])

    shapes = [K.int_shape(p) for p in params]
    ...

in my own optimizer (based on Keras original Adadelta, again mimicking your constants).

However, the loss/cost per batch went from 0.08 (from a single LSTM, applied to 2 inputs, as suggested by Keras for the Siamese logic) to 0.4, thus it is a logic error.

I guess that gradient manipulation is being applied frequently in my code, such as per batch (as by Keras logic), while in your code it's applied once, on Adadelta initialization/definition.

Can someone help me understand what's happening in the above code? what is it for, is it run per batch, and/or why not a single LSTM shared, as Keras suggests in:
https://keras.io/getting-started/functional-api-guide/#shared-layers

Best,
Pedro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant