-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DC/OS tests with torpedo #63
Conversation
Can you separate out the vendor changes into a separate commit. |
@disrani-px do you mean just a separate commit or a separate review as well? |
Just a separate commit with the same PR |
685bf0a
to
a933249
Compare
Done |
deployments/torpedo-dcos-ssh.json
Outdated
"id": "torpedo", | ||
"description": "Run torpedo on DC/OS", | ||
"run": { | ||
"cmd": "docker run -e 'TORPEDO_SSH_PASSWORD=root' --entrypoint ginkgo piyushpx/torpedo:latest --slowSpecThreshold 180 -v bin/basic.test -- --spec-dir ../drivers/scheduler/dcos/specs --scheduler dcos", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass in environment variables?
Many of the params to ginkgo
need to be controlled by the caller (e.g which tests to skip, scale factor etc). See the aws job for example https://github.com/portworx/torpedo/blob/master/deployments/deploy-aws.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is run as a job in dcos. The use will have to create it manually using the file and change the appropriate env variables.
For deployment from jenkins, we will have a shell script like we have for deploy-aws later which can take multiple params.
return specs, nil | ||
} | ||
|
||
func (d *dcos) IsNodeReady(n node.Node) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add TODO's stating functions are to be implemented. For all places in the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in incremental
drivers/scheduler/dcos/dcos.go
Outdated
nodes, _ := getNodes() | ||
|
||
for _, n := range nodes { | ||
node.AddNode(n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the call to IsNodeReady
(even though unimplemented) here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in incremental
var nodes []node.Node | ||
nodes = append(nodes, node.Node{ | ||
Name: "a1.dcos", | ||
Addresses: []string{"192.168.65.111"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing these are IP's of your private setup? Add a comment if so.
Is there a way to query DCOS for nodes in the cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If DCOS does not have such an API, can we pass it through ginkgo params or env variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a way to get nodes from mesos. But no way to get the marathon or dcos master. So here the user will have to give us the master ip using env variable which we can use to get nodes or communicate to marathon.
Will be doing the whole thing in the next PR. This PR mainly focused on the torpedo to run the setup teardown test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually wrong info. Mesos provides a nice set of dns names. We can use "leader.mesos" to get to the mesos host. Similarly for marathon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marathon is running on the master node, so it'll be the same as leader.mesos.
You can use mesos APIs to get the list of nodes. For example using curl: curl http://master.mesos:5050/slaves
mesos-go library should have the APIs for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the mesos-go library seemed a bit complicated, hence decided to do that in a later commit.
drivers/volume/portworx/portworx.go
Outdated
func (d *portworx) updateNode(n node.Node, pxNodes []api.Node) { | ||
for _, address := range n.Addresses { | ||
for _, pxNode := range pxNodes { | ||
if address == pxNode.DataIp { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add check for MgmtIP too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
}, | ||
{ | ||
"key": "volume", | ||
"value": "size=20,repl=2,name=px_mysql_vol:/var/lib/mysql" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use repl 3. Will give write quorum if we kill one of the nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return nil, err | ||
} | ||
|
||
_, err := m.client.Application(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be
if _, err := m.client.Application(name); err != nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
// Initialize Marathon client if not initialized | ||
func (m *marathonOps) initMarathonClient() error { | ||
if m.client == nil { | ||
marathonURL := "http://192.168.65.90:8080" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we eventually going to find the marathon URL? (For e.g k8s gets it by the mounted px-account or the KUBE_CONFIG env variable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use leader.mesos
var nodes []node.Node | ||
nodes = append(nodes, node.Node{ | ||
Name: "a1.dcos", | ||
Addresses: []string{"192.168.65.111"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If DCOS does not have such an API, can we pass it through ginkgo params or env variables?
return err | ||
} | ||
|
||
return m.client.WaitOnApplication(name, 5*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a blocking call?
The callers are not expecting ValidateApplication to be a blocking call. Also the time period can be a constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed the methods
return nil, nil | ||
} | ||
|
||
if _, err := task.DoRetryWithTimeout(t, 5*time.Minute, 10*time.Second); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we could regulate the timeouts and wait period constants by defining them in a constant file. All these constants should be common for both k8s and dcos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should have constants for all these things. There are many such places, lets change it everywhere all at once.
Created an issue to track:
#67
drivers/volume/portworx/portworx.go
Outdated
t := func() (interface{}, error) { | ||
clusterManager := d.getClusterManager() | ||
if status, _ := clusterManager.NodeStatus(); status != api.Status_STATUS_OK { | ||
if status, _ := d.getClusterManager().NodeStatus(); status != api.Status_STATUS_OK { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the use of this function if we have WaitForNode. This function is only going to check the status of the node which is currently the endpoint.
If the intention of this function is to just check Cluster status then use
clusterManager.Enumerate() which returns api.Cluster object which has cluster status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in incremental
deployments/torpedo-dcos-ssh.json
Outdated
"cpus": 0.5, | ||
"mem": 256, | ||
"docker": { | ||
"image": "piyushpx/torpedo:latest" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before final checkin, change to portworx/torpedo:latest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
drivers/scheduler/dcos/dcos.go
Outdated
if value, ok := opts[scheduler.OptionsWaitForResourceLeakCleanup]; ok && value { | ||
if err := d.WaitForDestroy(ctx); err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a TODO to add call to waitForCleanup
here once it's implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
}, | ||
{ | ||
"key": "volume", | ||
"value": "size=10,repl=3,name=px_mysql_vol:/var/lib/mysql" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reduce size to 2. 10G will cause issues on dev systems which might have less storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
tests/common.go
Outdated
Step(fmt.Sprintf("wait for %s app to start running", ctx.App.Key), func() { | ||
err := Inst().S.WaitForRunning(ctx) | ||
expect(err).NotTo(haveOccurred()) | ||
}) | ||
|
||
Step(fmt.Sprintf("validate %s app's volumes", ctx.App.Key), func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to move volume check after app check? The volume check was before applications since volumes are a pre-requisite for a app container to start. So that itself fails, there is no need to WaitForRunning
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done as per discussed.
drivers/volume/portworx/portworx.go
Outdated
|
||
cluster, err := task.DoRetryWithTimeout(t, 2*time.Minute, 10*time.Second) | ||
if err != nil { | ||
return api.Cluster{}, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an error case, it's better to return a nil object than a default one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was an object, hence could not return a nil
but changed it to a reference
Thanks for the review guys! |
** Hardcoded nodes and marathon leader, will change in next review ** - setup - tear down test works with DC/OS - creates supports application using px volumes - creates application and waits for it to be ready - deletes and validates deletion of application - sample mysql app deployed using marathon - no volume inspection - torpedo runs as a metronome job
236c18b
to
e07ad9a
Compare
// ValidateAppliation checks if the aplication is running and healthy | ||
ValidateApplication(string) error | ||
// DeleteApplication deletes the given application | ||
DeleteApplication(string) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marathon has 3 different operations: stop, remove, kill
All 3 behave differently and need to be tested. So there should be interfaces for all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow up on the discussion -
There are counter intuitive apis to do the stop and kill operations.
Stop is basically updating the app with instance count of 0
While kill operation calls a DELETE on the tasks of that app.
We can implement these as we need going forward.
// Initialize Marathon client if not initialized | ||
func (m *marathonOps) initMarathonClient() error { | ||
if m.client == nil { | ||
marathonURL := "http://192.168.65.90:8080" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use leader.mesos
"parameters": [ | ||
{ | ||
"key": "volume-driver", | ||
"value": "pxd" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These specs should be templated out. The driver field should be populated depending on which volume driver is being tested. It looks like it is the same for the k8s specs. So they can be fixed together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created an issue to track this: #70
** Hardcoded nodes and marathon leader, will change in next review **