Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DC/OS tests with torpedo #63

Merged
merged 2 commits into from
Nov 8, 2017
Merged

Conversation

piyush-nimbalkar
Copy link
Contributor

** Hardcoded nodes and marathon leader, will change in next review **

  • setup - tear down test works with DC/OS
  • creates supports application using px volumes
  • creates application and waits for it to be ready
  • deletes and validates deletion of application
  • sample mysql app deployed using marathon
  • no volume inspection
  • torpedo runs as a metronome job

@disrani-px
Copy link
Contributor

Can you separate out the vendor changes into a separate commit.

@piyush-nimbalkar
Copy link
Contributor Author

@disrani-px do you mean just a separate commit or a separate review as well?

@disrani-px
Copy link
Contributor

Just a separate commit with the same PR

@piyush-nimbalkar
Copy link
Contributor Author

Done

"id": "torpedo",
"description": "Run torpedo on DC/OS",
"run": {
"cmd": "docker run -e 'TORPEDO_SSH_PASSWORD=root' --entrypoint ginkgo piyushpx/torpedo:latest --slowSpecThreshold 180 -v bin/basic.test -- --spec-dir ../drivers/scheduler/dcos/specs --scheduler dcos",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass in environment variables?

Many of the params to ginkgo need to be controlled by the caller (e.g which tests to skip, scale factor etc). See the aws job for example https://github.com/portworx/torpedo/blob/master/deployments/deploy-aws.sh

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is run as a job in dcos. The use will have to create it manually using the file and change the appropriate env variables.
For deployment from jenkins, we will have a shell script like we have for deploy-aws later which can take multiple params.

return specs, nil
}

func (d *dcos) IsNodeReady(n node.Node) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add TODO's stating functions are to be implemented. For all places in the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in incremental

nodes, _ := getNodes()

for _, n := range nodes {
node.AddNode(n)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the call to IsNodeReady (even though unimplemented) here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in incremental

var nodes []node.Node
nodes = append(nodes, node.Node{
Name: "a1.dcos",
Addresses: []string{"192.168.65.111"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing these are IP's of your private setup? Add a comment if so.

Is there a way to query DCOS for nodes in the cluster?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If DCOS does not have such an API, can we pass it through ginkgo params or env variables?

Copy link
Contributor Author

@piyush-nimbalkar piyush-nimbalkar Nov 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a way to get nodes from mesos. But no way to get the marathon or dcos master. So here the user will have to give us the master ip using env variable which we can use to get nodes or communicate to marathon.
Will be doing the whole thing in the next PR. This PR mainly focused on the torpedo to run the setup teardown test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually wrong info. Mesos provides a nice set of dns names. We can use "leader.mesos" to get to the mesos host. Similarly for marathon.

Copy link
Contributor

@disrani-px disrani-px Nov 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marathon is running on the master node, so it'll be the same as leader.mesos.

You can use mesos APIs to get the list of nodes. For example using curl: curl http://master.mesos:5050/slaves

mesos-go library should have the APIs for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the mesos-go library seemed a bit complicated, hence decided to do that in a later commit.

func (d *portworx) updateNode(n node.Node, pxNodes []api.Node) {
for _, address := range n.Addresses {
for _, pxNode := range pxNodes {
if address == pxNode.DataIp {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add check for MgmtIP too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

},
{
"key": "volume",
"value": "size=20,repl=2,name=px_mysql_vol:/var/lib/mysql"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use repl 3. Will give write quorum if we kill one of the nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return nil, err
}

_, err := m.client.Application(name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be

if _, err := m.client.Application(name); err != nil 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// Initialize Marathon client if not initialized
func (m *marathonOps) initMarathonClient() error {
if m.client == nil {
marathonURL := "http://192.168.65.90:8080"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we eventually going to find the marathon URL? (For e.g k8s gets it by the mounted px-account or the KUBE_CONFIG env variable).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use leader.mesos

var nodes []node.Node
nodes = append(nodes, node.Node{
Name: "a1.dcos",
Addresses: []string{"192.168.65.111"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If DCOS does not have such an API, can we pass it through ginkgo params or env variables?

return err
}

return m.client.WaitOnApplication(name, 5*time.Minute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a blocking call?
The callers are not expecting ValidateApplication to be a blocking call. Also the time period can be a constant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the methods

return nil, nil
}

if _, err := task.DoRetryWithTimeout(t, 5*time.Minute, 10*time.Second); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we could regulate the timeouts and wait period constants by defining them in a constant file. All these constants should be common for both k8s and dcos

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should have constants for all these things. There are many such places, lets change it everywhere all at once.
Created an issue to track:
#67

t := func() (interface{}, error) {
clusterManager := d.getClusterManager()
if status, _ := clusterManager.NodeStatus(); status != api.Status_STATUS_OK {
if status, _ := d.getClusterManager().NodeStatus(); status != api.Status_STATUS_OK {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the use of this function if we have WaitForNode. This function is only going to check the status of the node which is currently the endpoint.

If the intention of this function is to just check Cluster status then use
clusterManager.Enumerate() which returns api.Cluster object which has cluster status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in incremental

"cpus": 0.5,
"mem": 256,
"docker": {
"image": "piyushpx/torpedo:latest"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before final checkin, change to portworx/torpedo:latest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if value, ok := opts[scheduler.OptionsWaitForResourceLeakCleanup]; ok && value {
if err := d.WaitForDestroy(ctx); err != nil {
return err
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a TODO to add call to waitForCleanup here once it's implemented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

},
{
"key": "volume",
"value": "size=10,repl=3,name=px_mysql_vol:/var/lib/mysql"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reduce size to 2. 10G will cause issues on dev systems which might have less storage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

tests/common.go Outdated
Step(fmt.Sprintf("wait for %s app to start running", ctx.App.Key), func() {
err := Inst().S.WaitForRunning(ctx)
expect(err).NotTo(haveOccurred())
})

Step(fmt.Sprintf("validate %s app's volumes", ctx.App.Key), func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to move volume check after app check? The volume check was before applications since volumes are a pre-requisite for a app container to start. So that itself fails, there is no need to WaitForRunning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done as per discussed.


cluster, err := task.DoRetryWithTimeout(t, 2*time.Minute, 10*time.Second)
if err != nil {
return api.Cluster{}, err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an error case, it's better to return a nil object than a default one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was an object, hence could not return a nil
but changed it to a reference

@piyush-nimbalkar
Copy link
Contributor Author

Thanks for the review guys!

** Hardcoded nodes and marathon leader, will change in next review **

- setup - tear down test works with DC/OS
- creates supports application using px volumes
- creates application and waits for it to be ready
- deletes and validates deletion of application
- sample mysql app deployed using marathon
- no volume inspection
- torpedo runs as a metronome job
// ValidateAppliation checks if the aplication is running and healthy
ValidateApplication(string) error
// DeleteApplication deletes the given application
DeleteApplication(string) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marathon has 3 different operations: stop, remove, kill

All 3 behave differently and need to be tested. So there should be interfaces for all of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up on the discussion -
There are counter intuitive apis to do the stop and kill operations.
Stop is basically updating the app with instance count of 0
While kill operation calls a DELETE on the tasks of that app.

We can implement these as we need going forward.

// Initialize Marathon client if not initialized
func (m *marathonOps) initMarathonClient() error {
if m.client == nil {
marathonURL := "http://192.168.65.90:8080"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use leader.mesos

"parameters": [
{
"key": "volume-driver",
"value": "pxd"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These specs should be templated out. The driver field should be populated depending on which volume driver is being tested. It looks like it is the same for the k8s specs. So they can be fixed together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue to track this: #70

@piyush-nimbalkar piyush-nimbalkar merged commit 906e4e9 into portworx:master Nov 8, 2017
@piyush-nimbalkar piyush-nimbalkar deleted the dcos branch November 8, 2017 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants