Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gce channel errors out when there are no VMs (instances) in Google cloud #381

Open
StevenCTimm opened this issue Oct 27, 2021 · 10 comments
Open
Assignees
Labels
prj_testing Issue identified by HEPCloud Phase IV Integration Testing

Comments

@StevenCTimm
Copy link
Contributor

The Gce_Figure_Of_Merit transform errors out when the GCE_Occupancy data frame is a null frame.
This happened for the first time today when GCE was completely empty for once.

Need to handle the error condition correctly.

@StevenCTimm StevenCTimm added the prj_testing Issue identified by HEPCloud Phase IV Integration Testing label Oct 31, 2021
@hyunwoo18
Copy link
Contributor

hyunwoo18 commented Dec 10, 2021

While migrating to OpenStack, I lost my development virtual machine.
I have a suggestion to handle this exception:
Can you test this in your dev machine?

In decisionengine_modules/GCE/transforms/GceFigureOfMerit.py
From

        for i, row in performance.iterrows():
            az = row["AvailabilityZone"]
            it = row["InstanceType"]
            entry_name = row["EntryName"]

To

        for i, row in performance.iterrows():
            try:
                az = row["AvailabilityZone"]
                it = row["InstanceType"]
                entry_name = row["EntryName"]
           except AttributeError:
                self.logger.error("In GceFigureOfMerit: Input is empty")

Something like this.

But looking at the code again more closely, I am not sure what actually needs to be done when this exception takes place, i.e. when there is no VM.

Steve said, "Need to handle the error condition correctly.",
What is the correct way of handling error here?
That is, can I do this for example?

 for i, row in performance.iterrows():
            try:
                az = row["AvailabilityZone"]
                it = row["InstanceType"]
                entry_name = row["EntryName"]
           except AttributeError:
                self.logger.error("In GceFigureOfMerit: Input is empty")
                fom = 0
                continue

so that the code does not reach this 2 lines below:

            fom = figure_of_merit(row["PricePerformance"],
                                  occupancy,
                                  max_allowed,
                                  idle,
                                  max_idle)

            figures_of_merit.append({"EntryName": entry_name,
                                     "FigureOfMerit": fom})

Or should I make it such that the code returns some null value right away the first this this exception occurs:

 for i, row in performance.iterrows():
            try:
                az = row["AvailabilityZone"]
                it = row["InstanceType"]
                entry_name = row["EntryName"]
           except AttributeError:
                self.logger.error("In GceFigureOfMerit: Input is empty")
                return {'GCE_Price_Performance': None,  'GCE_Figure_Of_Merit': None}

I can not tell which because I do not understand which part of DE codes follows this file (GceFigureOfMerit.py).

@StevenCTimm
Copy link
Contributor Author

I will have to check.. the right behavior is to return a blank GCE_Figure_of_Merit data frame but
the failure mode is likely to be that the GCE_Occupancy block is zero and that is not shown here.

@hyunwoo18
Copy link
Contributor

As I have more understanding of the DE source code structure, I will resume working on this issue now.

@hyunwoo18
Copy link
Contributor

I found this evidence from gpde01:/var/log/decision/gp_Gce.log:

2021-10-27T14:29:34-0500 - root - TaskManager - 22421 - GceFigureOfMerit - ERROR - exception from transform GceFigureOfMerit
Traceback (most recent call last):

File "/usr/lib/python3.6/site-packages/decisionengine/framework/taskmanager/TaskManager.py", line 422, in run_transform
data = transform.worker.transform(data_block)

File "/usr/lib/python3.6/site-packages/decisionengine_modules/GCE/transforms/GceFigureOfMerit.py", line 51, in transform
occupancy_df = gce_occupancy[((gce_occupancy.AvailabilityZone == az) &

File "/usr/local/lib64/python3.6/site-packages/pandas/core/generic.py", line 5141, in getattr
return object.getattribute(self, name)

AttributeError: 'DataFrame' object has no attribute 'AvailabilityZone'

I also did the following in order to start using Gce channel in my testing VM (fermicloud571):
[hyunwoo@ssilogin02 ~]$ scp root@fermicloud435:/etc/decisionengine/config.d/Gce.jsonnet ./
[hyunwoo@ssilogin02 ~]$ scp Gce.jsonnet root@fermicloud571:/etc/decisionengine/config.d/

Let me see if I can run gce channel

@hyunwoo18
Copy link
Contributor

Okay,
I enabled Gce channel in fermilcloud571 (my DE dev machine)
I had to renew monitor.json in both credentials.db and /etc/gwms-frontend/credentials/
(I enabled it in Steve's machine too fermicloud435)

I will start debugging GceFigureOfMerit.py tomorrow.

@hyunwoo18
Copy link
Contributor

Okay, finally..
I studied these two files
decisionengine_modules/GCE/transforms/GceFigureOfMerit.py
decisionengine_modules/GCE/sources/GceOccupancy.py

and concluded that the following change in decisionengine_modules/GCE/transforms/GceFigureOfMerit.py
should cover the exception (no GCE VMs running)

    def transform(self, data_block):

        self.logger.debug("in GceFigureOfMerit transform")
        performance     = self.GCE_Instance_Performance( data_block)
        performance["PricePerformance"] = np.where( performance["PerfTtbarTotal"] > 0,  (performance["OnDemandPrice"]/performance["PerfTtbarTotal"]),   sys.float_info.max )

        factory_entries = self.Factory_Entries_GCE(      data_block).fillna(0)
        gce_occupancy   = self.GCE_Occupancy(            data_block).fillna(0)

        figures_of_merit = []

        for i, row in performance.iterrows():
            az         = row["AvailabilityZone"]
            it         = row["InstanceType"]
            entry_name = row["EntryName"]
<new code>
            try:
               occupancy_df = gce_occupancy[((gce_occupancy.AvailabilityZone == az) &
                                             (gce_occupancy.InstanceType == it))]
            except:
                occupancy = 0
            else:
                occupancy = float(
                    occupancy_df["Occupancy"].values[0]) if not occupancy_df.empty else 0
<new code>

<original code>
            occupancy_df = gce_occupancy[((gce_occupancy.AvailabilityZone == az) &
                                          (gce_occupancy.InstanceType == it))]
            occupancy = float(
                occupancy_df["Occupancy"].values[0]) if not occupancy_df.empty else 0

</original code>

            max_allowed = max_idle = idle = 0

I tested this change in fermicloud435 (Steve's testing VM) and
at least it does not crash with current input.

Next:
We will have to wait until we have zero VM in GCE and see if all other DE instances' GCE channel crashes
and this instance (fermicloud435) does not.

In the meantime, I will clone the source code, make the changes and push.

@hyunwoo18
Copy link
Contributor

Steve said, "We can get zero VMs in GCE by temporarily killing the squid server.
and then starting it back up."

Could you do it Steve?
I have not connected to GCP/GCE for a long time.
Let me know when you have actually done it.
THanks!

@StevenCTimm
Copy link
Contributor Author

StevenCTimm commented Mar 22, 2022 via email

@hyunwoo18
Copy link
Contributor

Okay, Steve and I conducted a test today by deleting GCP squid VM by Steve
and seeing how this testing VM (fermicloud435) and other production de instances behave differently
about GCE channel
and concluded that my patch seems to work!
I will proceed to push the changes to the repo.

@StevenCTimm
Copy link
Contributor Author

StevenCTimm commented Oct 11, 2022 via email

@namrathaurs namrathaurs self-assigned this Jul 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
prj_testing Issue identified by HEPCloud Phase IV Integration Testing
Projects
None yet
Development

No branches or pull requests

3 participants