Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul database schema #215

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.rdoc
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ KalibroProcessor is the processing web service for Mezuro.

== Unreleased

* Optimize database structure by adding foreign keys and indexes where needed
* Insert in one query all aggregated MetricResults
* Aggregate values by tree levels
* Enable ModuleResult tree walking by level
Expand Down
103 changes: 103 additions & 0 deletions db/migrate/20160720185407_clean_inconsistencies.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
require 'fileutils'

class CleanInconsistencies < ActiveRecord::Migration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this script be a migration? We are not changing the database structure, just removing records.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are semantically changing the database structure, from allowing inconsistencies to not allowing. We can run this manually, but it will mean anyone using the software will have to do the same, otherwise the next migrations won't work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this script is preventing us from creating inconsistencies. It is merely cleaning things up so other scripts (migrations) change the structure (by adding indexes and foreign keys, for example).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the next scripts add the constraints to prevent the inconsistencies, but they would fail if the inconsistencies already exist. If you can suggest a better method to handle it I'll be happy to use it :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a rake task can achieve similar goals, with warning messages for those running the migrations. Skipping them if they should fail would be good too.
That way, we don't force anyone to erase data without warning them first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any data that the migration erases was already unusable and/or unreachable anyway. It existed purely due to bugs and poor validations, and can't even be retrieved by the APIs (such as multiple metric results with the same metric in the same module result).

That's true for the KalibroClient's point of view. If we were mining our database for statistics or some kind of study, suddenly losing data would make us desperate! xD
Notice that on those cases it may make sense to not erase a processing because it is not associated with a repository, for example.

I think that because we are changing data, not structure, this should not be a migration. Adding the extra step is surely more bureaucratic, but I think it's a good pratice to at least warn anyone who uses this software before erasing anything. Maybe a rake task is not the best option, but I don't think a migration is neither (even though rails guides say migrations can be made to add or modify data).

Anyway, if I'm the only one not happy with this, we can move on. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true for the KalibroClient's point of view. If we were mining our database for statistics or some kind of study, suddenly losing data would make us desperate! xD

It's true from the point of view of the model that the database is an expression of. If we had made a more faithful conversion, we would have foreign keys from the start, and the data being deleted now would never exist. They only reason it does exist is because we failed to keep garbage out. The fact that it got in doesn't make it not garbage.

What meaningful statistics can you gather from data that violates the semantic constraints of your model, and that should never exist? In the particular cases of things this migration deletes, you can't even reference the context that would allow you to make any sense of them.

What worth are process times if you can't even know what was processed? What worth is a kalibro module with no associated module result? What worth is a metric result that would never be looked at, and that is possibly complete garbage because it's associated with the wrong module result because we generated the wrong name?

I think that because we are changing data, not structure, this should not be a migration.

I don't see why that would be the case. Migrations are a mechanism to allow evolving a database progressively without throwing it away. Nothing makes them specific to structure. Also, changing structure many times entails changing data to be able to do it properly - this is just another case of that that is ugly because the situation is ugly.

Maybe a rake task is not the best option, but I don't think a migration is neither

The migration is the least bad option, as it is at least run in a controlled environment, and is a part of the known upgrade steps that any admin must run. How would someone upgrading find out that they need to run the rake task and how? Could we make it work in an automated processes without requiring manual intervention?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you processed a repository and then deleted it, all the relevant data would be there. Suppose we had 500 repositories processed and then all of them were deleted. If you looked at their process times and saw that say, the aggregation time was surprinsingly high, would you not consider the data because you can't see the repository associated?

I agree that changing structure many times entails changing data. But I think it's just wrong to delete data without even notifying the users. I think it's not a good policy because the data is not yours, even though you may think it's garbage. In my opinion, we have to make users aware of it even if it means the process is not entirely automatic. I would be happier if we could at least make the migration need some user input like This migration is going to delete this kind of data, ok?.

I've come to the conclusion that the migration is not a bad option. However, deleting data without informing the users, in my opinion, definitely is.

Copy link
Contributor Author

@danielkza danielkza Jul 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diegoamc I understand your concerns, but I am confident that everything that this migration deletes is data that could not be accessed at all, or that doesn't even make sense. In your particular example, it's hard to judge whether a processing took long when the repository used is unknown. The fact that a processing does not make sense without it's repository is encoded in the model: these changes just enforce what was already supposed to happen at the database level.

If there is any data being deleted in a way that is not specified in the model, we should absolutely fix the migration or the model (whichever is wrong).

But I think it's just wrong to delete data without even notifying the usersI think it's not a good policy because the data is not yours

We have to keep in mind that we had the responsibility to create valid data and failed it. We have the knowledge to separate what is garbage (according to the model) and what is not, such that the remaining data can be trusted from now into the future. I'm not proposing to delete old data or anything like that, but data that should never have existed at all. No judgements of value can be made about its contents because they can be absolute nonsense.

I would be happier if we could at least make the migration need some user input like This migration is going to delete this kind of data, ok?.

I don't know of a good way to do it, since we have no guarantee that the migration is interactive - it can be running in an automated way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood your point too, but I can't accept the PR. I wish I had the time right now to sit for a while, study and propose another solution.
If nobody else is unhappy about it, please accept it :)

def backup_dir
return @backup_dir if @backup_dir
@backup_dir = Rails.root.join('db', 'backup', Time.now.strftime('%Y-%m-%d_%H-%M-%S'))
FileUtils.mkdir_p(@backup_dir)
end

def backup(name, query, header: true)
say_with_time("backup(#{name.inspect}, #{query.inspect})") do
copy_query = "COPY (#{query}) TO STDOUT WITH DELIMITER ',' CSV #{header ? 'HEADER' : ''}"

File.open(backup_dir.join("#{name}.csv"), 'a') do |f|
connection.raw_connection.copy_data(copy_query) do
while line = connection.raw_connection.get_copy_data
f.write(line)
end
end
end
end
end

def backup_and_delete_missing(table, exists_query)
backup(table, "SELECT * FROM \"#{table}\" WHERE NOT EXISTS(#{exists_query})")
execute "DELETE FROM \"#{table}\" WHERE NOT EXISTS(#{exists_query})"
end

def up
say "WARNING: destructive migration necessary. Deleted data will be backed up to #{backup_dir}"

# Unset project reference for repositories with non-existing projects
execute <<-SQL
UPDATE repositories AS r
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this generate inconsistencies in ownerships of Prezento? Should we do something similar there?

Copy link
Contributor Author

@danielkza danielkza Jul 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how it could matter considering this only unsets the references to records that do not exist. If Prezento had those same references they would already be broken if anyone tried to actually use them for anything.

IMO we should have proper database constraints everywhere, but adding them after a long time is quite a bit harder. I don't know whether it's worth the extra work. I won't have the free time to do the same for the other services.

SET project_id = NULL
WHERE project_id = 0 OR NOT EXISTS (
SELECT 1 FROM projects AS p WHERE p.id = r.project_id
)
SQL

# Delete processings with non-existing repositories
backup_and_delete_missing("processings",
"SELECT 1 FROM repositories AS r WHERE r.id = processings.repository_id")

# Delete process times with non-existing processings
backup_and_delete_missing("process_times",
"SELECT 1 FROM processings AS p WHERE p.id = process_times.processing_id")

# Delete module results with non-existing processings
backup_and_delete_missing("module_results",
"SELECT 1 FROM processings AS p WHERE p.id = module_results.processing_id")

# Delete kalibro modules with non-existing module results
backup_and_delete_missing("kalibro_modules",
"SELECT 1 FROM module_results AS m WHERE m.id = kalibro_modules.module_result_id")

# Fix up metric results type, even before backing up so the backup is cleaner
execute <<-SQL
UPDATE metric_results SET "type" = 'TreeMetricResult' WHERE "type" = 'MetricResult'
SQL

# Delete metric results with non-existing module results
backup_and_delete_missing("metric_results",
"SELECT 1 FROM module_results AS m WHERE m.id = metric_results.module_result_id")

# Delete duplicate metric_results. Group them by (module_result_id, metric_configuration_id),
# then delete all but the one with the highest ID. The double wrapping on the inner query is
# necessary because window functions cannot be used in WHERE in PostgreSQL.
repeated_metric_result_query = exec_query <<-SQL
SELECT t.id FROM (
SELECT metric_results.*, ROW_NUMBER() OVER (
PARTITION BY module_result_id, metric_configuration_id, "type"
ORDER BY id DESC) AS rnum
FROM metric_results
WHERE "type" = 'TreeMetricResult'
) AS t
WHERE t.rnum > 1
SQL

unless repeated_metric_result_query.empty?
repeated_metric_result_ids = repeated_metric_result_query.rows.flat_map(&:first).join(',')

# Replace default messages with custom ones to avoid flooding the screen with the huge query
say_with_time('backup("metric_results", "SELECT * metric_results WHERE id IN (...)")') do
suppress_messages do
backup('metric_results',
"SELECT * FROM metric_results WHERE id IN (#{repeated_metric_result_ids})",
header: false)
end
end

say_with_time('execute("DELETE FROM metric_results WHERE id IN (...)")') do
suppress_messages do
execute "DELETE FROM metric_results WHERE id IN (#{repeated_metric_result_ids})"
end
end
end
end

def self.down
raise ActiveRecord::IrreversibleMigration
end
end
6 changes: 6 additions & 0 deletions db/migrate/20160720185408_add_indexes_to_kalibro_modules.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
class AddIndexesToKalibroModules < ActiveRecord::Migration
def change
add_foreign_key :kalibro_modules, :module_results, on_delete: :cascade
add_index :kalibro_modules, [:long_name, :granularity]
end
end
6 changes: 6 additions & 0 deletions db/migrate/20160720185409_add_indexes_to_module_results.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
class AddIndexesToModuleResults < ActiveRecord::Migration
def change
add_foreign_key :module_results, :module_results, column: 'parent_id'
add_foreign_key :module_results, :processings, on_delete: :cascade
end
end
11 changes: 11 additions & 0 deletions db/migrate/20160720185410_add_indexes_to_metric_results.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
class AddIndexesToMetricResults < ActiveRecord::Migration
def change
add_foreign_key :metric_results, :module_results, on_delete: :cascade
add_index :metric_results, :type
add_index :metric_results, :module_result_id
add_index :metric_results, :metric_configuration_id
add_index :metric_results, [:module_result_id, :metric_configuration_id],
Copy link
Contributor

@diegoamc diegoamc Jul 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this line doing?

Copy link
Contributor Author

@danielkza danielkza Jul 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the database counterpart to the unique validation in TreeMetricResult. It creates a composite index that ensure the uniqueness of the (module_result_id, metric_configuration_id) tuple across only records where type = 'TreeMetricResult'.

unique: true, where: "type = 'TreeMetricResult'",
name: 'metric_results_module_res_metric_cfg_uniq_idx'
end
end
6 changes: 6 additions & 0 deletions db/migrate/20160720185411_add_indexes_to_processings.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
class AddIndexesToProcessings < ActiveRecord::Migration
def change
add_foreign_key :processings, :repositories
add_foreign_key :processings, :module_results, column: 'root_module_result_id'
end
end
5 changes: 5 additions & 0 deletions db/migrate/20160720185412_add_indexes_to_process_times.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
class AddIndexesToProcessTimes < ActiveRecord::Migration
def change
add_foreign_key :process_times, :processings, on_delete: :cascade
end
end
5 changes: 5 additions & 0 deletions db/migrate/20160720185413_add_indexes_to_repositories.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
class AddIndexesToRepositories < ActiveRecord::Migration
def change
add_foreign_key :repositories, :projects
end
end
9 changes: 9 additions & 0 deletions db/migrate/20160720185414_add_indexes_of_foreign_keys.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
class AddIndexesOfForeignKeys < ActiveRecord::Migration
def change
add_index :module_results, :processing_id
add_index :kalibro_modules, :module_result_id
add_index :process_times, :processing_id
add_index :processings, :repository_id
add_index :repositories, :project_id
end
end
24 changes: 23 additions & 1 deletion db/schema.rb
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
#
# It's strongly recommended that you check this file into your version control system.

ActiveRecord::Schema.define(version: 20151002172231) do
ActiveRecord::Schema.define(version: 20160720185414) do

# These are extensions that must be enabled in order to support this database
enable_extension "plpgsql"
Expand Down Expand Up @@ -40,6 +40,9 @@
t.integer "module_result_id"
end

add_index "kalibro_modules", ["long_name", "granularity"], name: "index_kalibro_modules_on_long_name_and_granularity", using: :btree
add_index "kalibro_modules", ["module_result_id"], name: "index_kalibro_modules_on_module_result_id", using: :btree

create_table "metric_results", force: :cascade do |t|
t.integer "module_result_id"
t.integer "metric_configuration_id"
Expand All @@ -52,7 +55,11 @@
t.integer "related_hotspot_metric_results_id"
end

add_index "metric_results", ["metric_configuration_id"], name: "index_metric_results_on_metric_configuration_id", using: :btree
add_index "metric_results", ["module_result_id", "metric_configuration_id"], name: "metric_results_module_res_metric_cfg_uniq_idx", unique: true, where: "((type)::text = 'TreeMetricResult'::text)", using: :btree
add_index "metric_results", ["module_result_id"], name: "index_metric_results_on_module_result_id", using: :btree
add_index "metric_results", ["related_hotspot_metric_results_id"], name: "index_metric_results_on_related_hotspot_metric_results_id", using: :btree
add_index "metric_results", ["type"], name: "index_metric_results_on_type", using: :btree

create_table "module_results", force: :cascade do |t|
t.float "grade"
Expand All @@ -63,6 +70,7 @@
end

add_index "module_results", ["parent_id"], name: "index_module_results_on_parent_id", using: :btree
add_index "module_results", ["processing_id"], name: "index_module_results_on_processing_id", using: :btree

create_table "process_times", force: :cascade do |t|
t.string "state", limit: 255
Expand All @@ -72,6 +80,8 @@
t.float "time"
end

add_index "process_times", ["processing_id"], name: "index_process_times_on_processing_id", using: :btree

create_table "processings", force: :cascade do |t|
t.string "state", limit: 255
t.integer "repository_id"
Expand All @@ -81,6 +91,8 @@
t.text "error_message"
end

add_index "processings", ["repository_id"], name: "index_processings_on_repository_id", using: :btree

create_table "projects", force: :cascade do |t|
t.string "name", limit: 255
t.string "description", limit: 255
Expand All @@ -106,5 +118,15 @@
t.string "branch", default: "master", null: false
end

add_index "repositories", ["project_id"], name: "index_repositories_on_project_id", using: :btree

add_foreign_key "kalibro_modules", "module_results", on_delete: :cascade
add_foreign_key "metric_results", "module_results", on_delete: :cascade
add_foreign_key "metric_results", "related_hotspot_metric_results", column: "related_hotspot_metric_results_id"
add_foreign_key "module_results", "module_results", column: "parent_id"
add_foreign_key "module_results", "processings", on_delete: :cascade
add_foreign_key "process_times", "processings", on_delete: :cascade
add_foreign_key "processings", "module_results", column: "root_module_result_id"
add_foreign_key "processings", "repositories"
add_foreign_key "repositories", "projects"
end
5 changes: 2 additions & 3 deletions features/metric_result/module_result.feature
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,7 @@ Feature: ModuleResult retrieval

@clear_repository @kalibro_configuration_restart
Scenario: With a valid MetricResult id
Given I have sample readings
And I have a sample configuration with the Flay hotspot metric
Given I have a sample configuration with the Flay hotspot metric
And I have the kalibro processor ruby repository with revision "v0.11.0"
And I have a processing within the sample repository
And I run for the given repository
Expand All @@ -15,5 +14,5 @@ Feature: ModuleResult retrieval
Then I should get the given ModuleResult json

Scenario: With an invalid MetricResult id
When I request for the ModuleResult of the MetricResult with id "42"
When I request for the ModuleResult of the MetricResult with id "0"
Then I should get an error response
2 changes: 1 addition & 1 deletion features/repository/metric_result_history_of.feature
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Feature: Metric Result History Of
Scenario: After processing an existing repository with a kalibro configuration
Given I have sample readings
And I have a sample kalibro configuration with native metrics
And I have a sample repository within the sample project
And I have a sample repository
And I have a processing within the sample repository
And I run for the given repository
When I get the history for the first metric result of the root
Expand Down
4 changes: 2 additions & 2 deletions features/runner.feature
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ Feature: Runner run
Given I have sample readings
And I have a sample configuration with the Cyclomatic python native metric
And I add the "Maintainability" native metric to the sample configuration
And I have a sample python repository within the sample project
And I have a sample python repository
And I have a processing within the sample repository
When I run for the given repository
Then the repository code_directory should exist
Expand All @@ -107,7 +107,7 @@ Feature: Runner run
And the processing retrieved should have a Root ModuleResult
And the Root ModuleResult retrieved should have a list of MetricResults

@clear_repository @kalibro_configuration_restart @docker
@clear_repository @kalibro_configuration_restart @docker @no_transaction
Scenario: An existing php repository with a configuration with PHPMD (Hotspot Metrics)
Given I have a sample configuration with the PHPMD hotspot metric
And I have a sample php repository
Expand Down
Loading