-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul database schema #215
base: master
Are you sure you want to change the base?
Changes from all commits
c59f4ce
dbca269
1e1cbb2
f299518
6a1a8f7
a57e571
a8777c6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
require 'fileutils' | ||
|
||
class CleanInconsistencies < ActiveRecord::Migration | ||
def backup_dir | ||
return @backup_dir if @backup_dir | ||
@backup_dir = Rails.root.join('db', 'backup', Time.now.strftime('%Y-%m-%d_%H-%M-%S')) | ||
FileUtils.mkdir_p(@backup_dir) | ||
end | ||
|
||
def backup(name, query, header: true) | ||
say_with_time("backup(#{name.inspect}, #{query.inspect})") do | ||
copy_query = "COPY (#{query}) TO STDOUT WITH DELIMITER ',' CSV #{header ? 'HEADER' : ''}" | ||
|
||
File.open(backup_dir.join("#{name}.csv"), 'a') do |f| | ||
connection.raw_connection.copy_data(copy_query) do | ||
while line = connection.raw_connection.get_copy_data | ||
f.write(line) | ||
end | ||
end | ||
end | ||
end | ||
end | ||
|
||
def backup_and_delete_missing(table, exists_query) | ||
backup(table, "SELECT * FROM \"#{table}\" WHERE NOT EXISTS(#{exists_query})") | ||
execute "DELETE FROM \"#{table}\" WHERE NOT EXISTS(#{exists_query})" | ||
end | ||
|
||
def up | ||
say "WARNING: destructive migration necessary. Deleted data will be backed up to #{backup_dir}" | ||
|
||
# Unset project reference for repositories with non-existing projects | ||
execute <<-SQL | ||
UPDATE repositories AS r | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could this generate inconsistencies in ownerships of Prezento? Should we do something similar there? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see how it could matter considering this only unsets the references to records that do not exist. If Prezento had those same references they would already be broken if anyone tried to actually use them for anything. IMO we should have proper database constraints everywhere, but adding them after a long time is quite a bit harder. I don't know whether it's worth the extra work. I won't have the free time to do the same for the other services. |
||
SET project_id = NULL | ||
WHERE project_id = 0 OR NOT EXISTS ( | ||
SELECT 1 FROM projects AS p WHERE p.id = r.project_id | ||
) | ||
SQL | ||
|
||
# Delete processings with non-existing repositories | ||
backup_and_delete_missing("processings", | ||
"SELECT 1 FROM repositories AS r WHERE r.id = processings.repository_id") | ||
|
||
# Delete process times with non-existing processings | ||
backup_and_delete_missing("process_times", | ||
"SELECT 1 FROM processings AS p WHERE p.id = process_times.processing_id") | ||
|
||
# Delete module results with non-existing processings | ||
backup_and_delete_missing("module_results", | ||
"SELECT 1 FROM processings AS p WHERE p.id = module_results.processing_id") | ||
|
||
# Delete kalibro modules with non-existing module results | ||
backup_and_delete_missing("kalibro_modules", | ||
"SELECT 1 FROM module_results AS m WHERE m.id = kalibro_modules.module_result_id") | ||
|
||
# Fix up metric results type, even before backing up so the backup is cleaner | ||
execute <<-SQL | ||
UPDATE metric_results SET "type" = 'TreeMetricResult' WHERE "type" = 'MetricResult' | ||
SQL | ||
|
||
# Delete metric results with non-existing module results | ||
backup_and_delete_missing("metric_results", | ||
"SELECT 1 FROM module_results AS m WHERE m.id = metric_results.module_result_id") | ||
|
||
# Delete duplicate metric_results. Group them by (module_result_id, metric_configuration_id), | ||
# then delete all but the one with the highest ID. The double wrapping on the inner query is | ||
# necessary because window functions cannot be used in WHERE in PostgreSQL. | ||
repeated_metric_result_query = exec_query <<-SQL | ||
SELECT t.id FROM ( | ||
SELECT metric_results.*, ROW_NUMBER() OVER ( | ||
PARTITION BY module_result_id, metric_configuration_id, "type" | ||
ORDER BY id DESC) AS rnum | ||
FROM metric_results | ||
WHERE "type" = 'TreeMetricResult' | ||
) AS t | ||
WHERE t.rnum > 1 | ||
SQL | ||
|
||
unless repeated_metric_result_query.empty? | ||
repeated_metric_result_ids = repeated_metric_result_query.rows.flat_map(&:first).join(',') | ||
|
||
# Replace default messages with custom ones to avoid flooding the screen with the huge query | ||
say_with_time('backup("metric_results", "SELECT * metric_results WHERE id IN (...)")') do | ||
suppress_messages do | ||
backup('metric_results', | ||
"SELECT * FROM metric_results WHERE id IN (#{repeated_metric_result_ids})", | ||
header: false) | ||
end | ||
end | ||
|
||
say_with_time('execute("DELETE FROM metric_results WHERE id IN (...)")') do | ||
suppress_messages do | ||
execute "DELETE FROM metric_results WHERE id IN (#{repeated_metric_result_ids})" | ||
end | ||
end | ||
end | ||
end | ||
|
||
def self.down | ||
raise ActiveRecord::IrreversibleMigration | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
class AddIndexesToKalibroModules < ActiveRecord::Migration | ||
def change | ||
add_foreign_key :kalibro_modules, :module_results, on_delete: :cascade | ||
add_index :kalibro_modules, [:long_name, :granularity] | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
class AddIndexesToModuleResults < ActiveRecord::Migration | ||
def change | ||
add_foreign_key :module_results, :module_results, column: 'parent_id' | ||
add_foreign_key :module_results, :processings, on_delete: :cascade | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
class AddIndexesToMetricResults < ActiveRecord::Migration | ||
def change | ||
add_foreign_key :metric_results, :module_results, on_delete: :cascade | ||
add_index :metric_results, :type | ||
add_index :metric_results, :module_result_id | ||
add_index :metric_results, :metric_configuration_id | ||
add_index :metric_results, [:module_result_id, :metric_configuration_id], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is this line doing? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's the database counterpart to the unique validation in |
||
unique: true, where: "type = 'TreeMetricResult'", | ||
name: 'metric_results_module_res_metric_cfg_uniq_idx' | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
class AddIndexesToProcessings < ActiveRecord::Migration | ||
def change | ||
add_foreign_key :processings, :repositories | ||
add_foreign_key :processings, :module_results, column: 'root_module_result_id' | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
class AddIndexesToProcessTimes < ActiveRecord::Migration | ||
def change | ||
add_foreign_key :process_times, :processings, on_delete: :cascade | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
class AddIndexesToRepositories < ActiveRecord::Migration | ||
def change | ||
add_foreign_key :repositories, :projects | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
class AddIndexesOfForeignKeys < ActiveRecord::Migration | ||
def change | ||
add_index :module_results, :processing_id | ||
add_index :kalibro_modules, :module_result_id | ||
add_index :process_times, :processing_id | ||
add_index :processings, :repository_id | ||
add_index :repositories, :project_id | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this script be a migration? We are not changing the database structure, just removing records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are semantically changing the database structure, from allowing inconsistencies to not allowing. We can run this manually, but it will mean anyone using the software will have to do the same, otherwise the next migrations won't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this script is preventing us from creating inconsistencies. It is merely cleaning things up so other scripts (migrations) change the structure (by adding indexes and foreign keys, for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the next scripts add the constraints to prevent the inconsistencies, but they would fail if the inconsistencies already exist. If you can suggest a better method to handle it I'll be happy to use it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a rake task can achieve similar goals, with warning messages for those running the migrations. Skipping them if they should fail would be good too.
That way, we don't force anyone to erase data without warning them first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true for the KalibroClient's point of view. If we were mining our database for statistics or some kind of study, suddenly losing data would make us desperate! xD
Notice that on those cases it may make sense to not erase a processing because it is not associated with a repository, for example.
I think that because we are changing data, not structure, this should not be a migration. Adding the extra step is surely more bureaucratic, but I think it's a good pratice to at least warn anyone who uses this software before erasing anything. Maybe a rake task is not the best option, but I don't think a migration is neither (even though rails guides say migrations can be made to add or modify data).
Anyway, if I'm the only one not happy with this, we can move on. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true from the point of view of the model that the database is an expression of. If we had made a more faithful conversion, we would have foreign keys from the start, and the data being deleted now would never exist. They only reason it does exist is because we failed to keep garbage out. The fact that it got in doesn't make it not garbage.
What meaningful statistics can you gather from data that violates the semantic constraints of your model, and that should never exist? In the particular cases of things this migration deletes, you can't even reference the context that would allow you to make any sense of them.
What worth are process times if you can't even know what was processed? What worth is a kalibro module with no associated module result? What worth is a metric result that would never be looked at, and that is possibly complete garbage because it's associated with the wrong module result because we generated the wrong name?
I don't see why that would be the case. Migrations are a mechanism to allow evolving a database progressively without throwing it away. Nothing makes them specific to structure. Also, changing structure many times entails changing data to be able to do it properly - this is just another case of that that is ugly because the situation is ugly.
The migration is the least bad option, as it is at least run in a controlled environment, and is a part of the known upgrade steps that any admin must run. How would someone upgrading find out that they need to run the rake task and how? Could we make it work in an automated processes without requiring manual intervention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you processed a repository and then deleted it, all the relevant data would be there. Suppose we had 500 repositories processed and then all of them were deleted. If you looked at their process times and saw that say, the aggregation time was surprinsingly high, would you not consider the data because you can't see the repository associated?
I agree that changing structure many times entails changing data. But I think it's just wrong to delete data without even notifying the users. I think it's not a good policy because the data is not yours, even though you may think it's garbage. In my opinion, we have to make users aware of it even if it means the process is not entirely automatic. I would be happier if we could at least make the migration need some user input like
This migration is going to delete this kind of data, ok?
.I've come to the conclusion that the migration is not a bad option. However, deleting data without informing the users, in my opinion, definitely is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diegoamc I understand your concerns, but I am confident that everything that this migration deletes is data that could not be accessed at all, or that doesn't even make sense. In your particular example, it's hard to judge whether a processing took long when the repository used is unknown. The fact that a processing does not make sense without it's repository is encoded in the model: these changes just enforce what was already supposed to happen at the database level.
If there is any data being deleted in a way that is not specified in the model, we should absolutely fix the migration or the model (whichever is wrong).
We have to keep in mind that we had the responsibility to create valid data and failed it. We have the knowledge to separate what is garbage (according to the model) and what is not, such that the remaining data can be trusted from now into the future. I'm not proposing to delete old data or anything like that, but data that should never have existed at all. No judgements of value can be made about its contents because they can be absolute nonsense.
I don't know of a good way to do it, since we have no guarantee that the migration is interactive - it can be running in an automated way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understood your point too, but I can't accept the PR. I wish I had the time right now to sit for a while, study and propose another solution.
If nobody else is unhappy about it, please accept it :)