Streamlining Database Updates in OMOP CDM
A Journey through Bulk Inserts and Conditional Queries in PostgreSQL
Introduction
In this post, we're diving into the data trenches, where the mission is clear: efficiently import and integrate substantial datasets into your PostgreSQL database.
More specifically in this case, I had a medical EHR database structured according to the OMOP common data model, which I needed to update with new vocabularies. While this will serve as an example along the whole post, I'll keep the code as generic as possible, or at least, not tied to that specific data model and data sets.
We will walk through the process of preparing the data, utilizing temporary staging tables for bulk operations, and the intricacies of executing conditional inserts to maintain the integrity and consistency of the database.
The input files
In my specific case, the input files were almost ready to be loaded into postgres, as they were downloaded directly as TSV (Tab separated values) from the Athena website.
There is a number of files, each of which corresponds to a specific table. Because this is how Athena provides the files, and also because it makes things very clear, we'll start from the assumption that the file name is also the name of the table in which the data it contains needs to be inserted.
Filename | Table |
CONCEPT.csv | concept |
CONCEPT_ANCESTOR.csv | concept_ancestor |
CONCEPT_CLASS.csv | concept_class |
CONCEPT_RELATIONSHIP.csv | concept_relationship |
CONCEPT_SYNONYM.csv | concept_synonym |
DOMAIN.csv | domain |
DRUG_STRENGTH.csv | drug_strength |
RELATIONSHIP.csv | relationship |
VOCABULARY.csv | vocabulary |
At first sight, it may seem easy to insert those files in database using the COPY
statement, but there several issues to take into account:
Some of these files contain rows with characters that PostgreSQL refuses to load (or, at least, that seem to break the parser somehow). As these are in charsets / alphabets I don't know anything about, I'll simply get rid of those lines. I'm not going to use them anytime soon anyway.
Other lines are just badly formatted, they don't contain the correct number of fields. These lines are going away as well.
The files are not "comma" separated values like the file extension implies but "Tab" separated values.
These files may contain rows that already exist in the database, that should not be loaded a second time.
In the database, these tables have foreign keys, indexes and integrity constraints. This implies that:
Some records need to be inserted before others.
Removing rows that can not be inserted may violate integrity constraints. This needs to be addressed as well.
Handling invalid characters
Filtering the rows of a file can easily be handled by the following function :
def is_allowed_character(ch):
# Check if the character's Unicode code point is within allowed ranges
return (
0x0020 <= ord(ch) <= 0x007E or # Basic Latin (includes numbers and punctuation)
0x00A0 <= ord(ch) <= 0x024F # Latin-1 Supplement, Latin Extended-A, Latin Extended-B
or ch == '\n' or ch == '\r' or ch == '\t' or ch == ' '
)
def clean_file(input_file_path, output_file_path, separator, expected_separators):
count = 0
with open(input_file_path, 'r', encoding='utf-8') as input_file, \
open(output_file_path, 'w', encoding='utf-8') as output_file:
for line in input_file:
# Check if all characters in the line are allowed
if all(is_allowed_character(ch) for ch in line) and expected_separators == line.count(separator):
line = line.replace('"', "\\'")
output_file.write(line)
else:
count += 1
print(f'Filtered {count} lines from {input_file_path} and wrote to {output_file_path}.')
# Example usage
# clean_file('path/to/input.txt', 'path/to/output.txt', separator='\t', expected_separators=9)
Tab separated values
Now that the files are clean, I'll load them into temporary tables before merging them in their final destination table. The load part is performed using the COPY
statement, but specifying tab as a delimiter requires a small trick compared to "single byte characters" :
COPY table_name(field1, field2, ...) FROM 'file_path' DELIMITER e'\t' CSV HEADER;
The 'e' char before the delimiter allows to use the \t sequence.
This brings up 2 new issues :
The table needs to exist before loading and we do not want to insert directly into the final table (because of the duplicate or missing rows, the required order of insertion, integrity constraints, etc.)
- This is easily accomplished by copying the table structure directly using the following sql statement :
CREATE TABLE xxx_temp AS TABLE xxx WITH NO DATA;
wherexxx
is the original table. This has to be done for each table. Of course, as always, since I want this to be repeatable easily, this has to be scripted. You could do it in many different ways, I'll go the python route.
- This is easily accomplished by copying the table structure directly using the following sql statement :
Each table has a series of fields that I do not really want to look up and copy by hand each time. In this case, I only have 9 files, some of the tables may have about 20 columns, that's already quite a lot of field names to copy. And this does not take all tables into account, only those for which I have files now.
This really needs to be automated.
Create empty tables with the same structure.
To copy the tables structure, I'll loop over the filenames and generate the sql statements as I go.
import os
def generate_create_statements(filenames):
"""
Generates SQL CREATE TABLE statements to create temporary tables
based on existing table structures.
:param filenames: List of filenames, where each filename (without extension)
is assumed to be the name of the original table.
"""
for filename in filenames:
# Extract table name from filename, convert to lowercase, and remove file extension
table_name = filename.lower().split('.')[0]
temp_table_name = f"{table_name}_temp"
# Generate SQL statement
sql_statement = f"DROP TABLE IF EXISTS {temp_table_name};
CREATE TABLE {temp_table_name} AS TABLE
{table_name} WITH NO DATA;"
print(sql_statement)
filenames = [f for f in os.listdir('.') if f.endswith(".csv")]
generate_create_statements(filenames)
Generate the COPY statements.
To create the copy statements with the field names etc, we'll use the information schema to extract the relevant information.
import os
import psycopg2
def generate_copy_statements(dbname, user, password, host, port, filenames):
"""
Connects to a PostgreSQL database, retrieves column names for specified tables,
and generates COPY statements for loading data from CSV files.
:param dbname: Name of the database.
:param user: Username for authentication.
:param password: Password for authentication.
:param host: Database host address.
:param port: Database port.
:param filenames: List of files to generate COPY statements for.
"""
try:
# Connect to the database
conn = psycopg2.connect(
dbname=dbname,
user=user,
password=password,
host=host,
port=port
)
cur = conn.cursor()
for filename in filenames:
table_name = filename.lower().split('.')[0]
# Fetch the column names for the table
cur.execute(f"""
SELECT column_name FROM information_schema.columns
WHERE table_name = '{table_name}'
AND table_schema = 'public' ORDER BY ordinal_position;
""")
columns = [row[0] for row in cur.fetchall()]
# Generate the COPY command
column_list = ', '.join(columns)
csv_file_path = filename # Adjust path as necessary
copy_cmd = f"""COPY {table_name}_temp({column_list})
FROM '{csv_file_path}' DELIMITER e'\t'
CSV HEADER;
"""
print(copy_cmd)
except Exception as e:
print(f"An error occurred: {e}")
finally:
if conn:
cur.close()
conn.close()
# Example usage
filenames = [f for f in os.listdir('.') if f.endswith(".csv")]
generate_copy_statements("your_dbname", "your_user", "your_password", "your_host", your_port, filenames)
At this point, the temporary tables can be loaded in any order since they do not have all the integrity constraints and foreign keys of the destination tables.
Generate the INSERT statements.
Similarly, I do not want to start checking on every table which column should be taken care of to avoid inserting duplicates, so I made a script that does just that for me :
import psycopg2
def get_identifier_columns(cursor, table_name):
"""
Retrieves identifier columns for a given table based on the primary key, unique index,
or all columns if neither is found. Handles composite primary keys.
"""
# Attempt to find a primary key or unique constraint
cursor.execute("""
SELECT tc.constraint_type, kcu.column_name
FROM information_schema.table_constraints AS tc
JOIN information_schema.key_column_usage AS kcu ON tc.constraint_name = kcu.constraint_name
WHERE tc.table_schema = 'public' AND tc.table_name = %s
AND (tc.constraint_type = 'PRIMARY KEY' OR tc.constraint_type = 'UNIQUE')
ORDER BY tc.constraint_type, kcu.ordinal_position;
""", (table_name,))
result = cursor.fetchall()
if result:
# Group by constraint_type and aggregate column names
columns = [col[1] for col in result]
return columns
# Attempt to find a unique index
cursor.execute("""
SELECT
ic.relname AS index_name,
array_agg(a.attname ORDER BY a.attnum) AS columns
FROM
pg_class c
JOIN
pg_index ix ON c.oid = ix.indrelid
JOIN
pg_class ic ON ix.indexrelid = ic.oid
JOIN
pg_attribute a ON a.attrelid = c.oid AND a.attnum = ANY(ix.indkey)
WHERE
c.relname = %s
AND c.relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'public') -- Adjust the schema name as needed
AND ix.indisunique = TRUE
GROUP BY
ic.relname
ORDER BY
ic.relname;
""", (table_name,))
result = cursor.fetchall()
if result:
columns = result[0][1]
return columns
# Fallback: Use all columns if no PK/Unique constraint is found
cursor.execute("""
SELECT column_name
FROM information_schema.columns
WHERE table_schema = 'public' AND table_name = %s
ORDER BY ordinal_position;
""", (table_name,))
all_columns = [row[0] for row in cursor.fetchall()]
return all_columns
def construct_where_clause(identifier_columns):
"""
Constructs a WHERE NOT EXISTS clause using the identifier columns.
"""
conditions = " AND ".join([f"main.{col} = temp.{col}" for col in identifier_columns])
where_clause = f"WHERE NOT EXISTS (SELECT 1 FROM {table} AS main WHERE {conditions})"
return where_clause
def generate_insert_statements(db_params, tables):
"""
Generates and executes INSERT statements to copy records from "_temp" tables
to their corresponding main table, considering primary keys, unique indexes, or all columns.
"""
inserts = []
try:
with psycopg2.connect(**db_params) as conn:
with conn.cursor() as cur:
for table in tables:
temp_table = f"{table}_temp"
identifier_columns = get_identifier_columns(cur, table)
where_clause = construct_where_clause(identifier_columns)
insert_sql = f"""
INSERT INTO {table}
SELECT temp.* FROM {temp_table} AS temp
{where_clause};
"""
inserts.append(insert_sql) # Or execute it: cur.execute(insert_sql)
conn.commit()
except Exception as e:
# Rollback the transaction in case of an error
conn.rollback()
print(f"An error occurred: {e}")
finally:
conn.close()
print(inserts)
# Example usage
db_params = {
'dbname': 'your_database',
'user': 'your_user',
'password': 'your_password',
'host': 'your_host',
'port': '5432'
}
tables = [f.lower().split('.')[0] for f in os.listdir('.') if f.endswith(".csv")]
generate_insert_statements(db_params, tables)
Execute the INSERT statements.
Then we need to execute these insert statements. Remember, there were integrity constraints and foreign keys etc. Instead of computing the ideal insertion order (which might not exist if there are cycles in the dependencies graph), we rather disable all checks, execute all the inserts, then re-enable all checks. Of course, this might not be an option if your database is written simultaneously by other users.
import psycopg2
def load_data_with_disabled_checks(db_params, tables, statements):
"""
Load data using COPY statements with disabled triggers and integrity checks.
:param db_params: Database connection parameters as a dictionary.
:param copy_statements: List of COPY statements to execute.
"""
try:
conn = psycopg2.connect(**db_params)
cur = conn.cursor()
# Disable foreign key checks
cur.execute("SET session_replication_role = 'replica';")
# Disable triggers checks
for table in tables:
cur.execute(f"ALTER TABLE {table_name} DISABLE TRIGGER ALL;")
for statement in statements:
# Execute statement
cur.execute(statement)
# Re-enable triggers
for table in tables:
cur.execute(f"ALTER TABLE {table_name} ENABLE TRIGGER ALL;")
# Re-enable foreign key checks
cur.execute("SET session_replication_role = 'origin';")
conn.commit()
except Exception as e:
print(f"An error occurred: {e}")
conn.rollback()
finally:
cur.close()
conn.close()
# Example usage
db_params = {
'dbname': 'your_database',
'user': 'your_user',
'password': 'your_password',
'host': 'your_host',
'port': 'your_port'
}
tables = [f.lower().split('.')[0] for f in os.listdir('.') if f.endswith(".csv")]
insert_statements = generate_insert_statements(db_params, tables)
load_data_with_disabled_checks(db_params, tables, insert_statements)
Delete the temporary tables.
Finally, the temporary tables may be deleted :
import os
import psycopg2
def drop_temp_tables(db_params, csv_files_directory):
"""
Drops temporary tables based on the CSV files present in the given directory.
:param db_params: Dictionary with database connection parameters.
:param csv_files_directory: Directory where the CSV files reside.
"""
# List CSV filenames in the given directory
csv_files = [f for f in os.listdir(csv_files_directory) if f.endswith('.csv')]
# Extract table names from filenames and append '_temp' to form temporary table names
temp_table_names = [os.path.splitext(f)[0].lower() + '_temp' for f in csv_files]
try:
with psycopg2.connect(**db_params) as conn:
with conn.cursor() as cur:
for temp_table_name in temp_table_names:
drop_sql = f"DROP TABLE IF EXISTS {temp_table_name};"
cur.execute(drop_sql)
print(f"Dropped temporary table: {temp_table_name}")
conn.commit()
except Exception as e:
# Rollback the transaction in case of an error
conn.rollback()
print(f"An error occurred: {e}")
finally:
conn.close()
# Example usage
db_params = {
'dbname': 'your_database',
'user': 'your_user',
'password': 'your_password',
'host': 'your_host',
'port': '5432'
}
csv_files_directory = '.' # Current directory
drop_temp_tables(db_params, csv_files_directory)
Conclusion and perspectives.
Diving into the deep end of database updates within the OMOP CDM framework can feel a bit like trying to organize a dinner party in a revolving door – challenging, but not without its charms.
We've sliced through the data, diced through some of the intricacies of updating the database, and are simmering on solutions. While the pièce de résistance – a comprehensive tool to make all this as easy as pie – is still baking, fear not. The saga continues, and you're invited to the tasting menu of upcoming blog posts, where we'll serve up every course of its development. Keep your napkins ready and your forks poised; the adventure in data cuisine is just getting started.