In my work, I often need to perform certain data-engineering tasks frequently. One such frequent task is to create one-off database queries( aka sql queries) for data migration purposes. The process is often like this.
- Oh, I need to figure out how to write this query as part of my data migration script due to this new feature I need to add.
- Google and find the right StackOverflow answer.
- Copy, paste, and modify till it works.
- Promptly forget how it’s done.
- When I need the query again, repeat step 1-3.
I want to break the cycle of Google-StackOverflow-Copy-Paste-Forget. Looking ahead, I also expect to talk in greater depth about my work involving data engineering work. Hence, I want to fulfil both functions with this page.
For the time being, I will simply append the queries to this page as and when I encounter the need for one-off sql queries during my work.
When the list gets too long, I may reorganise this page. For now, my cycle is Google => StackOverflow => Copy-Paste => Modify => Append-To-This-Page.
Delete Duplicate Rows
This has its own page. I have yet to figure out whether to have separate pages for individual data engineering use cases. I’m experimenting with having some of the queries be in their own pages. And this is one of them.
*New* Select First Matching Row For All Variations
Another query with its own page.
Get the Auto-increment Identity and Set it to an Arbitrary Value
Situation: You have a table using autoincrement for its id
primary key field. You want to know what’s the next value that comes for the id. Or you want to reset it to an arbitrary value. Sometimes, the auto-increment is also known as identity.
Note: if you want to set the current auto-increment to the max id + 1, see the next section.
Postgres
Postgres uses SEQUENCE as a way to track the autoincrement for the primary key field. The general form is <tablename>_<autoincrement_field>_seq
.
To retrieve the next available value for the autoincrement in table1
and the field is id
:
SELECT nextval('table1_id_seq');
To reset the next value for the same field and table to 1337:
SELECT setval('table1_id_seq', 1337, false);
Note: you need to set the last argument to false, so the next value will be set arbitrarily to 1337. If it’s not included, it will take 1338 instead.
MySQL
MySQL uses AUTOINCREMENT
inside INFORMATION_SCHEMA.TABLES
to track the autoincrement.
To retrieve the next available value for the autoincrement in table1
and the database is database_name
:
SELECT `AUTO_INCREMENT`
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'database_name'
AND TABLE_NAME = 'table1';
To reset the next value for the same table to 1337:
ALTER TABLE table1 AUTO_INCREMENT = 1337;
Reset the Auto-increment to Max ID + 1
Situation: The table has auto-increment for the id field which uses int. You want to reset the auto-increment counter so that it carries on from the maximum id field that’s already being used.
MySQL
The issue here is that you need to use stored procedures. So there are basically two steps.
First, create the stored procedure. Second, run the stored procedure.
Step 1: Create Stored Procedure
DELIMITER // CREATE PROCEDURE reset_autoincrement(IN tablename varchar(200)) BEGIN SET @get_next_inc = CONCAT('SELECT @next_inc := max(id) + 1 FROM ',tablename,';'); PREPARE stmt FROM @get_next_inc; EXECUTE stmt; SELECT @next_inc AS result; DEALLOCATE PREPARE stmt; SET @alter_statement = CONCAT('ALTER TABLE ', tablename, ' AUTO_INCREMENT = ', @next_inc, ';'); PREPARE stmt FROM @alter_statement; EXECUTE stmt; DEALLOCATE PREPARE stmt; END // DELIMITER ;



Step 2: Call the stored procedure
call reset_autoincrement('products');
Postgres
The table is products
and the autoincrement field is id
.
SELECT setval('products_id_seq', (SELECT max(id) from products), true);
Note: you need to set the last argument to true, so the next value will be max(id) + 1.
Insert Into on Duplicate Update with Select
Situation: You want to insert a bunch of records into a table using data from other tables. At the same time, when there’s duplication in terms identity values, you want to use update instead of insert.
MySQL
INSERT INTO table1 (val1, val2) SELECT val_a, val_b FROM table2 ON DUPLICATE KEY UPDATE val1=table2.val_a, val2=table2.val_b;
Postgres
Will add this later.
Insert Into on Duplicate Update with Group By Clause
Situation: Similar situation as the case above. You want to insert a bunch of records into a table using data from other tables. At the same time, when there’s duplication in terms identity values, you want to use update instead of insert. The tricky part is that the inserted is an aggregation of data from other tables. This typically requires Group By clause in the select subquery.
MySQL
INSERT INTO table1(val1, val2) SELECT table2sum.val_a, table2sum.val_b FROM (SELECT SUM(val_a) as val_a, val_b from table2 group by table2.val_b) as table2sum ON DUPLICATE KEY UPDATE val1=table2sum.val_a, val2=table2sum.val_b;
Postgres
Will add this later.
Update One Table with Data From Another Related Table
There are at least 3 variations of this type of query. That includes updating a parent table based on aggregated data in the child table. This now has its own page.
Update Two Tables That Are Related
Situation: You want to update the fields for a particular row in both the parent and child table.
MySQL
UPDATE parent_table p INNER JOIN child_table c ON p.id = c.parent_id SET p.parent_name = 'parent', c.child_name = 'child' WHERE p.serial_number = '1234ST';
Postgres
Postgres does not have a similar update-inner-join
statement like MySQL.
You can do this in two ways. Do it in 2 separate queries. Or do it in 1 query. The 2 queries way is easier to understand.
2 queries
UPDATE parent_table AS p SET parent_name = 'parent' WHERE p.serial_number = '1234ST'; UPDATE child_table AS c SET child_name = 'child' FROM parent_table p WHERE c.parent_id = p.id and p.serial_number = '1234ST';
1 query
The technique below is also known as the writeable common table query but honestly, I find it not as intuitive.
The advantage of this over the 2 queries method is that you only need to state the WHERE clause condition (WHERE serial_number = '1234ST'
) once.
WITH parent_update as ( UPDATE parent_table SET parent_name = 'parent' WHERE serial_number = '1234ST' returning id, serial_number ) UPDATE child_table SET child_name = 'child' WHERE (parent_id, serial_number) IN (select id, serial_number from parent_update);
Order by Length of Field
Situation: You want to see the records of a table (table1) ordered by the string length of a field (field1) which is a text or char field. You want to do it by descending order too.
Postgres
SELECT * from table1 ORDER BY CHAR_LENGTH(field1) desc;
MySQL
SELECT * from table1 ORDER BY LENGTH(field1) desc;
TRUNCATE and CASCADE
Situation: You want to truncate parent table but it has child table using foreign key reference on it. You want to cascade the truncate.
Postgres
TRUNCATE TABLE parent CASCADE;
MySQL
TBD
Return Different String Based on Boolean Field
Situation: You have a boolean field but instead of returning that, you want to display different strings based on the value.
Postgres
TBD
MySQL
SELECT (CASE WHEN column <> 0 THEN 'yes' ELSE 'no' END) As Value
Update Field With String Replacement
Situation: You have a string field and you want to use it to update another string field with replacements for certain characters such as empty space and hyphens.
E.g. I want to remove all empty spaces and hyphens before shortening the string to 6 characters long
Postgres
# | is or operator # g means global # postgres uses 1-indexed UPDATE table_a set shortname = SUBSTR(regexp_replace(name, '( |-)', '', 'g'), 1, 6);
MySQL
TBD