How-To: Extract PHP-Nuke Articles from an SQL INSERT Statement in Python

I found an old SQL dump of this r3dux.org site yesterday which I thought was lost to time. The site ran php-nuke back then, and rather than importing the SQL into a database and fiddling with it to export all the old articles I thought I’d take the python hammer to it. It’s a bit of a bodge/hack – but it seems to work well enough.

The idea is you copy JUST the insert statement which deals with “stories” (nuke’s word for articles / posts) from the SQL dump and put it in a file (mine’s called “nuke_stories.sql”), then run the parse_nuke_stories python script and it’ll dump out all the articles into separate files with filenames like “Article_ID_YEAR_MONTH_DAY.html” and such.

Here’s the top few lines of my SQL insert statement:

INSERT INTO `nuke_stories` (`sid`, `catid`, `aid`, `title`, `time`, `hometext`, `bodytext`, `comments`, `counter`, `topic`, `informant`, `notes`, `ihome`, `alanguage`, `acomm`, `haspoll`, `pollID`, `score`, `ratings`, `associated`) VALUES
(1, 0, 'r3duxAdmin', 'W00t! The site it be a-changin''...', '2004-09-10 12:35:06', 'Well, PHP-Nuke seems to be up and running, so now it''s time to mess w/ it in a huge way and configure all the millions of options needed to make it just the way I want.\r\n

\r\nDaunted? Me? Er, yeah. You could say that...', '', 0, 83, 1, 'r3duxAdmin', '', 1, '', 0, 0, 0, 0, 0, ''),

Here’s what it’ll dump out into a file called Article_001_2004_09_10.html if this was all that was there:

Article ID : 1
Author : r3duxAdmin
Title : W00t! The site it be a-changin’…
Date : 2004-09-10 12:35:06

Well, PHP-Nuke seems to be up and running, so now it’s time to mess w/ it in a huge way and configure all the millions of options needed to make it just the way I want.

Daunted? Me? Er, yeah. You could say that…

And here’s the actual python (version 3) code to parse the SQL file:

# Function to find a single apostrophe that closes the field and return its location
def find_end_apostrophe(location, story):
    while True:
        # Move along
        location += 1
 
        # Grab two chars
        char1 = story[location:location+1]
        char2 = story[location+1:location+2]
 
        # If they're both apostophes then skip ahead 2 chars
        if char1 == "'" and char2 == "'":
            location += 2
        # An apostrophe and something else? It's the end apostrophe so bail!
        elif char1 == "'" and char2 != "'":
            break
 
    return location
 
# Function to parse a story into separate fields and place them into the parsed_story list
def parse_story(story):
 
    parsed_story = []
 
    # Fields:
    # 0 = `sid`, 1 = `catid`, 2 = `aid`, 3 = `title`, 4 = `time`, 5 = `hometext`, `6 = bodytext`, 7 = `comments`,
    #`8 = `counter`, 9 = `topic`, 10 = `informant`, 11 = `notes`, 12 = `ihome`, 13 = `alanguage`, 14 = `acomm`,
    # 15 = `haspoll`, 16 = `pollID`, 17 = `score`, 18 = `ratings`, 19 = `associated`
 
    # 0 - Add story ID
    temp_end = story.find(",")
    field = story[1:temp_end]
    parsed_story.append(field)
    #print("ID:", field)
 
    # 1 - Add category ID
    temp_start = temp_end + 1
    temp_end = story.find(",", temp_start)
    field = story[temp_start:temp_end]
    parsed_story.append(field)
    #print("Category:", field)
 
    # 2 - Add author
    temp_start = temp_end + 3
    temp_end = story.find("'", temp_start)
    field = story[temp_start:temp_end]
    parsed_story.append(field)
    #print("Author:", field)
 
    # 3 - Add title
    temp_start = temp_end + 4
    temp_end   = find_end_apostrophe(temp_start, story)    
    #print("Title locs:", temp_start, temp_end)
    field = story[temp_start:temp_end].replace("''", "'")
    parsed_story.append(field)
    #print("Title:", field)
 
    # 4- Add time
    temp_start = temp_end + 4
    temp_end = story.find("'", temp_start)
    #print("Time locs:", temp_start, temp_end)
    field = story[temp_start:temp_end]
    parsed_story.append(field) # initial_story_split[4][2:-1])
    #print("Date:", field)
 
    # 5 - Add home text (i.e. the first paragraph)
    temp_start = temp_end + 4
    temp_end   = find_end_apostrophe(temp_start, story)
    #print("Home locs:", temp_start, temp_end)
    field = story[temp_start:temp_end].replace("''", "'").replace("<br><br>\\r\\n", "<br/><br/>").replace("\\r\\n", " ").replace("  ", "")
    parsed_story.append(field)
    #print("Home:", field)        
 
    # 6 = Add body text
    temp_start = temp_end + 3
    temp_end   = find_end_apostrophe(temp_start, story)
    #print("Body locs:", temp_start, temp_end)
    field = story[temp_start:temp_end].replace("''", "'").replace("<br><br>\\r\\n", "<br/><br/>").replace("\\r\\n", " ").replace("  ", "") # Last one is 2 'A with a hat' chars used for bullet points
 
    # If there is no body blank the field, otherwise strip the first char
    if field == "'":
        field = ""
    else:
        field = field[1:]        
 
    parsed_story.append(field)
    #print("Body:", field)
 
    # Not bothered about the rest of the fields
 
    return parsed_story
 
# ---------------------- Main --------------------------
 
#import codecs
#sql_file = codecs.open("nuke_stories.sql", encoding="utf-8")
 
# Read the SQL dump of stories into a list of lines where each line is a story
sql_file = open("nuke_stories.sql", "r")
story_list = sql_file.readlines();
sql_file.close()
 
num_stories = len(story_list)
print("Number of stories: ", num_stories, "\n")
 
# Newline char
nl = "<br/>"
 
outfile = 0
for story_number in range(1, num_stories - 1): # Start from 1 because line 0 is the insert statement itself
 
    # Parse the story (i.e. article)
    parsed_story = parse_story( story_list[story_number] )
 
    # Grab the first section of the date only. Format: YYYY-MM-DD
    date_str = parsed_story[4][:parsed_story[4].find(" ")].replace("-", "_")
 
    # Construct the filename then create the file for writing
    filename = "Article_" + parsed_story[0].zfill(3) + "_" + date_str + ".html"
    outfile = open(filename, "w")
 
    # Dump the parsed story details to file
    outfile.write("Article ID : " + parsed_story[0] + nl)
    # Skipping Category ID - don't care.
    outfile.write("Author     : " + parsed_story[2] + nl)
    outfile.write("Title      : " + parsed_story[3] + nl)
    outfile.write("Date       : " + parsed_story[4] + nl + nl)
    outfile.write(parsed_story[5] + nl + nl) # Home text (i.e. first paragraph)
    outfile.write(parsed_story[6])           # Body text (i.e. rest of article)
 
    # Close the file to save it
    outfile.close()
 
print("All done! =D")

As mentioned, it’s a complete hack – but if you need to do something similar it may work as a starter to get you going…

Useful SQL Commands

I recently needed to take a database with separate date and time columns, convert that combined date-time data into Unix time format (i.e. seconds since the epoch [1st Jan 1970]), and add a new column to the database with that data. It took me a fair bit of googling to get it all working, so here’s my notes.

Sorry this isn’t really in an article format – but I need this written down and available for the future… Also, I wrote a small Java app to read the separate date and time text files and spit out a timezone-correct unixtime text file, which I then added to a spreadsheet containing just the the primary key rows and exported it as tab-separated values suitable for MySQL (or really, MariaDB in this case) import.

Show the table structure for the ‘accidents’ table

DESCRIBE accidents;
	+-------------------+-------------+------+-----+---------+-------+
	| FIELD             | TYPE        | NULL | KEY | DEFAULT | Extra |
	+-------------------+-------------+------+-----+---------+-------+
	| OBJECTID          | VARCHAR(50) | YES  |     | NULL    |       |
	| ACCIDENT_NO       | VARCHAR(50) | NO   | PRI |         |       |
	| ABS_CODE          | VARCHAR(50) | YES  |     | NULL    |       |
	..................................................................
	| STAT_DIV_NAME     | VARCHAR(50) | YES  |     | NULL    |       |
	+-------------------+-------------+------+-----+---------+-------+
	63 ROWS IN SET (0.00 sec)

Count unique values in the UNIX_TIME column in the accidents table

SELECT COUNT(DISTINCT ACCIDENT_NO) FROM `accidents`;
	+-----------------------------+
	| COUNT(DISTINCT ACCIDENT_NO) |
	+-----------------------------+
	|                       68272 |
	+-----------------------------+

Find duplicate values in a column (cannot make column a primary key if there are dupes!)

SELECT ACCIDENT_NO FROM accidents GROUP BY ACCIDENT_NO HAVING COUNT(*) > 1;
	Empty SET (0.15 sec)

Make a column be the primary key for a table

ALTER TABLE accidents ADD PRIMARY KEY (ACCIDENT_NO);
	Query OK, 68272 ROWS affected (12.24 sec)              
	Records: 68272  Duplicates: 0  Warnings: 0

Drop the UNIX_TIME column

ALTER TABLE accidents DROP COLUMN UNIX_TIME;

Add a new column UNIX_TIME to the accidents table after the last column (STAT_DIV_NAME)
ALTER TABLE `accidents` ADD `UNIX_TIME` BIGINT UNSIGNED NULL DEFAULT 0 AFTER `STAT_DIV_NAME`;
	Query OK, 0 ROWS affected (9.67 sec)
	Records: 0  Duplicates: 0  Warnings: 0

Add an empty column with no default value

ALTER TABLE `temptable` ADD `UNIX_TIME` BIGINT UNSIGNED NULL;

Create a temporary table in memory with a single column called UNIX_TIME which is of type BIGINT

CREATE TEMPORARY TABLE temptable (UNIX_TIME BIGINT) ENGINE = MEMORY;
	Query OK, 0 ROWS affected (0.00 sec)

Import from local CSV file

I created a tab-separated CSV file which had just the ACCIDENT_NO as primary key and the UNIX_TIME field. We can then import that data into the temptable with:

LOAD DATA LOCAL INFILE '/home/r3dux/accident_no_and_unix_time.csv' INTO TABLE temptable FIELDS TERMINATED BY '\t' (ACCIDENT_NO, UNIX_TIME);
	Query OK, 68272 ROWS affected (0.09 sec)             
	Records: 68272  Deleted: 0  Skipped: 0  Warnings: 0

Load all the single-line data values fromthe UNIX_TIME.txt file into the UNIX_TIME column of our temp table

LOAD DATA LOCAL INFILE '/home/r3dux/UNIX_TIME.txt' INTO TABLE temptable FIELDS TERMINATED BY ',' (UNIX_TIME);
	Query OK, 68272 ROWS affected (0.03 sec)             
	Records: 68272  Deleted: 0  Skipped: 0  Warnings: 0

Merge the temptable with just the ACCIDENT_NO and UNIX_TIME into the accidents table using ACCIDENT_NO as the primary key

MariaDB [govhack2015]> UPDATE accidents JOIN temptable USING (ACCIDENT_NO) SET accidents.UNIX_TIME = temptable.UNIX_TIME;
	Query OK, 13375 ROWS affected (2.08 sec)
	ROWS matched: 68272  Changed: 13375  Warnings: 0

So in this particular instance where I had a partial update, I was missing 13,375 UNIX_TIME values!

Make sure this worked by setting all the UNIX_TIME values in the accidents table to zero, and then re-importing

MariaDB [govhack2015]> UPDATE accidents SET UNIX_TIME = 0;
	Query OK, 68272 ROWS affected (3.16 sec)
	ROWS matched: 68272  Changed: 68272  Warnings: 0
 
MariaDB [govhack2015]> UPDATE accidents JOIN temptable USING (ACCIDENT_NO) SET accidents.UNIX_TIME = temptable.UNIX_TIME;
	Query OK, 68272 ROWS affected (4.01 sec)
	ROWS matched: 68272  Changed: 68272  Warnings: 0

Now finally, we can drop the in-memory temptable!

DROP TABLE temptable;

Export the database

To export the database we can use the mysqldump utility (this is NOT within the MySQL command line!):

mysqldump -u root -p govhack2015 > govhack2015_database_dump.sql

Then to import this dump, let’s say we don’t have a `govhack2015` database

mysql -u root -p

Followed by (in the MySQL command line):

CREATE DATABASE govhack2015;
USE govhack2015;
SOURCE /home/r3dux/govhack2015_database_dump.sql;
SELECT COUNT(DISTINCT ACCIDENT_NO) FROM `accidents`;
	+-----------------------------+
	| COUNT(DISTINCT ACCIDENT_NO) |
	+-----------------------------+
	|                       68272 |
	+-----------------------------+

It all works! =D

How to stop a MySQL action which is taking forever… (like when I bodge a merge) – in a different MySQL commandline interface

SHOW PROCESSLIST;
KILL the-id-number-of-the-hung-or-otherwise-borked-operation;