Nearly every script I have written for this project has been in Python. And for good reason, too: there are lots of pre-made modules that make certain tasks simpler, Python is easy and fast to write code in, and there are tons of examples of programs online that I can base things off of. The one area that Python is known to struggle in is efficiency.
The first script I wrote to iterate through every single post in a Reddit JSON scrape took 25 minutes to run. I imported a library to help read JSON data and then read in each line as a JSON to look specifically for posts from the Pirate101 subreddit. I knew this was not the most efficient way to do things, but it was the most straight-forward. I could have dealt with the long scrape time but I then realized that it would have taken me a few weeks to compile a list of all P101 posts assuming I was running non-stop.
In an effort to cut down the amount of processing time I opted to rewrite the script in C. The C programming language has been continuously used for over 50 years and its efficiency is one of the main reasons. I started learning C in a class I took a few months ago and I have been looking for an application to practice my skills. I’ll try to explain my code by section.
#include <stdio.h> #include <stdlib.h> #include <string.h>
Include statements are C’s version of import
. The files that are mentioned are .h header files which are equivalent to Python’s packages.
int main(int argc, char* argv[]){ char* in_fname; char* out_fname; FILE *in_file; FILE *out_file;
Whereas in Python functions can be placed anywhere in a script, in C they have to be in the main method in order to be executed upon calling the script. The int
before main
is the type of variable that will be returned upon the completion of main
. Explicit type declarations like this one are (unlike Python) needed for every variable that is created, which you can see in the next few lines. The variables argc
and argv
are the number of arguments and list of arguments, respectively. The five following lines are all variable declarations ending with semicolons to denote the end of the line. Because C does not have a string variable, char*
fills this place as a pointer to the first of a series of characters that comprise a string. FILE*
is a pointer to a file. Here is a much better explanation of pointers than I could ever give.
if(argc != 3){ printf("Usage: <in_file> <out_file> \n"); return 0; } in_fname = argv[1]; out_name = argv[2];
The first thing to do is to check that three input command-line arguments have been entered. The first is the program name which is sent automatically, but then the program expects two more, an input and output file. If more or less than these two parameters are entered then the program prints what it expects and exits. Assuming that there were two values entered, they get stored in in_fname
and out_fname
.
in_file = fopen(in_fname, "r"); out_file = fopen(out_fname, "a+"); if (!in_file){ perror(in_fname); return 0; } if(!out_file){ perror(out_fname); return 0; }
The program operates under the assumption that the two filenames the user provides map to actual files. It attempts to open them, the input file being read-only and the output that gets valid posts appended to it. The +
indicates that if the file does not exist it will be created. In the event that one or both of the files do not exist with the names the user has defined, the program will exit.
char *line = NULL; size_t len = 0; ssize_t read; char *sub_str; while((read = getline(&line, &len, in_file)) != -1){
The variables listed above pertain to reading each line in the file. The first character array, *line
, is initialized to be NULL
but will store each line of text in the loop. The len
variable is an unsigned integer containing the size required to hold all of the bytes in the line and isn’t important for this program as long as it is there. Then read
contains the number of bytes returned by the getline()
function and is -1 when there are no more lines, which will then end the loop.
sub_str = strstr(line, "\"subreddit\""); if(sub_str != NULL){ sub_str = sub_str + 13; sub_str[strlen(sub_str)-1] = '\0';
This is where the magic happens. The strstr()
function takes in two character arrays and returns the string which contains the search term “needle” (“subreddit”) in the wider “haystack” (line
) as the starting characters. The if
statement is to catch cases where there is no string “subreddit” in the entire data line, something that only occurred in pre-2018 data. Adding 13 to the value of sub_str
is to shift the string such that it no longer includes “subreddit” but now would start with the subreddit name. This line only works because of the aforementioned pointers. You see, sub_str
is actually the memory address of the substring, not the string itself. By adding 13, the memory address gets shifted over such that it starts 13 characters later. The last line simply sets the last character of the string to the string terminator \0
to remove a newline.
if(strncmp(sub_str, "Pirate101", 9) == 0){ fprintf(out_file,"%s,\n",line); }
This if
statement compares the first nine characters of sub_str
to see if it equals “Pirate101”. If they match, fprintf
writes a line to out_file
. The text to write is the entire line, which fills in for the %s
, a comma, and a newline character. This is so I can read in the entire output file as a JSON, something that I could not have done with the original scraped content because it lacked commas between objects.
} else { sub_str = strstr(line, "Pirate101"); if(sub_str != NULL){ fprintf(out_file,"%s,\n",line); } else { sub_str = strstr(line, "t5_2tzb"); if(sub_str != NULL){ fprintf(out_file,"%s,\n",line); } } }
In the event that the current line does not contain “subreddit”, I am checking it for “Pirate101”. If it is there, write it to the same out_file
. Otherwise, check for the subreddit id “t5_2tzb” and do the same.
} printf("done\n"); return 0; }
Finally after getting through the entire decompressed file the program prints “done” and returns 0 (exits).
Coding in C can be frustrating at times. It is much lower-level than Python, meaning that there is more you as the programmer must worry about than with higher-level languages that hide or take care of it for you. For one, runtime errors fall under the blanket category of Segmentation Fault and no further information is printed to give you an idea of what caused it or how to fix it. But before you can even run a C program you must compile it. This involves using a compiler, such as gcc
. Every time you make a change it must be recompiled, which becomes annoying fast and if you forget to do this you may think that your changes had no effect.
gcc extract.c -o extract
The real reason I started this script in C was to try to take advantage of multi-threading, a way to do many tasks simultaneously for a potentially astronomical speed boost. I discovered after several days that this program could not be made multi-threaded because of the way the decompressed file is read. There is no way to break the file into approximately equal-sized chunks without cutting an entry in half. Instead, I could have run through up to 16 files at once (the number of cores in my computer) and had them finish all at approximately the same time, but I don’t have enough storage space to fit that many decompressed files at once.