In this month’s project we will implement parallel programming in R:

We achieve this by using the following packages, so please install them on your RStudio IDE.

  • foreach,
  • doParallel
  • parallel. 

 

 Concept of parallel programming:

For beginners, the concept of parallel programming is simple, instead of calculating outputs (from inputs) one at a time, we divide our computation and allocate it to multiple worker connections that work concurrently. ( to retrieve data, process and calculate outputs, etc ) At the end we collate the results into a single dataset (list or dataframe).
This concept assumes that the computations and inputs are exclusive.

Example:
As a simple example, let us assume you are calculating prime factors for an input dataset with a million random, unsorted numbers. We can have a function to calculate factors that takes n seconds to process each number.
In a sequential (non-parallel ) scenario, it would take 1,000,000 * n seconds to process the entire dataset, assuming n seconds for each input.
However, if we had 7 parallel connections, then we could divide the input set into 7 chunks and process 7 numbers ( or datasets) at a time. This reduces the time by a factor of 7.

 

Benefits:

Parallel programming is not really useful if your computation is quick or if you are exploring stuff. But it is a very powerful tool to speed things up when it comes to simulations, machine learning and even time-consuming calculations or data retrievals.

For example, I was recently working  on a project where I was looking at billions of transaction data for sub-optimal price charges (where the actual transaction price was x% or higher than first bid). The logic itself is simple :

 

However, the sheer amount of data was taking the program hours (yes hours) to complete for a single month, even after optimizing the SQL queries. And I had to look at 6 months worth of data! Like looking for a needle in a haystack. So I implemented parallel worker connections and bingo! time reduced by 80%.

Similarly, I have used parallelization to simulate multiple modeling scenarios (what happens when input price changes to m and competition prices changes to y) in parallel and reduce computation time.

 

Code Structure:

The skeleton code for implementation is as follows:

In the above code, you can also embed the SqlQuery() in a function, along with other calculations, and call the function inside the dopar() function call.

Note, the odbc() function is a fake username-password combination to show the format. We need to place it inside the foreach() loop so each worker gets its own database connection to retrieve data in parallel. Assigning it outside as a global variable will NOT work.

We use a custom function to collate the results as follows:

 

Implementation:

The attached code file shows 3 cases for implementation, using the above code structure:

  1. calling a simple SQL query inside the foreach()
  2. calling a simple math-calculation function, without sql queries.
  3. more complex function to perform both sql data retrieval and math functions to compute results.

You can  download the code files from the Projects Page link here, under Apr 2017.

Points To Note:

  • Make sure you run the query on a single value to ensure that the sql query itself is correct. Unfortunately, R gives a very generic “failed to execute” type error irrespective of whether you missed a comma in the select statement or if the variable name is incorrect. This makes debugging very annoying.
  • If you have sql queries, please do optimize it so it only picks up the minimum number of records/columns needed.
  • Parallel programming does NOT mean instant!! So, processing a million rows may still take an hour or more, if you have loads of calculations. Though you will be tons faster than sequential processing.
  • A cluster with 1000 or more parallel worker connections is NOT effective. Unless you have a supercomputer of course. If using a laptop, don’t try more than 10.
  • Don’t put a query that writes to the table, you will lock the table and the process will NEVER ever complete.
    For a sanity check, try inserting a print statement to keep tabs that the function is actually working.

 

Until next time, happy coding.