Vectorization: a super-fast alternative to loops in Python

Feb. 05 6min read

Introduction

Loops are a fundamental concept in programming and one that is traditional at that, found across various languages. They're our go-to defacto for repetitive tasks or running across data structures such as sets, lists, dictionaries, arrays and even generators , but when dealing with massive datasets, relying on loops can be inefficient and time-consuming.

This is where the power of Vectorization in Python shines 👊.

Now what is vectorization and this strange but seemingly interesting word,

What is Vectorization?

Now, Vectorization involves executing array operations (commonly in NumPy) on entire datasets simultaneously. Unlike traditional loops, which handle one element at a time, Vectorization processes all elements in one swift operation.

Notice: Its is adviasable to use Vectorization for large or heavy task because it involves importing certain libaries native to data science and this can include but not limited to numpy, pandas , scikit and etc

In this story, we'll explore scenarios where replacing Python loops with Vectorization enhances efficiency, saving valuable time and boosting coding proficiency.

Example 1: Finding the Sum of numbers

Comparing Speed of finding the sum of large number in a data set using traditonal loops vs vectorization

Now let's look at a fundamental example of finding the sum of numbers using loops and Vectorization in Python.

Using Loops

import time 
start = time.time()
 
# iterative sum
total = 0
# iterating through 1.5 Million numbers
for item in range(0, 1500000):
    total = total + item
 
print('sum is:' + str(total))
end = time.time()
print(end - start)
#1124999250000
#0.14 Seconds

Using Vectorization

import numpy as np
 
start = time.time()
# vectorized sum - using numpy for vectorization
# np.arange create the sequence of numbers from 0 to 1499999
print(np.sum(np.arange(1500000)))
end = time.time()
print(end - start)
 
##1124999250000
##0.008 Seconds

Vectorization took ~18x less time to execute as compared to the iteration using the range function. This difference will become more significant while working with Pandas DataFrame.

Example 2: Mathematical Operations (on DataFrame)

In Data Science, while working with Pandas DataFrame, developers use loops to create new derived columns using mathematical operations.

In the following example, we can see how easily the loops can be replaced with Vectorization for such use cases.

Creating the DataFrame

The DataFrame is tabular data in the form of rows and columns.

We are creating a pandas DataFrame having 5 Million rows and 4 columns filled with random values between 0 and 50. This compares much to matrix in mathematics , here a data frame is an a rectangular array of numbers havign a distribution of rows and columns

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 50, size=(5000000, 4)), columns=('a','b','c','d'))
df.shape. 
# (5000000, 5) tells us the shape or dimension of the dataframe in terms of rows and cols
df.head().  # go to the first row

We will create a new column ‘ratio’ to find the ratio of the column ‘d’ and ‘c’.

Using Loops

import time 
start = time.time() #start timer

# Iterating or Looping through DataFrame using for-loop iterrows
for idx, row in df.iterrows():
    # creating a new column 
    df.at[idx,'ratio'] = 100 * (row["d"] / row["c"])  
end = time.time() #end timer
print(end - start) #preach time difference or time taken
### 109 Seconds -> results

Using Vectorization

start = time.

50% OFF

Over copies sold

time() df["ratio"] = 100 * (df["d"] / df["c"]) end = time.time() print(end - start) ### 0.12 seconds

We can see a significant improvement with DataFrame, the time taken by the Vectorization operation is almost 1000x faster as compared to the loops in Python.

Example 3: If-else Statements (on DataFrame)

We implement a lot of operations that require us to use the ‘If-else’ type of logic. We can easily replace these logics with Vectorization operations in Python.

Let’s look at the following example to understand it better (we will be using the DataFrame that we created in use case 2):

Imagine we want to create a new column ‘e’ based on some conditions on the exiting column ‘a’.

Using Loops

import time

start = time.time()
 
# Iterating through DataFrame using iterrows
for idx, row in df.iterrows():
    if row.a == 0:
        df.at[idx,'e'] = row.d    
    elif (row.a <= 25) & (row.a > 0):
        df.at[idx,'e'] = (row.b)-(row.c)    
    else:
        df.at[idx,'e'] = row.b + row.c
end = time.time()
print(end - start)
### Time taken: 177 seconds

Using Vectorization

# using vectorization 
 
start = time.time()
df['e'] = df['b'] + df['c']
df.loc[df['a'] <= 25, 'e'] = df['b'] -df['c']
df.loc[df['a']==0, 'e'] = df['d']end = time.time()
print(end - start)
## 0.28007707595825195 sec

The time taken by the Vectorization operation is 600x faster as compared to the Python loops with if-else statements.

Example 4 (Advance): Solving Machine Learning/Deep Learning Networks

Deep Learning requires us to solve multiple complex equations and that too for millions and billions of rows. Running loops in Python to solve these equations is very slow and Vectorization is the optimal solution.

For example, to calculate the value of y for millions of rows in the following equation of multi-linear regression:

Linear Regression (Image by Author)

we can replace loops with Vectorization.

The values of m1,m2,m3… are determined by solving the above equation using millions of values corresponding to x1,x2,x3… (for simplicity, we will just look at a simple multiplication step)

Creating the Data

import numpy as np# setting initial values of m m = np.random.rand(1,5)# input values for 5 million rowsx = np.random.rand(5000000,5)

Using Loops

import numpy as np
m = np.random.rand(1,5)
x = np.random.rand(5000000,5)
 
total = 0
tic = time.process_time()
for i in range(0,5000000):
    total = 0
    for j in range(0,5):
        total = total + x[i][j]*m[0][j] 
 
    zer[i] = total 
toc = time.process_time()
print ("Computation time = " + str((toc - tic)) + "seconds")
####Computation time = 28.228 seconds

Using Vectorization

tic = time.process_time()

#dot product 
np.dot(x,m.T) 
toc = time.process_time()
print ("Computation time = " + str((toc - tic)) + "seconds")
####Computation time = 0.107 seconds

The np.dot implements Vectorized matrix multiplication in the backend. It is 165x faster as compared to loops in Python.

Conclusion

Vectorization in Python is super fast and should be preferred over loops, whenever we are working with very large datasets.

Start implementing it over time and you will become comfortable with thinking along the lines of vectorization of your codes.

Python Anaconda Vectorization Data Science

Vectorization: a super-fast alternative to loops in Python

Introduction

Example 1: Finding the Sum of numbers

Comparing Speed of finding the sum of large number in a data set using traditonal loops vs vectorization

Using Vectorization

Example 2: Mathematical Operations (on DataFrame)

Creating the DataFrame

AD

Example 3: If-else Statements (on DataFrame)

Using Vectorization

Example 4 (Advance): Solving Machine Learning/Deep Learning Networks

Using Vectorization

Conclusion

Note by Adril Lee

More from Adril Lee

5 Subtle Indicators of a Challenging Childhood

some may look back on their early years with fond memories of family and friends, neighborhood bloc…

The Evolution of Thanksgiving: A Time-Honored Ame…

Thanksgiving is a cherished holiday in the United States, celebrated on the fourth Thursday of Nove…

Neuralink: A Controversial Path Toward Human-AI U…

In the absence of clear, well-defined guidelines and regulations, Neuralink's trajectory re…

Hey Habibi, Come to Dubai

"Habibi" (حبيبي) is an Arabic term of endearment that translates to "my love" o…

Recommended for Your Read

Artificial intelligence tools for work part:1

In this rapidly evolving digital age, Artificial Intelligence (AI) is at the forefront of innovatio…

Digital Products you can sell fast in 2024 for money Part 1

Here you will profitable digital products that you can sell online and explore the some of the best…

Five time management mistakes and how to cure them

For example for many people if their morning did no go well , they felt like throughout the rest o…

Implement Braintree as Payment gateway in django

A payment gateway integration is kind of a necessity in all of the websites, whether it be e-commer…