Data Hack Tuesday

Python: Three Hidden Ways to Create Potential Bugs with Mutable Data Structures (Part III)

May 4, 2021

We were told tuples are immutable data structures when we first learned Python, but this is not absolute. If you […]

View More

We were told tuples are immutable data structures when we first learned Python, but this is not absolute. If you pass in a mutable data structure, i.e. a list, into a tuple, you will be able to mutate that mutable data structure within the tuple. Please see the following example:

#a tuple with a list in it
tuple_list = ('apple', 'orange', ['grape', 'watermelon'])
#checking the id of the list item within the tuple
id(tuple_list[2])
4350747776

#append an item into the list
tuple_list[2].append('cucumber')
('apple', 'orange', ['grape', 'watermelon', 'cucumber'])
#the id of the list item remains the same after mutation
id(tuple_list[2])
4350747776

#directly assigning the entire list item within a tuple wouldn't work
a[2] = ['banana']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment

The idea is that you should be very careful or perhaps avoid passing in a mutable data structure into a tuple.
Hope you enjoy this week’s Data Hack Tuesday! We will see you next week.

Python: Three Hidden Ways to Create Potential Bugs with Mutable Data Structures (Part II)

April 27, 2021

Today we are going to continue our three-part series of Three Hidden Ways to Create Potential Bugs with Mutable Data […]

View More

Today we are going to continue our three-part series of Three Hidden Ways to Create Potential Bugs with Mutable Data Structures with a focus on passing in mutable data structures as a function’s default parameters .

But wait a second. What data structures/objects are mutable and what data structures/objects are immutable in Python? Here is a quick overview:
List: mutable
Dictionary: mutable
Set: mutable
Tuple: immutable
String: immutable
Interger: immutable
Boolean: immutable
Float: immutable

The idea is that we want to avoid passing in mutable data structures as a function’s default parameters as that could create potential problems that are hard to detect. I am borrowing an example from Fluent Python by Luciano Ramalho:

#Passing in an empty list as default parameter
class Fruits:
    def __init__(self, list_of_fruits = []):
        self.list_of_fruits = list_of_fruits
    def add_fruits(self, fruit):
        self.list_of_fruits.append(fruit)

a = Fruits(['apple', 'orange'])
a.list_of_fruits
['apple', 'orange']

#This is mutating on the default list 
b = Fruits()
b.add_fruits('watermelon')
b.list_of_fruits
['watermelon']

# The default list has been mutated, every instance created without passing in 
#a parameter value will have default list parameter as ['watermelon'] rather than an
#empty list
c = Fruits()
c.list_of_fruits
['watermelon']

#This is the appropriate way of passing in a mutable data structure as functional parameter
class Fruits:
    def __init__(self, list_of_fruits = None):
        if list_of_fruits is None:
           self.list_of_fruits = [] # if it's None, then create an empty list
        else:
           self.list_of_fruits = list(list_of_fruits) #creating a copy of passed in list
    def add_fruits(self, fruit):
        self.list_of_fruits.append(fruit)
                  
a = Fruits(['apple', 'orange'])
a.list_of_fruits
['apple', 'orange']

#This doesn't affect the default value, as it's None
b = Fruits()
b.add_fruits('watermelon')
b.list_of_fruits
['watermelon']


#Now future initiated instances don't get affected
c = Fruits()
c.list_of_fruits
[]

Hope you enjoy this week’s Data Hack Tuesday. Will see you next week!

Python: Three Hidden Ways to Create Potential Bugs with Mutable Data Structures (Part I)

April 20, 2021

Operating on mutable data structures can potentially create hidden bugs that are hard to detect if you are not as […]

View More

Operating on mutable data structures can potentially create hidden bugs that are hard to detect if you are not as familiar with some of the characteristics of the data structure. Today we want to share with you the first way that we’ve learned in the past.

List: Reference vs Shallow Copy
It’s very easy to use [list]*n to duplicate your list, forgetting the concept of reference vs copy. When you use [list] * n, you are creating a reference to the same list a. And when you try to mutate one of the elements of that duplicated list, you will end up changing it throughout the duplicated list, as you are just creating n references of the same exact list. In order to not mutate throughout the duplicated list, you would have to do [a[:] for i in range(n)] to create n copies of the list that you would like to duplicate. Please see a quick example below:

#Reference
a = [‘apple’,’orange’,’grape’,’banana’]
b = [‘plum’, ‘watermelon’, ‘cucumber’]

a_reference = [a] * 3
a_reference
[[‘apple’, ‘orange’, ‘grape’, ‘banana’], [‘apple’, ‘orange’, ‘grape’, ‘banana’], [‘apple’, ‘orange’, ‘grape’, ‘banana’]]

a_reference[0][0] = b[0]
a_reference
[[‘plum’, ‘orange’, ‘grape’, ‘banana’], [‘plum’, ‘orange’, ‘grape’, ‘banana’], [‘plum’, ‘orange’, ‘grape’, ‘banana’]]

#Shallow Copy
a = [‘apple’,’orange’,’grape’,’banana’]
b = [‘plum’, ‘watermelon’, ‘cucumber’]

a_copy = [a[:] for i in range(3)]
a_copy
[[‘apple’, ‘orange’, ‘grape’, ‘banana’], [‘apple’, ‘orange’, ‘grape’, ‘banana’], [‘apple’, ‘orange’, ‘grape’, ‘banana’]]

a_copy[0][0] = b[0]
a_copy
[[‘plum’, ‘orange’, ‘grape’, ‘banana’], [‘apple’, ‘orange’, ‘grape’, ‘banana’], [‘apple’, ‘orange’, ‘grape’, ‘banana’]]

However, this only worked for immutable items, as all items in list a are strings and strings are immutable. You would have to create a deep copy with copy.deepcopy() if you are dealing with list of lists, or list of mutable data structures.

List: Shallow Copy vs. Deep Copy

#Shallow Copy doesn’t work any more in this example

import copy
a = [‘apple’,’orange’,’grape’,’banana’]
b = [‘plum’, ‘watermelon’, ‘cucumber’]

#shallow copy on immutable data structure
a_copy = [a[:] for i in range(3)]
a_copy
[[‘apple’, ‘orange’, ‘grape’, ‘banana’], [‘apple’, ‘orange’, ‘grape’, ‘banana’], [‘apple’, ‘orange’, ‘grape’, ‘banana’]]

#shallow copy on mutable data structure
a_shallow_copy = a_copy[:]
a_shallow_copy
[[‘apple’, ‘orange’, ‘grape’, ‘banana’], [‘apple’, ‘orange’, ‘grape’, ‘banana’], [‘apple’, ‘orange’, ‘grape’, ‘banana’]]

#deep copy on mutable data structure
a_deep_copy = copy.deepcopy(a_copy)
a_deep_copy
[[‘apple’, ‘orange’, ‘grape’, ‘banana’], [‘apple’, ‘orange’, ‘grape’, ‘banana’], [‘apple’, ‘orange’, ‘grape’, ‘banana’]]

#mutating on the list
a_copy[1] += b
a_copy
[[‘apple’, ‘orange’, ‘grape’, ‘banana’], [‘apple’, ‘orange’, ‘grape’, ‘banana’, ‘plum’, ‘watermelon’, ‘cucumber’], [‘apple’, ‘orange’, ‘grape’, ‘banana’]]

#shallow copy gets affected because now items are lists, which are mutable
a_shallow_copy
[[‘apple’, ‘orange’, ‘grape’, ‘banana’], [‘apple’, ‘orange’, ‘grape’, ‘banana’, ‘plum’, ‘watermelon’, ‘cucumber’], [‘apple’, ‘orange’, ‘grape’, ‘banana’]]

#deep copy doesn’t get affected
a_deep_copy
[[‘apple’, ‘orange’, ‘grape’, ‘banana’], [‘apple’, ‘orange’, ‘grape’, ‘banana’], [‘apple’, ‘orange’, ‘grape’, ‘banana’]]

Hope you enjoy this week’s Data Hack Tuesday. Please tune in for the Part II next week!

Pandas: Detect Mixed Data Types and Fix it

April 13, 2021

For data science professionals, mixed data types within a column is a no no, however, we often face that problem […]

View More

For data science professionals, mixed data types within a column is a no no, however, we often face that problem as data tables can be unclean. It is our essential duty to clean the data tables, which often starts with data types. The following are three useful techniques/solutions for cleaning mixed data types within a column, that we would like to share with you today.

  1. Infer Data Types First: df.info() doesn’t always show the correct data type for a column with mixed data types; it might tell you it’s a column with “objects”, but it can be really a column with mixed data types. In turn, you can use pd.api.types.infer_dtypes() for that.

for column in df.columns:
print(pd.api.types.infer_dtypes(df[column])

2. Convert Strings of Numbers to Intergers: df[column].astype(int) can’t really handle a column with a mix of strings of characters and strings of numbers. You can use pd.to_numeric() for that. Don’t forget the errors = ‘ignore’ argument, as you might have the strings of characters within the same column.

df[column] = df[column].apply(lambda x: pd.to_numeric(x, errors = 'ignore'))

3. Investigate/Select Rows by Data Types: Once we are done with some of the cleaning operations, we might want to make sure the column is free of mixed data types. In order to do that, we can select the rows by the data types to see if there are still rows existing for that particular data type. For this, we can use isinstance(x, type) to apply to the column with mixed data types.

df[column][df[column].apply(lambda x: isinstance(x, type))]

Hope you enjoyed this week’s Data Hack Tuesday!

Python: When to Use Set vs. List

March 30, 2021

For many Python programmers, we love to use lists to store data, but when should you choose to use a […]

View More

For many Python programmers, we love to use lists to store data, but when should you choose to use a data structure like a set over a list? The answer is that when it comes to determining if an item is in a data structure, a set is a lot faster than a list. However, when it comes to looping through a data structure, a list is faster. Of course, a list has order, but a set does not, therefore no index either.

Columns

October 13, 2020

Apply separate transformations on every column using Sklearn’s ColumnTransformer https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html

View More

Apply separate transformations on every column using Sklearn’s ColumnTransformer https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html

Intro to linear regression

October 6, 2020

Perform linear regression by tapping into Python’s indispensable sklearn library https://scikit-learn.org/stable/index.html using the make_regression() function. Now, you can completely control […]

View More

Perform linear regression by tapping into Python’s indispensable sklearn library https://scikit-learn.org/stable/index.html using the make_regression() function. Now, you can completely control your data’s behavior whether you want a mini random data set or need to debug your algorithm.

Descriptive, prescriptive, or predictive?

September 29, 2020

What type of analytics makes the most sense for your organization? Naturally, this depends on the project and on your […]

View More

What type of analytics makes the most sense for your organization? Naturally, this depends on the project and on your goals. The three main types of analytics that you will want to consider are:

  • Descriptive
  • Predictive
  • Prescriptive

Here’s a handy reminder of how to apply each of these.

Prescriptive analytics will help you to answer the question of what your organization should do. Compare this to descriptive analytics which is the method for understanding your current situation or problem and reviewing how it looked in the past. Unlike prescriptive analytics, descriptive will not address a question specifically and will instead present a clear overview of the problem. This differs considerably from predictive analytics which involves making prediction about future behaviors or results based on current data and trends.

Data Security

September 22, 2020

Thanks to the sudden increase in remote work, companies must now confront a new set of security challenges. One small […]

View More

Thanks to the sudden increase in remote work, companies must now confront a new set of security challenges. One small way to help protect company data is to avoid downloading data to spreadsheets and Excel files.

While this is common enough, a few problems can result from downloading data to a spreadsheet:

-It is no longer possible to control how the data is used or with whom it is shared.

-The files could be exploited.

-The risk of confidential information becoming exposed is greatly increased.

Close