For data science professionals, mixed data types within a column is a no no, however, we often face that problem as data tables can be unclean. It is our essential duty to clean the data tables, which often starts with data types. The following are three useful techniques/solutions for cleaning mixed data types within a column, that we would like to share with you today.
- Infer Data Types First: df.info() doesn’t always show the correct data type for a column with mixed data types; it might tell you it’s a column with “objects”, but it can be really a column with mixed data types. In turn, you can use pd.api.types.infer_dtypes() for that.
for column in df.columns:
2. Convert Strings of Numbers to Intergers: df[column].astype(int) can’t really handle a column with a mix of strings of characters and strings of numbers. You can use pd.to_numeric() for that. Don’t forget the errors = ‘ignore’ argument, as you might have the strings of characters within the same column.
df[column] = df[column].apply(lambda x: pd.to_numeric(x, errors = 'ignore'))
3. Investigate/Select Rows by Data Types: Once we are done with some of the cleaning operations, we might want to make sure the column is free of mixed data types. In order to do that, we can select the rows by the data types to see if there are still rows existing for that particular data type. For this, we can use isinstance(x, type) to apply to the column with mixed data types.
df[column][df[column].apply(lambda x: isinstance(x, type))]
Hope you enjoyed this week’s Data Hack Tuesday!