Too bad cleaning isn’t as fun for data scientists as it is for this little guy.
The real world is messy, and so too is its data. So messy, that a recent survey reported data scientists spend 60% of their time cleaning data. Unfortunately, 57% of them also find it to be the least enjoyable aspect of their job.
Cleaning data may be time-consuming, but lots of tools have cropped up to
make this crucial duty a little more bearable. The Python community offers a host of libraries for making data orderly and legible—from styling DataFrames to anonymizing datasets.
These Python libraries will make the crucial task of data cleaning a bit more bearable—from anonymizing datasets to wrangling dates and times.
Let us know which libraries you find useful—we’re always looking to prioritize which libraries
to add to Mode Python Notebooks.
Dora is designed for exploratory analysis; specifically, automating the most painful parts of it, like feature selection and extraction, visualization, and—you guessed it—data cleaning. Cleansing functions include:
- Reading data with missing and poorly scaled values
- Imputing missing values
- Scaling values of input variables
by: Nathan Epstein
Where to learn more: https://github.com/NathanEpstein/Dora
Surprise, surprise, datacleaner cleans your data—but only once it’s in a pandas DataFrame. From creator Randy Olson: “datacleaner is not
magic, and it won’t take an unorganized blob of text and automagically parse it out for you.”
It will, however, drop rows with missing values, replace missing values with the mode or median on a column-by-column basis, and encode non-numeric variables with numerical equivalents. This library is fairly new, but since DataFrames are fundamental to analysis in Python, it’s worth checking out.
Created by: Randy
Where to learn more: https://github.com/rhiever/datacleaner
DataFrames are powerful, but they don’t produce the kind of tables you’d want to show your boss. PrettyPandas makes use of the pandas Style API to transform DataFrames into presentation-worthy tables. Create summaries, add
styling, and format numbers, columns, and rows. Added bonus: robust, easy-to-read documentation.
Created by: Henry Hammond
Where to learn more: https://github.com/HHammond/PrettyPandas
tabulate lets you print small,
nice-looking tables with just one function call. It’s handy for making tables more readable with column alignment by decimal, number formatting, headers, and more.
One of the coolest features is the ability to output data in a variety of formats like HTML, PHP, or Markdown Extra, so you can continue working with your tabular data in another tool or language.
Created by: Sergey Astanin
Where to learn more:
Data scientists in fields like healthcare and finance regularly have to anonymize datasets. scrubadub removes personally identifiable information (PII) from free text, such as:
- Names (proper nouns)
- Email addresses
- Phone numbers
- username/password combinations
- Skype usernames
- Social security numbers
The documentation does a good job of showing ways in which you might want to customize scrubadub’s behavior, like defining new PII types or excluding certain kinds of PII from being scrubbed.
Created by: Datascope Analytics
Where to learn more:
Let’s be honest: working with dates and times in Python is a pain. Local timezones aren’t automatically recognized. It takes several lines of unpleasant code to convert timezones and timestamps.
Arrow aims to fix these problems and plug functionality gaps to help you handle dates and times with less code and fewer imports. Unlike
Python’s standard library, Arrow is time-zone aware and UTC by default. You can convert timezones or parse strings using one line of code.
Created by: Chris Smith
Where to learn more: http://arrow.readthedocs.io/en/latest/
Beautifier’s mission is simple: clean and prettify URLs and email addresses.
You can parse emails by domain and username; URLs by domain and parameters (e.g. UTMs or tokens).
Created by: Sachin Philip Mathew
Where to learn more: https://github.com/sachinvettithanam/beautifier
ftfy (fixes text for you) takes in bad Unicode outputs good Unicode. Basically, it fixes all
the junk characters.
<3. If you work with text on a daily basis, this library is, as one user says, “a handy piece of magic.”
Created by: Luminoso
Where to learn more: https://github.com/LuminosoInsight/python-ftfy
Further resources for
Here are a couple of our favorite reads on munging/wrangling/cleansing data.
- What every data scientist should know about data anonymization (Katharina Rasch)
- Cleaning data in Python (University of Toronto Map & Data Library)
- Data Cleaning with Python – MoMA’s Artwork Collection (Dataquest)
- Cohort Analysis That Helps You Look Ahead
- 10 Useful Python Data Visualization Libraries for Any Discipline
- Thinking in SQL vs Thinking in Python
How do you clean data in Python?
Pythonic Data Cleaning With Pandas and NumPy.
Dropping Columns in a DataFrame..
Changing the Index of a DataFrame..
Tidying up Fields in the Data..
Combining str Methods with NumPy to Clean Columns..
Cleaning the Entire Dataset Using the applymap Function..
Renaming Columns and Skipping Rows..
Is NumPy used for data cleaning?
Data scientists spend a large amount of their time cleaning datasets so that they’re easier to work with. In fact, the 80/20 rule says that the initial steps of obtaining and cleaning data account for 80% of the time spent on any given project.
Which tool is used for data cleaning?
Melissa Clean Suite
Melissa Clean Suite is a highly targeted data cleaning and management tool. It’s designed specifically to support the Salesforce and Microsoft Dynamics customer relationship management (CRM) systems, which many businesses use.
Is pandas good for data cleaning?
Pandas is a very powerful data processing tool for the Python programming language. It provides a rich set of functions to process various types of file formats from multiple data sources. The Pandas library is specifically useful for data scientists working with data cleaning and analysis.
Thuộc website harveymomstudy.com