Introduction
I have recently come across a lot of aspiring data scientists wondering it’s so difficult to import different file formats in python. Most of you might be familiar with the read_csv() function in Pandas but things get tricky from there.
How to read a JSON file in Python? How about an image file? How about multiple files all at once? These are questions you should know the answer to — but might find it difficult to grasp initially.
And mastering these file formats is critical to your success in the data science industry. You'll be working with all sorts of file formats collected from multiple data sources- that's the reality of the modern digital age we live in.
So in this article, I will introduce you to some of the most common file formats that a data scientist should know. We will learn how to read them in Python so that you are well prepared before you enter the battlefield!
Table of Contents
- Extracting from Zip files in Python
- Reading Text files in Python
- Import CSV file in Python using Pandas
- Reading Excel file in Python
- Importing Data from Database using Python
- Working with JSON files in Python
- Reading data from Pickle files in Python
Extracting from Zip files in Python.
Zip files are gifts from the coding gods. It is like they have fallen from heaven to save our storage space and time. Old school programmers and computer users will certainly relate to how we used to copy extensive installation files in Zip format!
But technically speaking, ZIP is an archive file format that supports lossless data compression. This means you don’t have to worry about your data being lost in the compression-decompression process (Silicon Valley, anyone?).
Here, let’s look at how you can open a ZIP folder with Python. For this, you will need the zip file library in Python.
I have zipped all the files required for this article in a separate ZIP folder, so let’s extract them!
Once you run the above code, you can view the extracted in the same folder as your python script :
Reading Text Files in python
Text files are one of the most common file formats to store data. Python makes it very easy to read data from text files.
Python provides the open() function to read files that take in the file path and the file access mode as its parameters. For reading a text file, the file access mode is ‘r’. I have mentioned the other access modes below :
- ‘ w ’ — Writing to a file.
- ‘ r+’ or ‘w+’ — read and write to a file
- ‘ a ’ — appending to an already existing file.
- ‘ a+ ’ — append to a file after reading.
Python provides us with three functions to read data from a text file :
- read(n) — This function reads n bytes from the text files or reads the complete information from the file if no number is specified. It is smart enough to handle the delimiters when it encounters one and separates the sentences
- readline(n) — This function allows you to read n bytes from the file but not more than one line of information
- readlines() — This function reads the complete information in the file but unlike read(), it doesn’t bother about the delimiting character and prints them as well in a list format
Let us see how these functions differ in reading a text file:
The read() function imported all the data in the file in the correct structured form.
By providing a number in the read() function, we were able to extract the specified amount of bytes from the file.
Using readline(), only a single line from the text file was extracted.
Here, the readline() function extracted all the text file data in a list format.
Reading CSV Files in Python
ah, the good old CSV format.A CSV (Comma Separated Value