How to load data from Google Drive to Pandas running in Google Colaboratory

I like Google Colaboratory for multiple reasons.

First of all, the code runs on someone else’s machine so I can do something else on my laptop when the code is running, and it does not get overheated ;)

The second reason is, of course, effortless code sharing. Just click the share button, copy the link, and send it to someone else.

There is only one little problem, loading data into Colaboratory. Fortunately, you can store your dataset in Google Drive and import it in a pretty easy way.

Setup

Most of the setup part is described in the predefined code snippet that lists files in Google Drive. This part we can copy paste:

1
2
3
4
5
6
7
8
9
10
11
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
import os
import pandas as pd
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

What does it do? Imports libraries that allow us to access Google Drive and allows the Google Cloud SDK to access the Google Drive of the currently logged in user. As a result, you can access your files from python code running in Colaboratory.



Google Drive id

Unfortunately, I could not find a way to open a file using its full path as we usually do. So if I store a file in directory data/test_dataset and call the file test.csv I cannot use path: /data/test_dataset/test.csv to access it.

Google drive uses file and directory id to identify the location. Hence, to find the id of the file I have to open data/test_dataset directory in my browser and copy the identifier from the URL.

As far as I know, it is not so easy to find the identifier of a file. To find such identifier, we must list the files in the directory:

1
2
3
listed = drive.ListFile({'q': "title contains 'test.csv' and '1ANnCDVS281y486EVBqm_MDadxjkelxZM' in parents"}).GetList()
for file in listed:
  print('title {}, id {}'.format(file['title'], file['id']))

The code prints names and identifiers of the files in the directory. Copy the identifier of the file you want to open. You are going to need it.

Now you have everything you need to load data from Google Drives to Pandas.

Copy data from Google Drive to Colaboratory

First of all, let’s create a local directory to store a copy of the file:

1
2
download_path = os.path.expanduser('~/data')
os.makedirs(download_path)

There is one little problem with this code. If you rerun the notebook cell that contains it, the code will fail because the file already exists. If you want to ignore such error, the code should look like this:

1
2
3
4
5
download_path = os.path.expanduser('~/data')
try:
  os.makedirs(download_path)
except FileExistsError:
  pass

Now we have the file id and the output directory. We can copy the file from Google Drive:

1
2
3
output_file = os.path.join(download_path, 'test.csv')
temp_file = drive.CreateFile({'id': 'the_file_id'})
temp_file.GetContentFile(output_file)

Load the file in Pandas

Now is the time for a thing that looks familiar. Just load the file to a Pandas dataframe:

1
data = pd.read_csv(output_file)

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you watch programming live streams, check out my YouTube channel.
You can also follow me on Twitter: @mikulskibartosz

If you want to hire me, send me a message on LinkedIn or Twitter.


Bartosz Mikulski
Bartosz Mikulski * big data engineer * conference speaker * co-founder of Software Craftsmanship Poznan & Poznan Scala User Group