Using Python to query and extract data from Firestore, Storage Buckets and BigQuery
Google Firebase is a set of cloud-based development tools that helps mobile app developers like me, build, deploy, and scale our applications. It is easy to use, and cost effective. That is why I've built all our frontend digital infrastructure at KCMHR with Google Firebase using NodeJS and Typescript to programme a RESTful API.
Google Firebase is a great platform for rapid development using serverless technology. This means that we can focus on frontend development and not server management. A major bonus is that it is supported by Google, so you get all the top-notch security support bundled in. But one aspect that is a little difficult is getting data out of Google Firebase. It isn't that easy, or straightforward.
That is why I wanted to share three scripts I've created to gather data from Google Firebase, Google Storage Buckets and Google BigQuery - the three main areas where you'll store data on the platform.
There are multiple ways in which you can query the data, the ways below, for me, are the simplest way. Hereafter I will provide a Gist for each method, with the code commented outlining what is happening.
Each method requires Python 3.5, with the following packages installed as a minimum:
- pandas
- firebase_admin
- google.cloud
You will also need to download your service account key either from Google Firebase OR Google Cloud Console.
Getting data from Google Firestore
import pandas as pd | |
import firebase_admin | |
from firebase_admin import credentials, firestore | |
# set the root folder path | |
base_url = '<BASE URL HERE>' | |
# used for paging when downloading data e.g. only 1000 documents downloaded per call | |
limit = 1000 | |
# create the connection | |
if not firebase_admin._apps: | |
print('Setting connection') | |
# use the service key to authorise the login | |
cred = credentials.Certificate('ServiceAccountKey.json') | |
# ensure we have the storage bucket permission. | |
# for some reason Firebase wants this even if you don't want to pull the data | |
default_app = firebase_admin.initialize_app(cred, { | |
'storageBucket': '<STORAGE BUCKET' | |
}) | |
# start the firestore client | |
store = firestore.client() | |
# declare a function to stream the documents to download | |
def stream_collection_loop(collection, count, cursor=None): | |
dict_array = [] | |
id_array = [] | |
while True: | |
docs = [] # Very important. This frees the memory incurred in the recursion algorithm. | |
if cursor: | |
docs = [snapshot for snapshot in | |
collection.limit(limit).order_by('__name__').start_after(cursor).stream()] | |
else: | |
docs = [snapshot for snapshot in collection.limit(limit).order_by('__name__').stream()] | |
for doc in docs: | |
dict_array.append(doc.to_dict()) | |
id_array.append(doc.id) | |
if len(docs) == limit: | |
cursor = docs[limit-1] | |
continue | |
break | |
return dict_array, id_array | |
# example data downloading and streaming | |
# use the defined function and query the 'users' table with all its data, starting from index 0 | |
dict_array, id_array = stream_collection_loop(store.collection(u'users'), 0) | |
# once the query has finished we now have the dict_array and id_array | |
# dict_array is the data stored within the firesotre array | |
# id_array is the document ids often known as doc_id | |
# now this is where we can start to manage and process the data using pandas | |
# we create a panda view with the data, and document ids | |
df = pd.DataFrame(data=dict_array, index=id_array) | |
# to make sure we have the right format and references I always rename the axis to ensure doc_id is referenced | |
df.rename_axis("doc_id", inplace=True) | |
# the data has now been downloaded and can be viewed, saved and processed as needed | |
display(df.head(5)) |
If the Gist does not load, you can access it here.
Getting data from Storage Buckets
import pyrebase | |
import os | |
# define firebase config for the project being queried | |
firebaseConfig = { | |
"apiKey": "", | |
"authDomain": "", | |
"projectId": "", | |
"storageBucket": "", | |
"messagingSenderId": "", | |
"appId": "", | |
"measurementId": "", | |
"serviceAccount": "", | |
"databaseURL":"" | |
}; | |
# create app connection using pyrebase based on firebase config | |
firebase_storage = pyrebase.initialize_app(firebaseConfig) | |
# create a storage bucket container | |
storage = firebase_storage.storage() | |
# obtain a list of files hosted on the storage bucket | |
all_files = storage.list_files() | |
# we create this as an index of files to ensure we keep a track on the number downloaded | |
cnt = 0 | |
# define base storage location | |
base_url = '<BASE URL HERE>' | |
# loop through each file stored in all_files and download it | |
for file in all_files: | |
# let's check to see if the file already exists in its current location (it could be a file or folder) | |
if not os.path.exists(os.path.dirname(base_url + '/' + file.name)): | |
try: | |
# let make the directory if it doesn't already exist | |
os.makedirs(os.path.dirname(base_url + '/' + file.name)) | |
except OSError as exc: # Guard against race condition | |
if exc.errno != errno.EEXIST: | |
raise | |
# the file exists pass and present a print | |
if(os.path.isfile(base_url + '/' + file.name)): | |
print('File present') | |
pass | |
else: | |
# file does not currently exist - download and store it | |
print(file.name + ' ' + str(cnt)) | |
file.download_to_filename(base_url + '/' + file.name) | |
cnt += 1 |
If the Gist does not load, you can access it here.
Getting data from BigQuery
from google.cloud import bigquery | |
from google.oauth2 import service_account | |
import pandas as pd | |
# generate a service account container based on service account key | |
credentials = service_account.Credentials.from_service_account_file('ServiceAccountKey.json') | |
# define the project ID | |
project_id = '' | |
# create a biquery client using credentials and project id | |
client = bigquery.Client(credentials=credentials, project=project_id) | |
# define the query | |
query_job = client.query("""SELECT * FROM `<PROJECT>.analytics_289340186.events_*`""") | |
# this query above selects all the data in the analytics table but this can be any table | |
# this is the clever bit, we perform the query and convert the result to a panda dataframe | |
df = ( | |
query_job.result() | |
.to_dataframe( | |
# Optionally, explicitly request to use the BigQuery Storage API. As of | |
# google-cloud-bigquery version 1.26.0 and above, the BigQuery Storage | |
# API is used by default. | |
) | |
) | |
display(df.head(10)) | |
# you can then modify the data as required | |
If the Gist does not load, you can access it here.
It is important to note that when using BigQuery you need to elevate permissions on the service account to enable it to run BigQueries.