Using Python to query and extract data from Firestore, Storage Buckets and BigQuery

Sep 1, 2023

Google Firebase is a set of cloud-based development tools that helps mobile app developers like me, build, deploy, and scale our applications. It is easy to use, and cost effective. That is why I've built all our frontend digital infrastructure at KCMHR with Google Firebase using NodeJS and Typescript to programme a RESTful API.

Google Firebase is a great platform for rapid development using serverless technology. This means that we can focus on frontend development and not server management. A major bonus is that it is supported by Google, so you get all the top-notch security support bundled in. But one aspect that is a little difficult is getting data out of Google Firebase. It isn't that easy, or straightforward.

That is why I wanted to share three scripts I've created to gather data from Google Firebase, Google Storage Buckets and Google BigQuery - the three main areas where you'll store data on the platform.

There are multiple ways in which you can query the data, the ways below, for me, are the simplest way. Hereafter I will provide a Gist for each method, with the code commented outlining what is happening.

Each method requires Python 3.5, with the following packages installed as a minimum:

  • pandas
  • firebase_admin
  • google.cloud

You will also need to download your service account key either from Google Firebase OR Google Cloud Console.

Getting data from Google Firestore

import pandas as pd
import firebase_admin
from firebase_admin import credentials, firestore
# set the root folder path
base_url = '<BASE URL HERE>'
# used for paging when downloading data e.g. only 1000 documents downloaded per call
limit = 1000
# create the connection
if not firebase_admin._apps:
print('Setting connection')
# use the service key to authorise the login
cred = credentials.Certificate('ServiceAccountKey.json')
# ensure we have the storage bucket permission.
# for some reason Firebase wants this even if you don't want to pull the data
default_app = firebase_admin.initialize_app(cred, {
'storageBucket': '<STORAGE BUCKET'
})
# start the firestore client
store = firestore.client()
# declare a function to stream the documents to download
def stream_collection_loop(collection, count, cursor=None):
dict_array = []
id_array = []
while True:
docs = [] # Very important. This frees the memory incurred in the recursion algorithm.
if cursor:
docs = [snapshot for snapshot in
collection.limit(limit).order_by('__name__').start_after(cursor).stream()]
else:
docs = [snapshot for snapshot in collection.limit(limit).order_by('__name__').stream()]
for doc in docs:
dict_array.append(doc.to_dict())
id_array.append(doc.id)
if len(docs) == limit:
cursor = docs[limit-1]
continue
break
return dict_array, id_array
# example data downloading and streaming
# use the defined function and query the 'users' table with all its data, starting from index 0
dict_array, id_array = stream_collection_loop(store.collection(u'users'), 0)
# once the query has finished we now have the dict_array and id_array
# dict_array is the data stored within the firesotre array
# id_array is the document ids often known as doc_id
# now this is where we can start to manage and process the data using pandas
# we create a panda view with the data, and document ids
df = pd.DataFrame(data=dict_array, index=id_array)
# to make sure we have the right format and references I always rename the axis to ensure doc_id is referenced
df.rename_axis("doc_id", inplace=True)
# the data has now been downloaded and can be viewed, saved and processed as needed
display(df.head(5))

If the Gist does not load, you can access it here.

Getting data from Storage Buckets

import pyrebase
import os
# define firebase config for the project being queried
firebaseConfig = {
"apiKey": "",
"authDomain": "",
"projectId": "",
"storageBucket": "",
"messagingSenderId": "",
"appId": "",
"measurementId": "",
"serviceAccount": "",
"databaseURL":""
};
# create app connection using pyrebase based on firebase config
firebase_storage = pyrebase.initialize_app(firebaseConfig)
# create a storage bucket container
storage = firebase_storage.storage()
# obtain a list of files hosted on the storage bucket
all_files = storage.list_files()
# we create this as an index of files to ensure we keep a track on the number downloaded
cnt = 0
# define base storage location
base_url = '<BASE URL HERE>'
# loop through each file stored in all_files and download it
for file in all_files:
# let's check to see if the file already exists in its current location (it could be a file or folder)
if not os.path.exists(os.path.dirname(base_url + '/' + file.name)):
try:
# let make the directory if it doesn't already exist
os.makedirs(os.path.dirname(base_url + '/' + file.name))
except OSError as exc: # Guard against race condition
if exc.errno != errno.EEXIST:
raise
# the file exists pass and present a print
if(os.path.isfile(base_url + '/' + file.name)):
print('File present')
pass
else:
# file does not currently exist - download and store it
print(file.name + ' ' + str(cnt))
file.download_to_filename(base_url + '/' + file.name)
cnt += 1

If the Gist does not load, you can access it here.

Getting data from BigQuery

from google.cloud import bigquery
from google.oauth2 import service_account
import pandas as pd
# generate a service account container based on service account key
credentials = service_account.Credentials.from_service_account_file('ServiceAccountKey.json')
# define the project ID
project_id = ''
# create a biquery client using credentials and project id
client = bigquery.Client(credentials=credentials, project=project_id)
# define the query
query_job = client.query("""SELECT * FROM `<PROJECT>.analytics_289340186.events_*`""")
# this query above selects all the data in the analytics table but this can be any table
# this is the clever bit, we perform the query and convert the result to a panda dataframe
df = (
query_job.result()
.to_dataframe(
# Optionally, explicitly request to use the BigQuery Storage API. As of
# google-cloud-bigquery version 1.26.0 and above, the BigQuery Storage
# API is used by default.
)
)
display(df.head(10))
# you can then modify the data as required

If the Gist does not load, you can access it here.

It is important to note that when using BigQuery you need to elevate permissions on the service account to enable it to run BigQueries.