Disabling Google App Engine disabled my site’s database

Table of Contents

A sad Firebase Firestore story

It turns out the Firebase Firestore database we were using were disabled and not responding to query.

What happened

We are using Firebase and Google Cloud Platform extensively for https://like.co’s service. We have separate production and staging clusters set up, but both in the same GCP project.

Before the issue occurred, our team member was testing on the new GAE standard node.js (By the way, it is great, but OOM easily on Nuxt.js SSR). After prototyping was finished, since he could not delete the only instance left in GAE without deleting the whole project, he decided to just disable GAE instead.

He somehow disabled the Firestore access, and caused a lot of server response timeout. We have setup a 5xx alert on nginx, but in this case it is a discarded response so unfortunately it did not trigger the alert. We find out the issue from user’s report eventually.

The Problem

It seems that Firebase Firestore relies on Cloud Datastore, and Cloud Datastore relies on enabled GAE.

Disabling Google App Engine disabled my site’s database
You were hinted

There were no mentioning about Cloud Datastore need to be enabled in the documentation, except this paragraph:

You can’t use both Cloud Firestore and Cloud Datastore in the same project, which might affect apps using App Engine. Try using Cloud Firestore with a different project.

While it does not mention anything about disabling GAE will affects Firestore or not, there was a 2016 issue about stating Datastore needs GAE enabled. Firestore is probably running on a Cloud Datastore, and inherits this behaviour.

Thus disabling GAE cause Firestore to be not accessible too.

How to fix it

We simply re-enabled Google App Engine (and thus Cloud Datastore) and everything was working again.

Disabling Google App Engine disabled my site’s database
Sounds harmless? NO!

Lessons Learned

  • Do not try new feature on production Google cloud platform project (how embarrassing…)
  • Split production and development environment not only on cluster level, but also on project level
  • Do not try to disable seemingly harmless feature on production before testing what will happen in a development environment
  • Setup more generic alert instead of ad-hoc(e.g. only 5xx) ones for monitoring