2022-05-27 Chain Halt Incident Report

Table of Contents

An 2022-05-27 ~02:23 in HKT (2022-05-26 18:23 in UTC time), LikeCoin chain halted at block height 4011379. The LikeCoin team together with Oursky team addressed the issue, made hot fix on the issue, and by the cooperation of the community, the chain resumed at around 18:01 (2022-05-27 10:01 in UTC time). This article is to provide technical details on the cause and outcome of the chain halt event.

TL;DR

We missed one part in the Bech32 address prefix migration, and that part was somehow using string comparison for comparing addresses, which eventually results in double unbonding for some validators, which triggered panic in chain code.

Change in Bech32 address prefix

In the LaiChiKok upgrade, we migrated from cosmos address prefix to like address prefix.

In the blockchain world, address = hash(publicKey) , which are just some binary bytes. Bech32 is just a way to transform these non-human-friendly bytes into a (sort of) human-friendly string.

After all, changing the Bech32 prefix just changes the display format, while the underlying binary bytes are still the same.

So it should be easy, right? Well, not in Cosmos SDK.

Troubles with Cosmos SDK

In Cosmos SDK, the string form of the address, which is supposed to be used for human-friendly display, are actually used in quite a lot of places.

For example, when sending a transaction, the user specifies the addresses related to the transaction, e.g. sender and receiver addresses for sending tokens. These address fields are actually strings instead of binary bytes, even in the internal format of the transaction definition.

So instead of letting the frontend (either the client web / mobile application code, RESTful API server, or the GRPC-gateway code) to parse and convert the Bech32-encoded address strings into bytes, Cosmos SDK decided to let the chain code does so.

We are not sure about the design rationale of using string as the internal format of addresses, but strings are used everywhere as the standard format of storing addresses in Cosmos SDK modules. Some modules even store the addresses in the store.

Additionally, in Cosmos SDK, the Bech32 address prefix is a global configuration, meaning that there can be only one Bech32 prefix in the chain. Therefore if we simply modify the configuration and start the chain, it will fail to parse the addresses it stored and panic. Also, users will be unable to send any transaction using the old cosmos prefixed addresses, while we want a smooth adoption process letting users to move to the new like prefixed addresses progressively.

So to change the Bech32 address prefix without breaking the chain, and also being compatible with old addresses, we did the following:

  1. fork Cosmos SDK and change the Bech32 related code to support multiple Bech32 address prefixes
  2. write migration code for the stores to convert all stored string addresses with old prefix to the new one

Migration for change in Bech32 address prefix

The fork is surprisingly simple.

Originally, we thought that we need an AnteHandler to enumerate every possible message type when processing a transaction, and then do the corresponding address parsing and conversion for each field.

But then we realized that all address parsing are done in the same function, meaning that we can simply modify that function, and all parsing will work automatically. We need to specify one prefix (like in our case) as the display format, but for parsing, we can allow a set of possible prefixes ([" like", "cosmos"] in our case) instead of only one. The configuration part also needs to be changed for specifying multiple prefixes, but overall the modification is minimal.

For the migration code, that’s another story.

Since each module defines it’s own storage format, we need to find every place where the Cosmos SDK modules stores addresses as string.

Starting from the Stargate upgrade, all binary encoding are done by Protobuf, and the data layout is defined in .proto files, so we searched for all string matches in .proto files. But that results in hundreds of matches, and we need to check them one by one.

After checking these matches, here are the list of storage that we did migration on:

  • auth module
    • Account addresses
      • BaseAccount
      • ModuleAccount
      • various types of VestingAccount
  • gov module
    • Depositor addresses in proposals
    • Voter addresses in votes
    • Recipient addresses in community pool spend proposals
  • slashing module
    • Validator addresses in signing info
  • staking module
    • Validator addresses in validator info
    • Delegator and validator addresses in delegation info
    • Delegator, source validator and destination validator addresses in redelegation info
    • Delegator and validator addresses in unbonding delegation info
    • Validator addresses in historical info

During testing, these addresses showed no problem in chain logic other than display info in queries, i.e. even if we don’t migrate these addresses, the chain still operates well without problem other than some queries returning addresses with old prefix.

Logically speaking, this makes sense: in chain code, when operating with addresses stored in string format, the code should parse the address into binary bytes before further processing. Given that we modified the SDK code for parsing addresses, addresses with old prefix should also be parsed to the same binary bytes format without any difference with the new addresses. Queries being the exceptional case, but they only affect display info.

So the code seems worked well, the migration seems solely for better display in queries, and the testnet also ran properly without any problem. Therefore the code is applied to mainnet with the LaiChiKok upgrade.

The bug

What we missed in the migration is the structure below:

// ValAddresses defines a repeated set of validator addresses.
message ValAddresses {
  option (gogoproto.goproto_stringer) = false;
  option (gogoproto.stringer)         = true;

  repeated string addresses = 1;
}

(source)

which is, well, an array of addresses.

We thought that it is just some convenient definition of structures not for storage, so we didn’t migrate it.

But this is in fact a very important storage: this is the data structure for the validator unbonding queue.

In Cosmos SDK, when validator unbonding occurs (because of jailed or whatever other reasons), the validator address is stored in the following key-value format: the key is a combination of the unbonding time (21 days after the triggering time of the unbonding) and unbonding height, while the value is the address of the validator.

(source)

In this format, the chain code can iterate every keys with unbonding time less than current block time (i.e. should be unbonded now), unbond the validator, and then delete the unbonding queue entry.

Since it is possible to have more than one validator unbonding in the same time and height, the stored value is actually not one address, but an array of validator addresses, which is the structure that we missed in the migration.

So far it doesn’t matter. After all, the code finds the validator from the validator address by parsing it into binary bytes first, so it won’t be a problem.

But the queue doesn’t only support appending new addresses, it also support deleting: when a validator’s unbonding time is delayed or canceled, the corresponding unbonding queue entry will be modified so the validator address will be removed from the entry. It is done by iterating the addresses in the array, constructing a new array which does not contain the address to be removed.

The code for deleting address:

// DeleteValidatorQueue removes a validator by address from the unbonding queue
// indexed by a given height and time.
func (k Keeper) DeleteValidatorQueue(ctx sdk.Context, val types.Validator) {
	addrs := k.GetUnbondingValidators(ctx, val.UnbondingTime, val.UnbondingHeight)
	newAddrs := []string{}

	for _, addr := range addrs {
		if addr != val.OperatorAddress {
			newAddrs = append(newAddrs, addr)
		}
	}

	if len(newAddrs) == 0 {
		k.DeleteValidatorQueueTimeSlice(ctx, val.UnbondingTime, val.UnbondingHeight)
	} else {
		k.SetUnbondingValidatorsQueue(ctx, val.UnbondingTime, val.UnbondingHeight, newAddrs)
	}
}

(source)

And the key part is the comparison of addresses:

		if addr != val.OperatorAddress {
			newAddrs = append(newAddrs, addr)
		}

As you can see, the code here is doing string comparison.

So if a validator was in the unbonding queue before the Bech32 address prefix change, then after the prefix change, it is to be removed from the unbonding queue (e.g. re-bonded), the code won’t work here, because two addresses have different Bech32 prefixes, which causes the validator being wrongly removed at the original unbonding time.

What’s more, if the validator enters the unbonding queue again, then at the second unbonding time, the code trying to unbond the validator will find that the validator has already been unbonded, which triggers a panic:

// UnbondAllMatureValidators unbonds all the mature unbonding validators that
// have finished their unbonding period.
func (k Keeper) UnbondAllMatureValidators(ctx sdk.Context) {
        // ...
				val, found := k.GetValidator(ctx, addr)
				if !found {
					panic("validator in the unbonding queue was not found")
				}

				if !val.IsUnbonding() {
					panic("unexpected validator in unbonding queue; status was not unbonding")
				}
        // ...
}

(source)

which is exactly the cause of the chain halt.

Timeline of the incident in chain level

(all time are in UTC and are extracted from block time)

Block height 3692802, 2022-05-04 15:38:00

LikeCoin chain upgraded to v2.0.0 with default Bech32 address prefix changed from cosmos to like. At that time, 3 validators were in unbonding queue with cosmos address prefix: cosmosvaloper1nuvjepa3hc6av4awukvynwdrn0kjmqdfsdfkuu, cosmosvaloper1dl4s8xxpmt0u6yrrlh5tnmrlgsq0t70uw965u3, cosmosvaloper19cl82dvs83nzafxqc39nrmw5ex2umk67d4ut2u.

Block height 3698579, 2022-05-05 02:53:51

Validator cosmosvaloper19cl82dvs83nzafxqc39nrmw5ex2umk67d4ut2u was to be removed from unbonding queue, but not really removed due to the bug.

Block height 3702301, 2022-05-05 09:43:37

Validator cosmosvaloper1dl4s8xxpmt0u6yrrlh5tnmrlgsq0t70uw965u3 was to be removed from unbonding queue, but not really removed due to the bug.

Block height 3738602, 2022-05-07 22:39:05

Validator cosmosvaloper1nuvjepa3hc6av4awukvynwdrn0kjmqdfsdfkuu was unbonded. This is normal since there was no attempt to remove the validator from the unbonding queue.

// TODO: time for inserting into unbonding queue

Block height 3904478, 2022-05-19 08:57:59

Validator cosmosvaloper1dl4s8xxpmt0u6yrrlh5tnmrlgsq0t70uw965u3 was unbonded. This is abnormal since the unbonding entry at this time slot was to be removed.

Block height 3923103, 2022-05-20 15:32:10

Validator cosmosvaloper19cl82dvs83nzafxqc39nrmw5ex2umk67d4ut2u was unbonded. This is abnormal since the unbonding entry at this time slot was to be removed.

Block height 4011379, 2022-05-26T18:23:52

Validator likevaloper1dl4s8xxpmt0u6yrrlh5tnmrlgsq0t70uwzyar2 was being unbonded. However, likevaloper1dl4s8xxpmt0u6yrrlh5tnmrlgsq0t70uwzyar2 is the same validator as cosmosvaloper1dl4s8xxpmt0u6yrrlh5tnmrlgsq0t70uw965u3, who was already unbonded at block 3904478. The code found that the validator is being doubly unbonded, so it paniced, resulting in a chain halt.

Hotfix

Since the bug triggers during re-bonding when the code was trying to remove the affected validator from unbonding queue, at the panic time (the second unbonding time), the validator is already wrongly unbonded. So what we can do is to avoid the chain from panic.

diff --git a/x/staking/keeper/validator.go b/x/staking/keeper/validator.go
index ca6830697..4b3d67b05 100644
--- a/x/staking/keeper/validator.go
+++ b/x/staking/keeper/validator.go
@@ -428,11 +428,19 @@ func (k Keeper) UnbondAllMatureValidators(ctx sdk.Context) {
                                }
                                val, found := k.GetValidator(ctx, addr)
                                if !found {
-                                       panic("validator in the unbonding queue was not found")
+                                       // due to address Bech32 prefix migration, it is possible that entry with address with old prefix was not
+                                       // deleted (e.g. jailed during unbond), so we don't panic here and instead just skip this validator
+                                       ctx.Logger().Error("validator in the unbonding queue was not found", "validator", valAddr)
+                                       continue
                                }

                                if !val.IsUnbonding() {
-                                       panic("unexpected validator in unbonding queue; status was not unbonding")
+                                       if val.IsUnbonded() {
+                                               // same issue as the comments above
+                                               ctx.Logger().Error("unbonding validator but the status was unbonded", "validator", valAddr)
+                                               continue
+                                       }
+                                       panic("unexpected validator in unbonding queue; status was not unbonding or unbonded")
                                }

                                val = k.UnbondingToUnbonded(ctx, val)

(source)

The fix (which is no longer needed)

As mentioned before, the fix won’t work for previously wrongly unbonded validators. Also, the bug only occurs when there are addresses with old Bech32 prefix in the queue, and new entries in the queue are all using the new Bech32 prefix, this bug won’t happen again after unbonding period (i.e. 21 days) from upgrade time.

Anyway, we made a fix for this issue:

diff --git a/x/staking/keeper/validator.go b/x/staking/keeper/validator.go
index 4b3d67b05..e50f54bbd 100644
--- a/x/staking/keeper/validator.go
+++ b/x/staking/keeper/validator.go
@@ -371,9 +371,21 @@ func (k Keeper) DeleteValidatorQueue(ctx sdk.Context, val types.Validator) {
        addrs := k.GetUnbondingValidators(ctx, val.UnbondingTime, val.UnbondingHeight)
        newAddrs := []string{}

+       // since address string may change due to Bech32 prefix change, we parse the addresses into bytes
+       // format for normalization
+       deletingAddr, err := sdk.ValAddressFromBech32(val.OperatorAddress)
+       if err != nil {
+               panic(err)
+       }
+
        for _, addr := range addrs {
-               if addr != val.OperatorAddress {
-                       newAddrs = append(newAddrs, addr)
+               storedAddr, err := sdk.ValAddressFromBech32(addr)
+               if err != nil {
+                       // even if we don't panic here, it will panic in UnbondAllMatureValidators at unbond time
+                       panic(err)
+               }
+               if !storedAddr.Equals(deletingAddr) {
+                       newAddrs = append(newAddrs, storedAddr.String())
                }
        }

(source)

After the incident

We are now checking all string fields defined in the Protobuf definition of the Cosmos SDK modules we are using.

Validator cosmosvaloper1nuvjepa3hc6av4awukvynwdrn0kjmqdfsdfkuu (2022-05-07 22:39:01.80976696 +0000 UTC, 3439443) OK.

Validator cosmosvaloper1dl4s8xxpmt0u6yrrlh5tnmrlgsq0t70uw965u3 (2022-05-19 08:57:56.588790325 +0000 UTC, 3604294) OK.

Validator cosmosvaloper19cl82dvs83nzafxqc39nrmw5ex2umk67d4ut2u (2022-05-20 15:32:09.823104385 +0000 UTC, 3622470) OK.

Height 4011379

cosmosvaloper1dl4s8xxpmt0u6yrrlh5tnmrlgsq0t70uw965u3

likevaloper1dl4s8xxpmt0u6yrrlh5tnmrlgsq0t70uwzyar2