investigate BEATRICE halt #16

Closed
opened 2020-07-15 06:10:56 +00:00 by bobinson · 6 comments
bobinson commented 2020-07-15 06:10:56 +00:00 (Migrated from gitlab.com)

Beatrice stopped making blocks on July 4.
LIB is 16545027 (2020-07-04T00:59:30) from blckchnd-testnet

After maintenance of 2020-07-04T01:00:00, LIB stop increasing. Last "Got block" at 2020-07-04T01:00:48, and no more blocks. All witness_node continue running.

Part of logs. No blocks after:

3597023ms th_a       application.cpp:499           handle_block         ] Got block: #16545034 time: 2020-07-04T00:59:57 latency: 23 ms from: roelandp  irreversible: 16545026 (-8)
3104ms th_a       application.cpp:499           handle_block         ] Got block: #16545035 time: 2020-07-04T01:00:03 latency: 104 ms from: houdini-testnet  irreversible: 16545027 (-8)
3105ms th_a       db_maint.cpp:1321             process_dividend_ass ] In process_dividend_assets time 2020-07-04T01:00:03
3349ms th_a       db_maint.cpp:1076             schedule_pending_div ] payout_asset_type: 1.3.1 current_balance: 1 previous_balance: 0 
3349ms th_a       db_maint.cpp:1116             schedule_pending_div ] total_fee_per_asset_in_payout_asset: 0 dividend_data.options: {"next_payout_time":"2020-07-12T20:00:00","payout_interval":2592000,"minimum_fee_percentage":1000,"minimum_distribution_interval":259200,"extensions":[]} 
3485ms th_a       application.cpp:499           handle_block         ] Got block: #16545035 time: 2020-07-04T01:00:03 latency: 485 ms from: houdini-testnet  irreversible: 16545027 (-8)
3486ms th_a       application.cpp:499           handle_block         ] Got block: #16545035 time: 2020-07-04T01:00:03 latency: 486 ms from: houdini-testnet  irreversible: 16545027 (-8)
15038ms th_a       application.cpp:499           handle_block         ] Got block: #16545036 time: 2020-07-04T01:00:15 latency: 38 ms from: royal-flush  irreversible: 16545027 (-9)
18022ms th_a       application.cpp:499           handle_block         ] Got block: #16545037 time: 2020-07-04T01:00:18 latency: 22 ms from: roelandp  irreversible: 16545027 (-10)
21021ms th_a       application.cpp:499           handle_block         ] Got block: #16545038 time: 2020-07-04T01:00:21 latency: 21 ms from: eifos-witness  irreversible: 16545027 (-11)
24023ms th_a       application.cpp:499           handle_block         ] Got block: #16545039 time: 2020-07-04T01:00:24 latency: 23 ms from: blckchnd-testnet  irreversible: 16545027 (-12)
27185ms th_a       application.cpp:499           handle_block         ] Got block: #16545040 time: 2020-07-04T01:00:27 latency: 185 ms from: blockops1  irreversible: 16545027 (-13)
33162ms th_a       application.cpp:499           handle_block         ] Got block: #16545041 time: 2020-07-04T01:00:33 latency: 162 ms from: spacecrypt-witness  irreversible: 16545027 (-14)
36023ms th_a       application.cpp:499           handle_block         ] Got block: #16545042 time: 2020-07-04T01:00:36 latency: 23 ms from: alexpu-witness  irreversible: 16545027 (-15)
38988ms th_a       application.cpp:499           handle_block         ] Got block: #16545043 time: 2020-07-04T01:00:39 latency: -11 ms from: taconator-witness  irreversible: 16545027 (-16)
45050ms th_a       application.cpp:499           handle_block         ] Got block: #16545044 time: 2020-07-04T01:00:45 latency: 50 ms from: houdini-testnet  irreversible: 16545027 (-17)
48037ms th_a       application.cpp:499           handle_block         ] Got block: #16545045 time: 2020-07-04T01:00:48 latency: 37 ms from: royal-flush  irreversible: 16545027 (-18)
53577ms th_a       application.cpp:549           handle_transaction   ] Got 1 transactions from network
1790415ms th_a       application.cpp:637           get_item             ] Couldn't find block 0082cafa85f03a85900d54a74000000000000000 -- corresponding ID in our chain is 0082cafa85f03a85900d54a72a5e560139f766f2```
Beatrice stopped making blocks on July 4. LIB is 16545027 (2020-07-04T00:59:30) from blckchnd-testnet After maintenance of 2020-07-04T01:00:00, LIB stop increasing. Last "Got block" at 2020-07-04T01:00:48, and no more blocks. All witness_node continue running. **Part of logs. No blocks after:** ``` 3597023ms th_a application.cpp:499 handle_block ] Got block: #16545034 time: 2020-07-04T00:59:57 latency: 23 ms from: roelandp irreversible: 16545026 (-8) 3104ms th_a application.cpp:499 handle_block ] Got block: #16545035 time: 2020-07-04T01:00:03 latency: 104 ms from: houdini-testnet irreversible: 16545027 (-8) 3105ms th_a db_maint.cpp:1321 process_dividend_ass ] In process_dividend_assets time 2020-07-04T01:00:03 3349ms th_a db_maint.cpp:1076 schedule_pending_div ] payout_asset_type: 1.3.1 current_balance: 1 previous_balance: 0 3349ms th_a db_maint.cpp:1116 schedule_pending_div ] total_fee_per_asset_in_payout_asset: 0 dividend_data.options: {"next_payout_time":"2020-07-12T20:00:00","payout_interval":2592000,"minimum_fee_percentage":1000,"minimum_distribution_interval":259200,"extensions":[]} 3485ms th_a application.cpp:499 handle_block ] Got block: #16545035 time: 2020-07-04T01:00:03 latency: 485 ms from: houdini-testnet irreversible: 16545027 (-8) 3486ms th_a application.cpp:499 handle_block ] Got block: #16545035 time: 2020-07-04T01:00:03 latency: 486 ms from: houdini-testnet irreversible: 16545027 (-8) 15038ms th_a application.cpp:499 handle_block ] Got block: #16545036 time: 2020-07-04T01:00:15 latency: 38 ms from: royal-flush irreversible: 16545027 (-9) 18022ms th_a application.cpp:499 handle_block ] Got block: #16545037 time: 2020-07-04T01:00:18 latency: 22 ms from: roelandp irreversible: 16545027 (-10) 21021ms th_a application.cpp:499 handle_block ] Got block: #16545038 time: 2020-07-04T01:00:21 latency: 21 ms from: eifos-witness irreversible: 16545027 (-11) 24023ms th_a application.cpp:499 handle_block ] Got block: #16545039 time: 2020-07-04T01:00:24 latency: 23 ms from: blckchnd-testnet irreversible: 16545027 (-12) 27185ms th_a application.cpp:499 handle_block ] Got block: #16545040 time: 2020-07-04T01:00:27 latency: 185 ms from: blockops1 irreversible: 16545027 (-13) 33162ms th_a application.cpp:499 handle_block ] Got block: #16545041 time: 2020-07-04T01:00:33 latency: 162 ms from: spacecrypt-witness irreversible: 16545027 (-14) 36023ms th_a application.cpp:499 handle_block ] Got block: #16545042 time: 2020-07-04T01:00:36 latency: 23 ms from: alexpu-witness irreversible: 16545027 (-15) 38988ms th_a application.cpp:499 handle_block ] Got block: #16545043 time: 2020-07-04T01:00:39 latency: -11 ms from: taconator-witness irreversible: 16545027 (-16) 45050ms th_a application.cpp:499 handle_block ] Got block: #16545044 time: 2020-07-04T01:00:45 latency: 50 ms from: houdini-testnet irreversible: 16545027 (-17) 48037ms th_a application.cpp:499 handle_block ] Got block: #16545045 time: 2020-07-04T01:00:48 latency: 37 ms from: royal-flush irreversible: 16545027 (-18) 53577ms th_a application.cpp:549 handle_transaction ] Got 1 transactions from network 1790415ms th_a application.cpp:637 get_item ] Couldn't find block 0082cafa85f03a85900d54a74000000000000000 -- corresponding ID in our chain is 0082cafa85f03a85900d54a72a5e560139f766f2```
bobinson commented 2020-07-15 06:12:04 +00:00 (Migrated from gitlab.com)
List of BEATRICE witnesses https://peerplays.atlassian.net/wiki/spaces/PD/pages/15564866/Witness+Node+Connection+Details
pblockc commented 2020-07-15 14:12:13 +00:00 (Migrated from gitlab.com)

Analysis:
There is high possibility that the system has no votes. First GPOS period(180 days) should be finished by 1st week of July. Since no one has voted in the last sub-period of GPOS period, system has no valid witnesses to produce blocks. Based on current scheduling algorithm, every witnesses who was registered have equal authority to produce blocks.

Analysis: There is high possibility that the system has no votes. First GPOS period(180 days) should be finished by 1st week of July. Since no one has voted in the last sub-period of GPOS period, system has no valid witnesses to produce blocks. Based on current scheduling algorithm, every witnesses who was registered have equal authority to produce blocks.
sierra19XX commented 2020-07-17 07:39:10 +00:00 (Migrated from gitlab.com)

As @pblockc mentioned, chain halt happened because of a lack of votes and witness participation.

Also, below assert failure is identified.

"Assert Exception","stack":[{"context":{"level":"error","file":"db_maint.cpp","line":806,"method":"get_gpos_current_subperiod","hostname":"","thread_name":"th_a","timestamp":"2020-07-17T04:09:33"},"format":"period_start <= now && now <= period_end: ","data":{}}]}

In get_gpos_current_subperiod ,
FC_ASSERT(period_start <= now && now <= period_end); is failing.

This happened because of the missing block at Saturday, 4 July 2020 01:00:00

GPOS period ended at Saturday, 4 July 2020 01:00:00, if we had received the block at that time, we would have rolled over into the next GPOS period without issues.

Next block was received at Saturday, 4 July 2020 01:00:03, so maintenance happened at this point instead of 01.00.00

now <= period_end fails.

I think this issue is not a special case, edge cases were not considered during development.

"active_witnesses": [
    "1.6.1",
    "1.6.2",
    "1.6.3",
    "1.6.4",
    "1.6.5",
    "1.6.6",
    "1.6.7",
    "1.6.8",
    "1.6.9",
    "1.6.10",
    "1.6.11"
  ]

get_block 16545034
{
  "previous": "00fc7509bd9ba3790a9eaa7e157b1ccca06a0f99",
  "timestamp": "2020-07-04T00:59:57",
  "witness": "1.6.18",
  "next_secret_hash": "b661f8ceec8c644315810496285fe8ca2f563a0e",
  "previous_secret": "684a01be66ff66a984b8554892c194c1189265e7",
  "transaction_merkle_root": "0000000000000000000000000000000000000000",
  "extensions": [],
  "witness_signature": "1f51b136c019308650585679dbb589db58f22711d3867cc4b9522363755ac89257470922efa76ac51e82a33d23039a7fc1b5f5a00ef57dc78950fdeba4faef2b04",
  "transactions": [],
  "block_id": "00fc750aebd2206df6ca2e5b21e0ba2ab3da87ba",
  "signing_key": "TEST6yvan9HehRVeJh3JMivXdJAekBQzesUQeMvZ2joK1YK6heNrTf",
  "transaction_ids": []
}
unlocked >>> get_block 16545035
get_block 16545035
{
  "previous": "00fc750aebd2206df6ca2e5b21e0ba2ab3da87ba",
  "timestamp": "2020-07-04T01:00:03",
  "witness": "1.6.20",
  "next_secret_hash": "244ce649c757f5bb1e61e95c9c5943d5e9f62a4d",
  "previous_secret": "1e43cbb03825e1107cc282637af1b2a039c45796",
  "transaction_merkle_root": "0000000000000000000000000000000000000000",
  "extensions": [],
  "witness_signature": "1f3f728e1721ad4fa8cb332f9c32904bd67088cbbc59311e39e746735ed8676e6b5b8d42913f34706328dee42b0b695041a25b9ce4ff61abce5d1505c0e2d0376f",
  "transactions": [],
  "block_id": "00fc750bfa6024548878dc3afeabc87b52fe2be3",
  "signing_key": "TEST5DBceeRiYaxeh3SFQdo3ArTjnCttuVunz8t1jj8wUBmEvbAEYH",
  "transaction_ids": []
}
As @pblockc mentioned, chain halt happened because of a lack of votes and witness participation. Also, below assert failure is identified. `"Assert Exception","stack":[{"context":{"level":"error","file":"db_maint.cpp","line":806,"method":"get_gpos_current_subperiod","hostname":"","thread_name":"th_a","timestamp":"2020-07-17T04:09:33"},"format":"period_start <= now && now <= period_end: ","data":{}}]}` In **get_gpos_current_subperiod** , `FC_ASSERT(period_start <= now && now <= period_end);` is failing. This happened because of the missing block at **Saturday, 4 July 2020 01:00:00** GPOS period ended at **Saturday, 4 July 2020 01:00:00**, if we had received the block at that time, we would have rolled over into the next GPOS period without issues. Next block was received at **Saturday, 4 July 2020 01:00:03**, so maintenance happened at this point instead of **01.00.00** **now <= period_end** fails. I think this issue is not a special case, edge cases were not considered during development. ``` "active_witnesses": [ "1.6.1", "1.6.2", "1.6.3", "1.6.4", "1.6.5", "1.6.6", "1.6.7", "1.6.8", "1.6.9", "1.6.10", "1.6.11" ] get_block 16545034 { "previous": "00fc7509bd9ba3790a9eaa7e157b1ccca06a0f99", "timestamp": "2020-07-04T00:59:57", "witness": "1.6.18", "next_secret_hash": "b661f8ceec8c644315810496285fe8ca2f563a0e", "previous_secret": "684a01be66ff66a984b8554892c194c1189265e7", "transaction_merkle_root": "0000000000000000000000000000000000000000", "extensions": [], "witness_signature": "1f51b136c019308650585679dbb589db58f22711d3867cc4b9522363755ac89257470922efa76ac51e82a33d23039a7fc1b5f5a00ef57dc78950fdeba4faef2b04", "transactions": [], "block_id": "00fc750aebd2206df6ca2e5b21e0ba2ab3da87ba", "signing_key": "TEST6yvan9HehRVeJh3JMivXdJAekBQzesUQeMvZ2joK1YK6heNrTf", "transaction_ids": [] } unlocked >>> get_block 16545035 get_block 16545035 { "previous": "00fc750aebd2206df6ca2e5b21e0ba2ab3da87ba", "timestamp": "2020-07-04T01:00:03", "witness": "1.6.20", "next_secret_hash": "244ce649c757f5bb1e61e95c9c5943d5e9f62a4d", "previous_secret": "1e43cbb03825e1107cc282637af1b2a039c45796", "transaction_merkle_root": "0000000000000000000000000000000000000000", "extensions": [], "witness_signature": "1f3f728e1721ad4fa8cb332f9c32904bd67088cbbc59311e39e746735ed8676e6b5b8d42913f34706328dee42b0b695041a25b9ce4ff61abce5d1505c0e2d0376f", "transactions": [], "block_id": "00fc750bfa6024548878dc3afeabc87b52fe2be3", "signing_key": "TEST5DBceeRiYaxeh3SFQdo3ArTjnCttuVunz8t1jj8wUBmEvbAEYH", "transaction_ids": [] } ```
bobinson commented 2020-07-17 09:05:37 +00:00 (Migrated from gitlab.com)

assigned to @sierra19XX

assigned to @sierra19XX
sierra19XX commented 2020-07-22 07:50:01 +00:00 (Migrated from gitlab.com)

@pblockc

Do you see any side effects with the below change, this would remove the case of GPOS period end spilling over to the next GPOS period and keep the GPOS start and end to the maintenance period.

ceba16abc7

@pblockc Do you see any side effects with the below change, this would remove the case of GPOS period end spilling over to the next GPOS period and keep the GPOS start and end to the maintenance period. https://github.com/peerplays-network/peerplays/commit/ceba16abc73c140bdc5bbfac7d11dad8659f4f17
pblockc commented 2020-07-23 02:21:38 +00:00 (Migrated from gitlab.com)

The second change looks good to me. Even the first change is okay, other way is FC_ASSERT (period_start <= now && now <= period_end); can be replaced with wlog statement as GPOS changes were already went to production, so this check is not necessary.

The second change looks good to me. Even the first change is okay, other way is FC_ASSERT (period_start <= now && now <= period_end); can be replaced with wlog statement as GPOS changes were already went to production, so this check is not necessary.
bobinson (Migrated from gitlab.com) closed this issue 2020-10-20 17:13:28 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Peerplays_Blockchain/peerplays_migrated#16
No description provided.