The Art of Judging Bug Bounties

Or C.
May 20, 2024
5 min read

In the competitive world of bug bounties, judges play a pivotal role. With both sides (competitors and sponsors) pulling the rope to their side, judges are the only protectors of balance. Alas, it happens to be that judging is extremely abstract and open to interpretation. In this piece, we'll start by studying the mechanics of judging. Ideas will remain fairly abstract so they would fit most contest platforms. Finally, we'll leverage our experience in submitting thousands of findings in private audits and bounty contests, as well as judging thousands of submissions on Code4rena, to give helpful tips to aspiring judges.

The Conflict

Sponsors are incentivized to downplay bug severities and score a better looking public report. In case the prize pool has pay jumps per unlocked severity, it could also mean saving 5-6 figures USD in security spending. Finally, it is not rare to see developers overly protective of their code to save face with upper management. Conversely, competitors are financially driven to inflate the severities of their findings. Either side faces little to no consequences for overly pushing submissions. Judges are often caught in a crossfire and at least one of the parties will be displeased with the outcome.

The Judge's Role

A judge must resolve each submission based on:

Being inspired and directed by the judging guidelines of the contest platform
Applying their own common sense and considering the specific context in which the bug manifests
Completely ignoring anything other than the content (Identify of the submitter, public expectations, time constraints of judging, etc)

Sources of Dispute

A rewardable finding should have at minimum:

Technical validity
Sufficient proof
Identification of root cause of the issue
Measurable end impact in line with the severity level
Root cause & impact should both affect files in scope
Not considered a known issue

On top of that, severity is judged by:

Likelihood - Privileges required, conditions that must align, attack cost etc.
Impact - Severity of the invariant broken (user loses funds? user cannot use a function? protocol loses X amount of fees?)

Many of the likelihood and impact factors have been standardized in ruling guidelines, yet it is impossible to cover all of them, so very often a novel situation arises. In fact if it wasn't the case, judging would have been secretarial work and not the highly specialized role it is.

Many services operate by the famous severity matrix below:

However, blindly following this table often does more harm than good. Its main soft spots are:

Impact Med, likelihood High => High
Impact Low, likelihood High => Med
Impact High, likelihood Low => Med

The two distinct problems are:

Low is uncapped and can be abused - How low is too low?
Severity should be capped by the impact in isolation, then pulled down according to likelihood reasoning. Consider that impact is how bad it would be if X would happen (not maybe would happen), and likelihood fractionalizes that risk.

Several experienced bounty platforms like C4 and Immunefi opt for an impact-anchor, likelihood-pulldown approach.

Let's discuss how to assess the eligibility criteria:

Technical validity
- Chase down the code path where the issue exists, and pay attention to any hidden conditions / checks
- Run a coded PoC, if available
Sufficient proof
- Needs to be a step-by-step worded PoC or coded exploit for High, clear explanation of scenario of impact for Med
- Intended as an up-front price to prevent asymmetrical effort between contestant and judge
Identification of root cause
- A report must identify the disease, not the symptom. This is almost always where the minimal necessary fix is located.
Measurable end impact
- An impact is a high-level construct, detached from low-level technical jargon.
- If the report does not display a dangerous behavior for some actor in a protocol, it's not an end impact.
- DeFi impact tables are fairly well standardized (loss of funded, temporary freeze of funds, theft of yield etc). For other protocols, impact score should be determined by whether the issue breaks a core invariant of the protocol, or a standard invariant of the protocol. It is recommended to think about it in terms of "this behavior should not happen" (Medium) vs. "this behavior must not happen" (High).
Root cause and impact are in scope
- Fairly simple. Occasionally things could get complex if an in-scope file makes use of an out-of-scope file through inheritance or composition, in this case the official platform guidelines are the source of truth.

Escalations

Most platforms give time to correct any mistake made during initial judging. However, since it mostly doesn't cost anything (but time), it is almost always +EV. That's a mech-level issue that should be solved by platforms, but until a perfect formula is found, judges will continue to combat a high escalation ratio.

Escalations get messy very quickly and often feel like joining into a MOBA server, many discussions are being held at once between many parties. The judge's job during escalations is to:

Filter out the noise
Comment by comment, note down only the new information added
Validate each piece of information, and evaluate it's impact on the verdict
Finally reapply the eligibility and severity criteria

Pro tips

Do NOT be afraid to make what you believe is the right call, just because it may be bashed or not going to be as popular as the no-brainer ruling.
Do NOT hesitate to change a previous ruling due to new information or a realized mistake. If there is even a slight hesitation to admitting wrongdoing, judging is not for you.
Have a trusted group of judges ready in DM to weigh in on very borderline issues
Try to keep the discussion in public so others can see all information that led to a particular decision.
It's not necessary to put into writing the entire line of reasoning for a verdict, but we recommend to mention how the key characteristics of the submission affected the judging.
Assume all parties are aligned with their financial incentives and will misrepresent things to tilt verdicts in their favor. Your job is to extract the raw data from the input which can then be validated and evaluated.

Outlook

The judging system is still in its infancy, no doubt we will see new methodologies, tools, standardizations and completely new judging structures as the ecosystem evolves. The most important thing for contest platforms (and the white-hat industry in general) is maintaining the competence and credible neutrality of the judging process. If we don't have those, researchers will stop trusting the process and go through other avenues to get compensation for their work.

That's all for today. Ultimately, judging is one of the most demanding, stressful roles in web3 security, but for those who have the mental fortitude and love being bombarded with intensive decision-making, it could be a dream job.

Trust
security