CWE-502 deserialization of untrusted data

A worked example — what insecure deserialization looks like in real code, how Sebastion flags it, and the one-line fix. Based on RAGFlow PR

This playbook walks through a single real CWE-502 finding end to end: the vulnerable code, why it is dangerous even when not immediately reachable, how Sebastion surfaces it on a pull request and what the fix looks like. The case study is a real PR Sebastion shipped to the RAGFlow repo in May 2026 — #14803, "security: always use RestrictedUnpickler in deserialize_b64", which was merged the same day.

It is a small change — one insertion, six deletions, scoped to a single function — but it is a useful teaching example because the finding is latent rather than actively reachable, and how you think about latent findings tells you a lot about how to use Sebastion well.

The CWE

CWE-502 — Deserialization of Untrusted Data. In Python, this almost always means a path that calls pickle.loads (or pickle.load, cPickle.loads, dill.loads, joblib.load) on bytes that an attacker can influence.

Pickle is not a data format in the JSON sense. It is a bytecode for a stack machine that can construct arbitrary Python objects, which means it can invoke arbitrary callables. A pickle payload that says "construct an object by calling posix.system('id')" will, on unpickling, call posix.system('id'). There is no fix at the unpickling layer that makes this safe in general — the format itself is the unsafe surface.

The Python standard library acknowledges this directly in its pickle docs:

Warning The pickle module is not secure. Only unpickle data you trust.

The vulnerable code

In RAGFlow's api/utils/configs.py, deserialize_b64 was structured like this:

use_deserialize_safe_module = get_base_config('use_deserialize_safe_module', False)
if use_deserialize_safe_module:
    return restricted_loads(src)
return pickle.loads(src)   # <-- default path

Two things make this a classic CWE-502 sharp edge:

The safe path is gated on a config flag that defaults to False. Operators have to opt in to safety. Defaults matter: in our experience, ~95% of production deployments ship with the defaults.
The flag is not set anywhere in the repo. Not in pyproject.toml, not in any YAML config, not in any environment-driver helper. So the effective behaviour was always the default path — pickle.loads on the raw bytes — regardless of what an operator might think they had configured.

The caller chain made it worse. deserialize_b64 is invoked by SerializedField.python_value in api/db/db_models.py, which is Peewee's deserialiser for columns marked as PICKLE. So any code path that reads a SerializedField(serialized_type=PICKLE) column from MySQL would hit pickle.loads on whatever bytes were stored in that column.

Why "latent" still matters

When we filed this, no in-tree model actually used SerializedField with the default PICKLE type — every existing field used JsonSerializedField. So in practice, today, an attacker could not directly reach this via an HTTP endpoint.

That is exactly the kind of finding that gets dismissed as "not exploitable, don't bother". We argue Sebastion should still flag it, and the maintainers agreed. The reasons:

The default behaviour of a security-sensitive helper sets the ceiling for every future caller. The next contributor who reaches for SerializedField with PICKLE (because it's the default and that's how Peewee fields work) inherits RCE-on-read semantics. They will not know, because the helper does not warn.
The set of "attacker-controllable bytes in a database column" is larger than people think. SQL injection elsewhere in the codebase, a compromised database credential, a backup restored from an untrusted source, a compromised replication peer, a developer using a copy of production with seed data from an external contributor. Pickle in a DB column turns any of these from "bad" into "RCE on the application server".
The fix is one line. When the cost of fixing is roughly zero and the cost of the latent bomb going off is RCE, the trade is easy.

This is the framework you want for every "is this finding worth filing?" question. Reachability is a useful prior, not a final answer.

How Sebastion surfaces it

When this code lands in a PR, Sebastion posts an inline review comment on the offending line. The comment includes:

Severity: high (rising to critical when the path is demonstrably reachable from a request handler).
Rule id: the canonical Sebastion rule for the pickle.loads family (the exact id is shaped like py.<rule-name> and is surfaced inline on the finding).
CWE link: CWE-502.
Prose explanation of why the default path is unsafe and what attacker capabilities turn it into a real exploit.
A ready-to-apply fix, usually as a fenced diff block.

Because this is high severity, the chat-path suppression (@sebastionai ignore) cannot silence it — that requires an operator-added Learning with an audit trail. See severity floor for why.

If you legitimately need the pickle path (e.g. you're round-tripping trusted data with a strict allow-list), the right move is not to suppress — it is to leave the suppression to an operator-added Learning so we can review the reasoning. In most cases, switching to the restricted_loads pattern below is faster than the back-and-forth.

The fix

-    use_deserialize_safe_module = get_base_config(
-        'use_deserialize_safe_module', False)
-    if use_deserialize_safe_module:
-        return restricted_loads(src)
-    return pickle.loads(src)
+    return restricted_loads(src)

Three properties make this a clean fix:

No new code, only removal. restricted_loads already existed in the same file as an Unpickler subclass that whitelists numpy and rag_flow and rejects everything else. The fix is "always use the helper that was already designed for this", not "introduce a new dependency".
No behaviour change for legitimate callers. The pre-fix behaviour for callers who actually wanted safety (use_deserialize_safe_module = True) is preserved. The pre-fix behaviour for callers who left the default (use_deserialize_safe_module = False) is replaced with the safe path — which is what they would have got if they had read the RestrictedUnpickler docstring.
The dead config flag goes with it. Removing the flag and the now-unused get_base_config import prevents future contributors re-introducing the bug by setting the flag back to False.

Verifying the fix worked

The minimum bar for "did the fix actually fix it":

import pickle, posix
from api.utils.configs import deserialize_b64, serialize_b64

# Build a malicious pickle whose __reduce__ resolves to posix.system.
class Exploit:
    def __reduce__(self):
        return (posix.system, ('id',))

payload = pickle.dumps(Exploit())

# Pre-fix: this would execute `id`.
# Post-fix: restricted_loads raises UnpicklingError.
try:
    deserialize_b64(payload)
except Exception as e:
    print(f"Blocked: {e}")

Then prove you have not broken legitimate use:

import numpy as np
arr = np.array([1.0, 2.0, 3.0])
restored = deserialize_b64(serialize_b64(arr))
assert (restored == arr).all()

Both of these belong in your test suite alongside the fix, so the guarantee doesn't silently regress.

What this teaches about using Sebastion

Three meta-lessons that generalise beyond this single CWE:

Latent does not mean ignorable. "Not reachable today" is a weak ceiling on tomorrow's behaviour. The cost of fixing a one-line latent finding is usually cheaper than the audit conversation about whether it is reachable.
high severity is intentionally hard to suppress. If you find yourself wanting to suppress a high finding from the chat path and getting blocked, that block is the product working as intended. Either fix the code or get an operator to review your suppression rationale.
Adversarial reasoning belongs in the PR description. The maintainers will not necessarily share your threat model. The PR description for #14803 explicitly listed the two strongest objections (latent today; attacker already has DB write access) and argued why neither was sufficient to leave the bug. Sebastion structures PR bodies this way for exactly this reason.

False positives and Learnings — why high and critical cannot be suppressed from chat.
Suppression strategy — what to do when you legitimately have a high finding that needs an exception.
Sebastion's OSS contributions — every merged PR Sebastion has shipped to public repos, including this one.

CWE-502 deserialization of untrusted data

On this page