CWE-502 deserialization of untrusted data
A worked example — what insecure deserialization looks like in real code, how Sebastion flags it, and the one-line fix. Based on RAGFlow PR
This playbook walks through a single real CWE-502 finding end to
end: the vulnerable code, why it is dangerous even when not
immediately reachable, how Sebastion surfaces it on a pull request
and what the fix looks like. The case study is a real PR Sebastion
shipped to the RAGFlow repo
in May 2026 — #14803, "security: always use RestrictedUnpickler in
deserialize_b64",
which was merged the same day.
It is a small change — one insertion, six deletions, scoped to a single function — but it is a useful teaching example because the finding is latent rather than actively reachable, and how you think about latent findings tells you a lot about how to use Sebastion well.
The CWE
CWE-502 — Deserialization
of Untrusted Data. In Python, this almost always means a path that
calls pickle.loads (or pickle.load, cPickle.loads,
dill.loads, joblib.load) on bytes that an attacker can
influence.
Pickle is not a data format in the JSON sense. It is a bytecode for
a stack machine that can construct arbitrary Python objects, which
means it can invoke arbitrary callables. A pickle payload that says
"construct an object by calling posix.system('id')" will, on
unpickling, call posix.system('id'). There is no fix at the
unpickling layer that makes this safe in general — the format
itself is the unsafe surface.
The Python standard library acknowledges this directly in its
pickle docs:
Warning The
picklemodule is not secure. Only unpickle data you trust.
The vulnerable code
In RAGFlow's api/utils/configs.py, deserialize_b64 was structured
like this:
use_deserialize_safe_module = get_base_config('use_deserialize_safe_module', False)
if use_deserialize_safe_module:
return restricted_loads(src)
return pickle.loads(src) # <-- default pathTwo things make this a classic CWE-502 sharp edge:
- The safe path is gated on a config flag that defaults to
False. Operators have to opt in to safety. Defaults matter: in our experience, ~95% of production deployments ship with the defaults. - The flag is not set anywhere in the repo. Not in
pyproject.toml, not in any YAML config, not in any environment-driver helper. So the effective behaviour was always the default path —pickle.loadson the raw bytes — regardless of what an operator might think they had configured.
The caller chain made it worse. deserialize_b64 is invoked by
SerializedField.python_value in api/db/db_models.py, which is
Peewee's deserialiser for columns marked as PICKLE. So any code
path that reads a SerializedField(serialized_type=PICKLE) column
from MySQL would hit pickle.loads on whatever bytes were stored
in that column.
Why "latent" still matters
When we filed this, no in-tree model actually used
SerializedField with the default PICKLE type — every existing
field used JsonSerializedField. So in practice, today, an attacker
could not directly reach this via an HTTP endpoint.
That is exactly the kind of finding that gets dismissed as "not exploitable, don't bother". We argue Sebastion should still flag it, and the maintainers agreed. The reasons:
- The default behaviour of a security-sensitive helper sets the
ceiling for every future caller. The next contributor who
reaches for
SerializedFieldwithPICKLE(because it's the default and that's how Peewee fields work) inherits RCE-on-read semantics. They will not know, because the helper does not warn. - The set of "attacker-controllable bytes in a database column" is larger than people think. SQL injection elsewhere in the codebase, a compromised database credential, a backup restored from an untrusted source, a compromised replication peer, a developer using a copy of production with seed data from an external contributor. Pickle in a DB column turns any of these from "bad" into "RCE on the application server".
- The fix is one line. When the cost of fixing is roughly zero and the cost of the latent bomb going off is RCE, the trade is easy.
This is the framework you want for every "is this finding worth filing?" question. Reachability is a useful prior, not a final answer.
How Sebastion surfaces it
When this code lands in a PR, Sebastion posts an inline review comment on the offending line. The comment includes:
- Severity:
high(rising tocriticalwhen the path is demonstrably reachable from a request handler). - Rule id: the canonical Sebastion rule for the
pickle.loadsfamily (the exact id is shaped likepy.<rule-name>and is surfaced inline on the finding). - CWE link:
CWE-502. - Prose explanation of why the default path is unsafe and what attacker capabilities turn it into a real exploit.
- A ready-to-apply fix, usually as a fenced
diffblock.
Because this is high severity, the chat-path suppression
(@sebastionai ignore) cannot silence it — that requires an
operator-added Learning with an audit trail. See severity
floor for why.
If you legitimately need the pickle path (e.g. you're round-tripping
trusted data with a strict allow-list), the right move is not to
suppress — it is to leave the suppression to an operator-added
Learning so we can review the reasoning. In most cases, switching to
the restricted_loads pattern below is faster than the
back-and-forth.
The fix
- use_deserialize_safe_module = get_base_config(
- 'use_deserialize_safe_module', False)
- if use_deserialize_safe_module:
- return restricted_loads(src)
- return pickle.loads(src)
+ return restricted_loads(src)Three properties make this a clean fix:
- No new code, only removal.
restricted_loadsalready existed in the same file as anUnpicklersubclass that whitelistsnumpyandrag_flowand rejects everything else. The fix is "always use the helper that was already designed for this", not "introduce a new dependency". - No behaviour change for legitimate callers. The pre-fix
behaviour for callers who actually wanted safety
(
use_deserialize_safe_module = True) is preserved. The pre-fix behaviour for callers who left the default (use_deserialize_safe_module = False) is replaced with the safe path — which is what they would have got if they had read theRestrictedUnpicklerdocstring. - The dead config flag goes with it. Removing the flag and the
now-unused
get_base_configimport prevents future contributors re-introducing the bug by setting the flag back toFalse.
Verifying the fix worked
The minimum bar for "did the fix actually fix it":
import pickle, posix
from api.utils.configs import deserialize_b64, serialize_b64
# Build a malicious pickle whose __reduce__ resolves to posix.system.
class Exploit:
def __reduce__(self):
return (posix.system, ('id',))
payload = pickle.dumps(Exploit())
# Pre-fix: this would execute `id`.
# Post-fix: restricted_loads raises UnpicklingError.
try:
deserialize_b64(payload)
except Exception as e:
print(f"Blocked: {e}")Then prove you have not broken legitimate use:
import numpy as np
arr = np.array([1.0, 2.0, 3.0])
restored = deserialize_b64(serialize_b64(arr))
assert (restored == arr).all()Both of these belong in your test suite alongside the fix, so the guarantee doesn't silently regress.
What this teaches about using Sebastion
Three meta-lessons that generalise beyond this single CWE:
- Latent does not mean ignorable. "Not reachable today" is a weak ceiling on tomorrow's behaviour. The cost of fixing a one-line latent finding is usually cheaper than the audit conversation about whether it is reachable.
highseverity is intentionally hard to suppress. If you find yourself wanting to suppress ahighfinding from the chat path and getting blocked, that block is the product working as intended. Either fix the code or get an operator to review your suppression rationale.- Adversarial reasoning belongs in the PR description. The maintainers will not necessarily share your threat model. The PR description for #14803 explicitly listed the two strongest objections (latent today; attacker already has DB write access) and argued why neither was sufficient to leave the bug. Sebastion structures PR bodies this way for exactly this reason.
Related
- False positives and Learnings —
why
highandcriticalcannot be suppressed from chat. - Suppression strategy — what
to do when you legitimately have a
highfinding that needs an exception. - Sebastion's OSS contributions — every merged PR Sebastion has shipped to public repos, including this one.