This is perhaps insanely obvious but it was a measurement I had to do and it might help you too if you use python-jsonschema
a lot too.
I have this project which has a migration script that needs to transfer about 1M records from one PostgreSQL database, transform it a bit, validate it, and store it in another PostgreSQL database. The validation step was done like this:
from jsonschema import validate
...
with open(os.path.join(settings.BASE_DIR, "schema.yaml")) as f:
SCHEMA = yaml.load(f)["schema"]
...
class Build(models.Model):
...
@classmethod
def validate_build(cls, build):
validate(build, SCHEMA)
That works fine when you have a slow trickle of these coming in with many seconds or minutes apart. But when you have to do about 1M of them, the speed overhead starts to really matter. Granted, in this context, it's just a migration which is hopefully only done once but it helps that it doesn't take too long since it makes it easier to not have any downtime.
What about python-fastjsonschema
?
The name python-fastjsonschema
just sounds very appealing but I'm just not sure how mature it is or what the subtle differences are between that and the more established python-jsonschema
which I was already using.
It has two ways of using it either...
fastjsonschema.validate(schema, data)
...or...
validator = fastjsonschema.compile(schema)
validator(data)
That got me thinking, why don't I just do that with regular python-jsonschema
!
All you need to do is crack open the validate
function and you can now re-used one instance for multiple pieces of data:
from jsonschema.validators import validator_for
klass = validator_for(schema)
klass.check_schema(schema)
instance = klass(SCHEMA)
instance.validate(data)
I rewrote my projects code to this:
from jsonschema import validate
...
with open(os.path.join(settings.BASE_DIR, "schema.yaml")) as f:
SCHEMA = yaml.load(f)["schema"]
_validator_class = validator_for(SCHEMA)
_validator_class.check_schema(SCHEMA)
validator = _validator_class(SCHEMA)
...
class Build(models.Model):
...
@classmethod
def validate_build(cls, build):
validator.validate(build)
How do they compare, performance-wise?
Let this simple benchmark code speak for itself:
from buildhub.main.models import Build, SCHEMA
import fastjsonschema
from jsonschema import validate, ValidationError
from jsonschema.validators import validator_for
def f1(qs):
for build in qs:
validate(build.build, SCHEMA)
def f2(qs):
validator = validator_for(SCHEMA)
for build in qs:
validate(build.build, SCHEMA, cls=validator)
def f3(qs):
cls = validator_for(SCHEMA)
cls.check_schema(SCHEMA)
instance = cls(SCHEMA)
for build in qs:
instance.validate(build.build)
def f4(qs):
for build in qs:
fastjsonschema.validate(SCHEMA, build.build)
def f5(qs):
validator = fastjsonschema.compile(SCHEMA)
for build in qs:
validator(build.build)
import time
import statistics
import random
functions = f1, f2, f3, f4, f5
times = {f.__name__: [] for f in functions}
for _ in range(3):
qs = list(Build.objects.all().order_by("?")[:1000])
for func in functions:
t0 = time.time()
func(qs)
t1 = time.time()
times[func.__name__].append((t1 - t0) * 1000)
def f(ms):
return f"{ms:.1f}ms"
for name, numbers in times.items():
print("FUNCTION:", name, "Used", len(numbers), "times")
print("\tBEST ", f(min(numbers)))
print("\tMEDIAN", f(statistics.median(numbers)))
print("\tMEAN ", f(statistics.mean(numbers)))
print("\tSTDEV ", f(statistics.stdev(numbers)))
Basically, 3 times for each of the alternative implementations, do a validation on a 1,000 JSON blobs (technically Python dicts) that is around 1KB, each, in size.
The results:
FUNCTION: f1 Used 3 times
BEST 1247.9ms
MEDIAN 1309.0ms
MEAN 1330.0ms
STDEV 94.5ms
FUNCTION: f2 Used 3 times
BEST 1266.3ms
MEDIAN 1267.5ms
MEAN 1301.1ms
STDEV 59.2ms
FUNCTION: f3 Used 3 times
BEST 125.5ms
MEDIAN 131.1ms
MEAN 133.9ms
STDEV 10.1ms
FUNCTION: f4 Used 3 times
BEST 2032.3ms
MEDIAN 2033.4ms
MEAN 2143.9ms
STDEV 192.3ms
FUNCTION: f5 Used 3 times
BEST 16.7ms
MEDIAN 17.1ms
MEAN 21.0ms
STDEV 7.1ms
Basically, if you use python-jsonschema
and create a reusable instance it's 10 times faster than the "default way". And if you do the same but with python-fastjsonscham
it's 100 times faster.
By the way, in version f5
it validated 1,000 1KB records in 16.7ms. That's insanely fast!