Screen Scraping in 2025

Dr. Tony Przygienda left a very valid (off-topic) comment to my Breaking APIs or Data Models Is a Cardinal Sin blog post:

If, on the other hand, the customers would not camp for literally tens of years on regex scripts scraping screens, lots of stuff could progress much faster.

He’s right, particularly from Juniper’s perspective; they were the first vendor to use a data-driven approach to show commands. Unfortunately, we’re still not living in a perfect world:

  • Even the best vendors sometimes slip and create a show command that cannot produce JSON or XML output (because it’s faster to sprinkle printf statements throughout the code than doing the right thing). In those cases, screen scraping (collecting the results of a show command and trying to extract interesting bits of data from them) is the only way to go.
  • Many vendors added JSON/XML output as an afterthought, and numerous show commands still cannot generate outputs in one of those formats.
  • Vendors that generate JSON/XML “by hand” (instead of dumping a data structure that was used to generate the show printout) sometimes produce invalid JSON/XML data1.
  • There are still vendors that haven’t gotten the “JSON is the new SNMP” memo ;)

If you have to implement screen scraping for some devices, you might decide to do it for everything you have to work with as the least common denominator (and the least amount of headache).

However, let’s be positive and assume we want to Do the Right Thing (as opposed to Getting the Job Done and Having a Beer). Some devices can generate structured data in JSON or XML format, others support only JSON or only XML, and some can convert from internal XML representation into JSON with side effects that border on hilarious (unless you have to deal with them).

Structured data is great; every bit of data is properly named/tagged. It’s also bloated. A friend of mine once told me that fetching 100K routes from a device results in 4MB of text, a 100 MB JSON object, or a 500 MB XML object. Parsing a 500 MB XML object might take a bit longer than screen-scraping the text printout.

Speaking of XML: working with XML is a nightmare because you never know whether you’re dealing with lists or dictionaries, and it gets way worse when namespaces are involved. Compare that with json.load call that you need to get JSON data into a usable data structure. Nobody in their right mind wants to touch XML (unless there is no other option), and finding a programmer who can deal with XML in Python is probably as easy as finding a COBOL programmer.

Working with large JSON objects is no walk in the park, either. Parsing the 100 MB JSON object mentioned above will take a while and result in a data structure that’s at least as large. As anyone who ever had to parse a 500 MB XML object knows, there’s the Right Way of parsing large objects: use a generic JSON/XML parser as a framework and use callbacks/hooks to collect/analyze/store data on the fly (as they’re parsed) without ever generating the final data structure. Unfortunately, that’s not a very common skill either. Most programmers were never forced to look beyond json.load.

Let’s ignore the details and assume we got the structured data parsed somehow. Now we must navigate those data structures, and things quickly become as “easy” as navigating SNMP MIBs (or reading a James Joyce novel). Here’s what I had to deal with to find out if my device had BFD running for a BGP session:

vrfs.default.ipv4Neighbors["10.1.0.1"].peers.Ethernet1.types.normal.peerStats["10.1.0.2"].status == "up"

Not good enough? Go play with Nexus OS, where every single interesting bit of information is prefixed by something like:

TABLE_vrf.ROW_vrf.TABLE_addrf.ROW_addrf.TABLE_prefix.ROW_prefix

Finally, there’s the whole morass of what is in that structured data. Everyone will gladly tell you which IETF YANG models their boxes support, while forgetting to mention that you cannot get all the information you need from the standardized models2 and that even the augmented data model does not contain everything you can get with a show command.

OK, so let’s forget the standardized data models and be happy that the devices we work with can produce some structured data… if only we know how to fetch them. Everyone seems to support some variant of the show X | format json command that sends you the desired results in a structured format over an SSH session. There’s just a tiny little gotcha: at least one vendor forgot to do control-plane prioritization for SSH data. Fetching a large routing table can kill LACP sessions.

Back to the drawing board: we’ll have to use the management API our vendor decided to embrace. Some use NETCONF, others use REST API, and while NETCONF is pretty standard (but uses XML; see above), vendor-specific REST API could be anything. However, most vendors implemented NETCONF over SSH (while the rest of the world uses HTTP-based API) because the SSH hammer was conveniently close, and while there’s a Python ncclient library you can use to implement a NETCONF client, you won’t find many Python programmers who know how to use it. Oh, and do I have to mention that some vendors happily provide results of show commands via NETCONF as an XML-encapsulated text string?

With all that being said, do you still wonder why some people stubbornly use screen scraping in 2025? As a final nail in this coffin, let’s add insult to injury: I encountered at least one vendor (and heard of another one) that made breaking changes in their structured data while keeping the text printout mostly intact.

It’s easy to go to a World Congress and solve all the networking problems with PowerPoint. The real world is much messier, and every (supposed) attempt to make it more ordered usually ends up wasting energy and generating more entropy (see also: how standards proliferate).


  1. It could be as bad as not quoting single and double quotes in interface descriptions. ↩︎

  2. Remember the days of enterprise SNMP MIBs? We reinvented them in the brave new YANG/XML/JSON world (and renamed them to augmentation), yet again proving the infinite wisdom of RFC 1925 rule 11. ↩︎

Latest blog posts in CLI versus API series

3 comments:

  1. That's why I absolutely love EOS, no need to use text_fsm or ntc_templates, just run the command, get output and json.load it :-)

  2. Hi Ivan. Thank you for another great article.

    I mostly agree with your main point: scripting a regex parser for a single use case will often be simpler to get the job done, but is not scalable to a complex network (with its various and changing vendors/hw/versions) and thus "not the right thing". But I have to react on some of the more precise points you made.

    1/ XML and JSON are different in nature and both have pros and cons. I believe XML is better than JSON to serialize a complex router configuration, because its tree-like structure better maps a router configuration model, it supports namespaces (essential when dealing with native and IETF/openconfig models) and tag attributes: you can't even put metadata on a JSON object ("dictionnary"), unless by using a hack such as prepending a sub key/value with "@". Anyway, you'd rather use XML than JSON with NETCONF, because the former was the initial serialization language, while a serious support for the latter came only later (and for the known problems you mentionned) (NB: note an intrisic argument, but a pragmatic one).

    2/ Yes, if models are poorly structured/written and/or overly verbose, it will be a treasure hunt to find the proper request to get exactly the information you need (NB: no, chatgpt cannot do that, at least not for SROS), and yes it is easier to find it using show commands you already mastered and some regexes. You cannot work as comfortably with protocols and languages like NETCONF/XML/JSON/YANG than with show commands simply because the former were designed for machine-to-machine communication, while the latter is made for humans. You need tooling to work smartly with those automation things (parser, requester, explorers, request generator, etc.).

    3/ A show command does not always have a Model-Driven equivalent because it might be a compilation of diverse elements put on a table for an operator to read with his eyes and process with his brain. Cf. my previous remark on machine-to-machine.

    4/ Yes the IETF has biases, but HTTP/JSON/Python is not the only hammer out there ;-) Sure, HTTP/TLS could have been chosen instead of ssh. I guess people at the IETF (but also network engineer and techniciansi) were simply more familiar with ssh rather than HTTP, and back in 2006 (rfc4741) HTTP was not as ubiquitous as today. ssh still is ubiquitous for system and routers management though, and does not limit NETCONF.

    Alternatively, gRPC-based protocols such as gNMI (NETCONF competitor) are based on HTTP2/TLS, and work with the same YANG models. BTW, protobufs (gRPC serialization format) is compared to "XML, but smaller, faster, and simpler" and not JSON... food for thoughts. Don't even mention RESTCONF, the REST hammer is not suited for the job (bye bye transactions).

    5/ Yes, a vendor might make some breaking change to its YANG model while leaving its show command intact. That can be anticipated by reading the YANG changes (I agree that in some intricate models this is no simple task). The other way around, if the show command changes, good luck anticipating that: are you running some genAI to parse the vendor PDFs? it might not even be listed anywhere else than the developer post-it note.

    6/ Yes, NETCONF/YANG borrow from SNMP/MIBs, but overcome many of their limitations, see rfc3535 for reference.

    PS: Oops, sorry, made an extensive comment again... :-) PS2: answering on the NAT articles soon.

    Replies
    1. Hi Bob, thanks a million for such a detailed comment. Love it, and agree with quite a few things you wrote.

      However, as much as I like XML (I'm old enough to be using it before people started calling JavaScript objects JSON), you missed my point: it does not matter how good a technology is if people don't know how to use it, or if they're used to something else. VAX/VMS was the best operating system I've seen, and I don't think anyone remembers what it was. It was just too different from what people were used to.

      It's the same with XML/NETCONF and protobufs. It doesn't matter if they are orders of magnitude better than JSON/HTTP stuff if they're hard to use or unfamiliar to the people you can hire to get the job done. IETF insistence on using what they find convenient as opposed to what the consumers (= developers) prefer is just the icing on the cake (not to mention the irresistible urge to invent yet another schema language).

      Just my cranky perspective after having to deal with way too many "best" technologies for a lifetime ;)

    2. Hi. I confess: I have a bias towards what I believe to be the best solution as opposed to the most easily available one. Sometimes I also feel like the IETF has a not invented here complex, and when some new ideas are discussed I do not see much effort into searching the already available tools and practices (I see the same phenomenom with my experienced colleagues who, for some, still consider they are not doing computer science, as opposed to network engineering... they built a templating engine and language less than 10 years ago... jinja2 anyone?!). History is full of (considered) better solutions eclipsed by more pragmatic and easy to start with alternatives (linux instead of the yet-to-come GNU hurd or many others, Python rather than lower level languages despite the generally poorer performance, not to mention the HTTP hegemony, to name a few).

      XML might be considered old school, it is still hegemonous for documents serialization for example (I do not think there is a serious JSON equivalent of epub! might be wrong though). Protobufs and gRPC though, they are everywhere in k8s and other complex distributed systems, but rather on the "backend side".

      And yeah, never heard of VAX/VMS!

      Cheers.

  3. Hi.

    imho the whole NETCONF ecosystem primarily suffers from a tooling problem. Or I haven't found the right tools yet.

    ncclient is as you mentioned somewhere else an underdocumented mess. An that undocumented part is not even up to date (see https://github.com/ncclient/ncclient/issues/374#issuecomment-595092038) The commit-hash at the bottom of the docs page is from 2020... I am amazed how so many people got it working good enough to depend on it in their applications.

    The tools for browsing and inspecting YANG models is also not really there. Everything is abandoned. Cisco killed/abandoned 2 of their projects already. https://github.com/CiscoDevNet/yang-explorer is dead, and yangsuite fails to install, has outdated docs (https://developer.cisco.com/docs/yangsuite/welcome-to-cisco-yang-suite/#python-virtualenv-installation talks about Python 3.6 to 3.8, the README at https://github.com/CiscoDevNet/yangsuite is stuck at 3.9+ with 3.10 recommended, however installations fail for more modern versions. All the systems I am working on ship 3.11 or newer by default now. The docker container was also an insane amount of pain until it worked. And of course the code is not on github, the repo there is just a dummy repo for the docker container). All I was looking for was a nice way to display what my device supports and to display and edit the values so that I can see what I am doing in a sea of XML

    There have been a couple other tools that I have found which were dead even longer.

    Am I missing some nice tool? What are people actually using to get things done with NETCONF?

Add comment
Sidebar