How “Validator.nu As A Web Service” Can Help You Compile Your Code

11 min readAug 3, 2021

As we all know, Html is a markup language not a programming language, It is parsed and interpreted by web browser and no compiler is needed for html.
“But still, what if we want to compile our html code? Is it possible to compile ?”
Answer is yes, we can do it by using “Validator.nu as a web service”
Before that let us know some basic details about compiler and things required to compile our code, Because I believe that when you understand what something is, you are better able to figure out how to use it

“Humans can only understand high-level languages, which are called source code. Computers, on the other hand, can only understand programs written in binary languages, so either an interpreter or compiler is required.”

What is compiler?
It is a computer software that translates (compiles) source code written in a high-level language (e.g., C++) into a set of machine-language instructions that can be understood by a digital computer’s CPU. Compilers are very large programs, with error-checking and other abilities. Some compilers translate high-level language into an intermediate assembly language, which is then translated (assembled) into machine code by an assembly program or assembler. Other compilers generate machine language directly. The term compiler was coined by American computer scientist Grace Hopper, who designed one of the first compilers in the early 1950s.
The compiler specifies the errors at the end of the compilation with line numbers when there are any errors in the source code. The errors must be removed before the compiler can successfully recompile the source code again

Warnings:-
Warnings are diagnostic messages about constructions that are not errors, but are often associated with errors. For example, when a variable is declared but never used in a program, it is possible that the programmer overlooked something, so the compiler could issue a warning when it finds such a construction.

A Bit About the Compile and Link Process

when your program is compiled, the very first step that the compiler takes is to run the preprocessor, which processes all of the preprocessor directives, those lines that begin with the pound sign ‘#’.

If you are a programmer, its obvious that you know the basics of C language, Lets take a look at c language basic code for better understanding of the compiler process:-

#include <stdio.h>

int main ( )
{
printf( “ H ell o world \n” ) ;
return 0 ;
}

You may have been told that you need that rst line, which is the #include directive, #include , but you may not really know why it is there
The reason is that the printf() function is declared in the header stdio.h and in order for the compiler to check that you are calling it correctly in line 5, it needs to compare the declaration of the printf() function with its use. The compiler just needs to check things such as the number of parameters, their types, and the return value of the function

The #include directive literally copies the entire stdio.h, into your program starting at line 1. This is done by a program called the C preprocessor, And The call to printf() must be resolved by the linker.

The linker is a separate program that runs after the compiler. It looks at all of the unresolved symbols in the program, such as printf(), and tries to resolve them by looking up their locations in the software libraries that the program needs

The C Standard I/O Library is special because it is used in almost every C program, and therefore many C implementations include it within the C runtime library. This makes it possible for the linker to find it easily.

The same discussion would apply if you wrote the above program in C++ as in

#i n cl u d e

i n t main ( )
{
s t d : : c ou t << “ H ell o world \n “ ;
r e t u r n 0 ;
}

only instead of using the stdio.h header file, it would use the iostream header file, and the iostream library

in C++ instead of the C Standard I/O Library.

In summary, a header file contains declarations that the compiler needs, but not implementations. The corresponding library file has those.

“The compiler needs the header files but not the libraries; the linker needs the libraries, not the header files.”

When you run gcc, it usually performs preprocessing, compiling, assembly and linking. There are options to control which of these steps are performed. gcc can also look at the file extension for guidance as to which compiler to use and what kind of output to generate
What is Interpreter?:-
An interpreter reads an executable source program written in a high-level programming language as well as data for this program, and it runs the program against the data to produce some results.
Or in other words, An interpreter transforms or interprets a high-level programming code into code that can be understood by the machine (machine code) or into an intermediate language that can be easily executed as well. One example is the Unix shell interpreter, which runs operating system commands interactively.

Note that both interpreters and compilers (like any other program) are written in some high-level programming language (which may be different from the language they accept) and they are translated into machine code. For a example, a Java interpreter can be completely written in C, or even Java. The interpreter source program is machine independent since it does not generate machine code. (Note the difference between generate and translated into machine code.) An interpreter is generally slower than a compiler because it processes and interprets each statement in a program as many times as the number of the evaluations of this statement. For example, when a for-loop is interpreted, the statements inside the for-loop body will be analyzed and evaluated on every loop step. Some languages, such as Java and Lisp, come with both an interpreter and a compiler. Java source programs (Java classes with .java extension) are translated by the javac compiler into byte-code files (with .class extension). The Java interpreter, called the Java Virtual Machine (JVM), may actually interpret byte codes directly or may internally compile them to machine code and then execute that code (JIT: just-in-time compilation).

Is Html A Compiler Or Interpreter?:-
There are no interpreter or compiler for HTML. HTML is a language used to design web pages. So HTML is NOT a programming language. Compiler and Interpreter are used in Programming language to convert high level language i.e. C, C++, Java into low level language i.e. assembly/machine code.
But, Browsers do contain something similar to interpreter (called parser). Parser will identify various tags and display them accordingly — it depends on the way tags are rendered in the browser. One of the components of the web browser called the rendering engine performs this task.

“Our browser just consider tags as symbols for formatting the page, like when browser finds it consider the text inside it to be formatted as italic”

How Browsers Work:-
The browser’s main functionality is to present the web resource you choose, by requesting it from the server and displaying it on the browser window. The resource format is usually HTML but also PDF, image and more. The location of the resource is specified by the user using a URI (Uniform resource Identifier)

The way the browser interprets and displays HTML files is specified in the HTML and CSS specifications. These specifications are maintained by the W3C (World Wide Web Consortium) organization, which is the standards organization for the web.
The browser’s main components are:

The user interface — this includes the address bar, back/forward button, bookmarking menu etc. Every part of the browser displays except the main window where you see the requested page.
The browser engine — the interface for querying and manipulating the rendering engine.
The rendering engine — responsible for displaying the requested content. For example if the requested content is HTML, it is responsible for parsing the HTML and CSS and displaying the parsed content on the screen.
Networking — used for network calls, like HTTP requests. It has platform independent interface and underneath implementations for each platform.
UI backend — used for drawing basic widgets like combo boxes and windows. It exposes a generic interface that is not platform specific. Underneath it uses the operating system user interface methods.
JavaScript interpreter. Used to parse and execute the JavaScript code.
Data storage. This is a persistence layer. The browser needs to save all sorts of data on the hard disk, for examples, cookies. The new HTML specification (HTML5) defines ‘web database’ which is a complete (although light) database in the browser.

Webkit main flow:-

Parsing — general

Parsing a document means translating it to some structure that makes sense — something the code can understand and use. The result of parsing is usually a tree of nodes that represent the structure of the document. It is called a parse tree or a syntax tree.

Example — parsing the expression “2 + 3–1” could return this tree:

Grammars:-

Parsing is based on the syntax rules the document obeys — the language or format it was written in. Every format you can parse must have deterministic grammar consisting of vocabulary and syntax rules. It is called a context free grammar. Human languages are not such languages and therefore cannot be parsed with conventional parsing techniques.

Compilation flow:-

HTML DTD:-

HTML definition is in a DTD format. This format is used to define languages of the SGML family. The format contains definitions for all allowed elements, their attributes and hierarchy. As we saw earlier, the HTML DTD doesn’t form a context-free grammar.

DOM:-
The output tree — the parse tree is a tree of DOM elements and attribute nodes. DOM is short for Document Object Model. It is the object presentation of the HTML document and the interface of HTML elements to the outside world like JavaScript.

The root of the tree is the “Document” object.

The DOM has an almost one to one relation to the markup. Example, this markup:

<html>
<body>

Hello World

<div> <img src=”example.png”/></div>
</body>
</html>

Browsers error tolerance:-
You never get an “Invalid Syntax” error on an HTML page. Browsers fix an invalid content and go on.

Take this HTML for example:

<html>
<mytag>
</mytag>
<div>

</div>
Really lousy HTML

</html>

I must have violated about a million rules (“mytag” is not a standard tag, wrong nesting of the “p” and “div” elements and more) but the browser still shows it correctly and doesn’t complain. So a lot of the parser code is fixing the HTML author mistakes.

Let’s see some Webkit error tolerance examples:

instead of

Some sites use instead of . In order to be compatible with IE and Firefox Webkit treats this like .

The code:

if (t->isCloseTag(brTag) && m_document->inCompatMode()) {
reportError(MalformedBRError);
t->beginTag = true;
}

Note — the error handling is internal — it won’t be presented to the user.

CSS Parsing:-

Validator.nu:-

It is a html-checker, validates the markup of a website against the Nu HTML checker.
Validator.nu can be called as a Web service. Input and output modes can be chosen completely orthogonally. Responses and requests can be optionally compressed (independently of each other). Serving valid HTML nowadays has been commonly overlooked these days. By running the HTML documents through a checker, it’s easier to catch unintended mistakes which might have otherwise been missed. Adhering to the W3C’s standards has a lot to offer to both the developers and the web users: It provides better browser compatibility, helps to avoid potential problems with accessibility/usability, and makes it easier for future maintenance.
The Nu Html Checker(v.Nu) serves as the backend of html5.validator.nu, and validator.w3.org/nu. It also provides a web service interface.

Validator.nu has two facets: generic (complex UI) and (X)HTML5 (simple UI).

In the (X)HTML5 facet, the parser and the schema will be chosen based on the HTTP Content-Type of the document. In the generic facet, the parser will be chosen based on the HTTP Content-Type and a preset schema will be chosen based on the root namespace (for XML) or the doctype (for text/html).

Input Modes:-
For most Web service use cases, you should probably POST the document as the HTTP entity body.

Implemented

Document URL as a GET parameter; the service retrieves the document by URL over HTTP or HTTPS.
Document POSTed as the HTTP entity body; parameters in query string as with GET.
Document POSTed as a textarea value.
Document POSTed as a form-based file upload.

Not Implemented

Document in a data: URI as a GET parameter.
application/x-www-form-urlencoded

Output Modes:-

When using Validator.nu as a Web service back end, the XML and JSON output formats are recommended for forward compatibility. The available JSON tooling probably makes consuming JSON easier. The XML format contains XHTML elaborations that are not available in JSON. Both formats are streaming, but streaming XML parsers are more readily available. XML cannot represent some input strings faithfully.

Implemented

HTML with microformat-style class annotations (default output; should not be assumed to be forward-compatibly stable).
XHTML with microformat-style class annotations (append &out=xhtml to URL; should not be assumed to be forward-compatibly stable).
XML (append &out=xml to URL).
JSON (append &out=json to URL).
GNU error format (append &out=gnu to URL).
Human-readably plain text (append &out=text to URL; should not be assumed to be forward-compatibly stable for machine parsing — use the GNU format for that).

Not Implemented

Relaxed-compatible (lacks a spec)
Unicorn-compatible (hoping that Unicorn changes instead)
W3C Validator-compatible SOAP (legacy)
EARL (not implemented; domain modeling mismatch)

To use Validator.nu as a Web service by POSTing an entity body, the client issues an HTTP request either to https://validator.nu/ or https://html5.validator.nu/ using the POST method. The document to check is included as the entity body of the request. The Content-Type request header must be used to communicate the MIME type of the entity body. Common parameters are encoded as query string parameters.

Sample Application Screenshots:-