Live 1057

Free Internet Radio Stations
Improving Intermediate Codes – Computerphile

Improving Intermediate Codes – Computerphile


In what we’ve done so far we’ve ranged
over quite an area, because just the act of using the T-diagrams has forced us
to address lots of interesting and profound questions, over the years, about
how compilers actually work. So we’ve looked at the fact that we think of our
programs being written in a high-level language and the brain goes blurry. We
neglect to think, all the time, that although you they were written in C it
doesn’t execute directly. You have to compile the C into binary. So, really, your
beautiful program runs on a suitable architecture in a suitable binary. It has
an input, it has an output. Then we started looking at more advanced things
about … saying: “Well if that’s how a C compiler works why not write another C
compiler, in C, and then compile it with the old compiler?” And I think we just
about emerged from that with brain intact. If you want to go back and revise
some of this stuff before catching up with what we’re going to do today I
would recommend the one which we’ll put a link out to: ‘Self-compiling
Compilers”. There’s another one we put out also, on the UNCOL problem which was this
business about: “Is there a common universal intermediate code?” There’s more
of a consensus now than there used to be. And the LLVM system is a good example. It
won the Turing ACM Award for – not quite for ‘best intermediate code ever’ – but you know …
The reason, I think, that it became less of a problem – not totally solvable
but less of a problem – thinking about this the other night, is that
actually, over the years, certain things about computer architectures have
converged and moved together to consensus. And principally it seems to me
is the idea that your unit of discourse isn’t just the bit, it’s the byte. And the
idea of having, y’know, 8-bit ASCII characters fit into a byte. The idea
of being able to ‘glue two bytes together’ to make a 16-bit entity, or four to make a
32-bit entity. That has become more and more and more
prevalent. The reason why in some of the other videos I’ve said: “Oh! you know there
was such massive differences between machines” There were! In the 1970s. I can
quote [this] to you as a fact. Because there I was at the University of Nottingham working
on a machine with 24-bit words. Not byte addressed at all. What did you put inside
your 24-bit words if you were mad keen on characters? Four 6-bit characters. How
did you dig them out from that word? With great difficulty(!) The word-addressed
machine would give you the whole word. It was your responsibility with bit-shift
operations and a garden spade (more or less!) to dig out four 6-bit characters.
Non-standard, not 8-bit characters. And everybody said: “Ah well, it’s all right for IBM
but it’s so much more *expensive* building byte-addressed machines!” But in the end it
prevailed. And I do believe that things like that about the fundamental way you
address memory and being able to, sort of, put units together, preferably in powers
of two not multiples of two. So, right at the beginning I had a 24-bit
word; Atlas had a 48-bit word; DEC 10s, I think, had a 36-bit word. And Seymour
Cray, CDC, had a 60 bit word. All of those are multiples of 2, but I don’t think
any of them, that I’ve just quoted, are powers of 2. And it really mattered. it
turned out to have a power-of-two basic unit – bigger than a bit – 8-bit bytes.
So, anyway, I use that to introduce the idea of intermediate codes but we’re now
I think – in tying things up – in a situation to revisit that now and say:
“Intermediate codes really are useful”. and come into their own, if you like, with the
idea of wanting to port a compiler from one architecture, perhaps to a very very
different architecture. What I call the ‘semantic gap’ between your high-level
language, and your program eventually that runs on binary,
is HUGE! It’s not so huge in C, you can feel the assembler and the binary
poking up through the C, But you start trying to do a Haskell interprete,r or
compiler, and you’ll soon discover that this thing running down here is – so to
speak – miles away from the abstract stuff you were writing up at the top.
So, everybody started saying don’t we really need intermediate codes to help us bridge the gap? Hence Z-code, Bytecode for Java. All
these kind of things became discussed more and more and more. And I think I,
at one stage, said: “Don’t imagine it’s always fairly close to the hardware”. You could
end up in a situation where C is your intermediate code. Bjarne Stroustrup
got a C++ compiler started by – in its early days – just making it produce C,
which you know you can cope with. What I want to have a look at, today, is this
whole business of: How do intermediate codes help you port a compiler from one
architecture to another?” And you’ve got to remember that, in the worst case, those
machines and architectures could be very very different indeed. I did an example, I
think, of running a C compiler on a PDP-11 in PDP-11 binary, whose
cross-compilation effect was to spit out Z80 binary, which is very very different.
How can you cope, then, with cross-compilation? And how does
cross-compilation lead you on to being able to think about porting your
compiler? Not just producing code for a foreign machine but, in a way to
mount an invasion of the foreign machine and to say: “I’m not just going to push
boatloads of code over, I’m going to set up a bridgehead. We’re going to land and I’m
going to set up a lot of my software tools on the far machine – not just fling raw
binary at it. So let’s start to discover a bit more
about this then. But in order to get into the details I have, at great personal
mental expense, made myself something like 40 or 50 T-diagram blanks and
let us hope that those prove sufficient for the task. I just wanted to talk you
through this, first of all, as being the basis for what we’re going to discuss
and then I’ll put it to one side. But I’ll bring it back if I need to refer to
it again. We are getting in to a land of intermediate codes. I’ve glued four
things to the page. What are they? Well, these two, at the top left, are what I will
call Source Texts for your cross compilation / compiler porting efforts.
What you’re saying is: from now on we don’t compile directly from H, the
high-level language, to produce binary, B, in the code generator. We do it in two
steps. We have an H compiler written in H producing I, the intermediate code, which
I see is not on this list. Let’s add it: “I=Intermediate Code”. Now, on the left
at the top are the source texts for doing this. But, if you consult the
previous things we have done on compilation you will understand that
it’s not directly those that we can use because we can’t directly execute H.
We have to put these through an H compiler. And we end up with the binary executable
versions of them. Now, a little bit of extra notation here. If you see at the
bottom, for this executable B’, I use a single dash (prime) to mean: “The computer I
currently possess, my old computer, the one I do all my work on”. If you see
things like B” that means the new machine that I’m trying to port
things to. So, we’ll eventually get to that stage of getting stuff across and
you’ll see B” is appearing. Inevitably, if you go this route, your
compilation down to binary is now a two-stage process. It may be hidden from
you but it has to be there. The other half, you see, is [that] if
you’re only going half way, to intermedia code, the other half of the journey is to
go from intermediate code down to runnable binary for the whole thing.
There’s your intermediate code interpreter, or compiler, written in B’
running B’, producing B’. So, I’ve numbered these 1, 2, 3 and 4
and eventually I’ll come back and say: “We’ve now created a new number 3”, or
something like that. What I’m going to do in this. Whereas in previous episodes
I’ve talked about recoding a C compiler to make better binary come out I did
that previously by calling it ‘subscript B for better’, but I’ve decided now to
use an asterisk. If you see an asterisk somewhere it means it’s a ‘better’ version
of what went before. And I is ‘intermediate code’. So, let’s put that to
one side to be dragged back as and when we need it. When you write a program you
have in mind a certain input, it executes and it produces a certain form of output.
And you’re very happy. And it all works beautifully. Rather than writing ‘C’, down
here, as I have been doing all along, I’ll try and generalize it a bit, say to some
high-level language (H) that you’re confident with and have used for ages.
But all the while you know. So this is, if you like, ‘Userprog’ here you are. You know
that H can’t execute directly so you rely on the fact that, bootstrapped up
over several generations, we just happen to have an H compiler that runs in B’
and produces B’. By slotting that into there [uses T-diag template] you remember – there’s an
implicit arrow there. That’s H feeds into that one there and the transformation done in this
compiler is to take the H and convert into B’, with a compiler that is now an
executable binary itself. So, the net result of all that, which we’ll show up
here – arrows – is what that makes, once you’ve compiled is of course something
that has your treasured input and output but is running, H running on B’ produces
B’. So, that is the binary executable version of your program. What happens if
your input and output was H itself? Can you write a compiler in itself? Of course
you can! It’s what all this is about. You want to produce an H compiler. We’ll start off
by writing an H compiler in H. So we’ll put this back to one side now these two
[T-shapes]. What happens if I were to say: “Well, I’ve written an H compiler in H and it
produces B’*. What I’m saying is: “I am fed up with plain old B’
because it’s slow and it’s inefficient. And it was wonderful when I first did it
and actually this thing up here has been bootstrapped up through being written in
assembler and heaven knows what”. See previous episodes if that sentence
doesn’t make any sense to you. But now we have got a situation [where] you want to improve
the quality of your binary so here’s a bit of revision. What you do is you say:
“OK I’ll write a better C compiler – better in the sense of ‘better quality binary’ “.
I feed it to the old compiler that we’ve got working already, which takes in H
runs on B’, produces B’. Anbd what does that end you up with? Answer:
an H producing binary. It’s running on old binary that’s not super quality but
it’s ok it doesn’t crash. It’s a bit slow. Just to end this little exercise off,
before we get on to genuine porting compilers., You’ve got that [points at T-diagram] and you
naturally then say: “Well why not feed it to itself again?” And if you get another
version of that, feed one into the other, you can end up with H written in better
binary, producing better binary. And if you look up [the video] “Self-compiling Compilers”
that’s exactly what we do. So it’s all very well doing this simple-minded stuff.
We take one great flying leap in our [previous] compilers and require the code generator
to be able to get very low-level very specific and yet be very
very tight and wonderful for all sorts of different binaries for different
machines B’, B” etc. That’s hard. Sso we’ve decided that
intermediate codes might be the answer. So how do we cope then, using
intermediate codes just with this business of improving your code?
How would you do it? Well it has to be a two-stage process, no question about that.
It has to be. So, this time I will do an H compiler, written in H. And this time I am
going to make it produce better intermediate code than ever before. So
you see … think of it this way. When you upgrade a compiler you’ve now got two
halves to upgrade. Do you want to upgrade the H-to-intermediate-code piece, the
front end, so-called. Or do you want to upgrade the intermediate code
interpreter, going down to binary, the back-end? It’s a mix and match. You
can do either, or both, or in whatever order you like. But you’ve got to
remember it is a two-stage process. Fortunately for me I have, already
working, an old compiler for H, running on B’ – original machine binary – and producing ordinary intermediate code. Not “super*
intermediate code. I’ve written a better version of myself now and I’m upgrading
this front-end. So we’ve got H written in H producing I* – better-quality
intermediate code in some sense – than went before. But the old version of the
compiler I’ve been using for months now just has H running on B’ producing
ordinary intermediate code – not optimized. But it’s good enough for compiling this
one. The only thing that’s different from what we’ve got before is that we don’t
directly end up producing binary as the output. We produce intermediate code as
the output. So this first stage, then, will get me the following: H feeds into H
running on B’ produces I, means that this thing takes in H produces I* and is running on I. OK fine.
So we’ve kind of ‘recompiled the compiler’ but we haven’t gone far enough yet.
And this is the the ball and chain around your ankle when you go for
intermediate codes is you’ve got a better thing but it is reliant on
running – being able to cope with – the fact that’s an intermediate code stage.
OK that doesn’t put us off. What I’ve said, all along, is that these are my key
source texts these are my executables and what I’m pointing out now, is, if you
use intermediate codes you don’t just need a source-end translator you need a
back-end translator. You need a thing that says, you know: “My system produces
intermediate code but I am the back end that takes in intermediate code and
produces real binary that really will run.” So, I want one of those now [points at T-shape (3)]
to put into my diagram. This is what we’re at so far. Let me gobble up yet another T-diagram and to say: “If I put in here what
everybody ought to have available to them, which is an I producing B’
written in B’, I can now slot that in there like that. And just look what
happens. You’ve got your first-stage compilation. It’s doing wonderful things
but it’s executing intermediate code. And maybe I have a test interpreter that
sort of, to show you: “Well that’s kind of working.” But, in the end, you want faster,
more efficient, code. So you decide you will compile your intermediate code into
proper binary. And this kind of choice between: “Do I interpret it slowly and see
if it’s working versus do I in the end commit to compiling it?” is the sort of thing
available in stuff like Java bytecode and so on. Start off by interpreting –
feels a bit slow but it’s looking good, let’s compile it. Now watch what happens
as you trace through here. I goes in here, this is the intermediate code compiler,
if you like, now producing genuine binary. That shoots through there and what you
end up with is H producing I* i.e. r better-quality
intermediate code, running on B’. Right, well we’re getting close. There’s
one final thing, though, that you need to do. What we’ve got ourselves now is a
situation where we’ve got a compiler that takes in statements in H. It is
running on B’ binary, ordinary unimproved binary. but it does produce
better intermediate code. We’re now going to pull the same trick that we’ve done
before in Self-compiling Compilers. The only difference here. though. is we’ve got
this I to cope with this time. But the principles of what we’re doing are just
the same. If I bring this one [T shape] down to stop my having to write out another one,
we’ve now made a binary out of that, feed it to itself. Originally I started with H
I* to H. I compiled it all the way through to intermediate code, as best I
could. It’s good quality intermediate code. Now take that thing, your executable,
feed the original thing to itself and look what happens! The H goes in there
and produces I*. So. you end up with H [producing] I* [written in] I*. Final stage coming up,
I promise. So that, if you like, when you’re through its machinations produces that.
One final stage – and I promise the torture will end – right! Now you might say:
“There – no big deal. I can write an intermediate code interpreter and check
out rough-and-ready. It’ll be slow but [we can] check out whether everything works”. But
then somebody in the end says: “No – come on let’s compile it – it might go a lot faster.”
So if you remember item number 4 in our original toolkit was a thing that
takes I and turns it into binary. So I’m going to put one of those in place now for
that all the way through. What will that produce you? Don’t forget we’re starting
H producing good quality intermediate code so that is what it produces but
what about here? If you trace through, better quality intermediate code should
hopefully give you better quality binary when you compile it. So I can put down
here that this is basically – if there aren’t too many suffixes and
superscripts there – it’s producing you a H to I compiler but running on better
quality binary. So one way or another this is the new 3 because, referring back
to our original map, what i’ve said is how can i improve that and i have
managed to improve it. I’ve got better quality intermediate code and better
quality binary. It is improved but it has been a fairly messy two-stage process
that’s hidden from you. But the idea of that is to give you just the notion of
the fact that there is a penalty to pay, you have to have a two-stage process. So
we’ve expended a lot of brain cells I think between us now discovering how we
can improve our compiler, which we did do before in Self compiling Compilers
episode, but this time it’s different. How can you improve your compiler when
there’s an intermediate code involved? And we’ve done it! And we’ve seen exactly
how we can do it. We feel weary – or at least I do – but that is it. And you might say:
“Why bother with intermediate codes? It’s just producing more stages that we have to go
through?” Well, the answer us I think that it does make life more easy for you
if you say: “OK, instead of improving ourselves now within an intermediate
code context how about we say ‘I don’t want better binary out of it – I want
different binary for another machine’ ! So, we will find that the diagrams that I’ve
just drawn with some subtle adaptation – and me getting all tangled up probably –
can be adapted for producing binary for a completely new architecture which we
will be calling B”.
We’re now going to find in this next one,
we’re doing is we’re just changing the rules slightly. Instead of B* –
improved binary – we’re moving from B’ to B”.

40 comments on “Improving Intermediate Codes – Computerphile

  1. As I understand though Java only does the compilation for code that is frequently run. I understand this is to avoid the cost of compiling code that is rarely used and therefore doesn't affect speed as much as it would take to compile it (JIT compilation). But after seeing this video I am not entirely sure why Java doesn't compile everything before run time instead. Is it because it was originally intended to be run in web applets?

  2. I can see why low level programmers would be upset by python hobbyists standing on shoulders of abstraction of abstraction of abstraction

  3. My favorite part of this is at 4:13– you can feel the assembly and the binary poking up through the C!

  4. IBM COBOL compilers in the 1960s and '70s used to excrete Assembler as the intermediate code.

    A Haskell interpreter played an important role in the development of Perl 6. Now Perl 6 compiles itself.

  5. I literally feel like I spent huge money to get Visa and got admission in University and now sitting in office of this Professor to learn better. This is the best video series on Computerphile.

  6. Love listening to this. As a Systems Architect focusing on high level languages i love going deep like this with architecture and history!

  7. Aside from being one of the most informative sources on the internet he has a good taste of music as well (in my opinion) as demonstrated by the CD on his desk.

  8. Mr Brailsford you are quite amazing at using your profound knowledge to translate some quite damn complex concepts (for me at least) into something that makes sense. Thank you!

  9. This video has helped me, along with the previous computerphile video on recursion so I will share my thought. We are creating computers which do not run on traditional binary, quantum computers. Therefore, Intermediate code compilers will become the bleeding edge in order to nest quantum and traditional machines together. Moreover, as computing advances past quantum computers, as it surely will, intermediate code will still be needed to therefore nest those machines with the latter. This will create a 4D computational architecture with compilers acting in the 4th dimension, as in, over time. This architecture mirrors our current understanding physics and is therefore possible, probable, and dare I say inevitable. sidebar Legacy codes which back date may also be needed; however, this is just a loose end I have been unable to tie thus far.

  10. First of all, good videos! Even though I'm not in the field of computer science I can follow along, but I would like to learn a more about "Creating a compiler" and eventually your own language. Is there a good text book for this/those topic/s one could read?

  11. I think this would have been easier to follow if the 4 starting programs were labeled, for example with "#1"-"#4" circled in the top, to better distinguish the initial starting written programs from the generated programs. That way the fact that all these other programs are being generated using the starting programs would be clearer and easier to follow, rather than tons of programs coming from nowhere.

  12. Currently taking a class on compiler design and programming paradigms. Your insight into these related topics are so valuable to me. Thank you.

  13. What a cliff hanger, i want to know how B double dash compiles B dash interpreter to improve the intermediate code that makes better B double dash compiler on B dash.

  14. Who do amazing people like him that know so much and have experienced so much… Have such a low quality monitor and headphones? Is it really that someone that probably doesnt value picture and sound quality, can be happy with such low quality products?

  15. In practice is I* well defined? What I mean is, are there different implementations of I that I have to worry about, where an I* for one implementation is terrible for another?

  16. Took the Compilers I last Semester. How we were to design the compiler was to have the compiler in "modules" so it had to produce intermediate code then run a routine that would read the intermediate code and it would produce assembly for MIPS architecture. And then if needed we could program another module that would take the intermediate code and translate it on another architecture without changing anything else on the compiler.

  17. I think Dr.Brailsford would love to learn how the compilers in the D language work. There are 3 compilers existing DMD, LDC and GDC. DMD is the reference compiler with its frontend written in D generating intermediate syntax trees. The dmd compiler also has a backend, also written in D but quite limited in scope, x86 and limited optimization. ldc and gdc on the other hand are the same dmd frontend, but connected to the more advanced backends of llvm and gcc respectively. So the interesting part is the intermediate representation of the code, which allows to represent the same code for 3 different backends. What's unusual in that arrangment is that some optimisation passes are already done in the frontend (inlining f.ex.) part so that it's the intermediate code that benefits already from some amelioration.

  18. It's worth noting that C can be pretty different from machine code, in ways that make it hard to write a direct compiler. If the program is something like "while (x > 0 && f(x, y && g(y)) x–;" the flow control and use of registers is very different from how it looks, especially if you want to notice that "y && g(y)" may not have to be repeated, depending on the definition of g().

  19. I you want to watch/rewatch Professor Brailsford's entire series on compilers on Computerphile in order here they are in order:

    Bootstrapping with T-Diagrams – https://www.youtube.com/watch?v=PjeE8Bc96HY
    Self Compiling Compilers – https://www.youtube.com/watch?v=lJf2i87jgFA
    The UNCOL Problem – https://www.youtube.com/watch?v=pP_zIJSp3WA
    Improving Intermediate Codes – https://www.youtube.com/watch?v=TiJn9D6lZ-Y

  20. Love the bit about being able to "feel the machine code poking up through C code". Reminds me of my old prof, Jeff Rohl, saying C was really "glorified machine language", with just a tinge of disgust 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *