Sunday, March 1, 2020

Ruslan Spivak: EOF is not a character

I was reading Computer Systems: A Programmer’s Perspective the other day and in the chapter on Unix I/O the authors mention that there is no explicit “EOF character” at the end of a file.

If you’ve spent some time reading and/or playing with Unix I/O and have written some C programs that read text files and run on Unix/Linux, that statement is probably obvious. But let’s take a closer look at the following two points related to the statement in the book:

  1. EOF is not a character
  2. EOF is not a character you find at the end of a file


1. Why would anyone say or think that EOF is a character? I think it may be because in some C programs you can find code that explicitly checks for EOF using getchar() and getc() routines:

    #include <stdio.h>
    ...
    while ((c = getchar()) != EOF)
      putchar(c);

    OR

    FILE *fp;
    int c;
    ...
    while ((c = getc(fp)) != EOF)
      putc(c, stdout);

And if you check the man page for getchar() or getc(), you’ll read that both routines get the next character from the input stream. So that could be what leads to a confusion about the nature of EOF, but that’s just me speculating. Let’s get back to the point that EOF is not a character.

What is a character anyway? A character is the smallest component of a text. ‘A’, ‘a’, ‘B’, ‘b’ are all different characters. A character has a numeric value that is called a code point in the Unicode standard. For example, the English character ‘A’ has a numeric value of 65 in decimal. You can check this quickly in a Python shell:

$python
>>> ord('A')
65
>>> chr(65)
'A'


Or you could look it up in the ASCII table on your Unix/Linux box:

$ man ascii


Let’s check the value of EOF by writing a little C program. In ANSI C, EOF is defined in <stdio.h> as part of the standard library. Its value is usually -1. Save the following code in file printeof.c, compile it, and run it:

#include <stdio.h>

int main(int argc, char *argv[])
{
  printf("EOF value on my system: %d\n", EOF);
  return 0;
}


$ gcc -o printeof printeof.c

$ ./printeof
EOF value on my system: -1

Okay, so on my system the value is -1 (I tested it both on Mac OS and Ubuntu Linux). Is there a character with a numerical value of -1? Again, you could check the available numeric values in the ASCII table or check the official Unicode page to find the legitimate range of numeric values for representing characters. But let’s fire up a Python shell and use the built-in chr() function to return a character for -1:

$ python
>>> chr(-1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: chr() arg not in range(0x110000)

As expected, there is no character with a numeric value of -1. Okay, so EOF (as seen in C programs) is not a character.

Onto the second point.


2. Is EOF a character that you can find at the end of a file? I think at this point you already know the answer, but let’s double check our assumption.

Let’s take a simple text file helloworld.txt and get a hexdump of the contents of the file. We can use xxd for that:

$ cat helloworld.txt
Hello world!

$ xxd helloworld.txt
00000000: 4865 6c6c 6f20 776f 726c 6421 0a         Hello world!.

As you can see, the last character at the end of the file is the hex 0a. You can find in the ASCII table that 0a represents nl, the newline character. Or you can check it in a Python shell:

$ python
>>> chr(0x0a)
'\n'


Okay. If EOF is not a character and it’s not a character that you find at the end of a file, what is it then?

EOF (end-of-file) is a condition that can be detected by an application when a read operation reaches the end of a file.

Let’s see how we can detect the EOF condition in various programming languages when reading a text file using high-level I/O routines provided by the languages. For this purpose, we’ll write a very simple cat version called mcat that reads an ASCII-encoded text file byte by byte (character by character) and explicitly checks for EOF. Let’s write our cat version in the following programming languages:

  • ANSI C
  • Python
  • Go
  • JavaScript (node.js)

You can find source code for all of the examples in this article on GitHub. Okay, let’s get started with the venerable C programming language.

  1. ANSI C (a modified cat version from The C Programming Language book)

    /* mcat.c */
    #include <stdio.h>
    
    int main(int argc, char *argv[])
    {
      FILE *fp;
      int c;
    
      if ((fp = fopen(*++argv, "r")) == NULL) {
        printf("mcat: can't open %s\n", *argv);
        return 1;
      }
    
      while ((c = getc(fp)) != EOF)
        putc(c, stdout);
    
      fclose(fp);
    
      return 0;
    }
    

    Compile

    $ gcc -o mcat mcat.c
    

    Run

    $ ./mcat helloworld.txt
    Hello world!
    


    Quick explanation of the code above:

    • The program opens a file passed as a command line argument
    • The while loop copies data from the file to the standard output one byte at a time until it reaches the end of the file.
    • On reaching EOF, the program closes the file and terminates
  2. Python 3

    Python doesn’t have a mechanism to explicitly check for EOF like in ANSI C, but if you read a text file one character at a time, you can determine the end-of-file condition by checking if the character read is empty:

    # mcat.py
    import sys
    
    with open(sys.argv[1]) as fin:
        while True:
            c = fin.read(1) # read max 1 char
            if c == '':     # EOF
                break
            print(c, end='')
    


    $ python mcat.py helloworld.txt
    Hello world!
    

    Python 3.8+ (a shorter version of the above using the walrus operator):

    # mcat38.py
    import sys
    
    with open(sys.argv[1]) as fin:
        while (c := fin.read(1)) != '':  # read max 1 char at a time until EOF
            print(c, end='')
    


    $ python3.8 mcat38.py helloworld.txt
    Hello world!
    
  3. Go

    In Go we can explicitly check if the error returned by Read() is EOF.

    // mcat.go
    package main
    
    import (
        "fmt"
        "os"
        "io"
    )
    
    func main() {
        file, err := os.Open(os.Args[1])
        if err != nil {
            fmt.Fprintf(os.Stderr, "mcat: %v\n", err)
            os.Exit(1)
        }
    
        buffer := make([]byte, 1)  // 1-byte buffer
        for {
            bytesread, err := file.Read(buffer)
            if err == io.EOF {
                break
            }
            fmt.Print(string(buffer[:bytesread]))
        }
        file.Close()
    }
    


    $ go run mcat.go helloworld.txt
    Hello world!
    
  4. JavaScript (node.js)

    There is no explicit check for EOF, but the end event on a stream is fired when the end of a file is reached and a read operation tries to read more data.

    /* mcat.js */
    const fs = require('fs');
    const process = require('process');
    
    const fileName = process.argv[2];
    
    var readable = fs.createReadStream(fileName, {
      encoding: 'utf8',
      fd: null,
    });
    
    readable.on('readable', function() {
      var chunk;
      while ((chunk = readable.read(1)) !== null) {
        process.stdout.write(chunk); /* chunk is one byte */
      }
    });
    
    readable.on('end', () => {
      console.log('\nEOF: There will be no more data.');
    });
    


    $ node mcat.js helloworld.txt
    Hello world!
    
    EOF: There will be no more data.
    


How do the high-level I/O routines in the examples above determine the end-of-file condition? On Linux systems the routines either directly or indirectly use the read() system call provided by the kernel. The getc() function (or macro) in C, for example, uses the read() system call and returns EOF if read() indicated the end-of-file condition. The read() system call returns 0 to indicate the EOF condition.

Let’s write a cat version called syscat using Unix system calls only, both for fun and potentially some profit. Let’s do that in C first:

/* syscat.c */
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
  int fd;
  char c;

  fd = open(argv[1], O_RDONLY, 0);

  while (read(fd, &c, 1) != 0)
    write(STDOUT_FILENO, &c, 1);

  return 0;
}


$ gcc -o syscat syscat.c

$ ./syscat helloworld.txt
Hello world!

In the code above, you can see that we use the fact that the read() function returns 0 to indicate EOF.

And the same in Python 3:

# syscat.py
import sys
import os

fd = os.open(sys.argv[1], os.O_RDONLY)

while True:
    c = os.read(fd, 1)
    if not c:  # EOF
        break
    os.write(sys.stdout.fileno(), c)


$ python syscat.py helloworld.txt
Hello world!

And in Python3.8+ using the walrus operator:

# syscat38.py
import sys
import os

fd = os.open(sys.argv[1], os.O_RDONLY)

while c := os.read(fd, 1):
    os.write(sys.stdout.fileno(), c)


$ python3.8 syscat38.py helloworld.txt
Hello world!


Let’s recap the main points about EOF again:

  • EOF is not a character
  • EOF is not a character that you find at the end of a file
  • EOF is a condition provided by the kernel that can be detected by an application when a read operation reaches the end of a file

Happy learning and have a great day!


Resources used in preparation for this article (some links are affiliate links):

  1. Computer Systems: A Programmer’s Perspective (3rd Edition)
  2. C Programming Language, 2nd Edition
  3. The Unix Programming Environment (Prentice-Hall Software Series)
  4. Advanced Programming in the UNIX Environment, 3rd Edition
  5. Go Programming Language, The (Addison-Wesley Professional Computing Series)
  6. Unicode HOWTO
  7. Node.js Stream module
  8. Go io package
  9. cat (Unix)


from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...