KTH PDC [PDC - Center for Parallel Computers, KTH]
[Entrance to PDC]
[Information about PDC]
[News & Events]
[Computer resources]
*[User support]
[Training & Courses]
[Search the webmap]
[Links to far away]


Guided Tours: Binary files

Contents


Introduction

Most scientists have a need to transfer their computational data from one computer system to another without loosing their ability to re-use this data on this platform. Generally this is easy to do as there are numerous ways to transfer files available ranging from Internet based transfers to moving diskettes or tapes between physical computer systems.

Text based files are today easy to move between systems for users that create files based on western languages. Other file types need special consideration. This document will in particular focus on how to re-use files created on Cray PVP systems (Cray YMP/J90/C90) on the dominating computational platform of today - IEEE floating point based UNIX systems.


File formats

When talking about binary files there is often a confusion arising between users programming in the C language and in Fortran. Here we will use the following terms
Binary unblocked files
This is the "standard" C binary file format. On most system this file format can be accomplished for a Fortran file if it is opened with the keywords FORM='UNFORMATTED' ACCESS='DIRECT'. The file consist of an unbroken sequence of bytes representing the data in the file.
Binary blocked files
This is the "standard" binary format for binary files created in Fortran. This file format is what you typically get if you open a file in Fortran using the specifier FORM='UNFORMATTED' and you do not specify the ACCESS or if you do, specify it as ACCESS='SEQUENTIAL'.
Text files
The text based formats are character based files, typically using the ASCII character set, or more and more wide spread, using the default ISO Latin-1 character set.
High level formats
To make the underlying physical file format transparent to the user, and to make it easier to move files between systems, there are several high level file formats available. These formats are used by calling specific library functions for each file format, instead of the file functions provided by the programming language. Examples of such platform independent file formats are e.g. netCDF, HDF, XML, but there are many others available as well.

Data formats

Each piece of information in the file is represented as a low-level sequence of bits stored on some kind of persistent media. In the case of character data, this is typically the bit pattern corresponding to each character encoded according to the selected character set. Typically this will be the ASCII coding sequence or ISO Latin-1.

For integer data the bits stored are typically the bit representation of the value encoded in the native integer representation. Today this is almost exclusively a binary digital system using 2's complement in the case of signed integer values. The number of bits stored reflects the size of the integer format. Typical sizes are 32 bits, but 64 bits are default for integers on Cray PVPs, and C programs can make heavy use of 8 bit wide integers (declared as char).

Floating point numbers are also typically stored on the file using the same bit pattern as the internal representation of the computer system. Floating point numbers today are typically 32 bit or 64 bit wide. Most computer systems use the IEEE standardized floating point format, but one important exception is Cray PVP computers.

In this context it is also necessary to mention a concept called "endian-ness". Typically 8 consecutive bits are treated as an entity called a byte. Further, most computer systems internally handle data in chunks of two consecutive bytes sometimes referred to a "word". When a piece of data is stored in memory or on a file and it is larger than a byte, the two bytes within each word can be stored in either of two orders. Depending on whether the most significant part is stored before the less significant, the scheme is referred to as "big-endian" (big end first) or "little-endian".

Most traditional Unix systems run on "big-endian" machines, but one important exception is x86 based PC systems and x86 compatible systems. This also apply to IPF, itanium processor family, systems; i.e. the Lucidor system.

The Intel Fortran compiler has no less than six different methods specifying nonnative (to Intel Fortran compiler) numeric formats for unformatted data. Find further information through the IPF linux software page under Fortran77, and then navigate to Product Manuals/fortran, and then 'Data and I/O,' 'Converting Unformatted Data,' 'Methods of Specifying the Data Format: Overview.' We have also found this page from NCSA, Little-endian-to-Big-endian, quite useful. Another option is to look in the documentation at PDC. The command module show i-compilers shows where documentation can be found.

Binary unblocked files

This file format contains all the written data in one long byte stream. The program reading the file must know not only the exact location of the particular data that it is going to read, it must also know the type of the data. I.e. is the data to be interpreted as a IEEE floating point number, a Cray PVP floating point number, an integer or any other data type.

Each data follows directly upon the previous. If the file is to be accessed from Fortran, the location of data is in the unit of the record size specified at opening of the file. In C file accesses can be located by specifying a file offset in the unit of bytes.

Binary blocked files

This file format organize the data in entities called "records". Typically the low-level file format is a continuous sequence of data consisting of a "block header" followed by a continuous stream set of bytes with the data within the record and ended with a "block sentinel". The "block header" and "block sentinel" typically is a system dependent sized integer specifying the size of the current record. Thus, the header and the sentinel usually is the same binary value, as they refer to the same record. This information is used by the Fortran I/O routines to know the size of the record (and thus the location of the next record), as well as to be able to implement the BACKSPACE instruction available in Fortran.

As in the case of unblocked files, the reading routine must know the type of each data item to correctly interpret it upon read.

Text files

Text files store the data in a character representation that can be directly interpreted by humans. The individual data items are typically separated by white spaces. In the case of Fortran records are separated by newline characters. The representation of floating point numbers is unique and no conversion is needed when files are moved between platforms.

The compatibility issues that arises in the case of text files are of two kinds. The first is that the marking of end-of-line is different on Unix systems on PCs and on Macs. Unix workstations have the ASCII character LF as end of line while Windows/DOS based applications uses the two character sequence LF CR. There are utility programs that can convert between these conventions, e.g. dos2unix and unix2dos. The second portability issue regards character that are outside the ASCII character set. This includes many language specific characters, such as umlauts in western languages. Most Unix systems default to the ISO Latin-1 character set, while Windows machines have a slightly different default character set.

Accessing Cray files on IEEE Unix systems

The typical problem that arises for users moving data from a Cray PVP system onto a RISC Unix workstation is reading of old binary blocked Fortran files on the new system. Some Fortran compilers have non-standard extensions that allow the application to directly read the Cray PVP binary formatted file. Such options are available on the PDC computer boye.

Unfortunately, no such functionality is available in the current version of the IBM XL Fortran compiler. Instead, the user is required to do explicit conversion using an external library. PDC provides the NCARU library from the University Corporation for Atmospheric Research, UCAR.

In the case where each record consist of only one data type (e.g. 64bit floating point numbers), such records can be read and converted by a call to the NCARU function CRAYREAD. If a record contains mixed type of variables the complete record should first be read into a temporary array without conversion. The individual data items can then be converted using calls to the functions ctodpf, ctospf, ctospi.

Simple example

A simple example where each Fortran record is of a homogeneous type is available as an inspiration. This file can be compiled on the PDC IBM systems using the command
xlf90 -o testncaru testncaru.f -L/usr/local/vol/numlib/NCARU/lib -lncaru
The code is available at http://www.pdc.kth.se/support/testncaru.f.

Documentation of the NCARU library is available as man pages.

man -M /usr/local/vol/numlib/NCARU/man ncaru

Alternative approaches

Converting a Unicos (Cray PVP) blocked file to a Unix blocked file
This can be accomplished on an SGI machine by the following sequence of commands. Note that this only converts the block/record information and does not make any translation between binary representations of numbers.
boye> assign -F cos Cray-pvp.dat
boye> assign -F f77 Unixblocked-PVPreal.dat
boye> fdcp Cray-pvp.dat Unixblocked-PVPreal.dat
Accessing Unicos/PVP files directly
SGI provides a file access mode where binary conversion is made directly by the I/O library. This can be very handy at some time, but users should be aware that there is an overhead associated with the conversion. Also, the 64bit precision Cray floating point numbers will be truncated to 32bit IEEE numbers on the SGI. In the example below, the users application reads a file named BINFILE.cray-pvp.dat that was created on a Cray PVP machine.
# Set up a filenv descriptor file
boye> setenv FILENV /scratch/${USER}_filenv
# To get readable error messages from assign
boye> setenv LANG C 
## Cray -> IEEE translation on the fly
boye> assign -s cos -N cray BINFILE.cray-pvp.dat
## Execute the user application
boye> ./my_application_program

Comments about this document

If you have comments on this document let us know. You can use the help-request form and help us making it better.


pdc-staff, $Date: 2004/09/27 10:24:41 $