Friday, January 15, 2010

MATLAB: Read formatted input file

The data collected from the FIFA Ranking table and the CIA Factbook were entered in a 32-row by 8-column tab delimited text file shown below.

WCTextFileCapture_

Obviously, it can be easily manipulated with Excel but I also want to do some more work that is easier to do with MATLAB.

It is straight forward to open the file with Excel, copy a column, and cut and paste as input to variable in the command window or a file. But that is too crude for my taste and it does not work for very large files. Plus MATLAB (7.1 R14) offers a number of functions that can read formatted data. These functions are outlined below. For each case I get an estimate of runtime performance by measuring elapsed time using tic-toc. Prior to running each case I clear the memory using clear all as shown in Case 1.

Case 1: use IMPORTDATA
Function importdata will read the text file and assign the data in parameters data and text to textdata. Works OK but not perfect because it can easily confuse data and text.

clear all

tic

M = importdata('WC2010_GK Stats.txt');

fifaRank = str2num( char(M.textdata(:,1)) );
country = M.textdata(:,2);
points = M.data(:,1);
GDPcapita = M.data(:,3);

toc

Case 2: use STRREAD
Function can read formatted data from a string. It requires a loop to go through the file and string manipulation. However, it is pretty fast and reliable. 

fid = fopen('WC2010_GK Stats.txt', 'r');

n = 1;

% If fgetl encounters EOF indicator, it returns -1
while 1
   tline = fgetl(fid); % return the next line of the file associated w/ fid
   if ~ischar(tline),   break,   end % terminate loop
   dummy = strread(tline, '%s');
   % Preallocate to speed up though in this case not much difference
   fifaRank(n) = str2num( char( dummy(1,:) ) );
   % NOTE: theoretically str2double is faster than str2num but I have never
   % seen any real advantage!

   country(n,:) = dummy(2,:);
   points(n) = str2num( char( dummy(3,:) ) );
   GDPcapita(n) = str2num( char( dummy(5,:) ) );
   n = n + 1;
end

fclose(fid);

 

Case 3: use TEXTREAD
Function can read formatted data form text file. It easy to use once you read the (extensive) documentation. 

[fifaRank, country, points, x, GDPcapita, y, z, w] = ...
   textread('WC2010_GK Stats.txt', '%d %s %f %f %f %f %f %f');

 

Case 4: use SSCANF
Function sscanf reads data from the MATLAB string s, converts it according to the specified format string, and returns it in matrix A in column format. For a mixed number + character string the function returns all numbers which can be painful!!

fid = fopen('WC2010_GK Stats.txt', 'r');

n = 1;

% If fgetl encounters EOF indicator, it returns -1
while 1
   tline = fgetl(fid);
%return the next line of the file associated w/ fid
   if ~ischar(tline),   break,   end
  
% Read in the numerical values; Use * to ignore character input for
   % first pass. The reason is that countrly name is variable length.
   % Recall that conversion characters marked with asterisk are NOT
   % returned.

   A = sscanf(tline, '%e %*s %e %e %e %e %e %e', Inf);
   fifaRank(n) = A(1);
   points(n) = A(2);
   GDPcapita(n) = A(4);
  
% Now do a second pass with sscanf ignoring numerical characters.
   % Then use char to put name together.
   B = sscanf(tline, '%*e %s %*e %*e %*e %*e %*e %*e', inf);
   country(n,:) = cellstr( char(B) );

   n = n + 1;
end

fclose(fid);

 

Case 5: use FSCANF
Function fscanf is good about handling numerical data and is pretty fast but it does not handle character data very gracefully. I would need to write additional code to format the character data properly.

fid = fopen('WC2010_GK Stats.txt', 'r');

[A, count] = fscanf(fid, '%e %*s %e %e %e %e %e %e', [7 32]);

fifaRank = A(1,:);
points = A(2,:);
GDPcapita = A(4,:);

% Need to set the file position indicator to the beginning of the file
frewind(fid)

% and use fscanf a second time
B = fscanf(fid, '%*e %s %*e %*e %*e %*e %*e %*e');

fclose(fid);

Case 6: use TEXTSCAN
Function textscan is a fairly new function intended to replace textread and strread. Function textscan reads in the data, formats them according to format specifiers, and place them in cells in a cell array.

fid = fopen('WC2010_GK Stats.txt', 'r');

A = textscan(fid, '%d %s %f %f %f %f %f %f'); % data are placed in cell array

fifaRank = A{:,1};
country = A{:,2};
points = A{:,3};
GDPcapita = A{:,5};

fclose(fid);

I mentioned earlier that I quantified the performance of each case on an AMD X2 Dual Core Processor 3800+ with 3 GB of RAM. Average runtime results for my small file are tabulated below.

 

Case Function Elapsed time [sec] Comments
1 importdata 0.2177

41x slower. Standard import data function. Need to check data parameters.

2 strread 0.0352

6.6x slower. Pretty reliable.

3 textread 0.0222

4.2x slower. Least code.

4 sscanf 0.0175

3.3x slower. Needs a bit of code and care with text data. Possibly the function I have used the most.

5 fscanf 0.0104

2x slower. Text not properly displayed. More code needed. Very good for numerical data.

6 textscan 0.0053

Fastest! Efficient! Backwards compatibility issue…

No comments:

Post a Comment