\documentclass{ictlab}

\RCS $Revision: 1.1 $

\usepackage{alltt,xr,url}
\ifx\pdftexversion\undefined \else
  \usepackage[breaklinks,pdfpagemode=None,pdfauthor={Nick
    Urbanik}]{hyperref}
\fi
% Oh dear, external references don't work when compiling both dvi and
% pdf output.
\externaldocument[lt-]{../../linux_training-plus-config-files-ossi/build/masterfile}

\newcommand*{\labTitle}{Supplementary Assignment: Processing Identical Files}
\renewcommand*{\subject}{Operating Systems and Systems Integration}

\begin{Solutions}%
  \gdef\solution{\paragraph{Solution:}}%
\end{Solutions}

\providecommand*{\MD}{\acro{MD}\xspace}
\renewcommand*{\bs}{\texttt{\char '134}}

\begin{document}
%\tableofcontents


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Submission}%
\label{sec:submission}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\paragraph{Deadline:}

by 4.30\,pm, Tuesday 27 July 2004.  Late submissions are counted as
failure.

\paragraph{Where:}
\begin{itemize}
\item Paper submission at the office, C440, and
\item Online at \url{http://nicku.org/perl2/submit.cgi}.
\end{itemize}

\paragraph{What:}

Submit:
\begin{itemize}
\item A printout of your program to the office
\item A tarball or \acro{ZIP} file submitted online, as described
  above.  The tarball or \acro{ZIP} file should contain two shell
  scripts.
\end{itemize}

\paragraph{Cheating:}

Your work \emph{must} be original.  Copying will be \emph{severely}
dealt with.  I \emph{will} use the plagiarism detection tools at
\url{http://www.cs.berkeley.edu/~aiken/moss.html}, and rigorously
compare your work with all previous assignment submissions.  Of
course, you are welcome to use the code I have provided in lectures
and workshop sessions and emails I have sent you previously.

If your work fails the plagiarism detection, and if I find evidence of
copying, then you will fail the \acro{CA} component of this subject,
and will need to repeating one year of study of the subject.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Assignment Requirements}%
\label{sec:requirements}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This assignment requires you to write two shell scripts.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Delete Files in Current directory of which there are
  Copies Below}%
\label{sec:delete-from-current-dir}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Write a shell script that:
\begin{enumerate}
\item Takes a list of filenames on the command line that exist in the
  current directory;
\item For each of these files, the program:
  \begin{enumerate}
  \item calculates an \acro{MD5} sum of the file contents;
  \item searches the subdirectories below the current directory for a
    file of the same name, and calculates the \acro{MD5} sum of the
    contents of each of the other files with the same name;
  \item The program prints the details of each pair of files,
    indicating whether their contents are the same or not;
  \item if the contents are the same, the program will delete the copy
    of the file in the \emph{current} directory.
  \end{enumerate}
\end{enumerate}
I wrote a script like this to check that the photos downloaded from my
digital camera really were on the web site before deleting them.

\subsection{Link All Duplicates}
\label{sec:link-all-duplicates}

Write a shell script that:
\begin{enumerate}
\item Searches all files in the current directory and below that are
  on the same partition as the current directory
\item Calculates the \acro{MD5} sum of each file
\item Determines which files have the same file contents
\item For each set of files that have the same file contents and which
  are on the same partition
  \begin{enumerate}
  \item Replaces all copies of the file by a \emph{hard link} from one
    of the copies of the file.
  \end{enumerate}
\end{enumerate}

Note that the \texttt{-xdev} option to \texttt{find} may be helpful.

A script like this could be useful to save disk space when dealing
with large amounts of data that are being distributed by \HTTP or
\FTP, but where copies of the same files need to appear in different
directories; for example, for source code bundled with binary software
packages for different architectures will for the most part be
identical for each architecture.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{MD5 Sum}%
\label{sec:md5sum}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The \acro{MD5} sum is a \emph{one way hash}, described in \RFC 1321.
It takes any size of input, and calculates a 128-bit value as output.
This calculation is made in such a way that it is extremely unlikely
that two different files have the same \acro{MD5} sum.  Even if one
bit in one byte of the data in a file changes, the \acro{MD5} sum will
be totally different.

\POSIX systems have a command, \texttt{md5sum}, that calculates the
\acro{MD5} sum in hexadecimal of the contents of the input files.  It
displays the results so that the 32-character \acro{MD5} sum is
displayed first, then the file name, one per line of output.

See
\begin{alltt}
$ \textbf{man md5sum}
\end{alltt}%$

Here is example output from \texttt{md5sum}:
\begin{alltt}
$ \textbf{md5sum *}
042e4581dcb7f30a256ab961e6643a2f  assignment-ca-supp-delete.tex
8b977b41c35c11ee0e3e06b951fa2e89  assignment-ca-supp-delete.tex~
91f357e9ce91baf7070b1be6e8744d93  assignment-ca-supp-delete.toc
c09bbef217efa9a3ebeef30f375b6577  Makefile
\end{alltt}

\subsection{Other Useful POSIX Commands}
\label{sec:useful-commands}

Many other \POSIX commands besides \texttt{md5sum} may be useful when
working on this assignment.  Here are a few: \texttt{find},
\texttt{xargs}, \texttt{sort}, \texttt{uniq}, \texttt{grep}, and the
built-in commands \texttt{test} or \texttt{[\ldots]}, \texttt{echo}.
You will find that other programming structures are useful here,
particularly pipes to connect these commands together.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% \section{Examples}%
%% \label{sec:examples}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Guide to Testing\label{sec:guide-to-testing}}%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

If you want to fail this subject, the best thing to do is to never
test your work, and blindly submit it and hope it works.  It probably
doesn't.

Test \emph{incrementally} (as you write the program).  As you build
each part, test it, make sure it does what you expect.  Don't just sit
down, write your whole program, then start testing it.

You may use the \texttt{-x} option to the shell to enable tracing of
your program as it executes.

Create a deeply nested directory structure, and copy lots of files into
this.  Make copies of various files into different locations in this
directory structure.  Keep a record of all the files you have created,
and record what you expect your program to do.  Record the \acro{MD5}
sums of all the files.

Predict what your program should do, and write that down.

Run your program, and verify that it behaved as you expect.

Change the directory structure, change some of the files, 
and test again.

Test as much as you can.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Marking Scheme}%
\label{sec:marking-scheme}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Your submission will be marked as follows.  

For the first program, 
\par\bigskip\par%
\noindent%
\begin{tabularx}{\linewidth}{@{}Yl@{}}
  \toprule%
  \textbf{criteria} & \textbf{mark} \\
  \midrule%
  Handle the list of filenames on the command line, and calculate the
  \acro{MD5} sum of each, ensuring that they exist & 10\\
  Search for matching files and calculate the \acro{MD5} sum of
  matches & 30 \\
  displaying details of matching files & 20 \\
  deleting the files if this is appropriate, and never deleting the
  wrong file & 40 \\
  \bottomrule
\end{tabularx}

\par\bigskip\par%
For the second program, 
\par\bigskip\par%
\noindent%
\begin{tabularx}{\linewidth}{@{}Yl@{}}
  \toprule%
  \textbf{criteria} & \textbf{mark} \\
  \midrule%
  Find all matching files that have the same file contents & 50 \\
  ensure that they are on the same partition (filesystem) & 10 \\
  replacing each copy with a link to one of the copies & 40 \\
  \bottomrule
\end{tabularx}

\par\bigskip\par%
The marks for each program are equal, and the marks listed above for
meeting requirements constitute 70\% of the marks for this assignment.
The remaining 30\% shall be determined by the quality of the
submission according to the following scheme:

\par\bigskip\par%
\noindent%
\begin{tabularx}{\linewidth}{@{}Yl@{}}
  \toprule%
  \textbf{criteria} & \textbf{mark} \\
  \midrule%
  Elegant design that is as simple as possible & 40 \\
  Use of pipes to connect the commands together, avoiding temporary
  files & 30 \\
  Robust design, never making any system call unchecked & 15 \\
  Good structure of program: code divided into simple functions that
  each do one simple, easily specified action, identifiers have
  meaningful
  names,\,\ldots & 10 \\
  Additional flexibility (i.e., behaviour can be changed using
  options) & 5 \\
  \bottomrule
\end{tabularx}


\section{Getting Help}%
\label{sec:help}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Documentation}%
\label{sec:documentation}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

All these commands are documented with \texttt{man} pages, as well as
(in more detail) in \texttt{info} format.  I usually use Emacs to read
\texttt{info} documentation, as I described in
%section~\vref{lt-sec:emacs-info}
section~6.9 on page 192 of my workshop notes (see below for the link
to the workshop notes).  The documentation for the Bash shell itself
is in one big \texttt{man} page, or alternatively, available in
\texttt{info} format.

There are at least two very useful books available online at
\url{http://tldp.org/guides.html}.  There is the \emph{Bash Guide for
Beginners} available in \HTML at
\url{http://tldp.org/LDP/Bash-Beginners-Guide/html/index.html} and in
other formats too.  There is also the \emph{Advanced Bash-Scripting
  Guide} available in \HTML at
\url{http://tldp.org/LDP/abs/html/index.html}.  Both contain plenty of
examples.

Of course, my workshop notes may be helpful here:
\url{http://nicku.org/ossi/lab/workshop-notes.pdf} as
well as my lecture notes on shell programming:
\url{http://nicku.org/ossi/lectures/shell/shell-slides.pdf}.

See Module 5 of my workshop notes for details about hard links.  Also
see the notes at
\url{http://nicku.org/ossi/lab/sym-link/sym-link.pdf}
for more about hard and soft links.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Asking Questions}%
\label{sec:asking-questions}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

If you have any questions about the assignment, please ask them in
person, by phone to the office (2436\,8576) or by email.  I will send
the reply to all students if that seems helpful, but I will conceal
the identity of the person who asked the question.  I welcome any
questions.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{I am Not Available in the Office After 16 July 2004}
\label{sec:asking-questions-before-16-july}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Please note that I will not be available in the office after 16 July
2004.  However, I will read your email and answer it with care.
\end{document}

find / -xdev -type f 2> /dev/null | xargs md5sum 2> /dev/null \
 | sort > /tmp/sortedlist.txt

uniq -w 32 -D /tmp/sortedlist.txt

These did not show all the matching files, only one file name for each
match:
uniq -w 32 -c /tmp/sortedlist.txt | egrep -v '^ *1[^0-9]'
uniq -w 32 -d /tmp/sortedlist.txt

So, in one fell swoop:
find . -xdev -type f 2> /dev/null | xargs md5sum 2> /dev/null \
 | sort \
 | uniq -w 32 -D

cat /tmp/sortedlist.txt \
| while read md5 file;do if [ "$last" ] && [ "$md5" = "$last" ];then
line="$line $file";elif [ "$line" ];then
echo $line;last=$md5;line=;else last=$md5;line=;fi;done

So the whole shebang:
find . -xdev -type f 2> /dev/null | xargs md5sum 2> /dev/null \
 | sort \
 | uniq -w 32 -D |
    while read md5 file
    do
        if [ "$last" ] && [ "$md5" = "$last" ]
        then
            line="$line $file"
        elif [ "$line" ]
        then
            echo $line
            last=$md5
            line=
        else
            last=$md5
            line=
        fi
    done

Here we sort by the number of duplicates:
cat /tmp/sortedlist.txt | while read md5 file;do if [ "$last" ] && [ "$md5" = "$last" ];then line="$line $file";((++i));elif [ "$line" ];then echo $i $line;last=$md5;line=;i=;else last=$md5;line=;i=;fi;done | sort -k1n,1 -k2