\documentclass{ictlab} \RCS $Revision: 1.1 $ \usepackage{alltt,xr,url} \ifx\pdftexversion\undefined \else \usepackage[breaklinks,pdfpagemode=None,pdfauthor={Nick Urbanik}]{hyperref} \fi % Oh dear, external references don't work when compiling both dvi and % pdf output. \externaldocument[lt-]{../../linux_training-plus-config-files-ossi/build/masterfile} \newcommand*{\labTitle}{Supplementary Assignment: Processing Identical Files} \renewcommand*{\subject}{Operating Systems and Systems Integration} \begin{Solutions}% \gdef\solution{\paragraph{Solution:}}% \end{Solutions} \providecommand*{\MD}{\acro{MD}\xspace} \renewcommand*{\bs}{\texttt{\char '134}} \begin{document} %\tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Submission}% \label{sec:submission} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \paragraph{Deadline:} by 4.30\,pm, Tuesday 27 July 2004. Late submissions are counted as failure. \paragraph{Where:} \begin{itemize} \item Paper submission at the office, C440, and \item Online at \url{http://nicku.org/perl2/submit.cgi}. \end{itemize} \paragraph{What:} Submit: \begin{itemize} \item A printout of your program to the office \item A tarball or \acro{ZIP} file submitted online, as described above. The tarball or \acro{ZIP} file should contain two shell scripts. \end{itemize} \paragraph{Cheating:} Your work \emph{must} be original. Copying will be \emph{severely} dealt with. I \emph{will} use the plagiarism detection tools at \url{http://www.cs.berkeley.edu/~aiken/moss.html}, and rigorously compare your work with all previous assignment submissions. Of course, you are welcome to use the code I have provided in lectures and workshop sessions and emails I have sent you previously. If your work fails the plagiarism detection, and if I find evidence of copying, then you will fail the \acro{CA} component of this subject, and will need to repeating one year of study of the subject. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Assignment Requirements}% \label{sec:requirements} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% This assignment requires you to write two shell scripts. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Delete Files in Current directory of which there are Copies Below}% \label{sec:delete-from-current-dir} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Write a shell script that: \begin{enumerate} \item Takes a list of filenames on the command line that exist in the current directory; \item For each of these files, the program: \begin{enumerate} \item calculates an \acro{MD5} sum of the file contents; \item searches the subdirectories below the current directory for a file of the same name, and calculates the \acro{MD5} sum of the contents of each of the other files with the same name; \item The program prints the details of each pair of files, indicating whether their contents are the same or not; \item if the contents are the same, the program will delete the copy of the file in the \emph{current} directory. \end{enumerate} \end{enumerate} I wrote a script like this to check that the photos downloaded from my digital camera really were on the web site before deleting them. \subsection{Link All Duplicates} \label{sec:link-all-duplicates} Write a shell script that: \begin{enumerate} \item Searches all files in the current directory and below that are on the same partition as the current directory \item Calculates the \acro{MD5} sum of each file \item Determines which files have the same file contents \item For each set of files that have the same file contents and which are on the same partition \begin{enumerate} \item Replaces all copies of the file by a \emph{hard link} from one of the copies of the file. \end{enumerate} \end{enumerate} Note that the \texttt{-xdev} option to \texttt{find} may be helpful. A script like this could be useful to save disk space when dealing with large amounts of data that are being distributed by \HTTP or \FTP, but where copies of the same files need to appear in different directories; for example, for source code bundled with binary software packages for different architectures will for the most part be identical for each architecture. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{MD5 Sum}% \label{sec:md5sum} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The \acro{MD5} sum is a \emph{one way hash}, described in \RFC 1321. It takes any size of input, and calculates a 128-bit value as output. This calculation is made in such a way that it is extremely unlikely that two different files have the same \acro{MD5} sum. Even if one bit in one byte of the data in a file changes, the \acro{MD5} sum will be totally different. \POSIX systems have a command, \texttt{md5sum}, that calculates the \acro{MD5} sum in hexadecimal of the contents of the input files. It displays the results so that the 32-character \acro{MD5} sum is displayed first, then the file name, one per line of output. See \begin{alltt} $ \textbf{man md5sum} \end{alltt}%$ Here is example output from \texttt{md5sum}: \begin{alltt} $ \textbf{md5sum *} 042e4581dcb7f30a256ab961e6643a2f assignment-ca-supp-delete.tex 8b977b41c35c11ee0e3e06b951fa2e89 assignment-ca-supp-delete.tex~ 91f357e9ce91baf7070b1be6e8744d93 assignment-ca-supp-delete.toc c09bbef217efa9a3ebeef30f375b6577 Makefile \end{alltt} \subsection{Other Useful POSIX Commands} \label{sec:useful-commands} Many other \POSIX commands besides \texttt{md5sum} may be useful when working on this assignment. Here are a few: \texttt{find}, \texttt{xargs}, \texttt{sort}, \texttt{uniq}, \texttt{grep}, and the built-in commands \texttt{test} or \texttt{[\ldots]}, \texttt{echo}. You will find that other programming structures are useful here, particularly pipes to connect these commands together. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% \section{Examples}% %% \label{sec:examples} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Guide to Testing\label{sec:guide-to-testing}}% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% If you want to fail this subject, the best thing to do is to never test your work, and blindly submit it and hope it works. It probably doesn't. Test \emph{incrementally} (as you write the program). As you build each part, test it, make sure it does what you expect. Don't just sit down, write your whole program, then start testing it. You may use the \texttt{-x} option to the shell to enable tracing of your program as it executes. Create a deeply nested directory structure, and copy lots of files into this. Make copies of various files into different locations in this directory structure. Keep a record of all the files you have created, and record what you expect your program to do. Record the \acro{MD5} sums of all the files. Predict what your program should do, and write that down. Run your program, and verify that it behaved as you expect. Change the directory structure, change some of the files, and test again. Test as much as you can. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Marking Scheme}% \label{sec:marking-scheme} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Your submission will be marked as follows. For the first program, \par\bigskip\par% \noindent% \begin{tabularx}{\linewidth}{@{}Yl@{}} \toprule% \textbf{criteria} & \textbf{mark} \\ \midrule% Handle the list of filenames on the command line, and calculate the \acro{MD5} sum of each, ensuring that they exist & 10\\ Search for matching files and calculate the \acro{MD5} sum of matches & 30 \\ displaying details of matching files & 20 \\ deleting the files if this is appropriate, and never deleting the wrong file & 40 \\ \bottomrule \end{tabularx} \par\bigskip\par% For the second program, \par\bigskip\par% \noindent% \begin{tabularx}{\linewidth}{@{}Yl@{}} \toprule% \textbf{criteria} & \textbf{mark} \\ \midrule% Find all matching files that have the same file contents & 50 \\ ensure that they are on the same partition (filesystem) & 10 \\ replacing each copy with a link to one of the copies & 40 \\ \bottomrule \end{tabularx} \par\bigskip\par% The marks for each program are equal, and the marks listed above for meeting requirements constitute 70\% of the marks for this assignment. The remaining 30\% shall be determined by the quality of the submission according to the following scheme: \par\bigskip\par% \noindent% \begin{tabularx}{\linewidth}{@{}Yl@{}} \toprule% \textbf{criteria} & \textbf{mark} \\ \midrule% Elegant design that is as simple as possible & 40 \\ Use of pipes to connect the commands together, avoiding temporary files & 30 \\ Robust design, never making any system call unchecked & 15 \\ Good structure of program: code divided into simple functions that each do one simple, easily specified action, identifiers have meaningful names,\,\ldots & 10 \\ Additional flexibility (i.e., behaviour can be changed using options) & 5 \\ \bottomrule \end{tabularx} \section{Getting Help}% \label{sec:help} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Documentation}% \label{sec:documentation} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% All these commands are documented with \texttt{man} pages, as well as (in more detail) in \texttt{info} format. I usually use Emacs to read \texttt{info} documentation, as I described in %section~\vref{lt-sec:emacs-info} section~6.9 on page 192 of my workshop notes (see below for the link to the workshop notes). The documentation for the Bash shell itself is in one big \texttt{man} page, or alternatively, available in \texttt{info} format. There are at least two very useful books available online at \url{http://tldp.org/guides.html}. There is the \emph{Bash Guide for Beginners} available in \HTML at \url{http://tldp.org/LDP/Bash-Beginners-Guide/html/index.html} and in other formats too. There is also the \emph{Advanced Bash-Scripting Guide} available in \HTML at \url{http://tldp.org/LDP/abs/html/index.html}. Both contain plenty of examples. Of course, my workshop notes may be helpful here: \url{http://nicku.org/ossi/lab/workshop-notes.pdf} as well as my lecture notes on shell programming: \url{http://nicku.org/ossi/lectures/shell/shell-slides.pdf}. See Module 5 of my workshop notes for details about hard links. Also see the notes at \url{http://nicku.org/ossi/lab/sym-link/sym-link.pdf} for more about hard and soft links. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Asking Questions}% \label{sec:asking-questions} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% If you have any questions about the assignment, please ask them in person, by phone to the office (2436\,8576) or by email. I will send the reply to all students if that seems helpful, but I will conceal the identity of the person who asked the question. I welcome any questions. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{I am Not Available in the Office After 16 July 2004} \label{sec:asking-questions-before-16-july} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Please note that I will not be available in the office after 16 July 2004. However, I will read your email and answer it with care. \end{document} find / -xdev -type f 2> /dev/null | xargs md5sum 2> /dev/null \ | sort > /tmp/sortedlist.txt uniq -w 32 -D /tmp/sortedlist.txt These did not show all the matching files, only one file name for each match: uniq -w 32 -c /tmp/sortedlist.txt | egrep -v '^ *1[^0-9]' uniq -w 32 -d /tmp/sortedlist.txt So, in one fell swoop: find . -xdev -type f 2> /dev/null | xargs md5sum 2> /dev/null \ | sort \ | uniq -w 32 -D cat /tmp/sortedlist.txt \ | while read md5 file;do if [ "$last" ] && [ "$md5" = "$last" ];then line="$line $file";elif [ "$line" ];then echo $line;last=$md5;line=;else last=$md5;line=;fi;done So the whole shebang: find . -xdev -type f 2> /dev/null | xargs md5sum 2> /dev/null \ | sort \ | uniq -w 32 -D | while read md5 file do if [ "$last" ] && [ "$md5" = "$last" ] then line="$line $file" elif [ "$line" ] then echo $line last=$md5 line= else last=$md5 line= fi done Here we sort by the number of duplicates: cat /tmp/sortedlist.txt | while read md5 file;do if [ "$last" ] && [ "$md5" = "$last" ];then line="$line $file";((++i));elif [ "$line" ];then echo $i $line;last=$md5;line=;i=;else last=$md5;line=;i=;fi;done | sort -k1n,1 -k2