Skip to main content

How to read a large file in Java

Problem with reading large file

Large file can be any plain text or binary file which is huge in size and can not fit in JVM memory at once. For example if a java application allocated with 256 MB memory and it tries to load a file completely which is more or close to that memory in size then it may throw out of memory error.

Points to be remembered

  • Never read the whole file at once.
  • Read file line by line or in chunks, like reading few lines from text file or reading few bytes from binary file.
  • Do not store the whole data in memory, like reading all lines and keeping as string. 
Java has many ways to read the file in chunks or line by line, like BufferedReader, Scanner, BufferedInputStream, NIO API. We will use NIO API to read the file and Java stream to process it. We will also see how to span the processing with multiple threads to make the processing faster.

CSV file

In this example I am going to read a CSV file which is around 500 MB in size.  Sample is as given below.
"2018-06","2019-08","Long-term migrant","Arrivals","non-NZ","Student","Andorra",1,0,"Provisional"
"2019-06","2019-08","Long-term migrant","Arrivals","non-NZ","Student","Andorra",0,0,"Provisional"

File reading and counting year wise

We will read CSV file and provide the count year wise using the first column in this CSV file. We will see it in two different ways, one is synchronous way and another is asynchronous way using CompletableFuture. We will see the code in next sections.

Instance Variables

    private final long mb = 1024*1024;
    private final String file = "/Users/Downloads/sample.csv";

Common Methods

Below method is used to collect the count year wise in map.
    public void yearCount(String line, Map<String, Integer> countMap){
        String key = line.substring(1, 5);
        if(countMap.containsKey(key)) {
            countMap.put(key, countMap.get(key)+1);
            countMap.put(key, 1);

Below method I have annotated with EventListener to get it invoked itself when application is up and ready. This method also calculates memory consumption and execution time.
    public void testLargeFile() throws Exception{
        long premem = Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory();
        long start = System.currentTimeMillis();

        System.out.println("Used memory pre run (MB): "+(premem/mb));
//        System.out.println("Year count: "+simpleYearCount(file));//process file synchronously and print details
//        System.out.println("Year count: "+asyncYearCount(file));//process file asynchronously and print details
        long postmem = Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory();

        System.out.println("Used memory post run (MB): "+(postmem/mb));
        System.out.println("Memory consumed (MB): "+(postmem-premem)/mb);
        System.out.println("Time taken in MS: "+(System.currentTimeMillis()-start));

Synchronous processing 

Below is the code which reads the file using NIO API and calculates year count synchronously. Code looks pretty simple and small.
    public Map<String, Integer> simpleYearCount(String file) throws IOException {
        Map<String, Integer> yearCountMap = new HashMap<>();
        .skip(1)//skip first line
                yearCount(s, yearCountMap);

        return yearCountMap;


Used memory pre run (MB): 41
Year count: {2019=1178560, 2018=2775136, 2017=559632, 2016=250144, 2015=248192, 2014=144656}
Used memory post run (MB): 304
Memory consumed (MB): 262
Time taken in MS: 1971

Asynchronous processing 

Here we are going to read the file using NIO API and then will process it asynchronously using CompletableFuture. For example will read 10000 lines and then process them asynchronously, then next 5000 and so on. See the below code.
    public Map<String, Integer> asyncYearCount(String file) throws IOException, InterruptedException, ExecutionException {
        try {
            List<CompletableFuture<Map<String, Integer>>> futures = new ArrayList<>();
            List<String> items = new ArrayList<>();
            .skip(1)//skip first line
                if(items.size()%10000==0) {
                    //add completable task for each of 10000 rows
                    futures.add(CompletableFuture.supplyAsync(yearCountSupplier(new ArrayList<>(items), new HashMap<>())));
            if(items.size()>0) {
                //add completable task for remaining rows
                futures.add(CompletableFuture.supplyAsync(yearCountSupplier(items, new HashMap<>())));
            return CompletableFuture.allOf(futures.toArray(new CompletableFuture[futures.size()]))
                    //join all task to collect result after all tasks completed
                    Map<String, Integer> yearCountMap = new HashMap<>();
                        //merge the result of all the tasks
                        map.forEach((key, val)->{
                            if(yearCountMap.containsKey(key)) {
                                yearCountMap.put(key, yearCountMap.get(key)+val);
                                yearCountMap.put(key, val);
                    return yearCountMap;
        } catch (IOException e) {
        return new HashMap<>();
    //Supplier method to count the year in given rows
    public Supplier<Map<String, Integer>> yearCountSupplier(List<String> items, Map<String, Integer> map){
        return ()->{
            return map;


Used memory pre run (MB): 120
Year count: {2019=1178560, 2018=2775136, 2017=559632, 2016=250144, 2015=248192, 2014=144656}
Used memory post run (MB): 262
Memory consumed (MB): 142
Time taken in MS: 1549


Now we have seen how to read and process the huge file. We also learnt to do it synchronously and asynchronously. If we compare the output of both execution we can notice the memory consumption and execution time difference which says that async execution is faster however it may use more memory as multiple threads are processing at same time. Async execution may be more useful when you have more heavy files then difference is significant.
I would suggest if you have less memory then go with synchronous execution otherwise use async execution for better performance. You may use async execution also with less memory but it may not be that much beneficial due to small chunks and too many threads. 


  1. i'm dazzled. I don't assume Ive met each body who knows as an incredible arrangement simply extra or considerably less this present circumstance as you reach. you are in goal of truth handily proficient and colossally sparkling. You thought of something that individuals ought to get and made the trouble energizing for us all. absolutely, satisfying website you have came.. Razer Surround Pro Key Generator

  2. basically I experience your page besides practicing and helpful appraisal. it is defended colossally advantageous calculation when a combination of our assets.thanks for part. I participate in this make perceived. Tally GsT Crack


Post a Comment

Popular Posts

Setting up kerberos in Mac OS X

Kerberos in MAC OS X Kerberos authentication allows the computers in same domain network to authenticate certain services with prompting the user for credentials. MAC OS X comes with Heimdal Kerberos which is an alternate implementation of the kerberos and uses LDAP as identity management database. Here we are going to learn how to setup a kerberos on MAC OS X which we will configure latter in our application. Installing Kerberos In MAC we can use Homebrew for installing any software package. Homebrew makes it very easy to install the kerberos by just executing a simple command as given below. brew install krb5 Once installation is complete, we need to set the below export commands in user's profile which will make the kerberos utility commands and compiler available to execute from anywhere. Open user's bash profile: vi ~/.bash_profile Add below lines: export PATH=/usr/local/opt/krb5/bin:$PATH export PATH=/usr/local/opt/krb5/sbin:$PATH export LDFLAGS=&

SpringBoot - @ConditionalOnProperty example for conditional bean initialization

@ConditionalOnProperty annotation is used to check if specified property available in the environment or it matches some specific value so it can control the execution of some part of code like bean creation. It may be useful in many cases for example enable/disable service if specific property is available. Below are the attributes which can be used for property check. havingValue - Provide the value which need to check against specified property otherwise it will check that value should not be false. matchIfMissing - If true it will match the condition and execute the annotated code when property itself is not available in environment. name - Name of the property to be tested. If you want to test single property then you can directly put the property name as string like "" and if you have multiple properties to test then you can put the names like {"prop.name1","prop.name2"} prefix - It can be use when you want to apply some prefix to

Why HashMap key should be immutable in java

HashMap is used to store the data in key, value pair where key is unique and value can be store or retrieve using the key. Any class can be a candidate for the map key if it follows below rules. 1. Overrides hashcode() and equals() method.   Map stores the data using hashcode() and equals() method from key. To store a value against a given key, map first calls key's hashcode() and then uses it to calculate the index position in backed array by applying some hashing function. For each index position it has a bucket which is a LinkedList and changed to Node from java 8. Then it will iterate through all the element and will check the equality with key by calling it's equals() method if a match is found, it will update the value with the new value otherwise it will add the new entry with given key and value. In the same way it check for the existing key when get() is called. If it finds a match for given key in the bucket with given hashcode(), it will return the value other

jaxb2-maven-plugin to generate java code from XSD schema

In this tutorial I will show how to generate the Java source code from XSD schema. I will use jaxb2-maven-plugin to generate the code using XSD file which will be declared in pom.xml to make it part of build, so when maven build is executed it will generate the java code using XSD. Class generation can be controlled in plugin configuration. Maven changes (pom.xml) Include below plugin in your pom.xml. Here we have done some configuration under configuration section as given below. schemaDirectory : This is the directory where I keep my schema (XSD file). outputDirectory : This is the java source location where I want to generate the Java files. If it is not given then by default it will be generate inside target folder. clearOutputDir : If this property is true then it will generate the classes on each build otherwise it will generate only if output directory is empty. <plugin> <groupId>org.codehaus.mojo</groupId> <artifactId>jaxb2-maven-plugin</art